Design of The Software Backplane

Michael John Muuss
John R. Anderson
Paul J. Tanenbaum
Ballistic and Nuclear Vulnerability/Lethality Division
Survivability and Lethality Analysis Directorate
The U.S. Army Research Laboratory
APG, MD 21005-5068 USA

Christopher T. Johnson
Paladin Software
Havre de Grace, MD USA

Index

Design of The Software Backplane
Index
Hardware environment
The Backplane Philosophy
Implementation Plan
Required Backplane Features
Conceptual Model
Example Application
Implementation Details
Programming API

Hardware environment

The hardware environment that our simulations will be operating in has these properties:

Systems will have multiple CPUs per "cabinet", with multi-processor thread support inside the cabinet,
Multiple cabinets linked via high-speed networking, typically OC-3 or OC-12 ATM.
The cabinets will often be geographically distributed, including at least Ft. Belvoir VPS facilities, ARL-APG MSRC facilities, ARL-APG SLAD facilities, and Ft. Knox.
The collection of systems will be multi-vendor and multi-architecture, encompassing SGI Origin 2000s, SGI Infinite Realities, Cray vector machines, and Intel-Linux workstations.
Full simulations will have 100-200 processors participating.

The Backplane Philosophy

We take inspiration from standard hardware backplanes, such as the VME bus, which has standardized slots with a standardized interface, and a standardized signaling protocol used over the bus. All connectors are the same (location independent) except possibly for performance issues.

We strive to achieve the same flexibility and decoupling in our software simulations, where interface modules, vehicle dynamics modules, thermal simulation modules, atmospheric simulation modules, vulnerability/lethality simulation modules, dynamic terrain simulation modules, and countermeasures modules can all reside on the same "backplane" with multiple simultaneous synthetic image generators of three very different classes:

Near-Real-time ray-traced image generator (SWISS)
Real-time Infinite-Reality based image generator (PTN)
Low-cost Intel-PC based OpenGL image generator

Implementation Plan

We will attempt to implement this software backplane using the DoD High-Level Architecture (HLA), with extensions as necessary to provide the minimal set of backplane services.

Note that the HLA RTI library is not multi-threaded, so some care will have to be taken when using it with multi-threaded "plug-in" modules.

Required Backplane Features

In order to serve as a software backplane, in addition to having reconfigurability, a standardized interface, and location independence for the plug-in modules, there are four types of data exchange services that are required.

Event services. These are messages which are broadcast to any Federate interested, in which (a) no reply is expected, and (b) there is no persistence to this information, e.g. it is strictly non-modal. An example of this might be general notification of a weapon detonation. These can be directly implemented with HLA interactions.
Data-Object services. When "variables" need to be maintained across multiple Federates, with one Federate computing the values of those variables on a time schedule independent of when other federates might wish to operate on the then-current value of the variable, and where the variables are thus persistent. Examples of this include recording the current simulation time-of-day and planetary positions, recording the current operating parameters of a given sensor, recording the current state of a vehicle. These can be directly implemented with HLA data objects.
Query/Response Services. These are messages in which a question is uni-cast from a Federate to a specifically addressed "server" Federate, which are paired with a single uni-cast response from the server to the original asker. An example of this would be a request for collision-detection from a vehicle dynamics module to the terrain server. These can be implemented using HLA interactions with a custom routing space.
Continuous/Bulk Data Services. In order to de-couple the thermal simulators from the image generators, a "distributed shared memory" paradigm is necessary, such that the thermal solvers can update large thermal meshes at one rate (e.g. one full solution time step every 90 seconds of simulated time) while the image generators sample some subset of that data (the current sensor field-of-view) at very high rates (e.g. 60 Hz). There are two sub-modes of operating:
- Continuous auto-broadcast, where updated values to the source shared memory region is transmitted as quickly as possible and made immediately visible to subscribers in their own shared memory copy. A subscriber may optionally request notification (delivered via an Event) each time some region of interest to the subscriber in the shared memory has been updated. An example of this might be for coupling at the boundary layer between the atmospheric thermal model and the ground thermal model.
- On-demand update, where the subscriber receives notification each time some region of interest has been updated in the source shared memory region, and then may request that a portion of the local shared memory region be brought up to date. This could be used to conserve network bandwidth. An example of this might be for coupling between the overall ground thermal model, and a localized "bonfire" thermal model.

Conceptual Model

There may be an arbitrary number of SHM "objects" active in a simulation.

Each SHM object is intended to store an array of data elements, which may themselves be C structs. For example, the "Solar Flux" SHM might be comprised of a 2-dimensional array of structures like this:

	struct datum {
		double	last_updated;		/* simulation_time */
		double	incident_flux;		/* mw/cm**2 */
	};

If additional reference data which does not change is needed per-cell, a second SHM object should be used, to eliminate transmission of constant or nearly-constant data values each time the dynamic fields change. In this example, that might include a per-cell surface normal, and/or ground-cover type, albedo, etc.

Each SHM object will be directly addressable (using C/C++ pointers) in the address space of each Federate subscribed to that SHM object. The actual pointer value (object address) will be different in each Federate, so that SHM objects may not themselves contain pointers. Instead, use subscripts inside the SHM objects, if references must be stored.

Each SHM object is intended to have two parts. First, a fixed-size n-dimensional array of fixed-size objects, so that notification regions can be specified in terms of coordinates. Second, (a) an unstructured collection of data (typically smaller in size, for exchange of parameters), or (b)

Each SHM will be created with certain parameters, including:

dimensionality,
size in each dimension,
size of each base object,
default value for base object,
size of unstructured "trailer" data.

One Federate may both publish and subscribe to different SHMs.

The philosophy is that, in general, each SHM should be written into by only one Federate, and that all other subscribers to the SHM will only be reading. (This will be enforced by a read-only mapping on non-publishing subscribers's actual memory segment).

In certain cases, there may be a need for two Federates to update values in one SHM. More than one Federate may become a publisher in a single SHM. Data consistency across all Federates will not be provided by the backplane. In general this should be avoided. An example might be the solar-load SHM, which can be written by multiple SIGs, and also updated in the background by a "Sun Skulker" process.

SHM In-memory Layout
Fixed-size Library-Provided Header
n-dimensional data array
Unstructured application "trailer" data

The "trailer" data area is a place where the source application might place "by-product" information that is on-hand but expensive for a recipient to recompute, such as minimum and maximum values in the array, current operating parameters, etc. Depending on update rate, should more properly be made attributes of an HLA data object linked to the SHM. (Maybe we shouldn't even offer this).

It is expected that each subscriber will have mapped in the full SHM segment, even if it only requests updates for a small sub-rectangle.

Example Application

In this diagram, the right-most SHM ("Therm Texture") has one publisher (Texture Mapper) and one subscriber (PTN SIG). The center SHM ("Ground Temp") has one publisher (GroundTherm) and two subscribers (RT SIG and Texture Mapper). The left-most SHM ("Solar Flux") is the most complicated, with two publisher/subscribers (RT SIG and Sunlight Skulker) and one subscriber (GroundTherm).

Implementation Details

There will be one SHM_helper process running on each cabinet which is using SHM services. SHM data will be moved from cabinet to cabinet by connections between these SHM_helper processes.

SHM_helper processes will themselves be HLA Federates. The event signaling between the SHM_helper and the SHM clients will be done using HLA.

There will probably be one SHM_exec Federate, which will serve to coordinate the creation of the SHM_helpers, and establish communication between them.

As presently envisioned, in the SHM library, at the first call to Create_and_Publish_SHM() or Request_Subscribe_SHM(), the library will attempt to contact the Federate "SHM_exec". If that Federate does not yet exist, the process forks and execs one, and it will run on that cabinet. Once running, it will contact it.

The library will forward it's cabinet-ID (ATM IP address and/or NSAP address) to the SHM_exec, and ask if an SHM_helper is presently running on this cabinet. If there isn't one, we fork and exec one.

The library will then register the name of the requested SHM with the SHM_exec, determining whether such an SHM already exists or needs to be created. The SHM_exec will maintain a list of publishers and subscribers for each SHM. Whenever a new subscriber joins an existing SHM, connections will be established between the publisher(s) SHM_helper(s) and the new subscriber's SHM_helper. Possibly this state information could all be stashed in objects of an SHM_Object class, such that an SHM_exec process/Federate isn't necessary.

Possibly the presence/absence of the local SHM_helper Federate can be determined by encoding the network address into the Federate's name, so that a single call to get_fedid() could be used to see if one needed to be started on this cabinet, without need of an SHM_exec.

Programming API

(Perhaps these should be named more closely to the HLA data object routines, rather than the Federate routines?)

Create_and_Publish_SHM

Causes a new named SHM to come into existence, specifies size and dimension. No Joined_SHM messages for this SHM will be issued by the library to any Federates until one Federate performs Create_and_Publish_SHM().

This will fail if the SHM has already been created, and will respond with a Joined_SHM-style return "code" if successful.

If the SHM has already been created, then there is already a publisher, and so this Federate must request to join, and verify that the first publisher's object size and dimensions are appropriate.

Request_Subscribe_SHM

Request to join a named SHM as a subscriber. Notification will not be received until the SHM has been created by its first publisher, but this Federate will continue executing.

Specify Continuous vs. on-demand.

Subscribed_SHM (Message)

Notification that the named SHM has been created by its first publisher, and the SHM is now mapped into this Federate's address space read-only. (Or that joining has failed).

Information about the size and dimensionality will be provided, and should be checked.

Additional_Publisher_SHM

Notify library that this Federate, having already previously joined the SHM as a subscriber, will now also be publishing updates to some region(s) of this SHM, as a "secondary" publisher.

The shared memory segment will be re-mapped to read/write.

Additional_Publisher_Joined_SHM (Message)

Message sent to original publisher notifying him that an additional publisher has joined this SHM.

SetRangeBounds_SHM

Indicate restricted region of interest for SHM subscription. This can be changed any time as the subscription runs. Given as an array of min/max subscript values (int).

Send_Subscriber_Update_SHM

A publisher calls this to notify the SHM library that a region of the SHM has been updated, and subscribers should be notified of this.

Receive_Subscriber_Update_SHM (Message)

Message received by a subscriber, indicating that the indicated region of the SHM has been changed by the publisher. If this subscriber is in Continuous mode, this is a notification that the local shared memory has already been brought up to date. If this subscribers is in On-Demand mode, this is only a notification; an Update_Request_SHM() needs to be sent to bring the local shared memory copy of this region up to date.

The message specifies requested region which is now up to date (which may be larger than requested region, but never smaller).

Update_Request_SHM

Subscriber requests that the specified region of the SHM be brought up to date. The subscriber will receive an Receive_Update_Reply_SHM message when the requested region is up to date.

Receive_Update_Reply_SHM (Message)

Response to Update_Request_SHM. Specifies requested region which is now up to date (which may be larger than requested region, but never smaller).

Resign_SHM

Stop all updates of this SHM for this Federate, un-map the shared memory from this Federate's process. The SHM will continue to exist, and updates will continue to be delivered to other subscribers.

If the last subscriber resigns, the SHM continues to exist, in case additional subscribers might show up to read it, or maybe even become a new publisher.

Analogy: unix close() sys-call.

Destroy_SHM

When destroyFederationExecution() is called for this Federation, all SHMs in that Federation are also destroyed. Because of this property, most Federates will not need to explicitly call Destroy_SHM.

If a Federate needs to explicitly purge the SHM at the earliest safe opportunity, and does not expect a transient or late-joining Federate to need the current state, then this destruction can be explicitly made to happen sooner by calling Destroy_SHM(). This will post a request to the library to destroy the SHM when all publishers and subscribers have resigned. To keep Federates still subscribed to the SHM from dumping core due to dereferencing bad pointers, the SHM is not destroyed immediately.

Analogy: unix unlink() sys-call.

Future Directions

In the situation with multiple publishers, the concurrent-update problem can exist. In the initial implementation, the library will provide no protection against this; rectangles will be bulk-replaced, and there will be no assurance of global consistency or ordering.

Various techniques are being considered to address this issue, should it become useful to application writers. Possibilities include time-tagged data elements, a bit-vector of "array element modified" indicators, and rules for how to replace or modify (e.g. sum) values from multiple publishers.

References

Muuss, Lorenzo, PST: A Distributed Real-Time Architecture for Physics-based Simulation and Hyper-Spectral Scene Generation, Multi-Spectral Scene Generation Workshop, Redstone Technical Test Center, 27-April-1999.

$Header: /vld/mike/public_html/papers/brlcad5.0/RCS/backplane.html,v 1.11 1999/08/04 02:37:17 mike Exp mike $
Collected by Mike Muuss
E-mail comments to <acst@arl.army.mil>