Workstations, Networking, Distributed Graphics, and Parallel Processing

Michael John Muuss

Published in Computer Graphics Techniques: Theory and Practice D. F. Rogers, R. A. Earnshaw editors, Springer-Verlag, New York, 1990.
Please read the Postscript Version, which contains all the figures.

.\" groff -X -tepRs -ms abst a b c d e
.\" groff -tepRs -ms abst a b c d e | print-postscript
.\"
.\" Commands for grefer style selection
.\" label "A1.n.u D.y-2"
.R1
accumulate
sort
database ../bib
label-in-text
move-punctuation
bracket-label " [" ] ", "
label K
.R2
.\" Actual text
.EQ
delim $$
.EN
.\"gsize 24
.RP
.TL
Workstations,
.br
Networking,
.br
Distributed Graphics,
.br
and
.br
Parallel Processing
.AU
Michael John Muuss
.AI
Leader, Advanced Computer Systems Team
Ballistic Research Laboratory
Aberdeen Proving Ground
Maryland 21005-5066 USA
.AB
.LP
The process of design is iterative in nature;
designs are formulated, analyzed, and improved until the goals are met.
Modern graphics workstations provide a powerful platform for
performing detailed design and limited analysis of solid model designs,
with faster computers accessed via network links for full resolution analysis.
A broad spectrum of analysis tools exist, and most express their
output in graphical form.
Their three main styles of graphics display are examined, along
with a look at the underlying software mechanisms to support them.
.LP
Several enabling technologies
are required for analysis tools to run on one computer,
with graphical display on another computer, including the
network-transparent framebuffer capability.
The importance of portable external data representations is
reviewed, and several specific
external data representations are examined in significant detail.
By carefully dividing application software into appropriate tools
and connecting them with UNIX pipes, a measure of parallel processing
can be achieved within one system.
In a tool-oriented environment
with machine-independent data formats,
network-distributed computation can be accomplished.
.LP
The next step is to 
make a single tool execute faster using parallel processing.
Many analysis codes are implemented using the ray-tracing paradigm,
which is ideal for execution in parallel on
tightly coupled shared-memory multiprocessors and loosely
coupled ensembles of computers.
Both are exploited using different mechanisms.
The strategies used for operating
on shared-memory multiprocessors such as the Denelcor HEP, Alliant FX/8, and
Cray X-MP are presented along with measured performance data.
.LP
The strategies used for dividing the work among network-connected
loosely coupled processors (each of which may itself be a parallel
processor)
are presented, including details
of the dispatching algorithm and the design of the distribution protocol.
The performance issues of this type of parallel processing are presented,
including a
set of measured speeds on a variety of hardware.
.AE
.RT
.\" Make troff use nroff style bracketed references
.ds [. " [
.ds .] ]
.NH 1
INTRODUCTION
.PP
The fundamental purpose of the Ballistic Research Laboratory CAD Package is to
enable designers and engineers to create and analyze highly detailed
three dimensional solid models.
These models are considered ``solid models''
because the models are a computer
description of closed, solid, three dimensional shapes
represented by an analytical framework within which the
three-dimensional material can be completely and unambiguously defined.
.[
Deitz improved acquisition
.]
Completeness is assured because the
representation contains a full description of a piece of solid matter;
there is no view-specific information.
.[
Muuss preparation and analysis of solid models
.]
.PP
The software in the BRL CAD Package can be grouped
into three main categories.
First and foremost, it includes a complete solid modeling system.
Secondly, there is a set of libraries for producing graphical
displays on a variety of display hardware.
Finally, there is a large collection of software tools, each of which
performs a single function quite well, and which can be easily combined in
powerful ways using UNIX pipes.
.[
Ritchie time-sharing system
.]
.PP
The purpose of this paper is to describe the graphics tools, and
graphics support environment necessary to enable visual displays
to be the center of the design process, and to discuss how network
capabilities and parallel processing
are important in such an environment, both to support collaboration,
and to improve performance.
The first section will set the stage by reviewing the design process.
A broad spectrum of design analysis tools exist, and most express their
output in graphical form.
In the second section,
the three main styles of graphics display are examined, along
with a look at the underlying software mechanisms to support them.
In the third section, the extension of these graphics mechanisms to
the network environment is examined in significant detail.
In the fourth section,
the power of parallel processing is exploited for
multi-tool applications,
both on multiprocessor hardware,
and utilizing multiple uniprocessor machines on the network.
In the fifth section,
shared memory parallel processing techniques are employed to
make \fIa single application tool\fR run much faster.
Finally, an ensemble of network computers is employed,
to make a single application tool run much faster.
.NH 2
The Modern Computing Environment
.PP
The cost of computer power continues to plummet.
Therefore, computer systems will become more and more prevalent
in the modern workplace.
They will be purchased to improve productivity, lower costs,
or improve the quality of the workplace.
Due to system proliferation, users will be located on different
machines, separated from each other.
It is vitally important that they not become isolated from necessary
information, or the proliferation of computer systems will bring
a family of new problems.
.PP
To exploit the influx of new computer power,
\fIautomation systems and facilities,
regardless of manufacturer, model, and location,
must interoperate.\fR
This interoperation requires the connection of these systems to
data networks.
A user should be able to move data between any pair of computers,
without physical intervention, using one simple set of commands,
regardless of the data's source or destination.
Single vendor networks are impossible in the government arena,
and difficult elsewhere, forcing computer planners into a systems
integration role.
Networking standards are young, or still emerging, so there are
many issues still to be resolved.
This paper will detail one powerful set of applications built
for a network environment full of computers.
.PP
As one of the nation's foremost research and development laboratories,
BRL has continued to help advance the state of the art in computer
science and computer practice,
starting in 1943 with what became the ENIAC project,
which resulted in BRL possessing one of the worlds first electronic computers.
The proliferation of hardware started immediately thereafter.
The diversity of computers installed resulted in
BRL scientists having to deal with
issues of software portability in the 1950s.
In the early 1960s, BRL developed solid modeling and ray-tracing
technology which
went into production use in 1965.
BRL was part of the initial ARPANET in 1967, and by the advent of
the TCP/IP protocols in 1982 had the largest campus area network in DoD.
In 1988, the BRL facility includes two SuperComputers, a Cray X-MP/48
and a Cray-2, a fiber-optic based campus area network, dozens of
super-minicomputer systems,
over a dozen distributed computer facilities,
and hundreds of workstations and personal computers
(See Figure 1).
More than 30 high performance high resolution color
graphics workstations are located throughout the laboratory.
The use of color graphics to combine \fIadvanced modeling\fR
with \fIscientific visualization\fR accounts for BRL's
continuing ability to increase the pool of human knowledge
in times of dwindling funding.
.[
Visualization in Scientific Computing
.]
.PP
The BRL CAD Package is BRL's third generation of modeling software,
and represents over 150,000 lines of \fBC\fR source code written
since 1979.
Parts of this system
have roots in work done over two decades ago,
.[
geometric description technique MAGI
.]
most
notably the solid modeling, and the ray tracing.
It is highly portable, having run on five distinct
generations of hardware running various versions of the UNIX operating
system.
.[
Muuss Why buy UNIX
.]
As of this writing, this software is known to run on hardware
from many different vendors, and is in use at over 300 sites
worldwide.
The non-military portions of this software are
available;  contact the author for details.
.TS
center box;
c s
l l.
Hardware Supported
_
Vendor	System
=
Alliant	FX/4, FX/8, FX/80
Convex	C1, C2
Cray	Cray-1, Cray X-MP, Cray-2
DEC	VAX-11/750, 780, etc
DEC	VaxStation-II GPX
Denelcor	HEP H-1000
Elxsi	6410, 6420
Gould	PN 6000, PN 9000
ISI	68020
Multiflow	Trace 7/200
Pyramid	90MX
Ridge	330
SGI	3-D (2400, 3030, etc)
SGI	4-D (4D/60, etc)
Sun	Sun-2, Sun-3, Sun-4
Tektronix	4132
.TE
.NH 2
The Iterative Nature of Design
.PP
\fIDesign\fR is an iterative process.
Starting with a set of goals, an initial concept is defined.
The initial concept is refined
until the design is believed to have reached a worthwhile plateau, and
then the design is subjected to a variety of analyses techniques in an attempt
to determine how well it meets the original goals.
The result of the analysis determines whether the goals have been
met.  If the goals have not been met, then the outcome of the analysis
drives another iteration of the refinement process.
.PP
In a full scale solid modeling system, there is no need for initial
drawings;  the designer expresses the initial structures directly
into the modeling system's editor,
just as a modern author creates his ``rough draft''
directly into a word processor.
At the completion of each version of the design, the model is
subjected to a battery of analyses appropriate to the function of the
object being designed.
Strength, volume, weight, level of protection, and other similar evaluations
can be reported, along with the
production of a variety of images and/or drawings.
These automated analyses help identify weaknesses or deficiencies in
a design \fIearly in the design process\fR.
By detecting flaws early, the designer has the
opportunity to correct his plans before having invested too much time
in a bad design, or the designer can
switch to an entirely different approach which may seem more promising
than the original one.
In this way, the solid modeling system allows the designer to
concentrate on the important, creative aspects of the design process.
Freeing the designer of routine analysis permits designs to be
finished in less time than previously required, or allows much more
rigorously optimized designs to be delivered in comparable
timeframes and at the same cost as unoptimized designs created using
older techniques.
.[
Deitz item-level modeling
.]
.PP
The creative process that converts goals into designs has been
the subject of some study, but the process is not understood well enough 
for there to be much opportunity for automation yet.
However, the analysis of a given design against a fixed set of
standards (the design goals) is amenable to a great deal of automation.
Given a full description of the geometry of an object, and an indication
of the material used to construct each part, it is possible to conduct
a battery of analyses automaticly.  These analyses might range
from determining the center of mass, to creating an optical image of
the object, to performing a detailed structural analysis of the object
under various forms of loading.
In light of the iterative nature of design, and the current lack of
deep insight into the creative process,
the BRL CAD Package has been focused on appropriately applying the power of
computers to each part of the design loop.
Each of these parts are supported by at least one
main piece of software.
.IP 1)
The designer is provided
with powerful, interactive, and vision-oriented
tools for specifying the geometry of the objects that they wish to build.
The specification and modification of geometry is done using the
solid model editor \fBmged\fR.
.[
Weaver Muuss interactive construction of com-geom targets
.]
.[
Muuss GED interactive solid modeling system for vulnerability
.]
.IP 2)
The application writer is
provided a powerful platform upon which to build analysis tools,
where the physics of a particular analysis is insulated from
the complexities of the shapes under analysis.
Analysis tools are built, free from the details of the underlying
geometry using \fBlibrt\fR, a ray tracing library for model interrogation.
.IP 3)
A broad family of automatic analysis tools are provided to the designer,
to help assess the merits of each design change.
Many analysis tools are built around a generic skeleton of 
code that provides some common housekeeping functions,
and thus reduce development costs and help provide consistent interfaces.
.IP 4)
The output of the analysis tools is provided to the designer
in graphical form, to permit rapid comprehension of complex assessments.
The model analysis tools produce graphical output using \fBlibfb\fR, a generic
framebuffer library with full network display capability,
and \fBlibplot3\fP, a library for the production of
2-D and 3-D UNIX-plot files.
.IP 5)
The ability to store and manipulate images and signals is provided,
to permit convenient handling of analysis output, and to permit
easy comparison of simulation results with actual test data.
The handling of all the images and data resulting from the
analysis codes is managed via several groups of tools, including
a collection of software tools for framebuffer manipulation,
a collection of software tools for
image processing and analysis,
a collection of software tools for handling 2-D and 3-D
UNIX-Plot wireframe plots,
and
a collection of software tools for signal generation and processing.
.PP
There is a substantial amount of technology and software required
to achieve the environment described in the last paragraph.
The technology of solid modeling is discussed in much greater
detail elsewhere.
.[
Muuss preparation and analysis of solid models
.]
.NH 2
Overview of Analysis Tools
.PP
Many phenomena that are ordinarily difficult to model
can be handled simply and elegantly with ray-tracing.
For example,
an illumination model
based on ray-tracing merely needs to fire a ray at each light source
to determine the total light energy at each point.
By dithering the location of the light sources, this gives
a statistically good rendition of shadow penumbra effects.
Unlike conventional rendering packages, ray-tracing also makes it
easy to deal with objects that are partly or entirely reflective,
and with transparent objects that have varying refractive indices.
Furthermore, by applying the proper sorts of dither,
.[
Cook distributed ray tracing
.]
motion-blur, depth-of-field, translucency, and other effects are
easily achieved.
.PP
Given that lighting model code exists for making static images,
it is straightforward to develop code which
can animate the position of the ``eye'' (camera) and light sources
within the model, and only a modest additional effort to develop
the capability of articulating the model geometry itself.
Using animation can be very beneficial when trying to gain
comprehension of complex geometry, especially when the object
being modeled has never actually been physically built before.
There is nothing that communicates complex structures
more clearly than to observe them passing by.
.PP
The power of the lighting model code
can further extended by making a provision to record the paths of all
the rays followed when computing the light intensity for each pixel
in an auxiliary file.
This capability
allows one to follow the path of the light rays passing through lenses
reflecting from mirrors while performing image rendering,
with no additional computation,
which makes a fantastic debugging tool.
Studying the paths of light rays as they are repeatedly bent
by passing from air to glass and back again has traditionally
been a painstaking manual procedure for lens designers.
By modeling,
it becomes possible to predict lens behavior, including making a
determination of the exact focal length, finding the precise influence of
spherical distortions and edge effects, determining the amount of
image distortion due to internal reflection and scattering,
and finding the level of reflections from the lens mounting hardware.
Furthermore, experiments can be conducted to
determine the effects of adding or removing
baffles, irises, special lens coatings, etc.
.PP
In the design of vehicles,
moments and products of inertia play a central role.
Particularly when designing aircraft, weights,
the center of gravity,
and parameters related to inertia are vital aspects of
creating a stable design with high performance.
For ground vehicles and fixed structures these figures are
quite significant, providing a good estimate of structural
loading, and allowing transportation costs to be assessed.
Moments of inertia are important in determining what conditions
a vehicle may be overturned by maneuvering or impact, and can also
help predict the vehicle's handling when maneuvering over rough terrain.
.[
Deitz improved materiel acquisition
.]
.PP
An important extension to the study of simple static and dynamic loading of
structures is to consider dynamic stress caused by
the impact of a high velocity object with the model.
This could be as ordinary as a rock striking a car traveling at
80 kilometers per hour, hailstones falling from the sky, or
a collision with another vehicle.
On the other hand, if the object being designed is intended for
use in a military setting, the range of threats which must be considered
is greatly enlarged.
.[
Weaver Deitz solid modeling aircraft survivability
.]
.PP
Synthetic aperture radar (SAR) is a technique with which a variety of
image information about a distant object can be obtained by correlating
multiple radar samples taken from various positions.
.[
Toomay radar principles
.]
While standard radars only report target backscatter and range information,
SAR techniques can resolve distinct scattering regions of a target.
.[
Deitz signature modeling
.]
.NH 2
Tradeoff Analysis
.PP
The philosophy adopted at BRL has been to develop a broad set of
analyses which are supported from the same geometry database.
.[
Deitz predictive signature modeling
.]
These analyses cover the spectrum from engineering decision aids,
to design validators, to drafting and milling interfaces, to
the generation of manufacturing drawings,
to image generation for management comprehension and sales advantage.
Key analysis capabilities have been developed to assess the 
strength, weight, protection, and performance levels offered by
the structures represented by a solid model.
Using this analysis information and additional domain-specific
applications tools makes it possible to produce highly detailed
designs
constructed with a philosophy of \fIsystem optimization\fR
right from the start.
.[
Deitz Mermagen integrated environment for target description
.]
This should allow the rapid
development of systems with the desired levels
of performance at the best attainable price.
.NH 2
Local Patterns of Use
.PP
At any stage of a design,
the model can be used to make
realistic images and engineering drawings.
This capability is so powerful that it ordinarily justifies the
expense of constructing the solid model.
However, the real payoff from building a geometric model
comes when it is time to analyze it.
The model can be subjected to
numerous engineering analyses, allowing
the effects of varying many parameters
to be studied in a controlled way.
.PP
From the perspective of a designer working in
a highly interactive analysis environment,
the majority of his time is spent
using the graphics screen
to interact with the model geometry,
contemplating it, viewing it, and changing it.
After a significant design change, one or more analyses may be
conducted on the new design, with the output images being displayed
on the graphics screen as they are computed.
Then the interaction with the model continues.
.PP
.NH 1
LOCAL VIEWING OF LOCAL DATA
.NH 2
Main Styles of Graphics Display
.PP
There are several distinct types of graphics support that are
required to implement the use of the software just described.
First, interaction with the model geometry using \fBmged\fR
is done in a highly interactive mode using wireframe displays.
The \fBmged\fR program contains a library of \fIdisplay manager\fR
modules that support 
a wide variety of hardware types
to produce these wireframe displays, and to deal with knobs, buttons,
mice, joysticks, and other hardware-specific input devices.
.PP
Second, many of the applications programs produce full-color shaded
images.
To free the applications programs from having to deal with
the specifics of each type of hardware needed to display these images,
a hardware-independent framebuffer library called \fBlibfb\fR
has been built.
By using this library, each application can be written to perform
abstract operations on an idealized 24-bit deep RGB framebuffer,
and leave the details of managing the particular hardware to the library.
This library also allows applications to open an arbitrary number of
different framebuffers, where each one may potentially be made by a different
manufacturer.
.PP
Third, quite a few of the applications programs produce two dimensional
or three dimensional plots for the designer to view.
To provide this capability, the traditional UNIX-Plot capability
has been extended to provide three dimensional plotting, plotting
using floating point coordinates, and plotting in color.
This extended UNIX-Plot support is provided to the applications in the
form of the library \fBlibplot3\fR, which is described in more detail
below.
.PP
Finally, a number of applications presently under development
will produce collections of polygons, suitable for display on
hardware with intrinsic polygon drawing capabilities.
A definite format for these polygons has not been adopted yet.
.NH 2
Graphics Standards -vs- New Hardware
.PP
There are a great many graphics standards available today;  so many as
to make a mockery of the word ``standard''.
Support for the three main styles of graphic display in the BRL CAD
Package could perhaps be provided using one or more standards in existence
today, with some effort.
However, most of these capabilities were created \fIcirca\fR 1980, and
were oriented towards providing high performance over a well defined
and moderately limited domain of operations. As a result, the
capabilities provided by the BRL CAD Package overlap with some of the
capabilities that exist in current day standard graphics packages.
There are a number of compelling reasons for continuing to support
the existing BRL styles of operation, even in a climate rich with standards.
First, there is no known standard that can provide operations with
semantics that match the existing operations and capabilities fairly
closely. Second, there is no standard that is implemented across as wide
a range of hardware as the present BRL library implementations. Third,
there is no known implementation of a standard graphics package that can
deliver performance roughly comparable to the existing implementations.
Finally, implementing support for each new type of display hardware
typically requires from four to forty man-hours of effort, which is
significantly less effort than would be required to implement a
full standard package on the same hardware, the assumption here
being that new hardware does not support the necessary standard packages.
New hardware rarely supports the graphics standards of interest.
.PP
One of the strongest arguments in favor of standard graphics libraries
is based on the notion that if new hardware is always delivered with
support for the existing standards, and if all important user code
is written to use the existing standards, then moving the existing
user code to new hardware should just require recompilation, and
everything will work.
Unfortunately, in practice there have been several problems encountered
when attempting to realize this ideal.
Graphics hardware has been evolving at a fantastic pace.
Not only have speeds increased, but completely new features
never seen before keep appearing.
Standards activities, on the other hand, tend to succeed best in
fields of endeavor that have a more moderate growth rate,
because establishing standards is a very time consuming task.
Standards are therefore forced to lag the ``state of the art''
by half a decade or so.
.PP
The strategy adopted early in the development of the BRL CAD Package
was based on the premise that a graphics interface standard
with suitable power, flexibility, and performance
would probably not emerge during the lifetime of the software.
Therefore, it was necessary to be positioned such that new hardware
could be supported with great agility.
All current BRL applications can be satisfied using some combination
of the three basic families of graphics operations.
In each case, a low level interface has been constructed that
provides the absolute minimum set of features possible
while retaining the ability to actually accomplish something,
following the ``small is beautiful'' philosophy.
.[
Kernighan Pike UNIX Programming Environment
.]
.PP
These remarks should not be taken as a global condemnation of standards.
Because of the highly focused domain of operations, the lack of
committee participation in the design, and the orientation towards
performance in favor of generality, the support for graphic displays
in the BRL CAD Package has been easy to implement, cleanly designed,
and highly effective.
For uses that fall within the supported domain of operations,
these libraries are likely to prove highly satisfying.
For other uses, some other solution is required.
However, it is important to note that all of the display
generation and handling in the BRL CAD Package has been built
on these abstractions without significant strain.
.PP
The \fBmged\fR Display Managers and the \fBlibfb\fR interface routines
both provide capabilities to interactively manipulate display hardware,
so high performance is a significant issue.  To retain high performance
while also achieving good modular software design, an object-oriented
interface has been used in both cases.
While it is possible to use \fBlibplot3\fR in an interactive setting,
the portable machine-independent plot file provides the standard
interface.
Here, only a single hardware-specific plotfile
display program is needed for each
type of hardware.
.PP
As a result of the graphics interface strategy selected, it is necessary
to create three new software modules each time
new hardware is to be supported.
However, because of the very careful selection of the interfaces,
it typically requires less than one man-week of programming effort
to support each new type of hardware.
Because only several pieces of exciting new hardware appears each
year, this strategy has proven to have been most effective.
.NH 2
Handling Full-Color Images
.NH 3
The Frame Buffer Abstraction
.PP
The framebuffer that \fBlibfb\fR provides is very simple, and
has only a few assumptions. 
The display surface is rectangular, and is covered with individually
addressable pixels.
The display surface is \fInpixels\fR wide, and \fInlines\fR high.
Pixels are addressed as (x,y)
using first quadrant Cartesian coordinates
so that (0,0) is the lower left corner of the display,
with 0 <= x < \fInpixels\fR, and 0 <= y < \fInlines\fR.
.PP
The rectangular display surface requested must fit entirely on the
screen space available for the selected piece of hardware.
Therefore, the library does not have to provide any
services to clip the display surface to the hardware screen.
Multi-pixel writes start at a given x,y coordinate, and proceed
to the right (the +X direction) until x = \fInpixels\fR\-1.
If more pixels remain to be written, the y coordinate is increased
by one and the x coordinate is reset to zero, so that pixel writing
proceeds on the next line \fIup\fR on the screen.
.NH 3
Pixels
.PP
Each pixel is 24 bits deep, with eight bits assigned to each
of Red, Green, and Blue.
Pixels are dealt with by \fBlibfb\fR as atomic units of type \fBRGBpixel\fR.
The \fBC\fR language declaration for this is:
.sp .5
.ti +1i
\fBtypedef unsigned char RGBpixel[3];\fR
.sp .5
Where element [0] is Red, [1] is Green, and [2] is Blue.
As a result of this, pixels are stored tightly packed in memory,
using three bytes per pixel, in a large \fBunsigned char\fR array.
The sequence of colors in the array is thus Red Green Blue, Red Green Blue....
.PP
One seemingly attractive alternative to this format might have been
to define the pixel differently, as:
.sp .5
.in +1i
.nf
.ft B
typedef struct {
.in +1i
unsigned char red;
unsigned char green;
unsigned char blue;
.in -1i
} RGBpixel;
.in -1i
.ft R
.fi
.sp .5
However, such a choice would invite the compiler to provide structure
padding, to ensure word alignment of the start of the structure.
This could waste a significant amount of memory.
The worst case is machines with 64-bit word lengths like the Crays;
such a pixel definition would expand each pixel from three to eight bytes!
.NH 3
The Color Map
.PP
The 8-bit Red value of a pixel is mapped
by a Red colormap (color lookup table)
with 256 (2**8) entries to determine the actual
intensity displayed on the screen.  Green and Blue are handled similarly.
Each of the 256 colormap
entries is expected to be 16 bits wide, with the
values represented as binary fixed point notation, with the radix point
to the left of bit zero, so that intensities in the color map can be
thought of as varying from 0.0 to 1.0 (or, an \fBunsigned int\fR varying
from 0x0000 to 0xFFFF, in hexadecimal).
The colormap transformation will generally be
handled by hardware.  When the hardware color map is less than
16 bits wide, the low order bits are discarded.
Where the  hardware colormap capability is missing, it is
simulated by hardware-specific \fBlibfb\fR routines.
Such simulation implies that the hardware-specific routines will
need to retain a memory copy of the original
un-mapped pixels, to permit the original pixel values to be read back.
The colormap is important for certain image processing operations,
as well as to permit approximate corrections for the nonlinearity
of the display device (gamma correction).  For details
on better methods for achieving the gamma correction,
consult Hall.
.[
Hall methodology realistic image synthesis
.]
.NH 3
The ``libfb'' Interface
.PP
The interface to the \fBlibfb\fR routines has been loosely
modeled after the interface to the UNIX \fBstdio\fR I/O library.
In particular, the routine \fIfb_open()\fR returns a pointer to a
object of type FBIO.  While the contents of the FBIO object are
opaque to the application, all other \fBlibfb\fR routines
require that this pointer be provided as the first parameter.
All the internal state data for the library routines
and the underlying hardware is stored in the FBIO object.
With this formalism, it now becomes easy to write applications
that utilize an arbitrary number of display devices simultaneously,
even if each device is potentially a different type of hardware.
The routines that comprise \fBlibfb\fR are listed in the table
below.
.[
Muuss CAD Package Release 1.21
.]
.TS
center box;
c s
l l.
libfb routines
_
fb_open(device,width,height)	open the device
fb_close(fbp)	close the device
fb_read(fbp,x,y,buf,count)	read count pixels at x,y
fb_write(fbp,x,y,buf,count)	write count pixels at x,y
fb_clear(fbp,color)	clear to an optional color
fb_rmap(fbp,colormap)	read a colormap
fb_wmap(fbp,colormap)	write a colormap
fb_window(fbp,x,y)	place x,y at center
fb_zoom(fbp,xzoom,yzoom)	pixel replicate zoom
fb_getwidth(fbp)	actual device width in pixels
fb_getheight(fbp)	actual device height
fb_cursor(fbp,mode,x,y)	cursor in image coords
fb_scursor(fbp,mode,x,y)	cursor in screen coords
fb_log(format,arg,...)	user replaceable error logger
.TE
.PP
A set of buffered I/O routines is also provided, in which
a \fIband\fR of scanlines is kept in memory, with the library
transparently providing band exchanges via bulk framebuffer
reads and writes.
While using these routines can speed up programs that make
``well behaved'' single pixel reads and writes, it
does not make the drawing of vertical lines significantly faster,
since such a line could run through quite a few bands,
causing many band exchanges.
In practice, very few programs use buffered I/O, but because
it can be provided by the library
independent of the specifics of the underlying
display hardware, it seemed appropriate to implement it
as a general capability.
.TS
center box;
c s
l l.
libfb buffered I/O
_
fb_ioinit(fbp)	set up a memory buffer
fb_seek(fbp,x,y)	move to an x,y location
fb_tell(fbp,xp,yp)	gives the current location
fb_rpixel(fbp,pixelp)	read a pixel and bump location
fb_wpixel(fbp,pixelp)	write and bump current location
fb_flush(fbp)	bring display up to date
.TE
.NH 3
Specifying a Particular Framebuffer
.PP
The display to be used is selected
by the string parameter \fIdevice\fR in the call to \fBfb_open()\fR.
If that parameter is a null string,
the UNIX runtime environment variable FB_FILE is used.
If FB_FILE is undefined, the default display for that system type
is selected (a library compile-time option).
The format of the \fIdevice\fR string is
.sp .5
.ce 1
[host:]/dev/\fIdevice_name\fR[#]
.sp .5
to designate a hardware device (where the square brackets ``[]'' denote
an optional parameter), or simply
.sp .5
.ce 1
disk_filename
.sp .5
to specify the name
of an ordinary file to act as a virtual framebuffer (with the image
conveniently being stored in \fBpix\fR(5) format into that file).
The string prefix ``/dev/'' is used to identify that this request
is for a hardware display device, as opposed to naming an ordinary file.
In such cases, the
\fIdevice_name\fR part generally will \fInot\fR
correspond to actual system hardware entries in the directory /dev.
If the /dev prefix is not given, then a pathname to an ordinary disk file is
assumed.
.PP
If a hostname is given, a network connection is opened to
the framebuffer library daemon (\fBrfbd\fR) on that machine.
The remaining part of
the string is passed to the remote daemon to complete the open
(this generalizes
the open to allow multiple "hops" in order
to reach a desired host).  More about this later.
.PP
The \fIdebug\fR interface prints a message each time a library
routine is called, as an aid to debugging.
The disk file interface is used to directly manipulate images in a
\fBpix\fR(5) format file, as if that image was in a framebuffer.
Support for specific display hardware is optional, and is
determined by compile-time configuration options.  The current
list of supported devices is given in the table.
.sp
.TS
center box;
c
l l.
libfb supported interfaces
_
	STANDARD
_
file	Manipulate disk image
debug	Print library calls
_
	OPTIONAL
_
remote	Network access to remote display
adage	Adage RDS-3000 ("Ikonas")
sgi	Silicon Graphics Iris display (3D)
sgi	Silicon Graphics Iris display (4D)
sun	Sun Microsystems SunView (bw & color)
X	MIT X-windows interface, for X11
ptty	AT&T 5620 w/special Moss layer
rat	Raster Technology One/80
ug	Ultra Graphics
.TE
.PP
Because the model of the display is that of a traditional framebuffer,
providing support for much simpler display devices like
Sun workstations requires a lot of software simulation,
because of the many features that are not implemented in the hardware.
The first device supported (in 1981) was
an Ikonas (now Adage RDS-3000), which operates
as either a 512x512 or 1024x1024 display with 24 bit pixels.
The Ikonas hardware
does not allow the current display
resolution to be sensed from software, so
any program opening the display must ``know'' whether the
framebuffer should be opened in low or high resolution mode.
As a result, every program that deals with a framebuffer,
even those which have little to do with display size (such as those
which read or write colormaps), includes a \fB\-h\fR
``high resolution'' flag (for 1024x1024
operation) and a default resolution of 512x512, to make it convenient to
open the display with the proper resolution.
.NH 2
Wireframe Display support in MGED
.PP
The \fBmged\fR editor is used to specify and modify the geometry
in the solid model database.
The editor presents a representation of the geometry using
wireframe displays.
.PP
Support for a variety of display hardware is achieved through
the use of an object-oriented programming interface between the
main \fBmged\fR program and the \fIDisplay Manager\fR routines.
When objects are first called up for view, vector representations
of the wireframes are created in an internal form, and made available
to the currently attached Display Manager.
.PP
Display Managers for two main families of display systems are built
with a single common formalized interface.
First, display systems with displaylist capability are supported.
When objects are first called up for view,
the internal vector lists are converted to device-specific formats and
placed into displaylist memory.
All subsequent operations are
performed by transmitting displaylist updates, and
replacing viewing matrices.
Second, display systems with no displaylist memory are supported.
For these displays, each time the main \fBmged\fR program signals
that the screen needs to be updated, the current viewing matrices
are combined with the internal vector list,
the vectors are clipped, and output to
the display.
At present, all versions of \fBmged\fR have support for these types of
display devices by default:
.TS
center box;
l l.
plot	any UNIX-plot filter
tek	Tektronix 4014 and family
tek4109	Tektronix 4109
.TE
These optional display devices are also supported,
when specifically configured:
.TS
center box;
l l.
sgi	Silicon Graphics Iris 3D
sgi	Silicon Graphics Iris 4D
sun	Sun Microsystems SunView
X	MIT X-Windows X11R3
X10	MIT X-Windows X10R4
vg	Vector General 3300
mg	Megatek 7250
rat	Raster Tech One/180,380
rat80	Raster Tech One/80
ps	Evans and Sutherland PS-300
mer	Megatek Merlin 9200
sab	Saber Technology
.TE
.NH 2
libplot3:  Plotting Support
.PP
UNIX systems have traditionally supported device-independent
plotting on a variety of displays, using the \fBlibplot\fP(3)
library routines to produce \fBplot\fR(5) format plot files.
These plot files can be displayed on a variety of displays
using the \fBplot\fR(1) program (renamed \fBtplot\fR(1) in SystemV).
Standard device support on SystemV includes the
DASI 300, DASI 450, and Versatec D1200A.
Standard device support on Berkeley UNIX includes
the Tektronix 4013, 4014, DASI 450,
DASI 300, AED 512, BBN Bitgraph, Imagen laserprinter,
HP 2648, Versatec D1200A, and the Benson Varian plotter.
Crude plots can also be printed on any terminal with cursor-addressing
capability.
The BRL SystemV extensions
.[
Gwyn VLD UNIX Supplementary Manual for System V
.]
provide support for
the Selanar HiREZ-100, HP 2648, HP 7750A plotter, HP 2686A LaserJet,
Megatek 7255, Tektronix 4014, Tektronix 4105, Trilog ColorPlot II,
Teletype 5620, and Digital Engineering's Retrographics VT100.
.PP
The standard plotting interface permits plotting in two dimensions,
using 16-bit signed integer coordinates.
The intervals to be used in X and Y must be specified in advance.
There is support for text strings, circles, arcs, points, and lines,
with an option for selecting line types.
Plot files created using \fBlibplot\fR are binary, and portable to
other machines.  Unfortunately,
due to the origins of this software on the PDP-11 and the VAX,
the integers are written in VAX (``Little-Endian'') byte order.
.PP
BRL's \fBlibplot3\fR(3) library provides a great many additional features
over the standard UNIX \fBlibplot\fR(3) library.
Colors may be specified as 24-bit RGB quantities. Additional routines
exist to plot using three coordinates. The two dimensional and three
dimensional integer-valued routines have been supplemented with routines
that take floating-point values. These floating point values are written
into the plot file in a transportable, machine-independent binary format
using \fBhtond\fR(3). The 3-D floating point routines have two interface
styles, one where each coordinate is a formal parameter to the
subroutine, and one where location in space is passed as a pointer to a
vector. The second form is especially convenient when using the vector
math routines.
.PP
The \fBlibplot3\fR(3) library provides a powerful and portable mechanism
for creating 2-D and 3-D plot files that are upwards (but not downwards)
compatible with the original UNIX \fBlibplot\fR(3) library.
In addition to the low level routines mentioned so far,
additional library routines exist for drawing coordinate axes,
scaling data, etc, so that this capability can be used for traditional
data plotting,
.[
Muuss terminal independent graphics package
.]
as well as for representing three dimensional wireframes.
.de Cs
.\" Set constant spacing mode.  Turn off with ".cs R"
.br
.\" The 2nd arg to .cs is a fraction X/36 of an "m" width
.cs R 18
..
.NH 1
VIEWING OF REMOTE DATA
.NH 2
The Motivation for Complicating Graphics by adding Networking
.PP
Collaboration is a very important aspect of
creative endeavors, both scientific and artistic.
Collaboration implies having more than one person involved,
which implies that there is probably more than one workstation
or computer involved, which implies that some form of machine
communication is required.
.PP
In ``the old days'',
graphics displays were either terminals attached to a mainframe with
an ordinary serial line, such as the Tektronix 4014,
or graphics displays were high-speed peripherals connected directly
to an I/O channel, such as the Evans and Sutherland PS-1.
Networking involved little more than
connecting all the terminals to \fIthe\fR computer.
In either case, sharing of data between users was automatic,
because all users used the same computer.
.PP
Such simplicity no longer exists.
In the modern computing environment, computers are everywhere.
To retain the benefits of synergistic interaction,
the barriers between machines must be removed.\(dg
.FS
\(dg For authorized users.
A sensible security posture must still prevail,
but that is the topic for some other paper.
.FE
.NH 3
The Modern Computing Environment
.PP
The state of most computer facilities in the late 1980s can be
best summarized by the word ``diversity''.
.[
Quarterman Notable Computer Networks
.]
A single computer is no longer able to serve the needs of an entire
facility;  in many cases a single computer may be allocated to serve
the needs of a small group of people, or even an individual.
Fortunately, the cost per unit of processing power has fallen
steadily and rapidly for the last fifteen years.
As a result of the increase in demand and decrease in cost,
computer systems have proliferated wildly, appearing in all aspects of
business, education, and government.
.PP
Proliferation of computer systems within organizations has served to
highlight the importance of networking.
Some organizations have opted to depend on a single vendor for all
their hardware needs;  such organizations have typically been able to
enjoy a reasonable degree of interoperability between their systems,
but often at a sharp expense of having to forgo more powerful or less
expensive offerings from other vendors.
Educational and government institutions, always forced to be very
short-term cost conscious, have often been forced to diversify.
Having a collection of dissimilar machines has complicated the task
of achieving communication between them.
.PP
Accomplishing communication between
two different vendor specific communication systems
has always been possible, although often expensive.
The development of the vendor independent TCP/IP protocol family,
.[
Feinler DDN Protocol Handbook
.]
the use of TCP/IP on the ARPANET,
and the development of fast, inexpensive local area network (LAN)
technologies like Ethernet
.[
Xerox ethernet specification
.]
has resulted in a
common communication standard that is in widespread use.
The TCP/IP protocol family took
the promise of Open Systems Interconnection (OSI)
from an ideal to a production capability in 1982,
and TCP/IP has been in widespread use ever since.
Through the use of the TCP/IP protocols,
computers from any manufacturer can communicate reliably,
effectively, and efficiently with all other computers on the network.
.PP
With processing power getting less expensive,
computers becoming physically smaller,
and the invention of Ethernet making local area networking inexpensive
enough to be practical in any facility,
the emergence of the workstation was inevitable.
Personal computers (PCs) and workstations share the
common trait of having a processor, main memory, and video display
dedicated to working on the problems of a single user.
A workstation is distinguished from a personal computer
mainly by the ability to multiprogram, ie,
the workstation has the ability to interleave the running of
more than one program without interference.
The ability to multiprogram is necessary to make a computer system into
a good network citizen.
The network environment is populated by many ``daemon'' programs
that provide services to, and demand services of other machines on
the network;  these programs use a small amount of processor time on
an ongoing basis to conduct those services.
.PP
Workstations are a commonplace item in all current computer installations.
Recent advances in VLSI have made it clear that the workstation
is the heir apparent to the throne once held by the ordinary terminal.
For example, the Sun-3/50 is a 1-1.5 million instruction per second
(MIPS) Motorola 68020 based machine with a Motorola 68881
floating point unit, 4 Mbytes of main memory,
a 10 million bit per second (Mbps) Ethernet interface,
and an 1152x896 pixel display -- yet as of this writing
the selling price in the USA is only four thousand dollars,
a fraction of the cost of a graphics terminal only a decade earlier.
Furthermore,
high performance workstations offer compute performance that
rivals that of many common minicomputers,
often at a small fraction of the price.
What is happening in the computer industry is that while
the super-minicomputers grow upwards to compete with the SuperComputers,
the workstations grow upwards to compete with the conventional
minicomputer.
.NH 3
Computer Categories
.PP
A properly balanced computer network has a mix of different resources.
For a laboratory working on science and engineering problems,
the resources typically fall into three categories:  SuperComputers,
super-minicomputers, and workstations.
It can sometimes be difficult to distinguish between different
types of computers, so a small diversion to define terms is necessary.
.TS
expand box;
c s s
l | l | l.
Processor Categorization
_
Category	MFLOPS	I/O
=
SuperComputer	>500	T{
High performance I/O
channels with dedicated
I/O processors.
3-15 Mbytes/sec per channel.
T}
_	_	\^
Mini-SuperComputer	>50	\^
_	_	\^
Mainframe	3 - 50	\^
=
Super-Minicomputer	7 - 100	T{
Medium performance I/O controlled by
same CPU(s) that run user code.
1-3 Mbytes/sec per channel.
Hardware error recovery.
T}
_	_	\^
Minicomputer	0.5 - 7	\^
=
Workstation	0.1 - 3	T{
Low cost. Low speed I/O controlled by CPU.
Software error correction.
<1 Mbytes/sec per channel.
T}
_	_	\^
Microcomputer	0.0 - 0.5	\^
.TE
.IP "A SuperComputer\ 
represents the fastest processing and I/O capabilities
available.  The instruction set is optimized for scientific processing.
The SuperComputer processor is best utilized 
primarily for computationally intensive
processing loads, because of the long time needed to quench the processor
pipelines and save the state of the processor.  Due to the complexity
of the processor, some machine-specific software development is often
necessary to obtain maximum performance.
.IP "A Mini-SuperComputer\ 
is processor which executes the exact same instruction
set and binary software as a SuperComputer, but at a much slower speed
and at a much lower cost.  These systems are typically purchased to
offload software development from SuperComputers, or in circumstances
where a full SuperComputer is not required, or prohibitively expensive.
Having the exact same hardware complexity as a full SuperComputer,
similar machine-specific software development is necessary for maximum
performance.
.IP "A Mainframe computer\ 
typically has an instruction set balanced for
general-purpose programming and business applications, and usually
runs a business-oriented operating system.
.IP "A Super-Minicomputer\ 
is a very large mini-computer architecture device,
utilizing lower performance I/O devices and a collection of mini-computer
processing elements to make a system which has a substantial aggregate
processing power, but retains the low-overhead interactive orientation of
the minicomputer architecture.  Suitable for highly interactive tasks,
especially graphics display, as well as medium-scale numeric processing.
I/O devices are typically controlled by the same processors that runs users
programs, with hardware to perform I/O error correction.
.IP "A Minicomputer\ 
is intended for interactive computing, to be shared between a
small to medium community of users.  The hardware is optimized for
interactive processing, and generally low cost, with modest I/O bandwidths.
I/O devices are typically controlled by the same processor that runs users
programs, with hardware to perform I/O error correction.
.IP "A Workstation\ 
is intended for a single user at at time, with high
resolution monochrome or medium resolution color displays, and high
interactivity for a single application.  Only small amounts of processing
can be performed local to the workstation.
.IP "A Microcomputer\ 
is cheap.  Typically they lack multi-programming, and lack
TCP/IP network support, making the effective utilization of multiple
microcomputers difficult.  However, they are often extremely cost-effective
sources of computing power for clerical and administrative applications.
.NH 3
Network Concepts, Standard Protocols
.PP
The International Standards Organization (ISO)
has adopted a reference model for Open Systems Interconnection (OSI)
that divides network service into
seven layers of abstraction, to permit the different aspects of
a complete network implementation to be more easily considered.
.[
Tanenbaum computer networks
.]
.TS
center box;
n l l.
7	Application	(User)
6	Presentation	Data transformation
5	Session	Create connections
4	Transport	Segmentation & Reliability
3	Network	Routing
2	Link	Build bit streams
1	Physical	Electrical transitions
.TE
Except for the discussion about network bandwidth, which is primarily
determined by the characteristics of the hardware that is providing
a particular Physical Layer, this paper will simply assume that
a reliable, full duplex, eight-bit binary path is available between
pairs of computers, on demand, and that the Session Layer and lower layers
will handle all necessary details.
Instead, the focus is on the
activities in the Presentation and Application Layers,
to take the raw network capability, and use it to provide meaningful
graphics across network links.
.PP
There are quite a few network protocol families in existence that
can be viewed using the ISO model (with varying degrees of success).
Most are vendor specific, and two are standards devised by non-vendors.
.TS
center box;
l l.
Vendor	Protocol
_
Xerox	NS
Digital	DECNET
IBM	SNA
_
US Govt	TCP/IP
ISO	TP4
.TE
.NH 3
The Need for Bandwidth
.PP
People who are tasked with planning for network bandwidth requirements
generally are not prepared for the volume of data that graphical
applications generate and consume.
At BRL, there are two functional requirements for network bandwidth,
which define the minimum acceptable performance of a network,
and they are based on user impatience factors.
.[
IBM economic value of rapid response
.]
Any user should be able to achieve file transfer rates at the
rate of one megabyte per minute, or 140 Kbits per second.
A CAD user should be able to retrieve a 512x512 24-bit deep
image in 30 seconds, or 210 Kbits per second.
These values are bare minimums.  More reasonable performance
would be to transfer data at the rate of one megabyte per ten seconds,
or 838 Kbits per second, with the image transfer happening in three
seconds, or 2 megabits per second.
While the minimum values are no problem, the more desirable speeds
begin to approach the limits of conventional technology
(considering that the network supports hundreds of users).
.TS
center box;
c s
n l.
Various LAN Speeds (Mbps)
_
0.056	Point-to-Point
1.544	Point-to-Point
10	Ethernet
10	ProNet-10 Ring
16	DEC PCL-11
50	Hyperchannel
80	ProNet-80 Ring
100	FDDI
1200	UltraNet
.TE
.PP
Fortunately, for graphics problems there is an upper bound on
the amount of network bandwidth that is likely to be required per user.
In this case, the limiting factor is the bandwidth of the
human eye-brain system, which has been estimated at approximately
10 Gbps (10,000 Mbps).
For most purposes, the eye can be deceived with images of
significantly lower resolution.  The best example of this
is the 4.2 Mhz bandwidth of an NTSC television signal,
.[
Conrac raster graphics handbook
.]
which provides low but very usable image quality.
For computer graphics applications, more spatial resolution than
that provided by NTSC is often desirable, while 30 frames per second
animation rates are only needed for the most demanding work.
The table below provides data rates for several types of animation.
.TS
center box;
l l l
l n n.
Resolution	FPS	Data rate
_
512x512x24	15	95 Mbps
512x512x24	30	190 Mbps
1024x1024x24	15	375 Mbps
1024x1024x24	30	755 Mbps
1024x1024x24	60	1510 Mbps
2048x2048x24	24	2415 Mbps
.TE
.[
Ricart research bandwidth estimates
.]
There are not many computers that can compute animation sequences in
real time, although there are quite a few that have I/O systems
capable of reading, postprocessing, and displaying precomputed data at
these data rates.
The first line in the table describes a data rate comparable to
the maximum throughput of the new generation of LAN technology, the
proposed FDDI (fiber-optic data distribution
interconnect), 100 Mbps.
To date, only the Ultra device is capable of operating in
this range of data rates,
with a peak speed of 1200 Mbps,
but a variety of other devices of comparable bandwidth are presently
in development.
.[
Visualization in Scientific Computing
.]
.NH 2
The BRL Remote Frame Buffer Capability
.PP
The machine at BRL which had the nicest display hardware on it
had one of the slowest processors (a VAX 11/780).
To allow computation and display to occur on separate machines,
the \fBlibfb\fR frame buffer library was given support
for an extra type of display hardware:  the remote frame buffer.
The goal of this capability was to be able to direct graphics
operations to any display screen attached to the network, merely
by specifying in the \fIdevice\fR string
the name of the host computer and the name of the display device.
(Refer back to the section on \fBlibfb\fR for a discussion of the
three methods for providing the \fIdevice\fR string, and the syntax).
The details of establishing the network connection to the
appropriate computer, as well as the hardware-specific display support,
should all be entirely transparent to the user, except perhaps
for some performance variations.
.PP
Detection that the given \fIdevice\fR string indicates the use of a network
display is handled by the \fBlibfb\fR routines that implement the
fb_open() function;  rather than vectoring to a module to support
a local display, the fb_open() function forces all operations to
vector through the if_remote.c module.
When the if_remote.c module is called to open the framebuffer,
it assumes the role of a ``network client'', and
causes a network connection to be established
to the ``network server'' on the indicated remote machine.
The client then passes the hardware device specification to
the server.  The server process calls fb_open() on it's
copy of \fBlibfb\fR, and returns the status code (success or failure)
to the client, which returns the status code to the calling application.
.PP
The routines in the if_remote module
convert all the parameters to the
library calls into machine independent messages.
These messages are sent from the client to the server using
the ``package'' (PKG) protocol.
.[
MUUS87c Muuss CAD Package Release 1.21
.]
When the server receives a message, it determines which operation
is being requested, converts the machine independent values in
the message back into the native format of the server machine,
and then invokes the intended library routine.
.PP
The definition of the protocol exchanged by the client
and server machines is a level seven (Application) protocol
that depends on level six services (Presentation) for the
external data representation conversions.
The machine independent messages generated by this protocol are
transmitted via \fBlibpkg\fR, which provides level
six (Presentation) and level five (Session) services.
\fBlibpkg\fR provides capabilities for the transmission of both
synchronous messages and asynchronous messages.
Synchronous messages are used by the if_remote protocol
when return codes from the server are significant.
In this case, the client library routine does not return until
a reply message has been received from the server.
For significantly improved performance, the fb_write() routine
uses asynchronous messages, so that the client library routine
returns a ``success'' indication immediately, while the network
and server continue to work on displaying the data.
This allows the production of data and the display of the data
to proceed at the maximum possible rate.
With this design, the library depends on 
reliable delivery services from the level four
(Transport) services of the particular communication protocol.
.PP
One of the delightful features of this design is that the remote
framebuffer capability can be stacked to an arbitrary depth.
That is to say, one could specify the \fIdevice\fR as
.sp .5
.ce 1
host1:host2:/dev/\fIdevice_name\fR[#]
.sp .5
In this case, the original client would contact \fIhost1\fR as
a server.  \fIhost1\fR, in processing the remainder of the
\fIdevice\fR string, would vector to if_remote.c, thus
acting as a client, and would contact \fIhost2\fR, which would
perform as the actual server.
This sort of an arrangement was a good test of the symmetry of
the protocol, and ensured that no implementation details had been overlooked.
Stacking servers in this manner is rarely useful,
but the capability comes in handy just often enough so as to be
worth keeping around.
One use is to bridge across a change underlying protocols.
For example, assume
three hosts, A, B, and C, where A and B communicate via the TCP/IP
protocols, and B and C communicate via the DECNET protocols.
There is no way for an image to be transmitted directly from A to C.
However, in this case, machine B can act as an application-level
``relay'', and bridge the gap.
Another use for stacking servers is to bypass
network software or traffic routing problems, by forcing the data
to flow through a suitable intermediary.
.PP
By careful implementation, the use of remote framebuffers across
reasonably high performance network connections has performance
which is only slightly slower than local display performance.
The network framebuffer concept is a very powerful one;
once accustomed to it, users refuse to deal with more restrictive
environments.
This capability has been used to conduct data analysis at NASA
facilities, with live graphics transmitted from BRL computers,
miles away.
.PP
Will new standards like the
MIT X Window System will make the
BRL framebuffer library unnecessary?  In the short term, no.
X Version 11 does not support 24-bit color images, nor does it provide
a powerful enough interface for controlling many important framebuffer
operations such as the colormaps, pan and zoom.
However, to provide compatibility with X-based systems,
an if_X module
implementing the framebuffer library functions
is being developed, so that all systems that run X
will have \fBlibfb\fR support.
.NH 2
The Remote MGED Display Manager
.PP
In a manner similar to that just described for \fBlibfb\fR,
there is a special display manager for \fBmged\fR called dm-net.c
that communicates with an mged server process on a remote machine,
and facilitates network execution of \fBmged\fR.
This software is still considered experimental, but it has been
demonstrated to work correctly.
However, the proper level of interactive response depends on having
a network path between the client and the server with
low latency and low packet loss, so that display motions remain
smooth.
.PP
The underlying mechanism is very similar to that just described for
\fBlibfb\fR, with specially formatted messages carried via
\fBlibpkg\fR between the client and server processes,
and it will not be further described here.
.NH 2
The PKG Protocol
.PP
The ``package'' (PKG) protocol
is a critical ``enabling technology'' upon which a great many
BRL network applications are built.
The \fBlibpkg\fR services bridge levels six (presentation) and
five (session) in the OSI model, and
is layered on top of a virtual circuit provided
by the level five and four (transport) capabilities
of the native operating system.
One of the main purposes of \fBlibpkg\fR is to
insulate the applications programmer from the details of
establishing and managing a network connection.
.PP
\fBlibpkg\fR frees the programmer from concerns
about storage and buffer management, and other details
relating to the network communications internals, by
allowing exchange of messages of any size (up to 2\u32\d-1 bytes),
with automatic allocation of sufficient dynamic memory
for the receive buffer handled by the library.
Thus,  messages exchanged by users of
\fBlibpkg\fR may be of arbitrary size.  There is no requirement
for different messages of the same type to have the same size.
It is perfectly reasonable to transmit messages with no data,
for example, to signal that a particular event has happened.
The \fBlibpkg\fR interface supports 
a dynamic mix of synchronous and asynchronous message transmission
paradigms, allowing the implementation of traditional
blocking remote procedure call operations, as well as
permitting non-blocking operations (such as used by the fb_write()
operation described above), and ``surprise''
asynchronous operations (such as used by the fb_log() operation).
.PP
Each message is transmitted across the virtual circuit with
an eight byte header, to permit the receiving \fBlibpkg\fR routines
to properly receive and dispatch the message.
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1                   2                   3 3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Magic Number          |         Message Type          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                  Count of actual data bytes                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
:                                                               :
:                             [Data]                            :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.ce 0
.fi
.cs R
.sp 1
All three header fields are transmitted in Big-Endian byte order;
the format of the data field is determined by the calling application
and is opaque to the \fBlibpkg\fR routines.
The \fImagic number\fR field is set to the value 0x41FE hexadecimal,
and is used by the
receiver routines to
defend against various system failures, such as an application
program writing data to the wrong file descriptor (the network connection,
in this case).
The \fItype\fR field is provided by the caller of pkg_send(),
and is used on the receiving end to appropriately dispatch the message:
either a blocking read is satisfied by the arrival of the appropriate
message type, or an asynchronous event is vectored through the pkg_switch
table and the message is dispatched to an application specific handler.
.PP
Typically, PKG is layered on top of a TCP connection, although PKG has
also been run over DECNET and X.25.
Multiple PKG connections per
process are supported, which is used by \fBremrt\fR.
When using TCP, the TCP option SO_KEEPALIVE
is enabled so that all communications failures and remote system
failures are noticed by the TCP layer after an appropriate time
interval, avoiding the need for application-level timeouts.
\fBlibpkg\fR
handles the incremental aggregation of received data into full messages,
freeing the calling application from having to handle the details of
short reads.
The Berkeley UNIX \fIselect\fR(3) system call provides the ability to
easily handle asynchronous communications traffic on multiple
connections.
.TS
center box;
c s
l l.
libpkg Routines
_
pkg_open	Open net conn to host
pkg_permserver	Be permanent server, and listen
pkg_transerver	Be transient server, and listen
pkg_getclient	Server: accept new connection
pkg_close	Close net connection
_
pkg_send	Send message
_
pkg_waitfor	Get specific msg, do others
pkg_bwaitfor	Get specific msg, user buffer
pkg_get	Read bytes, assembling msg
pkg_block	Wait for full msg to be read
.TE
This
protocol had originally been developed to make the remote \fBmged\fR
display possible, but later found uses in command and control
experiments,
.[
Chamberlain Muuss fire support application for addcompe
.]
distributed processing programs,
.[
Muuss rt and remrt shared memory parallel
.]
and other network applications.
The functionality provided by \fBlibpkg\fR is not significantly greater
than the functionality provided by the virtual circuit support
of the basic operating system, and amounts to to only about 1300 lines
of source code, yet it provides a convenient and worthwhile
abstraction of the underlying network mechanism that makes the
implementation of distributed applications very much simpler.
.NH 2
External Data Representation
.PP
When transmitting information on a network, careful attention
must be given to the format and order that data is transmitted in.
In a heterogeneous network
the internal processor architectures of the communication
endpoints are likely to be different, yet effective and accurate
transmission of data is still desired.
.PP
Moving binary data between disparate computer architectures is not
a new idea, and has been done for at least three decades.
So far, the communication has typically involved specific pairs of
computers, typically a mainframe or supercomputer sending results to
some kind of ``front-end''.  In such cases, one machine would
be given the task of always converting the data to and from the format
used by the other machine.
When constructing distributed applications that run on a wide
variety of hardware that can all communicate with each other,
it is no longer possible to predict in advance
how the machines may be coupled together.
.PP
Two strategies exist for coping with such heterogeneity.  Either (1) all
machines can agree to adopt a single standard for communication,
or (2) all machines must have the software to enable them to convert
internal data to the internal formats of all other machine types.
The first strategy is weakest when two computers of similar type
wish to communicate, where their internal format is not the standard
format.  Transmission of a block of data would therefore require
converting it to the standard format, sending it, and converting it
back to the internal format, when using the internal format would
have been acceptable \fIin this case\fR.
The second strategy always performs well when similar machines converse,
because no format conversion is required.  However, this strategy
creates a significant software coding and testing burden on
the implementors when
the number of different formats \fBN\fR
becomes significant, ie, \fBN\fR greater than 3,
because of the need to write \fBN**2\fR conversion routines.
Also, the task of adding support for a new type of machine becomes
formidable:  \fBN\fR new conversion routines must be written,
and every existing machine needs to be updated to support the
new format.
.PP
The overall strategy favored by the DARPA Network Research community
has been to establish a single standard representation
for each type of communication,
and then to have all machines communicate using that single standard.
The BRL CAD Package has continued in this fine tradition.
The external data representation described here
fits into level six (Presentation)
of the OSI reference model, and serves generally the same purpose as
the ISO X.409 Abstract Syntax Notation (ASN.1).
Note that both the data representation described here
and the Sun XDR standard
.[
Sun xdr standard
.]
use implicit typing, ie,
no indication of the data format becomes part of the external
data representation,
while the ASN.1 standard includes many details of the
of the data format and data structuring in the external
representation of the data.
.NH 3
Ordering and Numbering Bits in a Byte
.PP
In the discussion that follows, it is assumed that the
underlying hardware handles the disassembly and reassembly of
serial bit streams into 8-bit bytes in a transparent manner.
In this document, the left-most bit in a diagram is the high order
or most significant bit (MSB), and this bit is labeled bit zero.
For example, the following diagram represents the value 170 decimal,
0xCC hex, 0252 octal:
.sp 1
.Cs
.nf
.ce 99
.ne 1i
 0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|1 0 1 0 1 0 1 0|
+-+-+-+-+-+-+-+-+
.ce 0
.fi
.cs R
.sp 1
.NH 3
Transmitting sequences of bytes
.PP
The organization and order of transmission of data described in this
document is specified at the byte level.
The order of transmission of a group of bytes is
the normal order in which they are read in English.
In the following example, the bytes are transmitted in
ascending numerical order (ie, 1, 2, 3, ...).
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1                   2                   3 3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       1       |       2       |       3       |       4       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       5       |       6       |       7       |       8       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       9       |      10       |      11       |      12       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.ce 0
.fi
.cs R
.sp 1
This convention is often referred to as the "Big-Endian" order
.[
Cohen holy wars and plea for peace IEN-137
.]
because the big (most significant) end of the binary number is
transmitted first.  Big-Endian notation is the standard order
in the TCP/IP protocols of the DARPA InterNet,
.[
Postel Internet Protocol Specification
.]
now formalized as MIL-STD-1777.
.[
MIL-STD-1777 Internet protocol
.]
The Big-Endian order
is also the internal format commonly used by IBM, Gould, Sun, SGI,
Alliant, Apple, and many other computers.
For these reasons, the Big-Endian order is used for
the BRL CAD Package external data representation.
.NH 3
Transmitting Integers
.PP
All integers are transmitted in Big-Endian,
or "network order", regardless of the internal representation of
the particular machines.
To send a  signed 16-bit quantity,
the number is represented in twos-compliment form, and the
byte sequence is:
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1         1
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       1       |       2       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|S|MSB                       LSB|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.ce 0
.fi
.cs R
.sp 1
where "S" is the sign bit, MSB is the most significant bit,
and LSB is the least significant bit.
Byte "1" is transmitted first, and byte "2" is transmitted second.
.PP
To send a 32-bit signed integer,
the number is represented in twos-compliment form, and the
byte sequence is:
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1                   2                   3 3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       1       |       2       |       3       |       4       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|S|MSB                                                       LSB|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.ce 0
.fi
.cs R
.sp 1
.PP
Unsigned integers are encoded similarly, occupying either 16 or 32 bit
positions.
.NH 3
Transmitting Strings
.PP
Character strings are represented using the ASCII character set.
They are placed into the external message without regard for
alignment, and are terminated by a zero byte (all bits off).
There is no limit to the length of a character string.
.sp 1
.Cs
.nf
.ce 999
.ne 1i
+---------+---------+---//---+---------+---------+-+-+-+-+-+-+-+-+
|  Byte 1 |  Byte 2 |  ...   |Byte N-1 | Byte N  |0 0 0 0 0 0 0 0|
+---------+---------+---//---+---------+---------+-+-+-+-+-+-+-+-+
.ce 0
.fi
.cs R
.sp 1
.NH 3
Floating Point Representations
.PP
The most portable representation of a floating point number would
be to encode it as a printable ASCII string.  However,
using a printable ASCII representation
would require 16 decimal digits to contain the mantissa, two digits
for the signs of the exponent and mantissa, three decimal digits for the
exponent, and at least two additional characters for the mantissa decimal
(radix) point and the exponent marker.  Such an encoding would look
like "-1.234567890123456e\-103", and requires at least 23 characters
(bytes).
.PP
Exchanging floating point numbers in binary form is a very
compact representation, using a small and constant amount of bandwidth:
eight bytes.  Therefore, using an eight byte binary format
requires about one third the network bandwidth as the printable
ASCII representation.  In addition to the significant bandwidth
savings, using a binary format is also likely to require significantly
less computer time for format conversion to machine-specific formats,
because of the strong similarities in internal binary formats and
the simplicity of using bit operations to implement the conversions.
.NH 3
Binary Floating Point Representations
.PP
Virtually all modern computers utilize twos compliment binary format
as the internal representation for integers.  Thus, the only issue with
transmitting integers was to define a byte order;  the underlying
representation was already common to all the computers.
In the case of floating point numbers, the situation is more complex.
Not only does the byte ordering need to be defined, but the
network representation of the floating point number itself
needs to be defined.
.PP
Just as integers come in various sizes, floating point numbers
also come in different sizes, the most common of which are called
``single precision'' (typically 32 bits), and
``double precision'' (typically 64 bits).
This discussion will focus entirely on the double precision format.
.PP
Between the various vendors, there have been quite a few different
internal representations for floating point numbers.  Typically,
each format has several advantages over the others.
The main alternatives are:
.IP 1)
IEEE Standard 754 double precision floating point is most popularly
used in new microprocessor designs, and is found in a wide variety of
new computer systems, especially new workstations.
As stated by Kahan
.[
Kahan mathematical library functions
.]
"This standard is on its way to becoming more widely adopted
than any other design for computer arithmetic.  VLSI chips
that conform to some version of that standard have been produced
by a host of manufacturers, among them
Intel i8087, i80287,
Motorola 68881,
National Semiconductor 32081,
Weitek WTL-1032 and WTL-1165,
Western Electric (AT&T) WE32106,
Zilog Z8070"
and MIPS FPS2000.
The radix is 11 bits wide, in base 2, excess-1023 format.
The fraction has 52 significant bits,
a ``hidden'' leading one to the left of the high-order bit,
and a radix point to the \fIright\fR of the hidden bit,
giving roughly 16 decimal digits of significance.
Overflow threshold is 2.0**1024, or 1.8e+308.
Underflow threshold is 0.5**1022, or 2.2e\-308.
There are explicit representations for \fIinfinity\fR,
and not-a-number (NaN), both Signaling Nan and Quite NaN.
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1                   2                   3     6 6
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1   2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
|S|   Exponent          |  Fraction                                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
.ce 0
.fi
.cs R
.sp 1
When the exponent is 2047 (all bits set), a special condition exists.
If all fraction bits are zero, the value is infinity times the sign.
If the fraction is non-zero, and the MSB (bit 12) of the fraction
is zero, then this represents a signaling NaN, otherwise
this represents a quiet NaN.
.IP 2)
IBM System/360 double precision (eight-byte, or ``long'') floating
point,
.[
IBM 370 principles of operation
.]
is also used by other vendors such as Gould. The radix is 7 bits
wide, in base 16, excess-64 format. The fraction has 56 significant
bits, having a radix point to the left of the high-order bit,
with no hidden bits, giving roughly 17 decimal digits of significance.
Overflow threshold is 16.0**63, or 7.2e+75.
Underflow threshold is 16.0**-65, or 5.4e\-79.
There is no representation of \fIinfinity\fR.
There is no representation of not-a-number (NaN).
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1                   2                   3     6 6
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1   2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
|S|   Exponent  |  Fraction                                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
.ce 0
.fi
.cs R
.sp 1
.IP 3)
Digital Equipment Corporation's VAX-11 double-precision ``D'' format,
also used on the earlier PDP-11 machines.
The radix is 8 bits wide, in base 2, excess-128 format.
The fraction has 56 significant bits, with a ``hidden''
leading bit, giving roughly 17 decimal digits of significance.
Overflow threshold is 2.0**127, or 1.7e+38.
Underflow threshold is 0.5**128, or 2.9e\-39.
This range of representable numbers is comparatively narrow.
There is no representation of \fIinfinity\fR.
There are reserved values to represent signaling NaN.
.[
Digital pdp-11/70 processor handbook
.]
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0 0   1 1 1 1   3 3 3 3   4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
 0 1   4 5 6 7   0 1 2 3   6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+//-+-+-+-+//-+-+-+-+//-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Fract_D | Fract_C | Fract_B |S|    Exponent   |   Fract_A   |
+-+-+//-+-+-+-+//-+-+-+-+//-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.ce 0
.cs R
.fi
.sp 1
Note that the actual fraction is made by combining (left to right)
Fract_A, Fract_B, Fract_C, and Fract_D.
Also note that this ordering is neither pure ``Big-Endian'' or
``Little-Endian'', but a peculiar mixture of the two, being
Big-Endian within each 16-bit unit, but Little-Endian when combining
the 16-bit sections.
.IP 4)
Cray Research Incorporated full precision floating point.
The radix is 15 bits wide, in base 2, excess-16384 format.
The fraction has 48 significant bits, having a radix point
to the left of the high-order bit, with no hidden bits,
giving at least 14 decimal digits of significance.
Overflow threshold is 2.0**8191, or 1.0e+2466.
Underflow threshold is 0.5**8192, or 1.0e\-2466.
There is no representation of \fIinfinity\fR.
There are reserved values to represent NaN.
Whether the NaN is signaling or quiet depends
on the setting of the hardware Floating-point Mode flag.
.[
Cray-1 mainframe reference manual
.]
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1                   2                   3     6 6
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1   2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
|S|          Exponent           |          Fraction                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
.ce 0
.fi
.cs R
.sp 1
.IP 5)
Digital Equipment Corporation's VAX-11 double-precision ``G'' format.
This format is considered optional, and
is not supported on all VAX machines, with
hardware support available only on some models.
Thus it is much less frequently used than the DEC ``D'' format, above,
even though it is very close to the IEEE format.
The radix is 11 bits wide, in base 2, excess-1024 format.
The fraction has 52 significant bits, having
a ``hidden'' leading one to the left of the high-order bit,
and a radix point to the \fIleft\fR of the hidden bit,
giving roughly 15 decimal digits of significance.
Overflow threshold is 2.0**1023, or 8.9e+307.
Underflow threshold is 0.5**1024, or 5.5e\-309.
There is no representation of \fIinfinity\fR.
There are reserved values to represent signaling NaN.
.[
VAX Technical Summary
.]
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0 0   1 1 1 1   3 3 3 3   4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
 0 1   4 5 6 7   0 1 2 3   6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+//-+-+-+-+//-+-+-+-+//-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Fract_D | Fract_C | Fract_B |S|      Exponent       |Fract_A|
+-+-+//-+-+-+-+//-+-+-+-+//-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.ce 0
.fi
.cs R
.sp 1
.IP 6)
The native double-precision format of the
Convex machines (some of which also have IEEE capability), is
a Big-Endian version of
Digital Equipment Corporation's VAX-11 double-precision ``G'' format
described above, and has the same properties.
.[
Convex Vector C Compiler
.]
.sp 1
.Cs
.nf
.ce 999
.ne 1i
 0                   1                   2                   3     6 6
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1   2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
|S|   Exponent          |  Fraction                                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+
.ce 0
.fi
.cs R
.sp 1
.PP
The salient features of these different formats are summarized
in this table:
.TS
center box;
l n n n | n n.
Format	Radix	Fraction	Decimal	Overflow	Underflow
	bits	bits	digits	threshold	threshold
_
IEEE	11	52+1	16	1.8e+308	2.2e\-308
IBM 360	7	56,53	17	7.2e+75	5.4e\-79
VAX ``D''	8	56+1	17	1.7e+38	2.9e\-39
Cray	15	48	14	1.0e+2466	1.0e\-2466
VAX ``G''	11	52+1	15	8.9e+307	5.5e\-309
Convex	11	52+1	15	8.9e+307	5.5e\-309
.TE
.NH 3
A Standard for Network Floating Point Numbers
.PP
From all of the possibilities just mentioned,
the ANSI/IEEE Standard 754 for Binary Floating-Point Arithmetic
.[
IEEE binary floating-point
.]
was chosen.
This choice was made on the basis of the IEEE format's
overall technical merits,
including extended dynamic range,
it's status as an accepted and respected standard,
and on the basis of it's
widespread implementation in common graphics workstations.
This choice was made even though this format has a few
less bits in the fraction than other formats, most notably that of the VAX,
and thus delivers slightly less accuracy.
The differences in accuracy were considered to be small enough to
not matter;  depending on more than 15 decimal digits of accuracy
is an impediment to portability in any case, and can not be improved
upon in the external data representation.
.PP
Therefore, when communicating binary floating point data across a
network connection, all floating point values must be converted from the
host-specific representation to the
64-bit IEEE representation before transmitting it, and upon reception,
all floating point values must be converted from the 64-bit IEEE
representation to the host-specific representation.
When the IEEE representation is transmitted, it is to be transmitted in
Big-Endian byte order.  That is, bits 0-7 are transmitted in the
first byte, bits 8-15 are transmitted in the second byte, and so forth.
.NH 3
Floating Point Converter Interface
.PP
A pair of library routines have been written (in the C
.[
Kernighan Ritchie C programming language Bell System Technical Journal
.]
.[
Kernighan Ritchie Lesk Johnson C programming language 
.]
programming language) to convert between a local host computer's
C \fBdouble\fR data type (typically 64-bit double precision)
and the \fInetwork floating point format\fR, which is the
64-bit IEEE double precision representation, in ``network''
(ie, Big-Endian) order.
The subroutines are defined in C as:
.sp 1
.nf
.in +1i
.ft B
htond( netptr, hostptr, count );
unsigned char *netptr;
unsigned char *hostptr;
int count;
.sp .5
ntohd( hostptr, netptr, count );
unsigned char *hostptr;
unsigned char *netptr;
int count;
.ft R
.in -1i
.fi
.sp 1
The names are acronyms for Host TO Network Double (\fBhtond\fR)
and Network TO Host Double (\fBntohd\fR),
in the style of the existing Berkeley network data representation
subroutines, such as \fBhtons\fR (host to network short),
\fBhtonl\fR (host to network long), etc.
However, unlike their Berkeley counterparts, these routines
have two significant differences.  First, neither the input
or output buffers need to be word aligned, to permit the greatest
flexibility in converting the data, even though this may impose a
speed penalty on some architectures.
This property can be especially useful when creating tightly packed
network messages containing mixed data types, freeing the programmer
from the necessity to add ``pad'' bytes to ensure alignment.
The concept of alignment is hard to make both portable and storage efficient,
and is thus properly avoided as a subroutine constraint.
Second, these subroutines operate on a sequential block of
numbers, rather than on just a single value.
This allows entire arrays to be conveniently converted with
a single subroutine call, thus improving the clarity of the
code, saving on subroutine linkage execution costs, and
allowing the hope for effective vectorization of the subroutines
on those machines with vector hardware capability.
.NH 3
Significance
.PP
The significance of having established a standard for the exchange
of floating point numbers should not be underestimated.
Most existing network applications restrict themselves
to the transmission of integers, either for reasons of efficiency,
or to simplify the conversion problem.
For the purposes of scientific computing, converting data to
integers is rarely acceptable, because of the requirement for
having a large dynamic range available while also maintaining
many digits of significance.
.PP
Having the capability of conveniently and efficiently
transmitting floating point numbers over the the network
has given rise to a whole variety of important new capabilities.
One example which has already had significant impact,
was the result of 
including the 3-D floating point plotting capabilities of \fBlibplot3\fR,
which provides a machine independent ``meta-file'' capability
for plots of full resolution, allowing arbitrary amounts of zooming
to examine extraordinarily fine details in the data.
.PP
The \fBhtond\fR and \fBntohd\fR
routines have been implemented and tested on a wide variety of
machine families, including the Cray X-MP, Cray-2, Alliant, Sun-3,
Silicon Graphics 3-D and 4-D, DEC VAX, and Gould PowerNode.
When combined with the Berkeley routines
\fBhtons\fR, \fBntohs\fR, \fBhtonl\fR, and \fBntohl\fR,
or with the \fBlibpkg\fR versions
(\fBpkg_gshort\fR, \fBpkg_pshort\fR, \fBpkg_glong\fR, and \fBpkg_plong\fR),
it therefore becomes possible to easily read and write messages
in a portable, machine independent format.
This capability is the cornerstone of
the successful growth of the BRL CAD Package software
into networked environments.
.NH 3
Comparison with Sun XDR
.PP
This document has described a standard for the encoding of data,
without making any attempt to provide a mechanism for describing
the data, either in the program source code, or in the transmitted
external representation.
The format described in this document specifically assumes that
the external representation does not have any alignment restrictions,
so that strings and binary data may be mixed freely.
.PP
The Sun Microsystems XDR standard
.[
Sun xdr standard
.]
is very similar to the encoding proposed here, with a few significant
differences.
It is worthwhile noting that the BRL encoding was selected in 1984,
and developed independently of the work at Sun.  It is therefore
interesting that, with the exception of alignment issues and
the representation of strings, the same external format
was selected.
This is almost certainly due to the exhortations of Cohen.
.[
Cohen a plea for peace IEN
.]
Therefore, with some careful coding
it is possible for XDR software and BRL CAD Package
software to interoperate.
.PP
XDR assumes that all quantities are aligned on four byte
boundaries, for the greatest efficiency of processing on 32-bit
word-addressable computers.
XDR also prefixes all variable length fields with a 32-bit length
indicator, and then pads the variable field.
These choices consume some
additional network bandwidth, but in most cases the difference in
bandwidth requirements is small (for cases seen to date,
XDR generates between 0% and 20% more bytes in the external representation).
.PP
Sun XDR goes a step further than just defining a standard representation
for data exchange;  XDR also provides a syntax for describing the
external data structure.
This description is preprocessed to produce compilable code that
automaticly converts data between the internal and external formats,
using the description of the data to be exchanged.
This is a very nice feature indeed, 
and will certainly make the exchange of complex data structures
much more convenient than the explicit conversion strategy.
The XDR preprocessing step
presumes a much closer coupling to the compiler environment than
the BRL CAD approach, which may result in some additional (one-time)
portability efforts to support each new type of system.
In summary, it seems that either interface can be readily used,
with the BRL CAD interface being easier to use for simpler data
structures, with a few less restrictions, while the Sun XDR is
far better at making the handling of complex data structures easy.
.NH 1
PIPELINE ORIENTED DISTRIBUTED COMPUTATION
.PP
Carefully dividing application software into appropriate tools
provides significant rewards for software developers and maintainers,
in terms of decreased program complexity, decreased incidence
of bugs, and increased maintainability.
.[
Kernighan software tools
.]
For the user, having the ability to combine a set of software tools
in arbitrary ways provides the potential for performing functions
never imagined by the original software designers.
.PP
Traditional operating systems require that
each command in a multi-command sequence be run sequentially,
with the output from each command stored in a temporary file,
for input into later steps.
In the UNIX system, this style of operation can be improved upon,
if the data passing between pairs of commands
was originally stored in a single temporary file.  In this case,
pairs of commands can be connected together using UNIX pipes.
In so doing, not only has the requirement for managing the temporary files
(and their attendant storage) been eliminated, but a degree of parallelism
has been introduced as well.
As an example, this three command sequence:
.sp 1
.in +1i
.ft B
.nf
step1  < input  > tmp1
step2  < tmp1  > tmp2
step3  < tmp2  > output
rm tmp1 tmp2
.fi
.ft R
.in -1i
.sp 1
can be reduced to this pipeline:
.sp 1
.ft B
.ti +1i
step1 < input | step2 | step3 > output
.ft R
.sp 1
The standard UNIX shell notation is used here, with
\fB<\fR meaning ``read from'' (redirect standard input),
\fB>\fR meaning ``write to'' (redirect standard output), and
\fB|\fR meaning ``pipe standard output from command on the
left into standard input of command on the right''.
.PP
Systems with multiple processors have existed since the 1960s.
Experimental UNIX systems with multiple processors existed
in the late 1970s
.[
Muuss multi-processor unix system
.]
The Dual-VAX work at Purdue University
.[
Goble dual VAX
.]
popularized the use of multiple processors with the UNIX system,
and paved the way for widespread commercial support.
Multiprocessor UNIX systems are now common, with most major vendors supporting
at least one model with multiple processors.
Vendors like Alliant and Sequent have based their entire product lines
on parallel processing.
Utilizing the UNIX pipeline constructs permits the convenient and
transparent exploitation of local multiprocessor resources with
no reprogramming.
.NH 2
Tool-Oriented Image Processing
.PP
A large number of simple tools for manipulating images and framebuffers
are provided in the CAD package.  They have been written in the
traditional UNIX Software Tools fashion: each performs a single
basic function, and is intended to be
connected together with other tools to achieve an overall
goal.  A fair amount of effort has gone into making a standard interface
to the tools.  All tools provide a usage
message if executed with no arguments,
and a common collection
of flags is defined for all of the tools.
If a tool expects to see binary data, and discovers that the
binary data is routed to a user's terminal, the tool aborts, rather
than splattering a binary stream on an unsuspecting terminal.
.PP
The use of software tools for computer graphics is not
new.  Recent systems advocating this tools based approach include
those of
Duff
.[
Duff
.]
and Peterson.
.[
Peterson Utah Raster Toolkit
.]
The BRL CAD Package has proven to be extremely flexible as
as result of this approach.  Generally, a new tool is
added whenever the existing ones are found to be inadequate.
Success can be claimed if users can easily
achieve day-to-day tasks without having to write any new programs.
.NH 2
File Formats
.PP
The effective construction of software tools depends on having
agreed upon standards for the format of the data that is passed between
the various tools.
It is important to note that UNIX itself does not impose any
form on data files;  there are no end of record markers or other
operating system generated information.  Therefore, the
interpretation of the bytes within a file is implicit, and
left entirely up to the programs which are used to read the file.
The tools that are provided with the UNIX system are primarily
intended for dealing with data stored in ASCII text files.
These tools are quite powerful in their problem domain,
but representing graphics constructs in a printable ASCII form
typically would represent an unacceptable performance penalty.
Therefore, it is necessary to define a variety of file formats
that are more appropriate to the graphical tasks at hand.
For efficiency reasons, binary formats are generally to be preferred,
both in terms of being a compact representation, and also requiring
the least processing time to handle by virtue of needing little or
no conversion before use.
.PP
Significantly more stringent requirements are placed on the
specification of file formats if files containing binary data are to be
portable between different types of machines without need for
explicit data conversion.
.NH 3
Storing Images:  pix(5) and bw(5) files
.PP
By far the most common image formats are
either eight bit per pixel black and white \fBbw\fR(5)
and 24-bit per pixel color \fBpix\fR(5) formats.
The files have the simplest possible format, in order to
enable the rapid development of new tools.
As a result of pixel arrays being stored tightly packed in memory
in the well defined order of
RGB, the RGBpixel definition is a useful format for storing pixels in files.
RGBpixels are tightly packed arrays with three
\fBunsigned char\fR elements.
The values in each
byte are viewed as intensities from 0 (off), through 255 (full on).
Therefore, \fBpix\fR(5) format picture files are defined as being
a sequence of RGBpixels.
The ordering of the pixels in the file
is first quadrant Cartesian, and
corresponds to the pixel writing
order of \fBlibfb\fR.  The first pixel has coordinates (0,0) and
is the lower left corner of the image.  The second pixel has
coordinates (1,0), and subsequent pixels are stored left-to-right.
The first pixel after the end of the first scanline has coordinates (0,1).
Subsequent scanlines are stored bottom-to-top.
This can be more simply thought of
as a byte stream composed of repeated RGB triples, eg, RGBRGBRGB.
Because \fBpix\fR files are defined as a byte stream, they are
inherently portable to all machines with eight bit bytes.
.PP
A close relative of the \fBpix\fR(5) format file is the \fBbw\fR(5)
format file.  This is used for storing eight bit deep monochrome
(``black and white'') images.  \fBbw\fR files are stored in the
same first quadrant layout as \fBpix\fR files, with the exception
that each pixel is only one byte wide instead of being three bytes wide.
.PP
The \fBpix\fR(5) and \fBbw\fR(5) file formats have no header.
The use of a simple headerless image format
is the only fundamental design choice that has
continued to be debated.
The primary advantage of the headerless format is
the ease of connecting tools together with pipes,
and the ease of creating new tools without having to
use I/O routines specific to the reading and writing of images.
Each program is simply handed data, and the interpretation of the data
is implicit in the definition of the tool used and the active options.
This also allows the whole wealth of UNIX tools to be applied
to images as well,
as there is no special-purpose header that has to be skipped.
More importantly, tools don't have to know how to do the
``right thing'' with the header information, in the presence of
seemingly contradictory information from either the definition of the
tool or from the command line options.
For example, processing a \fBpix\fR image with a \fBbw\fR tool like
\fBbwmod\fR is often done.
If files had headers, tools would almost certainly 
balk at being feed input with the wrong pixel depth, which in this
example would be very frustrating, yet having tools infer how to
do the ``right thing'' could also be
extremely complicated if the header contains very much
information, such as with the NASA FITS image format.
.PP
Having ``raw'' headerless data has its price however.
It is difficult to tell whether a given image is color or not, what
its dimensions are, etc.
Consistent file naming conventions
(using ``.bw'' or ``.pix'' as filename suffixes)
address the first issue;
doing most work in ``standard sizes'' of 512x512 or 1024x1024
pixels helps alleviate the second issue.
In general,
only the scanline length needs to be known, as the number of scanlines
can then be found by dividing the file size by the scanline length.
The algorithms of many tools simply
run until all of the data is gone, and some don't even care
about scanline lengths at all, so the inconvenience
of having headerless files tends to be minor.
Using a good Shell with history and commandline screen editing features,
or writing sets of simple goal-specific Shell scripts
tends to minimize the aggravation of having to repetitively specify
unusual image sizes on every command.
.NH 3
libplot3 Files
.PP
The original UNIX \fBplot\fR(5) format file was machine independent,
although with the unfortunate choice of Little-Endian (VAX) byte
ordering for the 16-bit integers.
The extensions to the file format to provide all of the additional
features (3-D, floating point coordinates, color) described above
were all done in such a way to preserve the machine independent property
of UNIX-plot files.  In this way, not only is it possible to view
plot files on displays attached to local computers, but it is also
very convenient to produce images on remote displays.
.NH 3
Image Compression:  The Utah RLE Format
.PP
Another format for storing images is the
University of Utah Run Length Encoded \fBrle\fR images,
which typically have a file suffix of ``.rle''.
This is the standard image format of the Utah Raster Toolkit.
.[
Peterson Raster Toolkit
.]
Because this format typically requires less disk storage than
the equivalent \fBpix\fR or \fBbw\fR file, and the compression
operation takes some processor time, this format is typically
used for long-term image storage, and image exchange with
other institutions that use the Utah Raster Toolkit.
Institutions that have neither the Utah nor BRL image processing
tools often have an easier time importing images in \fBpix\fR format,
because it is so simple.
.NH 3
Model Databases
.PP
Model databases are normally stored in a machine-specific binary form
with a typical filename extension of ``.g'',
with portable ASCII versions of those databases
having a filename extension of ``.asc''.
It is beyond the scope of this paper to describe the format
in detail, other than to note that \fBmged\fR and \fBlibrt\fR
know how to read the model databases, and that programs
that wish to create geometry using procedural methods
may do so with the services of the \fBlibwdb\fR library for
writing databases.
.NH 2
Framebuffer Tools
.PP
Most image manipulation and processing is performed
either on data streams, or on disk files.  This was done in order
to separate the operations for handling a display device
from the generic operations of image handling.
A common beginning to a processing pipeline is a \fBfb-pix\fR
command to get an image from a framebuffer, just as a common
end of processing pipeline is a \fBpix-fb\fR command to
display the final result.
Several useful operations can be performed with
direct interaction with a framebuffer,
so some device independent tools are provided,
including tools to allow changing colormaps,
panning and zooming through an image, moving a pointer, adding labeling, etc.
Where tools require the user to
move a cursor or the image and cursor support is not available,
both \fBemacs\fR-style and \fBvi\fR-style keyboard motion commands are
accepted by all programs.
.TS
center box;
c s
l l.
Selected Framebuffer Tools
_
fb-pix	framebuffer to color image
fb-bw	framebuffer to black and white
fb-cmap	read a framebuffer colormap
fbcmap	can load several "standard" colormaps
fbclear	clear to an optional RGB color
fbgamma	load or apply gamma correcting colormaps
fbzoom	general zoom and pan routine
fbpoint	select pixel coordinates
fblabel	put a label on an image
fbcolor	a color selecting tool
fbscanplot	scanline RGB intensity plotter
fbanim	a "postage-stamp" animator
fbcmrot	a colormap rotator
fbed	a framebuffer image editor
.TE
.NH 2
Image Manipulation
.PP
A substantial collection of tools for image manipulation are provided
as part of the package.
These can generate statistics, histograms, extract parts of
an image, rotate, scale, and filter them, etc.
Space does not permit a discussion of all of the type of manipulation
supported, but some of the more interesting tools
are listed in the table.
.TS
center box;
c s
l l.
Selected Image Tools
_
pixstat	statistics - min, max, mean, etc.
pixhist	histogram
pixhist3d	RGB color space cube histogram
pixfilter	apply selected 3x3 filters
pixrect	extract a rectangle
pixrot	rotate, reverse, or invert
pixscale	scale up or down
pixdiff	compare two images
pixmerge	merge two/three images
pixtile	mosaic images together
gencolor	source a byte pattern
bwmod	apply expressions to each byte
.TE
.NH 2
Format Conversion
.PP
The \fBN**2\fR problem of format
conversion between all the different ``external'' file formats
is simplified by providing tools to convert all external
file formats into the simple \fBpix\fR and \fBbw\fR formats.
A selection of these conversion tools is listed in
the table below.
In all of the tables, the tool for the reverse conversion
is omitted, e.g. in addition to the \fBrle-pix\fR tool,
there is also a \fBpix-rle\fR for converting color
images into \fBrle\fR format.  Also, only the color (\fBpix\fR) version of
a tool has been listed, even though most have black and white (\fBbw\fR)
equivalents.
.PP
Most of the tools listed have a wide variety of
options consistent with their basic function.
For example, the tool to convert a
color image to a black and white image (\fBpix-bw\fR),
allows a selection of color blending values to be selected:
equal weighting, NTSC weighting, or ``typical CRT'' weighting.
It also allows arbitrary weights to be given for selecting
or mixing of the color planes in any way desired.
.TS
center box;
c s
l l.
Selected Format Conversion Tools
_
g2asc	model database to portable ascii form
bw-pix	black and white to color image
bw3-pix	three black and whites to color RGB
rle-pix	Utah's RLE format to color image
ap-pix	Applicon Ink-Jet to color image
sun-pix	Sun bitmap to color or black and white
mac-pix	MacIntosh MacPaint bitmaps to color
.TE
.NH 2
User Interface
.PP
Using software tools effectively comes with experience.
The BRL CAD Package has tried to ease the difficulty of
learning a new set of tools by using a common set of flags
and common tool naming conventions throughout the
package.
The ``user interface'' is ultimately the Unix shell,
and its conventions for establishing pipes, passing arguments
to programs, etc.
A shell with history recall and screen-oriented command editing,
such as the \fBtcsh\fR, is a major convenience
when constructing complicated command pipelines.
.PP
Constructing very complex interconnections between processing tools
from the command line is sometimes difficult.  One limitation
is the single input, single output notion of a Unix pipe.
Image manipulation often calls for three or more channels
of data, as with \fBpix-bw3\fR.
The most common solution to this problem is the use of intermediate
files.  Other approaches include extensions to the \fBtee\fR program,
or a special tool such as \fBchan\fR
.[
Moore CARL
.]
which demultiplexes a
stream, feeds each channel to a different program, and remultiplexes
the results.
.NH 2
Pipeline Processing
.PP
Systems which facilitate
the coupling of dataflow oriented tools
allow complex custom applications to be
put together without writing any code.
Consider this simple image processing example:
.sp
.in +1i
.ft B
.nf
.ne 1i
pixinterp2x -s512  < image.pix |  \\
pixfilter -s1024 -lo |  \\
pixmerge -n 63/0/127 -- - background.pix |  \\
pixrot -r -i 1024 1024 | \\
pix-fb -h
.fi
.ft R
.in -1i
.sp
which roughly says:
take the file 512x512 color picture in \fIimage.pix\fR
and perform bi-linear interpolation to increase the image size to
1024x1024, then low-pass filter the large image with a 3x3 filter kernel,
composite the filtered image with an existing color picture in
\fIbackground.pix\fR with background replacement done on all
pixels with RGB value 63/0/127,
then rotate the image 180 degrees and display it on the
current frame buffer.
.PP
Note how no intermediate images are stored in disk files throughout
the whole procedure.
For this example this may not be significant
because the image at most stages required only 3
Mbytes of storage,
but this becomes more important when manipulating 400Mbyte image
data, which is processed in exactly the same way.
.PP
It is also worth noting that for larger images, this type of
image processing can take a significant amount of processing time.
While operations like this could be performed in a batch mode,
there is a significant advantage to being able to observe the
progress of the computation.
As the results arrive on the display, the opportunity exists to
abort the whole process if something is going wrong.
In addition to the savings of significant amounts of
computer time, there is also the potential to save ``people time''
as well, by increasing the number of cases that can be attempted
within a single day, and thus increasing the speed of the
project (assuming that the computer processing requirements are part
of the critical path of the project).
.PP
This type of processing interactive computing works most effectively in an
environment where each graphics user has their own multi-window workstation
(not necessarily a graphics workstation)
so that the user can switch to another
window and continue working while
continuing to monitor the progress of the graphics output.
For example, a group of ten people, each with a dedicated inexpensive
monochrome workstation, might share the services of two or three color
workstations, using the network capabilities described below.
.NH 2
General Network Computing
.PP
Using UNIX interactively on a Cray is pretty heady stuff.
Simply being
able to open a "Cray X-MP" window or a "Cray-2" window on a workstation,
and having the same environment
(Shells, screen editors, compilers, source code tools, TCP
networking, etc.), as on all the rest of the the network machines
(Suns, SGIs, VAXen, Goulds, Alliants, etc)
can be worth a lot.
However, being able to \fBrsh\fR an image processing
command over to a Cray without having to make special arrangements to
put the files over on the Cray
first, or having to submit a batch job, allows the power of
SuperComputing to be harnessed without having to pay a stiff premium
in inconvenience.
For people acquainted with the Berkeley UNIX networking capabilities,
the capabilities described in this section may seem trivial,
and almost not worthy of mention.
People not acquainted with the ease and power of
the Berkeley \fBrsh\fR remote shell command may be surprised
at the hidden synergistic power that appears when this capability is
combined with a good collection of tools.
.PP
While logged in on the console of an SGI workstation,
to dynamically produce some plot data, and then
locally view the resulting \fBplot\fR file,
this pipeline would be needed:
.sp
.in +1i
.ft B
.nf
cruncher | pl-sgi
.fi
.ft R
.in -1i
.sp
In this case, ``cruncher'' is assumed to produce a \fBplot\fR file
on it's standard output (\fIstdout\fR).  The \fBpl-sgi\fR program
reads a \fBplot\fR file on standard input (\fIstdin\fR) and produces
a color wireframe display in a new window.
However, if the computation speed of the workstation was inadequate,
a slight variation could be used:
.sp
.in +1i
.ft B
.nf
rsh Cray.arpa cruncher | pl-sgi
.fi
.ft R
.in -1i
.sp
Simply by directing remote execution (via the \fBrsh\fR command)
and naming the remote machine to perform the computation
(in this case ``Cray.arpa''), the
power of another machine can be brought to bear on the task.
The \fBrsh\fR command
.[
Leffler bug fixes and changes
.]
provides a number of significant features:
(a) it connects to the specified machine, and validates access permission,
(b) it passes the specified command to a Shell on the specified
machine for parsing and execution,
(c) it copies its standard input to the remote command,
(d) the standard output of the remote command is copied to \fBrsh\fRs
standard output,
(e) the standard error output of the remote command is returned
separately, and is copied to \fBrsh\fRs standard error output,
(f) local interrupt, quit, and terminate signals are propagated
to the remote command,
(g) \fBrsh\fR exits when the remote command does, returning the
remote exit status.
Therefore, performing an operation on a remote machine with \fBrsh\fR
produces an effect that is indistinguishable from having performed
the same operation locally.
.PP
Another variation of this example might be useful if, instead
of being logged in on the graphics workstation, you were logged in
directly on the Cray, and after having finished some code development
you wished to see an image.
In this case, the command would be:
.sp
.in +1i
.ft B
.nf
cruncher | rsh Vax.arpa pl-fb
.fi
.ft R
.in -1i
.sp
This would send the \fBplot\fR file to the machine ``Vax.arpa''
and cause it to be displayed on a framebuffer.
To generate a videotape to
display the effect of varying a parameter in a simulation running
on the Cray, with the display and videotape
capability on the VAX,
consider the following modest Shell script
(which may be typed at the keyboard or run from a file):
.EQ
delim @@
.EN
.sp
.in +1i
.ft B
.nf
.ne 1i
for parm in `loop 1 100 2`
do
     cruncher $parm | rsh Vax.arpa "pl-fb; vas4 record 1"
done
.fi
.ft R
.in -1i
.sp
.EQ
delim $$
.EN
This simple script runs the \fBloop\fR tool to generate all the
integers between 1 and 100 inclusive, with an increment of 2,
which are assigned to the shell variable \fIparm\fR.
For each value of \fIparm\fR, the \fBcruncher\fR program is
run with \fIparm\fR as a formal argument.
The \fBplot\fR format output is sent to the Vax,
displayed on the framebuffer, and then the \fBvas4\fR
directs the Lyon-Lamb video animation controller to record
the image onto one frame of videotape.
.PP
Now, consider this variation on the earlier image processing example:
.sp
.in +1i
.ft B
.nf
.ne 1i
pixinterp2x -s512  < image.pix |  \\
rsh Cray.arpa "pixfilter -s1024 -lo" |  \\
rsh Alliant.arpa "pixmerge -n 63/0/127 -- - background.pix |  \\
rsh Vax.arpa "pixrot -r -i 1024 1024 | pix-fb -h"
.fi
.ft R
.in -1i
.sp
which roughly says:  grab an image on my local machine, perform
bi-linear interpolation locally, then send it to the Cray for
low-pass filtering, then send it to the Alliant for compositing
with a background image, then send it to a trusty VAX to
(a) rotate the image 180 degrees and (b) display it on the frame buffer.
.PP
Having learned about \fBrsh\fR,
the possibilities of such combinations of different tools
running on different machines are staggering.
With the proper infrastructure of computers, operating systems,
display hardware, network software, and image processing tools
all connected together in compatible ways, the tremendous
potential of distributed computing can be easily harnessed,
without users having to write any programs!
.NH 1
SHARED-MEMORY PARALLEL PROCESSING
.PP
In the preceding section, we have seen how
different processors
can be harnessed to achieve a single goal.
The discussion so far has focused on using multiple processors
(a) within a single, multi-CPU system, through the use of UNIX pipes,
and (b) by distributing different tools in a pipeline to different
machines on the network.
This section extends the investigation into parallel processing
one level further, to harness the power of multiple processor
machines to make a single application run faster.
For the purposes of this discussion, the application to be
parallelized is a ray-tracing program, but the techniques
developed here are quite general.
.NH 2
The Need for Speed
.PP
Images created using ray-tracing have a reputation for consuming large
quantities of computer time.  For complex models,
10 to 20 hours of processor time to render a single frame
on a DEC VAX-11/780 class machine is not uncommon.
Using the ray-tracing paradigm for engineering analysis
.[
Muuss Understanding Solid Models
.]
often requires many times more processing than rendering
a view of the model.  Examples of such engineering analyses
include the predictive calculation of radar cross-sections,
heat flow, and bi-static laser reflectivity.  For models of
real-world geometry, running these analyses approaches
the limits of practical execution times, even with modern SuperComputers.
There are three main strategies that are being employed to
attempt to decrease the amount of elapsed time it takes to
ray-trace a particular scene:
.IP 1)
Advances in algorithms for ray-tracing.
Newer techniques in partitioning space
.[
Kaplan Space-Tracing
.]
and in taking advantage of ray-to-ray coherence
.[
Arvo Ray Classification
.]
promise to continue to yield algorithms that do fewer and fewer
ray/object intersections which do not contribute to the final results.
Significant work remains to be done in this area, and an order of
magnitude performance gain remains to be realized.  However, there
is a limit to the gains that can be made in this area.
.IP 2)
Acquiring faster processors. A trivial method for decreasing the elapsed
time to run a program is to purchase a faster computer.  However, even
the fastest general-purpose computers such as the Cray X-MP and Cray-2
do not execute fast enough to permit practical analysis of all
real-world models in appropriate detail. Furthermore, the speed of light
provides an upper bound on the fastest computer that can be built out of
modern integrated circuits; this is already a significant
factor in the Cray X-MP and Cray-2 processors, which operate with 8.5 ns
and 4.3 ns clock periods respectively.
.IP 3)
Using multiple processors to solve a single problem. By engaging the
resources of multiple processors to work on a single problem, the
speed-of-light limit can be circumvented.  However, the price is that
explicit attention must be paid to the distribution of data to the
various processors, synchronization of the computations, and collection
of the results.
.PP
Parallel processing is still a relatively young art, and presently there
is only limited support available for
the automatic parallelization of existing
code, with newer vendors like Alliant leading the crowd.
For now, there are few general techniques
for taking programs intended
for serial operation on a single processor, and automatically adapting
them for operation on multiple processors.
.[
Ohr Minisupercomputers Speed
.]
The \fBWorm\fR program developed at Xerox PARC
.[
Shoch Worm Distributed Computation
.]
is one of the earliest known network image-rendering applications.
More recently at Xerox PARC, Frank Crow has attempted to distribute
the rendering of a single image across multiple processors,
.[
Crow Distributed Execution Work in Progress
.]
but discovered that communication overhead and synchronization problems
limited parallelism to about 30% of the available processing power.
A good summary of work to date has been collected by Peterson.
.[
Peterson Distributed Computation
.]
.PP
Ray-tracing analysis of a model has the very nice property that
the computations for each ray/model intersection are entirely
independent of other ray/model intersection calculations.
Therefore, it is easy to see how the calculations for each ray
can be performed by separate, independent processors.  The
underlying assumption is that each processor has read-only access
to the entire model database.
While it would be possible to partition the ray-tracing algorithm
in such a way as to require only a portion of the model database
being resident in each processor, this would significantly increase
the complexity of the implementation as well as the
amount of synchronization and control traffic needed.
Such a partitioning has therefore
not yet been seriously attempted.
.PP
It is the purpose of the research reported in the rest of this paper to
explore the performance limits of parallel operation of ray-tracing
algorithms where available processor memory is not a limitation.
While it is not expected that this research will result
in a general purpose technique for distributing arbitrary programs
across multiple processors, the issues of the control and distribution of work
and providing reliable results in a potentially unreliable system
are quite general.  The techniques used here are likely to be
applicable to a large set of other applications.
.NH 2
Raytracing Background
.PP
The origins of modern ray-tracing come from work at MAGI
under contract to BRL, initiated in the early 1960s.
The initial results were reported by MAGI
.[
geometric description technique MAGI
.]
in 1967.
Extensions to the early developments were undertaken by a
DoD Joint Technical Coordinating Group effort, resulting in
publications in 1970
.[
MAGIC User Manual
.]
and 1971.
.[
MAGIC Analyst Manual
.]
A detailed presentation of the
fundamental analysis and implementation of the ray-tracing algorithm
can be found in these two documents.
Also see
.[
Appel shading machine renderings solids
.]
.PP
More recently, interest in ray-tracing developed in the academic
community, with Kay's
.[
Kay Ray Tracing 1979
.]
thesis in 1979
being a notable early work.
One of the central papers in the ray-tracing literature is
the work of Whitted.
.[
Whitted Improved Illumination Model
.]
Model sampling techniques can be improved to provide substantially
more realistic images by using the ``Distributed Ray Tracing'' strategy.
.[
Cook Carpenter Distributed Ray Tracing
.]
For an excellent, concise discussion of ray-tracing,
consult pages 363-381 of Rogers.
.[
Rogers Procedural Elements
.]
.PP
There are several implementation strategies for interrogating
the model by computing ray/geometry intersections.
The traditional approach has been batch-oriented,
with the user defining a set of ``viewing angles'',
turning loose a big batch job to compute all the ray intersections,
and then post-processing all the ray data into some meaningful form.
However, the major drawback of this approach is that the application
has no dynamic control over ray paths, making another batch run
necessary for each level of reflection, etc.
.PP
In order to be successful, applications need: (1) dynamic control of ray
paths, to naturally implement reflection, refraction, and fragmentation into
multiple subsidiary rays, and (2) the ability to fire rays in arbitrary
directions from arbitrary points.
Nearly all non-batch ray-tracing implementations have a specific closely
coupled application (typically a model of illumination), which
allows efficient and effective control of the ray paths.
However, the most flexible approach is to implement the ray-tracing
capability as a general-purpose library, to make the functionality
available to any application as needed, and
this is the approach taken in the BRL CAD Package.
.[
Muuss CAD Package Release 1.21
.]
The ray-tracing library is called \fBlibrt\fR, while the ray-tracing
application of interest here (an optical spectrum lighting model)
is called \fBrt\fR.
.NH 2
The Structure of librt
.PP
In order to give all applications dynamic control over
the ray paths, and to allow the rays to be fired in arbitrary directions
from arbitrary points, BRL has implemented its third generation
ray-tracing capability as a set of library routines.
\fBlibrt\fR exists to allow application programs to
intersect rays with model geometry.  There are four parts to the
interface: three preparation routines and the actual ray-tracing routine.
The first routine which must be called is
rt_dirbuild(), which opens the database file, and builds the
in-core database table of contents.
The second routine to be called is rt_gettree(), which
adds a database sub-tree to the active model space.
rt_gettree() can be called multiple times
to load different parts of the database
into the active model space.
The third routine is rt_prep(), which
computes the space partitioning data structures and does other
initialization chores.
Calling this routine is optional,
as it is called by rt_shootray() if needed.
rt_prep() is provided as a separate routine to
allow independent timing of the preparation and ray-tracing phases of
applications.
.PP
To compute the intersection of a ray with the geometry in the active
model space, the application must call rt_shootray() once for each
ray. Ray-path selection for perspective, reflection, refraction, etc,
is entirely determined by the application program. The only parameter
to the rt_shootray() is a \fBlibrt\fR ``application'' structure, which
contains five major elements: the vector a_ray.r_pt which
is the starting point of the ray to be fired, the vector a_ray.r_dir
which is the unit-length direction vector of the ray,
the pointer *a_hit() which is the address of an application-provided
routine to call when the ray intersects the model geometry, the pointer
*a_miss() which is the address of an application-provided routine to
call when the ray does not hit any geometry, the flag a_onehit which is
set non-zero to stop ray-tracing as soon as the ray has intersected at
least one piece of geometry (useful for lighting models), plus various
locations for each application to store state (recursion level, colors,
etc). Note that the integer returned from the application-provided
a_hit()/a_miss() routine is the formal return of the function
rt_shootray(). The rt_shootray() function is prepared for full recursion
so that the a_hit()/a_miss() routines can themselves fire additional
rays by calling rt_shootray() recursively before deciding their own
return value.
.PP
In addition, the function rt_shootray() is serially and concurrently
reentrant, using only registers, local variables allocated on the stack, and
dynamic memory allocated with rt_malloc().
The rt_malloc() function serializes calls to \fBmalloc\fR(3).
By having the ray-tracing library fully prepared to run
in parallel with other instances of itself in the same address space,
applications may take full advantage of parallel hardware capabilities,
where such capabilities exist.
.NH 2
A Sample Ray-Tracing Program
.PP
A simple application program that fires one ray at a model and prints
the result is included below, to demonstrate the simplicity of the
interface to \fBlibrt\fR.
.LP
.sp .5
.nf
.ne 2i
#include 
struct application ap;
main() {
      rt_dirbuild("model.g");
      rt_gettree("car");
      rt_prep();
      ap.a_point = [ 100, 0, 0 ];
      ap.a_dir = [ -1, 0, 0 ];
      ap.a_hit = &hit_geom;
      ap.a_miss = &miss_geom;
      ap.a_onehit = 1;
      rt_shootray( &ap );
}
hit_geom(app, part)
struct application *app;
struct partition *part;
{
      printf("Hit %s", part->pt_forw->pt_regionp->reg_name);
}
miss_geom(){
      printf("Missed");
}
.fi
.NH 2
Normal Operation:  Serial Execution
.PP
When running the \fBrt\fR program on a serial processor, 
the code of interest is the top of the subroutine hierarchy.
The function
main()
first calls
get_args()
to parse any command line options, then calls
rt_dirbuild()
to acquaint \fBlibrt\fR with the model database, and
view_init()
to initialize the application
(in this case a lighting model, which may call
mlib_init()
to initialize the material-property library).  Finally,
rt_gettree()
is called repeatedly to load the model treetops.
For each frame to be produced,
the viewing parameters are processed, and
do_frame()
is called.
.PP
Within
do_frame(),
per-frame initialization is handled by calling
rt_prep(),
mlib_setup(),
grid_setup(),
and
view_2init().
Then,
do_run()
is called with the linear pixel indices of the start and end locations in
the image;  typically these values are zero and width*length-1,
except for the ensemble computer case.
In the non-parallel cases,
the do_run()
routine initializes the global variables
cur_pixel and last_pixel, and calls
worker().
At the end of the frame,
view_end()
is called to handle any final output, and print some statistics.
.PP
The worker() routine
obtains the index of the next pixel that needs to be computed by
incrementing cur_pixel, and calls
rt_shootray()
to interrogate the model geometry.
view_pixel()
is called to output the results for that pixel.
worker()
loops, computing one pixel at a time, until
\fIcur_pixel > last_pixel\fR,
after which it returns.
.PP
When
rt_shootray()
hits some geometry, it calls the a_hit() routine listed in the application
structure to determine the final color of the pixel.
In this case, colorview() is called.  colorview() uses view_shade()
to do the actual computation.
Depending on the properties of the material hit and the stack of shaders
that are being used, various material-specific renderers may be called,
followed by a call to
rr_render()
if reflection or refraction is needed.  Any of these routines may spawn
multiple rays, and/or recurse on
colorview().
.NH 2
Parallel Operation on Shared-Memory Machines
.PP
By capitalizing on the serial and concurrent
reentrancy of the \fBlibrt\fR routines, it is very easy to take advantage
of shared memory machines where it is possible to initiate multiple
``streams of execution'' or ``threads''
within the address space of a single process.
In order to be able to ensure that global variables are only
manipulated by one instruction stream at a time,
all such shared modifications are enclosed in critical sections.
For each type of processor, it is necessary to implement the routines
RES_ACQUIRE()
and
RES_RELEASE()
to provide system-wide semaphore operations.
When a processor acquires a resource, and any other processors need that
same resource, they will wait until it is released, at which time exactly
one of the waiting processors will then acquire the resource.
.PP
In order to minimize contention between processors
over the critical sections of code, all critical sections are kept
as short as possible: typically only a few lines of code.
Furthermore, there are different semaphores for each type of resource
accessed in critical sections.
.I res_syscall
is used to interlock all UNIX
system calls and some library routines,
such as
\fBwrite\fR(2),
\fBmalloc\fR(3),
\fBprintf\fR(3),
etc.
.I res_worker
is used by the function
worker()
to serialize access to the variable
cur_pixel,
which contains the index of the next pixel to be computed.
.I res_results
is used by the function
view_pixel
to serialize access to the result buffer.  This is necessary because
few processors have hardware multi-processor interlocking on
byte operations within the same word.
.I res_model
is used by the \fBlibspl\fR spline library routines to serialize operations
which cause the model to be further refined during the raytracing process,
so that data structures remain consistent.
.PP
Application of the usual client-server model of computing would
suggest that one stream of execution would be dedicated to dispatching the
next task, while the rest of the streams of execution would be used for
ray-tracing computations.  However, in this case, the dispatching operation
is trivial and a ``self-dispatching'' algorithm is used,
with a critical section
used to protect the shared variable
cur_pixel.
The real purpose of the function
do_run()
is to perform whatever machine-specific operation is required to
initiate
.I npsw
streams of execution within the address space of the \fBrt\fR program, and
then to have each stream call the function
worker(),
each with appropriate local stack space.
.PP
Each worker() function will loop until no more pixels remain,
taking the next available pixel index.
For each pass through the loop, RES_ACQUIRE(res_worker) is
used to acquire the semaphore, after which the index of the next
pixel to be computed, \fIcur_pixel\fR,
is acquired and incremented, and before the semaphore is released,
ie,
.nf
.sp .5
worker() {
	while(1)  {
		RES_ACQUIRE( &rt_g.res_worker );
		my_index = cur_pixel++;
		RES_RELEASE( &rt_g.res_worker );
		if( my_index > last_pixel )
			break;
		a.a_x = my_index%width;
		a.a_y = my_index/width;
		...compute ray parameters...
		rt_shootray( &a );
	}
}
.fi
.PP
On the Denelcor HEP H-1000 each word of memory has a full/empty tag bit
in addition to 64 data bits.
RES_ACQUIRE is implemented using the
Daread()
primitive, which uses the hardware capability to wait until the
semaphore word is full, then read it, and mark it as empty.
RES_RELEASE is implemented using the
Daset()
primitive, which marks the word as full.
do_run()
starts additional streams of execution using the
Dcreate(worker)
primitive, which creates another stream which immediately calls the
worker()
function.
.PP
On the Alliant FX/8, RES_ACQUIRE is implemented using the hardware
instruction test-and-set (TAS) which tests a location for being
zero.
As an atomic operation,
if the location is zero, it sets it non-zero and
sets the condition codes appropriately.  RES_ACQUIRE embeds this
test-and-set instruction in a polling loop to wait for acquisition of
the resource.
RES_RELEASE just zeros the
semaphore word.  Parallel execution is achieved by using the
hardware capability to spread a loop across multiple processors,
so a simple loop from 0 to 7 which calls
worker()
is executed in hardware concurrent mode.  Each concurrent instance
of worker() is given a separate stack area in the ``cactus stack''.
.PP
On the Cray X-MP and Cray-2, the Cray multi-tasking library is used.
RES_ACQUIRE maps into LOCKON, and RES_RELEASE maps into LOCKOFF,
while
do_run()
just calls
TSKSTART(worker)
to obtain extra workers.
.NH 2
Performance Measurements
.PP
An important part of the BRL CAD Package is a set of five benchmark model
databases and associated viewing parameters, which permit the relative
performance of different computers and configurations to be made using
a significant production program as the basis of comparison. For the
purposes of this paper, just the "Moss" database is used for
comparison.  Since this benchmark generates pixels the fastest, it will
place the greatest demands on any parallel processing scheme. The
benchmark image is computed at 512x512 resolution.
.PP
The relative performance figures for running \fBrt\fR in the parallel mode
with Release 1.20 of the BRL CAD Package are presented below. The
Alliant FX/8 machine was brl-vector.arpa, configured with 8
Computational Elements (CEs), 6 68012 Interactive Processors (IPs), 32
Mbytes of main memory, and was running Concentrix 2.0, a port of 4.2 BSD
UNIX.
The Cray X-MP/48 machine was brl-patton.arpa, serial number 213, with
4 processors, 8 Mwords of main memory, with a clock period
of 8.5 ns, and UNICOS 2.0, a port of System V UNIX.
Unfortunately, no comprehensive results are available for the
Denelcor HEP, the only other parallel computer known to have run this code.
.TS
center box;
c s s s s
n n n n n.
Parallel \fBrt\fR Speedup -vs- # of Processors
_
# Processors	FX/8	(eff)	X-MP/48	(eff)
_
1	1.00	100%	1.00	100%
2	1.84	92.0%	1.99	99.5%
3	2.79	93.0%	2.96	98.7%
4	3.68	92.0%	3.86	96.5%
5	4.80	96.0%	
6	5.70	95.0%	
7	6.50	92.9%	
8	7.46	93.3%	
.TE
.PP
The multiple-processor performance of \fBrt\fR increases nearly
linearly for shared memory machines
with small collections of processors.
The slight speedup of the Alliant when the fifth processor is added
comes from the fact that the first four processors share one cache
memory, while the second four share a second cache memory.
To date, \fBrt\fR holds the record for the best achieved speedup for parallel
processing on both the Cray X-MP/48 and the Alliant.
Measurements on the HEP, before
it was dismantled, indicated that near-linear improvements continued through
128 streams of execution.  This performance is due to the fact that
the critical sections are very small, typically just a few lines of code,
and that they account for an insignificant portion of the computation time.
When \fBrt\fR is run in parallel and the number of processors is increased,
the limit to overall performance is determined
by the total bandwidth of the shared memory, and by memory conflicts over
popular regions of code and data.
.NH 1
DISTRIBUTED COMPUTATION ON LOOSELY-COUPLED ENSEMBLES OF PROCESSORS
.PP
The basic assumption of this design is that network bandwidth is
modest, so that the number of bytes and packets of overhead should not exceed
the number of bytes and packets of results.
The natural implementation would be to provide
a remote procedure call (RPC) interface to
rt_shootray(), so that when additional subsidiary rays are needed,
more processors could potentially be utilized.  However, measurements
of this approach on VAX, Gould, and Alliant computers indicates that
the system-call and communications overhead is comparable to the
processing time for one ray/model intersection calculation.
This much overhead rules out the RPC-per-ray interface for practical
implementations.
On some tightly coupled
ensemble computers, there might be little penalty for such an approach,
but in general, some larger unit of work must be exchanged.
.PP
It was not the intention of the author to develop another protocol
for remote file access, so the issue of distributing the model database
to the \fBrtsrv\fR server machines is handled outside of the context of
the \fBremrt\fR and \fBrtsrv\fR software.
In decreasing order of preference, the
methods for model database distribution that are currently used
are Sun NFS, Berkeley \fBrdist\fR, Berkeley \fBrcp\fR,
and ordinary DARPA \fBftp\fR.
Note that the binary databases need to be converted to a portable
format before they are transmitted across the network, because \fBrtsrv\fR
runs on a wide variety of processor types.
Except for the model databases
and the executable code of the \fBrtsrv\fR server process itself,
no file storage is used on any of the server machines.
.NH 2
Distribution of Work
.PP
The approach used in \fBremrt\fR involves a single dispatcher
process, which communicates with an arbitrary number of server
processes.
Work is assigned in groups of scanlines.
As each server finishes a scanline, 
the results are sent back to the dispatcher,
where they are stored.
Completed scanlines are removed from the list of scanlines to be done
and from the list of scanlines currently assigned to that server.
Different servers may be working on entirely different frames.
Before a server is assigned scanlines from a new frame, it is sent
a new set of options and viewpoint information.
.PP
The underlying communications layer used
is the package (PKG) protocol, provided by the \fBlibpkg\fR library,
so that all communications are known to be
reliable, and communication disruptions are noticed.
Whenever the
dispatcher is notified by the \fBlibpkg\fR routines that contact with a server
has been lost, all unfinished scanlines assigned to that server will be
requeued at the head of the ``work to do'' queue, so that it will be
assigned to the very next available server, allowing tardy scanlines to
be finished quickly.
.NH 2
Distribution Protocol
.PP
When a server process \fBrtsrv\fR is started, the host name of the machine
running the dispatcher process is given as a command line argument.
The server process can be started from a command in the dispatcher
\fBremrt\fR, which uses \fBsystem\fR(3)
to run the \fBrsh\fR program, or directly via some other mechanism.
This avoids the need to register the \fBrtsrv\fR program as a system network
daemon and transfers issues of access control, permissions, and
accounting onto other, more appropriate tools. Initially, the \fBrtsrv\fR
server initiates a PKG connection to the dispatcher process and then
enters a loop reading commands from the dispatcher. Some commands
generate no response at all, some generate one response message, and
some generate multiple response messages.  However, note that the server
does not expect to receive any additional messages from the dispatcher
until after it has finished processing a request, so that requests do
not have to be buffered in the server.  While this simplifies the code,
it has some performance implications, which are discussed later.
.PP
In the first stage, the message received must be of type MSG_START, with
string parameters specifying the pathname of the model database and the
names of the desired treetops.  If all goes well, the server responds
with a MSG_START message, otherwise diagnostics are returned as string
parameters to a MSG_PRINT message and the server exits.
.PP
In the second
stage, the message received must be of type MSG_OPTIONS or MSG_MATRIX.
MSG_OPTIONS specifies the image size and shape, hypersampling, stereo
viewing, perspective -vs- ortho view, and control of randomization
effects (the ``benchmark'' flag), using the familiar UNIX command line
option format. MSG_MATRIX contains the 16 ASCII floating point numbers 
for the 4x4 homogeneous transformation matrix which represents the
desired view.
.PP
In the third stage, the server waits for messages of type MSG_LINES,
which specify the starting and ending scanline to be processed.
As each scanline is completed, it is immediately sent back to the
dispatcher process to minimize the amount of computation that could
be lost in case of server failure or communications outage.
Each scanline is returned in a message of type MSG_PIXELS.  The first
two bytes of that message contain the scanline number in 
network order 16-bit binary.
Following that is the 3*width bytes of RGB
data that represents the scanline.
When all the scanlines specified in the MSG_LINES command are processed,
the server again waits for another message, either another MSG_LINES
command or a MSG_OPTIONS/MSG_MATRIX command to specify a new view.
.PP
At any time, a MSG_RESTART message can be received by the server, which
indicates that it should close all it's files and
immediately re-\fBexec\fR(2) itself, either to prepare for processing
an entirely new model, or as an error recovery aid.
A MSG_LOGLVL message can be received at any time, to enable and disable
the issuing of MSG_PRINT output.
A MSG_END message suggests that the server should commit suicide,
courteously.
.NH 2
Dispatching Algorithm
.PP
The dispatching (scheduling) algorithm revolves around two main lists,
the first being a list of currently connected servers and
the second being a list of frames still to be done.  For each unfinished
frame, a list of scanlines remaining to be done is maintained.
For each server, a list of the currently assigned scanlines is kept.
Whenever a server returns a scanline, it is removed from the list of
scanlines assigned to that server, stored in the output image,
and also in the optional attached framebuffer.
(It can be quite entertaining to watch the scanlines racing up the screen,
especially when using processors of significantly different speeds).
If the arrival of this scanline completes a frame, then the
frame is written to disk on the dispatcher machine, timing data is computed,
and that frame
is removed from the list of work to be done.
.PP
When a server finishes the last scanline of its assignment and more
work remains to be done, the list of unfinished frames is searched and
the next available increment of work is assigned.  Work is assigned in
blocks of consecutive scanlines, up to a per-server maximum assignment size.
The block of scanlines is recorded as the server's new assignment and
is removed from the list of work to be done.
.NH 2
Reliability Issues
.PP
If the \fBlibpkg\fR communications layer looses contact with a server machine,
or if \fBremrt\fR is commanded to drop a server, then the scanlines remaining
in the assignment are requeued at the head of the list of scanlines remaining
for that frame.  They are placed at the head of the list so that the first
available server will finish the tardy work, even if it had gone ahead to
work on a subsequent frame.
.PP
Presently, adding and dropping server machines is a manual (or script
driven) operation.  It would be desirable to develop a separate
machine-independent network mechanism
that \fBremrt\fR could use to inquire about the current loading and availability
of server machines, but this has not been done.
This would permit periodic status requests to be made
and automatic reacquisition of eligible server machines
could be attempted.
Peterson's Distrib
.[
Peterson Distributed Computation
.]
System incorporates this as a built-in part of the distributed computing
framework, but it seems that using an independent transaction-based
facility such as Pistritto's Host Monitoring Protocol (HMP) facility
.[
Natalie Muuss BRL VAX UNIX Manual
.]
would be a more general solution.
.PP
If the dispatcher fails, all frames that have not been completed
are lost;  on restart,
execution resumes at the beginning of the first uncompleted frame.
By carefully choosing a machine that has excellent reliability to
run the dispatcher on, the issue of dispatcher failure can be largely
avoided.  However, typically no more than two frames will be lost,
minimizing the impact.  For frames that take extremely long times to
compute, it would be reasonable extend the dispatcher to snapshot the work
queues and partially assembled frames in a disk file, to permit operation
to resume from the last ``checkpoint''.
.NH 2
Distributed remrt Performance
.PP
Ten identical Sun-3/50 systems were used
to test the performance of \fBremrt\fR.
All had 68881 floating point units and 4 Mbytes of memory,
and all were in normal timesharing mode, unused except for
running the tests and the slight overhead
imposed by /etc/update, \fBrwhod\fR, etc.
To provide a baseline performance figure for comparison, the
benchmark image was computed in the normal way using \fBrt\fR, to
avoid any overhead which might be introduced by \fBremrt\fR.
The elapsed time to execute the ray-tracing portion of the benchmark
was 2639 seconds;  the preparation phase was not included, but amounted
to only a few seconds.
.TS
center box;
c s s   s s   s s
l c s | c s | c c
l c c | c c | c c
l n n | n n | n n.
\fBremrt\fR Speedup -vs- # of Processors
_
	Ratios	Elapsed Seconds
# CPUs	Theory	Sun-3/50	Theory	Sun-3/50	Total Speedup	Efficiency
_
1	1.0000	1.0072	2639.0	2658	0.993	99.3%
2	0.5000	0.5119	1319.5	1351	1.953	97.7%
3	0.3333	0.3357	879.6	886	2.979	99.3%
4	0.2500	0.2524	659.7	666	3.949	98.7%
5	0.2000	0.2027	527.8	535	4.916	98.3%
_
6	0.1666	0.1686	429.8	445	5.910	98.5%
7	0.1429	0.1470	377.0	388	6.778	96.8%
8	0.1250	0.1266	329.9	334	7.874	98.4%
9	0.1111	0.1133	293.2	299	8.796	97.7%
10	0.1000	0.1019	263.9	269	9.777	97.8%
.TE
.PP
The ``speedup'' figure of 0.993 for 1 CPU shows the loss of performance
of 0.7% introduced by the overhead of
the \fBremrt\fR to \fBrtsrv\fR communications,
versus the non-distributed \fBrt\fR performance figure. The primary result of note
is that the speedup of the \fBremrt\fR network distributed application is
very close to the theoretical maximum speedup, with a total efficiency
of 97.8% for the ten Sun case!
The very slight loss of performance noticed (2.23%)
is due mostly to ``new assignment latency'',
discussed further below.  Even so, it is worth noting that the speedup
achieved by adding processors with \fBremrt\fR was even better than the
performance achieved by adding processors in parallel mode with \fBrt\fR.
This effect is due mostly to the lack of memory and semaphore contention
between the \fBremrt\fR machines.
.PP
Unfortunately, time did not permit configuring and testing
multiple Alliants running \fBrtsrv\fR in full parallel mode,
although such operation is supported by \fBrtsrv\fR.
.PP
When \fBremrt\fR is actually being used for producing images, many different
types of processors can be used together.  The aggregate performance of
all the available machines on a campus network is truly awesome,
especially when a Cray or two is included! Even in this
case, the network bandwidth required does not exceed the capacity of an
Ethernet (yet). The bandwidth requirements are sufficiently small that
it is practical to run many \fBrtsrv\fR processes distributed over the
ARPANET/MILNET. On one such occasion in early 1986, 13 Gould PN9080
machines were used all over the east coast to finish some images for a
publication deadline.
.NH 2
Performance Issues
.PP
The policy of making work assignments in terms of multiple adjacent
scanlines
reduces 
the processing requirements of the dispatcher and also improves the
efficiency of the servers.
As a server finishes a scanline, it can give the scanline to
the local operating system to send to the dispatcher machine,
while the server
continues with the computation, allowing the transmission to be
overlapped with more computation.  When gateways and wide-area networks
are involved (with their accompanying increase in latency and packet loss),
this is an important consideration.  In the current
implementation, assignments are always blocks of three scanlines
because there is no general way for the \fBrtsrv\fR process to know what kind
of machine it is running on and how fast it is likely to go.  Clearly,
it would be worthwhile to assign larger blocks of scanlines to the
faster processors so as to minimize idle time and control traffic overhead.
Seemingly the best way to determine this would be to measure the rate of
scanline completion and dynamically adjust the allocation size.
This is not currently implemented.
.PP
By increasing the scanline block assignment size for the faster
processors, the amount of time the server spends waiting for a new
assignment (termed ``new assignment latency'') is diminished, but
not eliminated. Because the current design assumes that the server will
not receive another request until the previous request has been fully
processed, no easy solution exists.  Extending the server implementation
to buffer at least one additional request would permit this limitation
to be overcome, and the dispatcher would then have the option of sending
a second assignment before the first one had completed, to always keep
the server ``pipeline'' full.  For the case of very large numbers of
servers, this pipelining is important to keep delays in the
dispatcher from affecting performance. In the case of very fast servers,
pipelining is important in achieving maximum server utilization, by
overcoming network and dispatcher delays.
.PP
To obtain an advantage from the pipeline effect of the multiple scanline
work assignments, it is important that the network implementations
in both the servers and the dispatcher have adequate buffering to
hold an entire scanline (typically 3K bytes).  For the dispatcher,
it is a good idea to increase the default TCP receive space (and thus
the receive window size) from 4K bytes to 16K bytes.  For the server
machines, it is a good idea to increase the default TCP transmit space
from 4K bytes to 16K bytes.  This can be accomplished by modifying
the file /sys/netinet/tcp_usrreq.c to read:
.sp .5
.nf
int   tcp_sendspace = 1024*16;
int   tcp_recvspace = 1024*16;
.fi
.sp .5
or to make suitable modifications to the binary image of the kernel
using \fIadb\fR(1):
.sp .5
.nf
adb -w -k /vmunix
tcp_sendspace?W 0x4000
tcp_recvspace?W 0x4000
.fi
.sp .5
.PP
The dispatcher process must maintain an active network connection to
each of the server machines.  In all systems there is some limit to the
number of open files that a single process may use (symbol NOFILE);  in
4.3 BSD UNIX, the limit is 64 open files.  For the current implementation,
this places an upper bound on the number of servers that can be used.
As many campus networks have more than 64 machines available at night,
it would be nice if this limit could be eased.  One approach is
to increase the limit on the dispatcher machine.  Another approach is to
implement a special ``relay server'' to act as a fan-in/fan-out
mechanism, although the additional latency could get to be an issue.
A third approach is to partition the problem at a higher level.
For example, having the east campus do the beginning of a movie,
and the west campus do the end would reduce the open file problem.
Additionally, if gateways are involved, partitioning the problem may be kinder
to the campus network.
.NH 2
Conclusions
.PP
Parallel computing is good.
.PP
This paper has shown how it is possible to implement good graphics
interfaces within the confines of a single uniprocessor machine.
With the adoption of
a ``software tools'' approach when providing data handling capabilities,
it was shown how to transparently take advantage of multi-processor
machines, and thus realize a speed advantage.
Furthermore, the careful selection of file formats permitted
realizing a further speed advantage in the accomplishment of a single task
by utilizing multiple systems located across a network.
.PP
Carying the theme of increased speed further,
multi-processor sytems were examined as a vehicle for making single
image generation tools operate faster.
An important result was to note that
when operation in a shared memory parallel environment was an initial design
goal, the implementation of concurrently reentrant code did not significantly
increase the complexity of the software.  Having code with such properties allows
direct utilization of nearly any shared memory multiprocessor with
a minimum of system-specific support, namely the RES_ACQUIRE and RES_RELEASE
semaphore operations, and some mechanism for starting multiple streams
of execution within the same address space.
.PP
Finally, collections of processors connected only by conventional
speed network links are considered as the final 
environment for making a single tool operate faster using multiprocessing.
It was shown that
network distributed computing need not be inefficient or difficult.
The protocol and dispatching mechanism described in the preceding
sections has been shown to be very effective at taking the
computationally intensive task of generating ray-traced images and
distributing it across multiple processors connected only by a
communications network. There are a significant number of other
application programs that could directly utilize the techniques and
control software implemented in \fBremrt\fR to achieve network distributed
operation.  However, the development and operation of this type of
program is still a research effort; the technology
is not properly packaged for widespread, everyday use.
Furthermore, it is clear that the techniques used in \fBremrt\fR are not
sufficiently general to be applied to all scientific problems.  In
particular, problems where each ``cell'' has dependencies on some or
all of the neighboring cells will require different techniques.
.PP
Massive proliferation of computers is a trend that is likely to continue
through the 1980s into the 1990s and beyond.  Developing software to
utilize significant numbers of network connected processors is the
coming challenge. This paper has presented a strategy that meets this
challenge, and provides a simple, powerful, and efficient method for
distributing a significant family of scientific analysis codes across
multiple computers.
.SH
Acknowledgements
.PP
The author would like to thank
Dr. Paul Deitz for providing his unflagging support and encouragement,
Phil Dykstra, Paul Stay, Gary Moss, Chuck Kennedy, and Bob Reschly
for the long hours as we designed and built this software,
and Dr. Dave Rogers for once again
persuading me to write it all down.
.PP
The following strings which have been included in this paper
are known to enjoy protection as trademarks;
the trademark ownership is acknowledged in the table below.
.TS
center;
l l.
Trademark	Trademark Owner
_
Cray	Cray Research, Inc
Ethernet	Xerox Corporation
FX/8	Alliant Computer Systems Corporation
IBM 370	International Business Machines Corporation
Macintosh	Apple Computer, Inc
MacPaint	Apple Computer, Inc
NFS	Sun Microsystems, Inc
PowerNode	Gould, Inc
ProNet	Proteon, Inc
Sun Workstation	Sun Microsystems, Inc
SunView	Sun Microsystems, Inc
UNIX	AT&T Bell Laboratories
UNICOS	Cray Research, Inc
VAX	Digital Equipment Corporation
VaxStation	Digital Equipment Corporation
X Window System	Massachusetts Institute of Technology
.TE