An Architecture Independent Programming Language For Low-Level Vision
An Architecture Independent Programming Language For Low-Level Vision
NOTE
1. INTRODUCTION
In computer vision, the first, and often most time-consuming, step in image
processing is image to image operations. In this step, an input image is mapped into
an output image through some local operation that applies to a window around each
pixel of the input image. Algorithms that fall into this class include: edge detection,
smoothing, convolutions in general, contrast enhancement, color transformations,
and thresholding. Collectively, we call these operations low-level vision. Low-level
vision is often time consuming simply because images are quite large--a typical size
is 512 x 512 pixels, so the operation must be applied 262,144 times.
Fortunately, this step in image processing is easy to speed up, through the use of
parallelism. The operation applied at every point in the image is often independent
from point to point and also does not vary much in execution time at different
points in the image. This is because at this stage of image processing, nothing has
been done to differentiate one area of the image from another, so that all areas are
processed in the same way. Because of these two characteristics, many parallel
*The research was supported in part by Defense Advanced Research Projects Agency (DOD),
monitored by the Air Force Avionics Laboratory under Contract F33615-81-K-1539, and Naval Elec-
tronic Systems Command under Contract N00039-85-C-0134, in part by the U.S. Army Engineer
Topographic Laboratories under Contract DACA76-85-C-0002, and in part by the Office of Naval
Research under Contracts N00014-80-C-0236, NR 048-659, and N00014-85-K-0152, NR SDRJ-007.
246
0734-189X/89 $3.00
Copyright 9 1989 by AcademicPress,Inc.
All rights of reproductionin any form reserved.
ARCHITECTURE INDEPENDENT LANGUAGE 247
computers achieve good efficiency in these algorithms, through the use of input
partitioning [10].
We define a language, called Apply, which is designed for implementing these
algorithms. Apply runs on the Warp machine, which has been developed for image
and signal processing. We discuss Warp, and describe its use at this level of vision.
The same Apply program can be compiled either to run on the Warp machine, or
under UNIX, and it runs with good efficiency in both cases. Therefore, the program-
mer is not limited to developing his programs just on Warp, although they run much
faster (typically 100 times faster) there; he can do development under the more
generally available UNIX system.
We consider Apply and its implementation on Warp to be a significant develop-
ment for image processing on parallel computers in general. The most critical
problem in developing new parallel computer architectures is a lack of software
which efficiently uses parallelism. While building powerful new computer architec-
tures is becoming easier because of the availability of custom VLSI and powerful
off-the-shelf components, programming these architectures is difficult.
Parallel architectures are difficult to program because it is not yet understood how
to "cover" parallelism (hide it from the programmer) and get good performance.
Therefore, the programmer either programs the computer in a specialized language
which exploits features of the particular computer and which can run on no other
computer (except in simulation), or he uses a general purpose language, such as
F O R T R A N , which runs on many computers but which has additions that make it
possible to program the computer efficiently. In either case, using these special
features is necessary to get good performance from the computer. However, exploit-
ing these features requires training, limits the programs to run on one or at most a
limited class of computers, and limits the lifetime of a program, since eventually it
must be modified to take advantage of new features provided in a new architecture.
Therefore, the programmer faces a dilemma: he must either ignore (if possible) the
special features of his computer, limiting performance, or he must reduce the
understandability, generality, and lifetime of his program.
It is the thesis of Apply that application dependence, in particular programming
model dependence, can be exploited to cover this parallelism while getting good
performance from a parallel machine. Moreover, because of the application depen-
dence of the language, it is possible to provide facilities that make it easier for the
programmer to write his program, even as compared with a general-purpose
language. Apply was originally developed as a tool for writing image processing
programs on UNIX systems; it now runs on UNIX systems, Warp, and the Hughes
HBA. Since we include a definition of Apply as it runs on Warp, and because most
parallel computers support input partitioning, it should be possible to implement it
on other supercomputers and parallel computers as well.
Apply also has implications for benchmarking of new image processing comput-
ers. Currently, it is hard to compare these computers, because they all run different,
incompatible languages and operating systems, so the same program cannot be
tested on different computers. Once Apply is implemented on a computer, it is
possible to fairly test its performance on an important class of image operations,
namely low-level vision.
Apply is not a panacea for these problems; it is an application-specific, machine-
independent, language. Since it is based on input partitioning, it cannot generate
248 HAMEY, WEBB, AND WU
programs which use pipelining, and it cannot be used for global vision algorithms
[9] such as connected components, Hough transform, FFT, and histogram. How-
ever. Apply is in daily use at Carnegie Mellon and elsewhere, and has been used to
implement a significant library (100 programs) of algorithms covering most of
low-level vision. A companion paper [14] describes this library and evaluates
Apply's performance.
We begin by describing the Apply language and its utility for programming
low-level vision algorithms. Examples of Apply programs and Apply's syntax are
presented. Finally, we discuss implementations of Apply on various architectures:
Warp and Warp-like architectures, uni-processors, the Hughes HBA, bit-serial
processor arrays, and distributed memory machines.
2. INTRODUCTION TO APPLY
The Apply programming model is a special-purpose programming approach
which simplifies the programming task by making explicit the parallelism of
low-level vision algorithms. We have developed a special-purpose programming
language called the Apply language which embodies this parallel programming
approach. When using the Apply language, the programmer writes a procedure
which defines the operation to be applied at a particular pixel location. The
procedure conforms to the following programming model:
9 It accepts a window or a pixel from each input image.
9 It performs arbitrary computation, without side effects.
9 It returns a pixel value for each output image.
The Apply compiler converts the simple procedure into an implementation which
can be run efficiently on Warp, or on a uni-processor machine in C under UNIX.
The idea of the Apply programming model grew out of a desire for efficiency
combined with ease of programming for a useful class of low-level vision operations.
In our environment, image data is usually stored in disk files and accessed through a
library interface. This introduces considerable overhead in accessing individual
pixels so algorithms are often written to process an entire row at a time. While
buffering rows improves the speed of algorithms, it also increases their complexity.
A C language subroutine implementation of Apply was developed as a way to hide
the complexities of data buffering from the programmer while still providing the
efficiency benefits. In fact, the buffering methods which we developed were more
efficient than those which would otherwise be used, with the result that Apply
implementations of algorithms were faster than previous implementations.
After implementing Apply, the following additional advantages became evident:
9 The Apply programming model concentrates the programming effort on the
actual computation to be performed instead of the looping in which it is embedded.
This encounters programmers to use more efficient implementations of their algo-
rithms. For example, a Sobel program gained a factor of four in speed when it was
reimplemented with Apply. This speedup primarily resulted from explicitly coding
the convolutions. The resulting code is more comprehensible than the earlier
implementation.
9 Apply programs are easier to write, easier to debug, more comprehensible,
and more likely to work correctly the first time. A major benefit of Apply is that it
ARCHITECTURE INDEPENDENT LANGUAGE 249
greatly reduces programming time and effort for a very useful class of vision
algorithms. The resulting programs are also faster than the programmer would
probably otherwise achieve.
I 1 2 1 [ I 1 0 -1 [
I 0 0 0 I 1 2 0 -2 I
I -1 -2 -1 I I 1 0 -1 I
Horizontal Vertical
Diagonal edges produce some response from each mask, allowing the edge orienta-
tion and strength to be measured for all edges. Both masks are shown in Fig. 1.
An Apply implementation of Sobel edge detection is shown in Fig. 2. The lines
have been numbered for the purposes of explanation, using the comment conven-
tion. Line numbers are not a part of the language.
Line 1 defines the input, output, and constant parameters to the function. The
input parameter inimg is a window of the input image. The constant parameter
thresh is a threshold. Edges which are weaker than this threshold are suppressed in
the output magnitude image, mag. Line 3 defines horiz and vert which are internal
variables used to hold the results of the horizontal and vertical Sobel edge operator.
Line 1 also defines the input image window. It is a 3 • 3 window centered about
the current pixel processing position, which is filled with the value 0 when the
window lies outside the image. This same line declares the constant and output
parameters to be floating-point scalar variables.
The computation of the Sobel convolutions is implemented by the straight-for-
ward expressions on lines 5 through 7. These expressions are readily seen to be a
direct implementation of the convolutions in Fig. 1.
2.3. Border Handling
Border handling is always a difficult and messy process in programming kernel
operations such as Sobel edge detection. In practice, this is usually left up to the
programmer, with varying results--sometimes borders are handled in one way,
sometimes another. Apply provides a uniform way of resolving the difficulty. It
supports border handling by extending the input images with a constant value, or by
replicating image elements. The constant value is specified as an assignment. Line 1
e n d if; -- 10
end sobel; -- ii
of Fig. 2 indicates that the input image inimg is to be extended by filling with the
constant value 0.
If the programmer says the border is to be MIRROREDthen rows and columns of
the image are reflected about the border to supply the needed values. For example,
border element ( - 2 , - 3 ) outside the image has the value of image element (1, 3).
This method of providing borders is convenient in edge detection and smoothing
operations.
2.4. Image Reduction and Magnification
Apply allows the programmer to process images of different sizes, for example to
reduce a 512 • 512 image to a 256 X 256 image, or to magnify images. This is
implemented via the SAMPLEparameter, which can be applied to input images, and
by using output image variables which are arrays instead of scalars. The SAMPLE
parameter specifies that the Apply operation is to be applied not at every pixel, but
regularly across the image, skipping pixels as specified in the integer list after
SAMPLE. The window around each pixel still refers to the underlying input image.
For example, the following program performs image reduction, using overlapping
4 x 4 windows, to reduce a n x n image to an n/2 X n/2 image:
procedure reduce(inimg : in array (0.. 3, 0 . . 3) of byte sample (2, 2),
outimg : out byte)
is
sum : integer;
i, j : integer;
begin
sum := O;
for i in 0 . . 3 loop
f o r j in 0 . . 3 loop
sum :- sum + inimg (i, j ) ;
end loop;
end loop;
outimg := sum / 16;
end reduce;
The semantics of SA~IPLE (sl, S2) are as follows: the input window is placed
so that pixel (0,0) falls on image pixel (0,0),(0, s2) . . . . . (0, n • s2) . . . . ,
(m • sl, n • s2). Thus, SAMPLE(1, 1) is equivalent to omitting the SAMPLEoption
entirely.
Output image arrays work by expanding the output image in either the horizontal
or vertical direction, or both, and placing the resulting output windows so that they
tile the output image without overlapping.
a d g
v V v
b <e <h
v v V
c f i
is
define(assignlmh, ' h := $1; m := $2; 1 := $3;')
define(medianAB, 'if A > B then median := A;
else median := B; end if;')
int, m, h;
byte A, B;
begin
if si( - 1, 0) > si(0, 0)
then if si(0, 0) > si(1, 0)
then assignlmh ( - 1, 0,1) end if;
else if si( - 1, 0) > si(1, 0)
then assignlmh ( - 1,1, 0)
else assignlmh (1, - 1 , 0 ) end if; end if;
else if si(0, 0) > si(1, 0)
then if si( - 1, 0) > si(1, 0)
then assignlmh (0, - 1,1)
else asignlmh (0,1, - 1) end if;
else assignlmh (1, 0, - 1 ) end if; end if;
if si(1, - 1) > si(m, 1)
then A := si(1, - 1)
else A := si(m, 1) end if;
if si(m, - 1) < si(h, 1)
then B := si(m, - 1 )
else B := si(h, 1) end if;
if A > si(m, 0)
then if si(m, 0) > B
then median := si(m,0); end if;
else medianAB0 end if;
else if si(m, 0) > B
then medianAB ()
else median := si(m, 0); end if; end if;
end median2;
ARCHITECTURE INDEPENDENT LANGUAGE 255
The Warp-like architectures have in common that they are systolic arrays, in
which each processor is a powerful (10 MFLOPS or more) computer with high
word-by-word I / O bandwidth with adjacent processors, arranged in a simple
topology. Apply is implemented on these processors in similar ways, so we first
describe the basic model of low-level image processing on Warp, and then sketch
the implementations on FT Warp and iWarp.
We briefly describe each of the Warp-like architectures; a complete description of
Warp is available elsewhere [2]. Warp is a short linear array, typically consisting of
ten cells, each of which is a 10 MFLOPS computer. The array has high internal
bandwidth, consistent with its use as a systolic processor. Each cell has a local
program and data memory, and can be programmed in a Pascal-level language
called W2, which supports communication between cells using asynchronous word-
by-word send and receive statements. The systolic array is attached to an external
host, which sends and receives data from the array from a separate memory. The
external host in turn is attached to a Sun computer, which provides the user
interface.
Fault-tolerant (FT) Warp is a two-dimensional array, typically a five-by-five
array, being designed by Carnegie Mellon. Each cell is a Warp cell. Each row and
column can be fed data independently, providing for a very high bandwidth. As the
name suggests, this array has as a primary goal fanlt-tolerance, which is supported
by a virtual channel mechanism mediated by a separate hardware component called
a switch.
iWarp is an integrated version of Warp being designed by Carnegie Mellon and
Intel. In iWarp each Warp cell is implemented by a single chip, plus memory chips.
The baseline iWarp machine is a 8 • 8 two-dimensional linear array, iWarp
includes support for distant cells to communicate as if they were adjacent, while
passing their data through intermediate cells.
512
C C C C C C C
e e e e e e e
l l l l l l l
l l l l l l l
6 5 4 3 2 1 0
52
one-tenth of the numbers, removing them from the stream, and passes through the
rest to the next cell. The first cell than adds a number of zeroes to replace the data it
has removed, so that the number of data received and sent are equal.
This process is repeated in each cell. In this way, each cell obtains one-tenth of
the data from a row of the image. As the program is executed, and the process is
repeated for all rows of the image, each cell sees an adjacent set of columns of the
image, as shown in Fig. 4.
We have omitted certain details of 6ETROW--for example, usually the image row
size is not an exact multiple of ten. In this case, the GETROW macro pads the row
equally on both sides by having the interface unit generate an appropriate number
of zeroes on either side of the image row. Also, usually the area of the image each
cell must see to generate its outputs overlaps with the next cell's area. In this case,
the cell copies some of the data it receives to the next. All this code is automatically
generated by GETROW.
VUTROW, the corresponding macro for output, takes a buffer of one-tenth of the
row length from each cell and combines them by concatenation. The output row
starts as a buffer of 512 zeroes generated by the interface unit. The first cell discards
the first one-tenth of these and adds its own data to the end. The second cell does
the same, adding its data after the first. When the buffer leaves the last cell, all the
zeroes have been discarded and the first cell's data has reached the beginning of the
buffer. The interface unit then converts the floating point numbers in the buffer to
zeroes and outputs it to the external host, which receives an array of 512 bytes
packed into 128 32-bit words. As with GETROW, PUTROW handles image buffers that
are not multiples of ten, this time by discarding data on both sides of the buffer
before the buffer is sent to the interface unit by the last cell.
During GETROW, no computation is performed; the same applies to PUTROW.
Warp's horizontal microword, however, allows input, computation, and output at
the same time. COMPUTEROW implements this. Ignoring the complications men-
tioned above, COMPUTEROWconsists of three loops. In the first loop, the data for the
cell is read into a memory buffer from the previous cell, as in GETROW, and at the
same time the first one-tenth of the output buffer is discarded, as in PUTROW. In
the second loop, nine-tenths of the input row is passed through the next cell, as in
GETROW; at the same time, nine-tenths of the output buffer is passed through, as in
ARCHITECTURE INDEPENDENT LANGUAGE 257
PUTROW. This loop is unwound by COMPUTEROWSO that for every nine inputs and
outputs passed through, one output of this cell is computed. In the third loop, the
outputs computed in the second loop are passed on to the next cell, as in PUTROW.
There are several advantages to this approach to input partitioning:
9 Work on the external host is kept to a minimum. In the Warp machine, the
external host tends to be a bottleneck in many algorithms; in the prototype
machines, the external host's actual data rate to the array is only about one-quarter
of the maximum rate the Warp machine can handle, even if the interface unit
unpacks data as it arrives. Using this input partitioning model, the external host
need not unpack and repack bytes, which it would have to if the data was requested
in another order. On the production Warp machine, the same concern applies; these
machines have DMA, which also requires a regular addressing pattern.
9 Each cell sees a connected set of columns of the image, which are one-tenth
of the total columns in a row. Processing adjacent columns is an advantage since
many vision algorithms (e.g., median filter [6]) can use the result from a previous set
of columns to speed up the computation at the next set of columns to the right.
9 Memory requirements at a cell are minimized, since each cell must store
only one-tenth of row. This was important in the prototype Warp machines, since
they had only 4K words memory on each cell. On PC Warp, with 32K words of
memory per cell, this approach makes it possible to implement very large window
operations.
9 An unexpected side effect of this programming model was that it made it
easier to debug the hardware in the Warp machine. If some portion of a Warp cell is
not working, but the communications and microsequencing portions are, then the
output from a given cell will be wrong, but it will keep its proper position in the
image. This means that the error will be extremely evident--typically a black stripe
is generated in the corresponding position in the image. It is quite easy to infer from
such an image which cell is broken!
3.2. Apply on FT Warp
The 2-dimensional FT Warp array can be viewed as several one-dimensional
arrays. An image is usually divided into several swaths (adjacent groups of rows) on
FT Warp. The data of each swath are fed into the corresponding row of these
two-dimensional processors, as an image is fed into a one-dimensional array. This
results in each cell of FT Warp in seeing a rectangular portion of the image.
To make the bandwidth as high as possible and to use the COMPUTEROWmodel,
we input the data along the horizontal path and output data along the vertical path.
The typical FT Warp array is a five-by-five array, as opposed to ten cells in Warp,
and each cell is as powerful as a Warp cell. FT Warp, however, has much higher
bandwidth than Warp. Therefore, for complex image processing operations where
I / O bandwidth is not a factor, we expect FT Warp Apply programs to be 2.5 times
faster than Warp programs, and even faster in simple image processing operations
where I / O bandwidth limits Warp performance.
3.3. Apply on iWarp
The iWarp implementation of Apply uses a logical pathway mechanism to allow
each cell to process only data intended for that cell. This eliminates much of the
258 HAMEY, WEBB, AND WU
complication of Apply on Warp; there is no need for a cell to explicitly pass data on
to other cells, instead it can simply direct the rest of the data to pass on to later cells
without further intervention.
Our description of Apply on iWarp will be clear if we describe the action of
GETROW and PUTROW on this machine. In GETROW, each cell accepts data intended
for that cell and then releases control of the data to be passed on to the next cell
automatically, until the arrival of the start of the next row. After releasing control, it
goes on to process the data it has just received. In the meantime, it is allowing data
to pass by on the output channel until the end of the output row arrives. It then
tacks on its computed output to the end of this output row, completing PUTROW.
We expect this method of implementing Apply to be at least as efficient as the
COMPUTEROW model on Warp. Since the baseline iWarp machine has 64 cells, each
of which is 1.6 to 2 times as powerful as a Warp cell, total performance should be
from about 10 to 13 times greater than Warp. iWarp's I / O bandwidth is much
higher than Warp's, so this performance should be achievable for all but the most
simple image processing operations.
II I
1
2
i, [ I
0
I
1
Kernel
2 3 4 s13
KB: The e l e m e n t pointed by k e r n e l base.
RB: The e l e m e n t pointed by r o w base. Spare
2
I,I
KB
i
IJ
I
0 i 2 513
Kernel
processor and the bus that allows it to select a subwindow of the image to be stored
into its memory. The input image is sent over the bus and windows are stored in
each processor automatically using DMA. A similar interface exists for outputting
the image from each processor. This allows flexible real-time image processing.
The Hughes HBA Apply implementation is straightforward and similar to the
Warp implementation. The image is divided in "swaths," which are adjacent sets of
rows, and each processor takes one swath. (In the Warp implementation, the swaths
are adjacent sets of columns, instead of rows). Swaths overlap to allow each
processor to compute on a window around each pixel. The processors independently
compute the result for each swath, which is fed back onto the video bus for display.
The HBA implementation of Apply includes a facility for image reduction, which
was not included in earlier versions of Apply. The HBA implementation subsamples
the input images, so that the input image window refers to the subsampled image,
not the original image as in our definition. We prefer the approach here because it
has more general semantics. For example, using image reduction as we have defined
it, it is possible to define image reduction using overlapping windows as in Sec-
tion 2.4.
two steps down, and so on, storing the pixel value in each virtual processor the pixel
encounters, until a m • m square around each virtual processor is filled. This will
take m 2 steps.
9 Compute: Each virtual processor now has all the inputs it needs to calculate
the output pixels. Perform this computation in parallel on all processors.
9 Input: If there are n processors in use, divide the image into n regions, and
store one region in each of the n processors' memories. The actual shape of the
regions can vary with the particular machine in use. Note that compact regions have
smaller borders than long, thin regions, so that the next step will be more efficient if
the regions are compact.
9 Window: For each IN variable, processors exchange rows and columns of
their image with processors holding an adjacent region from the image so that each
processor has enough of the image to compute the corresponding output region.
9 Compute: Each processor now has enough data to compute the output
region. It does so, iterating over all pixels in its output region.
7. S U M M A R Y
[ BORDER border-type ]
[ SAMPLE ( integer-list ) ]
oar-list : OUT type
oar-list : CONST type
oar-list := name [, name ]*
type := ARRAY ( range [, range ]* ) OF elementary-type
elementary-type
statement ~= assignment-strut
if-strut
for-stmt
while-strut
alpha := A [B[CIDIEIFIG[rlIIIJ[KIL
MINIO[PIQ[S[TIU[VIW[Y[Z
digit ,= 0111213141516171819
border-type := integer-expr l MIRRORED
integer-list := integer [, integer ]*
range -'= integer-expr., integer-expr
elementary-type := sign-type object
object
assignment-stmt ~= scalar-oar : = expr
if-stmt := IF bool-expr THEN
statements
END IF
IF bool-expr THEN
statements
ARCHITECTURE INDEPENDENT LANGUAGE 263
ELSE
statements
END IF
for-strut := FOR variable IN range LOOP
statements
END LOOP
while-strut := WHILE bool-expr LOOP
statements
E N D LOOP
integer-expr ~= integer-expr binary-op integer-expr
!integer-expr
( integer-expr )
integer
integer ,= [sign]digit [ digit ]*
sign-type := [ SIGNED [ U N S I G N E D ]
object := BYTE
INTEGER
REAL
scalar-oar := variable
variable ( subscript-list )
expr 9 ~= expr binary-op expr
!expr
( expr )
pseudo-function ( expr )
variable ( subscript-list )
variable
bool-expr 9 := bool-expr AND bool-expr
bool-expr OR bool-expr
NOT bool-expr
( bool-expr )
expr < expr
expr < = expr
expr = expr
expr > = expr
expr > expr
expr / = expr
binary-op ,= + I-I*l/l^l"l"l&
sign ::= +1-
pseudo-function : -'= name
subscript-list : := expr [ , expr ]*
number : '-= integer [. [ digit ]* ]
ACKNOWLEDGMENTS
W e would like to acknowledge the contributions made by Steve Shafer, who helped develop the Apply
p r o g r a m m i n g model. The Warp project is a large project at Carnegie Mellon and General Electric
C o m p a n y . The authors are greatly indebted to this group, which has designed, built, and maintained the
W a r p machine, as well as implemented the W2 programming language, which is the basis for the Warp
implementation of Apply. Apply itself grew out of work in the standard vision programming environ-
m e n t at Carnegie Mellon, which is based on C/UNIX. Apply benefited from the use and criticism of
m e m b e r s of the Image Understanding Systems and Autonomous Land Vehicles group at Carnegie
Mellon.
264 HAMEY, WEBB, AND WU
REFERENCES
1. Reference Manual for the Ada Programming Language, MIL-STD 1815 edition, U.S. Department of
Defense, AdaTEC, SIGPLAN Technical Committee on Ada, New York, AdaTEC, 1982. Draft
revised MIL-STD 1815; draft proposed ANSI Standard document.
2. M. Annaratone, E. Arnould, T. Gross, H. T. Kung, M. Lam, O. Menzileioglu, K. Sarocky, and J. A.
Webb, Warp architecture and implementation, in Proceedings, 13th Annual International Sympo-
sium on Computer Architecture, June, 1986, pp. 346-356.
3. K. E. Batcher, Bit-serial parallel processing systems, IEEE Trans. Comput. C-31, No. 5, 1982,
377-384.
4. BBN Laboratories, The Uniform System Approach to Programming the Butterfly Parallel Processor, 1st
ed., Cambridge, MA, 1985.
5. W. D. Hillis, The Connection Machine, The MIT Press, Cambridge, MA, 1985.
6. T. S. Huang, G. J. Yang, and G. Y. Tang, A fast two-dimensional median filtering algorithm, in
International IEEE Conference on Pattern Recognition and Image Processing, 1978, pp. 128-130.
7. iPSC System Overview, Intel Corporation, 1985.
8. B. W. Kernighan and D. M. Ritchie, The M4 macro processor, in Unix Programmer's Manual, Bell
Laboratories, Murray Hill, NJ 07974, 1979.
9. H. T. Kung and J. A. Webb, Global operations on the CMU Warp machine, in Proceedings, 1985
AIAA Computers in Aerospace V Conference, October, 1985, pp. 209-218.
10. H. T. Kung and J. A. Webb, Mapping image processing operations onto a linear systolic machine,
Distributed Comput. 1, No. 4, 1986, 246-257.
11. T. J. Olson, An Image Processing Package for the BBN Butterfly Parallel Processor, Butterfly Project
Report 9, Department of Computer Science, University of Rochester, August 1985.
12. C. Seitz, The cosmic cube, Commun. ACM 28, No. 1, 1985, 22-33.
13. R. S. Wallace and M. D. Howard, HBA vision architecture: Built and benchmarked, in Computer
Architectures for Pattern Analysis and Machine Intelligence, IEEE Computer Society, Seattle,
Washington, December 1987.
14. R. S. Wallace, J. A. Webb, and I-C. Wu, Architecture independent image processing: Performance of
Apply on diverse architectures, Comput. Vision Graphics Image Process. 48 (1989).