0% found this document useful (0 votes)
77 views19 pages

An Architecture Independent Programming Language For Low-Level Vision

Uploaded by

Victor Chiriac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views19 pages

An Architecture Independent Programming Language For Low-Level Vision

Uploaded by

Victor Chiriac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

COMPUTER VISION, GRAPHICS,AND IMAGEPROCESSING48, 246-264 (1989)

NOTE

An Architecture Independent Programming Language


for Low-Level Vision'"
LEONARD G. C. HAMEY, JON A. WEBB, AND I-CHEN W U

School of Computer Science, Carnegie Mellon Univesity, Pittsburgh, Pennsylvania 15213


Received October 8, 1987; revised February 22, 1989

Low-level vision is particularly amenable to implementation on parallel architectures, which


offer an enormous speedup at this level. To take advantage of this, the algorithm must be
adapted to the particular parallel architecture. Having to adapt programs in this way poses a
significant barrier to the vision programmer, who must learn and practice a different method of
parallelization for each different parallel machine. There is also no possibility of portability for
programs written for a particular parallel architecture. We have developed a specialized
programming language, called Apply, which reduces the problem of writing the algorithm for
this class of programs to the task of writing the function to be applied to a window around a
single pixel. Apply provides a method for programming these applications which is easy,
consistent, and efficient. Apply is programming model specific--it implements the input
partitioning model--but is architecture independent. It is possible to implement versions of
Apply which run efficiently on a wide variety of computers. We describe implementations of
Apply on Warp, various Warp-like architectures, UNIX, and the Hughes HBA and sketch
implementations on bit-serial processor arrays and distributed memory machines. 9 1989
AcademicPress.Inc.

1. INTRODUCTION

In computer vision, the first, and often most time-consuming, step in image
processing is image to image operations. In this step, an input image is mapped into
an output image through some local operation that applies to a window around each
pixel of the input image. Algorithms that fall into this class include: edge detection,
smoothing, convolutions in general, contrast enhancement, color transformations,
and thresholding. Collectively, we call these operations low-level vision. Low-level
vision is often time consuming simply because images are quite large--a typical size
is 512 x 512 pixels, so the operation must be applied 262,144 times.
Fortunately, this step in image processing is easy to speed up, through the use of
parallelism. The operation applied at every point in the image is often independent
from point to point and also does not vary much in execution time at different
points in the image. This is because at this stage of image processing, nothing has
been done to differentiate one area of the image from another, so that all areas are
processed in the same way. Because of these two characteristics, many parallel

*The research was supported in part by Defense Advanced Research Projects Agency (DOD),
monitored by the Air Force Avionics Laboratory under Contract F33615-81-K-1539, and Naval Elec-
tronic Systems Command under Contract N00039-85-C-0134, in part by the U.S. Army Engineer
Topographic Laboratories under Contract DACA76-85-C-0002, and in part by the Office of Naval
Research under Contracts N00014-80-C-0236, NR 048-659, and N00014-85-K-0152, NR SDRJ-007.
246
0734-189X/89 $3.00
Copyright 9 1989 by AcademicPress,Inc.
All rights of reproductionin any form reserved.
ARCHITECTURE INDEPENDENT LANGUAGE 247

computers achieve good efficiency in these algorithms, through the use of input
partitioning [10].
We define a language, called Apply, which is designed for implementing these
algorithms. Apply runs on the Warp machine, which has been developed for image
and signal processing. We discuss Warp, and describe its use at this level of vision.
The same Apply program can be compiled either to run on the Warp machine, or
under UNIX, and it runs with good efficiency in both cases. Therefore, the program-
mer is not limited to developing his programs just on Warp, although they run much
faster (typically 100 times faster) there; he can do development under the more
generally available UNIX system.
We consider Apply and its implementation on Warp to be a significant develop-
ment for image processing on parallel computers in general. The most critical
problem in developing new parallel computer architectures is a lack of software
which efficiently uses parallelism. While building powerful new computer architec-
tures is becoming easier because of the availability of custom VLSI and powerful
off-the-shelf components, programming these architectures is difficult.
Parallel architectures are difficult to program because it is not yet understood how
to "cover" parallelism (hide it from the programmer) and get good performance.
Therefore, the programmer either programs the computer in a specialized language
which exploits features of the particular computer and which can run on no other
computer (except in simulation), or he uses a general purpose language, such as
F O R T R A N , which runs on many computers but which has additions that make it
possible to program the computer efficiently. In either case, using these special
features is necessary to get good performance from the computer. However, exploit-
ing these features requires training, limits the programs to run on one or at most a
limited class of computers, and limits the lifetime of a program, since eventually it
must be modified to take advantage of new features provided in a new architecture.
Therefore, the programmer faces a dilemma: he must either ignore (if possible) the
special features of his computer, limiting performance, or he must reduce the
understandability, generality, and lifetime of his program.
It is the thesis of Apply that application dependence, in particular programming
model dependence, can be exploited to cover this parallelism while getting good
performance from a parallel machine. Moreover, because of the application depen-
dence of the language, it is possible to provide facilities that make it easier for the
programmer to write his program, even as compared with a general-purpose
language. Apply was originally developed as a tool for writing image processing
programs on UNIX systems; it now runs on UNIX systems, Warp, and the Hughes
HBA. Since we include a definition of Apply as it runs on Warp, and because most
parallel computers support input partitioning, it should be possible to implement it
on other supercomputers and parallel computers as well.
Apply also has implications for benchmarking of new image processing comput-
ers. Currently, it is hard to compare these computers, because they all run different,
incompatible languages and operating systems, so the same program cannot be
tested on different computers. Once Apply is implemented on a computer, it is
possible to fairly test its performance on an important class of image operations,
namely low-level vision.
Apply is not a panacea for these problems; it is an application-specific, machine-
independent, language. Since it is based on input partitioning, it cannot generate
248 HAMEY, WEBB, AND WU

programs which use pipelining, and it cannot be used for global vision algorithms
[9] such as connected components, Hough transform, FFT, and histogram. How-
ever. Apply is in daily use at Carnegie Mellon and elsewhere, and has been used to
implement a significant library (100 programs) of algorithms covering most of
low-level vision. A companion paper [14] describes this library and evaluates
Apply's performance.
We begin by describing the Apply language and its utility for programming
low-level vision algorithms. Examples of Apply programs and Apply's syntax are
presented. Finally, we discuss implementations of Apply on various architectures:
Warp and Warp-like architectures, uni-processors, the Hughes HBA, bit-serial
processor arrays, and distributed memory machines.
2. INTRODUCTION TO APPLY
The Apply programming model is a special-purpose programming approach
which simplifies the programming task by making explicit the parallelism of
low-level vision algorithms. We have developed a special-purpose programming
language called the Apply language which embodies this parallel programming
approach. When using the Apply language, the programmer writes a procedure
which defines the operation to be applied at a particular pixel location. The
procedure conforms to the following programming model:
9 It accepts a window or a pixel from each input image.
9 It performs arbitrary computation, without side effects.
9 It returns a pixel value for each output image.
The Apply compiler converts the simple procedure into an implementation which
can be run efficiently on Warp, or on a uni-processor machine in C under UNIX.
The idea of the Apply programming model grew out of a desire for efficiency
combined with ease of programming for a useful class of low-level vision operations.
In our environment, image data is usually stored in disk files and accessed through a
library interface. This introduces considerable overhead in accessing individual
pixels so algorithms are often written to process an entire row at a time. While
buffering rows improves the speed of algorithms, it also increases their complexity.
A C language subroutine implementation of Apply was developed as a way to hide
the complexities of data buffering from the programmer while still providing the
efficiency benefits. In fact, the buffering methods which we developed were more
efficient than those which would otherwise be used, with the result that Apply
implementations of algorithms were faster than previous implementations.
After implementing Apply, the following additional advantages became evident:
9 The Apply programming model concentrates the programming effort on the
actual computation to be performed instead of the looping in which it is embedded.
This encounters programmers to use more efficient implementations of their algo-
rithms. For example, a Sobel program gained a factor of four in speed when it was
reimplemented with Apply. This speedup primarily resulted from explicitly coding
the convolutions. The resulting code is more comprehensible than the earlier
implementation.
9 Apply programs are easier to write, easier to debug, more comprehensible,
and more likely to work correctly the first time. A major benefit of Apply is that it
ARCHITECTURE INDEPENDENT LANGUAGE 249

greatly reduces programming time and effort for a very useful class of vision
algorithms. The resulting programs are also faster than the programmer would
probably otherwise achieve.

2.1. The Apply Language


The Apply language is designed for programming image to image computations
where the pixels of the output images can be computed from corresponding
rectangular windows of the input images. The essential feature of the language is
that each operation is written as a procedure for a single pixel position. The Apply
compiler generates a program which executes the procedure over an entire image.
No ordering constraints are provided for in the language, allowing the compiler
complete freedom in dividing the computation among processors.
Each procedure has a parameter list containing parameters of any of the follow-
ing types: in, out, or constant. Input parameters are either scalar variables or
two-dimensional arrays. A scalar input variable represents the pixel value of an
input image at the current processing coordinates. A two-dimensional array input
variable represents a window of an input image. Element (0, 0) of the array
corresponds to the current processing coordinates.
Output parameters are scalar variables. Each output variable represents the pixel
value of an output image. The final value of an output variable is stored in the
output image at the current processing coordinates.
Constant parameters may be scalars, vectors, or two-dimensional arrays. They
represent precomputed constants which are made available for use by the proce-
dure. For example, a convolution program would use a constant array for the
convolution mask.
The reserved variables ROW and COL are defined to contain the image coordinates
of the current processing location. This is useful for algorithms which are dependent
in a limited way on the image coordinates.
Appendix I gives a grammar of the Apply language. The syntax of Apply is based
on Ada [1]; we chose this syntax because it is familiar and adequate. However, as
should be clear, the application dependence of Apply means that it is not an Ada
subset, nor is it intended to evolve into such a subset.
The operators ^, 1, &, and ! refer to the exclusive or, or, and, and not operations,
respectively. Variable and function names are alpha-numeric strings of arbitrary
length, commencing with an alphabetic character. The INTEGER and REALpseudo-
functions convert from real to integer, and from integer (or bye) to real types. Case
is not significant, except in the preprocessing stage which is implemented by the m4
macro processor [8].
BYTE, INTEGER, and REAL refer to (at least) 8-bit integers, 16-bit integers, and
32-bit floating point numbers. BYTE values are converted implicitly to INTEGER
within computations. The actual size of the type may be larger, at the discretion of
the implementor.

2.2. An Implementation of Sobel Edge Detection


As a simple example of the use of Apply, let us consider the implementation of
Sobel edge detection. Sobel edge detection is performed by convolving the input
image with two 3 • 3 masks. The horizontal mask measures the gradient of
horizontal edges, and the vertical mask measures the gradient of vertical edges.
250 HAMEY, WEBB, AND WU

I 1 2 1 [ I 1 0 -1 [
I 0 0 0 I 1 2 0 -2 I
I -1 -2 -1 I I 1 0 -1 I

Horizontal Vertical

FIG. 1. The Sobel convolution masks.

Diagonal edges produce some response from each mask, allowing the edge orienta-
tion and strength to be measured for all edges. Both masks are shown in Fig. 1.
An Apply implementation of Sobel edge detection is shown in Fig. 2. The lines
have been numbered for the purposes of explanation, using the comment conven-
tion. Line numbers are not a part of the language.
Line 1 defines the input, output, and constant parameters to the function. The
input parameter inimg is a window of the input image. The constant parameter
thresh is a threshold. Edges which are weaker than this threshold are suppressed in
the output magnitude image, mag. Line 3 defines horiz and vert which are internal
variables used to hold the results of the horizontal and vertical Sobel edge operator.
Line 1 also defines the input image window. It is a 3 • 3 window centered about
the current pixel processing position, which is filled with the value 0 when the
window lies outside the image. This same line declares the constant and output
parameters to be floating-point scalar variables.
The computation of the Sobel convolutions is implemented by the straight-for-
ward expressions on lines 5 through 7. These expressions are readily seen to be a
direct implementation of the convolutions in Fig. 1.
2.3. Border Handling
Border handling is always a difficult and messy process in programming kernel
operations such as Sobel edge detection. In practice, this is usually left up to the
programmer, with varying results--sometimes borders are handled in one way,
sometimes another. Apply provides a uniform way of resolving the difficulty. It
supports border handling by extending the input images with a constant value, or by
replicating image elements. The constant value is specified as an assignment. Line 1

procedure sobel (inimg : in array (-1..1, -1..1.) of b y t e -- 1


b o r d e r O,
thresh : c o n s t real,
mag : out real)
is -- 2
horiz, vert : integer; --3
begin -- 4
horiz := i n i - ~ (-1, -1) + 2 * i n i m g ( - 1 , 0 ) + i n i m g (-1,1) - -- 5
inimg(l,-l) - 2 * inimg(l,O) - inimg(l,l);
vert := i n i m g ( - l , - l ) + 2 * inimg(O,-l) + inimg(l,-i)
i n i m g (-l, l ) - 2 * i n i m g ( O , l ) - inimg(l,l);
m a g := s q r t ( h o r i z * h o r i z + vert*vert); -- 7
if meg < thresh then ---- 8
mag := 0.0; --w -9

e n d if; -- 10
end sobel; -- ii

FIG. 2. An Apply implementation of thresholded Sobel edge detection.


ARCHITECTURE INDEPENDENT LANGUAGE 251

of Fig. 2 indicates that the input image inimg is to be extended by filling with the
constant value 0.
If the programmer says the border is to be MIRROREDthen rows and columns of
the image are reflected about the border to supply the needed values. For example,
border element ( - 2 , - 3 ) outside the image has the value of image element (1, 3).
This method of providing borders is convenient in edge detection and smoothing
operations.
2.4. Image Reduction and Magnification
Apply allows the programmer to process images of different sizes, for example to
reduce a 512 • 512 image to a 256 X 256 image, or to magnify images. This is
implemented via the SAMPLEparameter, which can be applied to input images, and
by using output image variables which are arrays instead of scalars. The SAMPLE
parameter specifies that the Apply operation is to be applied not at every pixel, but
regularly across the image, skipping pixels as specified in the integer list after
SAMPLE. The window around each pixel still refers to the underlying input image.
For example, the following program performs image reduction, using overlapping
4 x 4 windows, to reduce a n x n image to an n/2 X n/2 image:
procedure reduce(inimg : in array (0.. 3, 0 . . 3) of byte sample (2, 2),
outimg : out byte)
is
sum : integer;
i, j : integer;
begin
sum := O;
for i in 0 . . 3 loop
f o r j in 0 . . 3 loop
sum :- sum + inimg (i, j ) ;
end loop;
end loop;
outimg := sum / 16;
end reduce;

Magnification can be done by using an output image variable which is an array.


The result is that, instead of a single pixel being output for each input pixel, several
pixels are output, making the output image larger than the input. The following
program uses this to perform a simple image magnification, using linear interpola-
tion:
procedure magnify(inimg : in array(- 1 . . 1, - 1.. 1) of byte
border mirrored,
outimg : out array (0.. 1, 0 . . 1) of byte)
is
begin
outimg(0, 0) := (inimg( - 1, - 1) + inimg( - 1, 0)
+ i n i m g ( 0 , - 1) + inimg(0, 0 ) ) / 4 ;
outimg(0, 1) := (inimg( - 1,0) + inimg( - 1,1)
+inimg(0, 0) + i n i m g ( 0 , 1 ) ) / 4 ;
outimg(1, 0) :~ (inimg(0, - 1) + inimg(0, 0)
+ inimg(1, - 1 ) + inimg(1, 0) / 4 ;
outimg(1,1) := (inimg(0, 0) + inimg(0,1)
+ inimg(1, 0) + inimg(1,1)) / 4 ;
end magnify;
252 HAMEY, WEBB, A N D W U

The semantics of SA~IPLE (sl, S2) are as follows: the input window is placed
so that pixel (0,0) falls on image pixel (0,0),(0, s2) . . . . . (0, n • s2) . . . . ,
(m • sl, n • s2). Thus, SAMPLE(1, 1) is equivalent to omitting the SAMPLEoption
entirely.
Output image arrays work by expanding the output image in either the horizontal
or vertical direction, or both, and placing the resulting output windows so that they
tile the output image without overlapping.

2.5. Multifunction Apply Modules


In many low-level image processing algorithms, results from an adjacent pixel are
saved in order to be used to calculate the results at an adjacent pixel; this results in
a more efficient algorithm. Because Apply programs do not share results from
adjacent pixels (doing so would violate Apply's order-independence, which is what
makes it easy to implement in parallel), Apply programmers cannot take advantage
of this trick. However, many of these algorithms can be factored into multiple
passes in a way that results in an efficient program without needing to introduce
order dependence.
These multiple functions can be efficiently implemented in Apply. Where memory
use is not a concern, the intermediate results can be saved, and used by the next
Apply program. In cases where memory is limited, multiple Apply functions can be
compiled together into a single pass.

procedure rowcol (inimg : in a r r a y (-i..I, -1..1) of b y t e


b o r d e r O,
rowsum : out i n t e g e r ,
colsum : out i n t e g e r )
is
begin
r o w s u m := i n i m g ( O , - l ) + 2 * inimg(O,O) + i n i m g ( O , I) ;
c o l s u m := i n ~ ( - l , O ) + 2 * inimg(O,O) + i n i m g ( l , O) ;
e n d rowcol;

procedure sobel (rowsum : in a r r a y ( - l . . l , 0..0) of integer,


colsum : in a r r a y ( O . . O , -i..i) of integer,
thresh : c o n s t real,
mag : out real)
is
horiz, v e r t : i n t e g e r ;
begin
h o r i z := r o w s u m ( - l , O ) - rowsum(l,O);
v e r t := c o l s u m ( O , - 1 ) - colsum(0,1);
m a g := s q r t ( h o r i z * h o r i z + vert*vert);
if m a g < t h r e s h t h e n
m a g := 0.0;
e n d if;
e n d sobel;

FIG. 3. A more efficient Sobel operator.


ARCHITECTURE INDEPENDENT LANGUAGE 253

2.5.1. An Efficient Sobel Operator


A simple example is the Sobel operator. In the program shown in Fig. 2, at each
pixel the row and column sums must be recalculated. But at every pixel, one of the
row and column sums are shared with pixels two to the left, right, top, and bottom.
This is inefficient.
Figure 3 shows the same Sobel operator implemented as multiple functions. In
the ROWCOL procedure, the row and column sums are calculated--each is calculated
only once per pixel. In the SOBEL procedure, the row and column differences are
summed, and the result is computed just as before. This program does six fewer
additions and two fewer multiplications than the program in Fig. 2.

2.5.2. An Efficient Median Filter


Another example is median filter. Many median filter algorithms use results from
an adjacent calculation of the median filter to compute a new median filter, when
processing the image in raster order. Apply multiple functions lead to the following
3 x 3 median filter.
The algorithm works in two steps. The first step (MEDIAN1) produces, for each
pixel, a sort of the pixel and the pixels above and below that pixel. The result from
this step is an image three times higher than the original, with the same width. The
second step (MEDIAN2) sorts, based on the middle element in the column, the
three elements produced by the first step. Note the use of the SAMPLE clause in
this step to place MEDIAN2 at every third row produced by M E D I A N I - - t h i s
causes M E D I A N 2 to produce an image the same size as the input to MEDIAN1.
M E D I A N 2 produces the following relationships among the nine pixels at and
surrounding a pixel:

a d g
v V v
b <e <h
v v V
c f i

F r o m this diagram, it is easy to see that none of pixels g, h, b, or c can be the


median, because they are all greater or less than at least five other pixels in the
neighborhood. The only candidates for median are a, d, e, f, and i. Now we
observe that f < (e, h, d, g), so that if f < a, f cannot be the median since it will
be less than five pixels in the neighborhood. Similarly, if a < f , a cannot be the
median. We therefore compare a and f , and keep the larger. By a similar argument,
we compare i and d and keep the smaller. This leaves three pixels: e, and the two
pixels we chose from { a, f }, and { d, i }. All of these are median candidates. We
therefore sort them and choose the middle element; this is the median.
This algorithm computes a 3 x 3 median filter with only eleven comparisons,
comparable to many techniques for optimizing median filter in raster-order process-
ing algorithms.
M4 macro definitions are used to shorten the code.
254 HAMEY, WEBB, A N D WU

- - S o r t the three elements at, above, and below each pixel


procedure median1 (image : in a r r a y ( - 1.. 1, 0.. O) of byte
border mirrored,
si : out a r r a y ( - 1.. 1, 0.. O) of byte)
is
define(assigntosi, 'si(1,0 := image(S1,0);
si(O, O) := image(S2, 0);
si( - 1, O) := image(S3, 0);')
begin
if image ( - 1, O) > image(O, O)
then if image(O, O) > image(l, O)
then assigntosi( - 1, O, 1) end if;
else if i m a g e ( - 1, O) > image(l, O)
then assigntosi( - 1, i, O)
else assigntosi(1, - 1 , O) end if; end if;
else if image(O, O) > image(l, O)
then if image( - 1, O) > image(l, O)
then assigntosi(O, - 1,1)
else assigntosi(O, 1, - 1) end if;
else assigntosi(1, O, - 1) end if; end if;
end median1;
procedure median2 (si : in a r r a y ( - 1.. 1 , - 1.. 1) of byte
sample (3,1) border mirrored,
median : out byte)
- - C o m b i n e the sorted columns from the first step to give the median.

is
define(assignlmh, ' h := $1; m := $2; 1 := $3;')
define(medianAB, 'if A > B then median := A;
else median := B; end if;')
int, m, h;
byte A, B;
begin
if si( - 1, 0) > si(0, 0)
then if si(0, 0) > si(1, 0)
then assignlmh ( - 1, 0,1) end if;
else if si( - 1, 0) > si(1, 0)
then assignlmh ( - 1,1, 0)
else assignlmh (1, - 1 , 0 ) end if; end if;
else if si(0, 0) > si(1, 0)
then if si( - 1, 0) > si(1, 0)
then assignlmh (0, - 1,1)
else asignlmh (0,1, - 1) end if;
else assignlmh (1, 0, - 1 ) end if; end if;
if si(1, - 1) > si(m, 1)
then A := si(1, - 1)
else A := si(m, 1) end if;
if si(m, - 1) < si(h, 1)
then B := si(m, - 1 )
else B := si(h, 1) end if;
if A > si(m, 0)
then if si(m, 0) > B
then median := si(m,0); end if;
else medianAB0 end if;
else if si(m, 0) > B
then medianAB ()
else median := si(m, 0); end if; end if;

end median2;
ARCHITECTURE INDEPENDENT LANGUAGE 255

3. APPLY ON WARP AND WARP-LIKE ARCHITECTURES

The Warp-like architectures have in common that they are systolic arrays, in
which each processor is a powerful (10 MFLOPS or more) computer with high
word-by-word I / O bandwidth with adjacent processors, arranged in a simple
topology. Apply is implemented on these processors in similar ways, so we first
describe the basic model of low-level image processing on Warp, and then sketch
the implementations on FT Warp and iWarp.
We briefly describe each of the Warp-like architectures; a complete description of
Warp is available elsewhere [2]. Warp is a short linear array, typically consisting of
ten cells, each of which is a 10 MFLOPS computer. The array has high internal
bandwidth, consistent with its use as a systolic processor. Each cell has a local
program and data memory, and can be programmed in a Pascal-level language
called W2, which supports communication between cells using asynchronous word-
by-word send and receive statements. The systolic array is attached to an external
host, which sends and receives data from the array from a separate memory. The
external host in turn is attached to a Sun computer, which provides the user
interface.
Fault-tolerant (FT) Warp is a two-dimensional array, typically a five-by-five
array, being designed by Carnegie Mellon. Each cell is a Warp cell. Each row and
column can be fed data independently, providing for a very high bandwidth. As the
name suggests, this array has as a primary goal fanlt-tolerance, which is supported
by a virtual channel mechanism mediated by a separate hardware component called
a switch.
iWarp is an integrated version of Warp being designed by Carnegie Mellon and
Intel. In iWarp each Warp cell is implemented by a single chip, plus memory chips.
The baseline iWarp machine is a 8 • 8 two-dimensional linear array, iWarp
includes support for distant cells to communicate as if they were adjacent, while
passing their data through intermediate cells.

3.1. Low-Level Vision on Warp


We map low-level vision algorithms onto Warp by the input partitioning method.
On a Warp array of ten cells, the image is divided into ten regions, by column, as
shown in Fig. 4. This gives each cell a tall, narrow region to process; for 512 • 512
image processing, the region size is 52 columns by 512 rows. To use technical terms
from weaving, the Warp cells are the "warp" of the processing; the "weft" is the
rows of the image as it passes through the Warp array.
The image is divided in this way using a series of macros called OETROW,
PUTROW, and COMPUTEROW.GETROWgenerates code that takes a row of an image
from the external host, and distributes one-tenth of it to each of ten cells. The
programmer includes a GETROWmacro at the point in his program where he wants
to obtain a row of the image; after the execution of the macro, a buffer in the
internal cell memory has the data from the image row.
The G~TROW macro works as follows. The external host sends in the image rows
as a packed array of bytes--for a 512-byte wide image, this array consists of 128
32-bit words. These words are unpacked and converted to floating point numbers in
the interface unit. The 512 32-bit floating point numbers resulting from this
operation are fed in sequence to the first cell of the Warp array. This cell takes
256 HAMEY, WEBB, AND WU

512

C C C C C C C
e e e e e e e
l l l l l l l
l l l l l l l
6 5 4 3 2 1 0

52

FIG. 4. Input partitioning method on Warp.

one-tenth of the numbers, removing them from the stream, and passes through the
rest to the next cell. The first cell than adds a number of zeroes to replace the data it
has removed, so that the number of data received and sent are equal.
This process is repeated in each cell. In this way, each cell obtains one-tenth of
the data from a row of the image. As the program is executed, and the process is
repeated for all rows of the image, each cell sees an adjacent set of columns of the
image, as shown in Fig. 4.
We have omitted certain details of 6ETROW--for example, usually the image row
size is not an exact multiple of ten. In this case, the GETROW macro pads the row
equally on both sides by having the interface unit generate an appropriate number
of zeroes on either side of the image row. Also, usually the area of the image each
cell must see to generate its outputs overlaps with the next cell's area. In this case,
the cell copies some of the data it receives to the next. All this code is automatically
generated by GETROW.
VUTROW, the corresponding macro for output, takes a buffer of one-tenth of the
row length from each cell and combines them by concatenation. The output row
starts as a buffer of 512 zeroes generated by the interface unit. The first cell discards
the first one-tenth of these and adds its own data to the end. The second cell does
the same, adding its data after the first. When the buffer leaves the last cell, all the
zeroes have been discarded and the first cell's data has reached the beginning of the
buffer. The interface unit then converts the floating point numbers in the buffer to
zeroes and outputs it to the external host, which receives an array of 512 bytes
packed into 128 32-bit words. As with GETROW, PUTROW handles image buffers that
are not multiples of ten, this time by discarding data on both sides of the buffer
before the buffer is sent to the interface unit by the last cell.
During GETROW, no computation is performed; the same applies to PUTROW.
Warp's horizontal microword, however, allows input, computation, and output at
the same time. COMPUTEROW implements this. Ignoring the complications men-
tioned above, COMPUTEROWconsists of three loops. In the first loop, the data for the
cell is read into a memory buffer from the previous cell, as in GETROW, and at the
same time the first one-tenth of the output buffer is discarded, as in PUTROW. In
the second loop, nine-tenths of the input row is passed through the next cell, as in
GETROW; at the same time, nine-tenths of the output buffer is passed through, as in
ARCHITECTURE INDEPENDENT LANGUAGE 257

PUTROW. This loop is unwound by COMPUTEROWSO that for every nine inputs and
outputs passed through, one output of this cell is computed. In the third loop, the
outputs computed in the second loop are passed on to the next cell, as in PUTROW.
There are several advantages to this approach to input partitioning:
9 Work on the external host is kept to a minimum. In the Warp machine, the
external host tends to be a bottleneck in many algorithms; in the prototype
machines, the external host's actual data rate to the array is only about one-quarter
of the maximum rate the Warp machine can handle, even if the interface unit
unpacks data as it arrives. Using this input partitioning model, the external host
need not unpack and repack bytes, which it would have to if the data was requested
in another order. On the production Warp machine, the same concern applies; these
machines have DMA, which also requires a regular addressing pattern.
9 Each cell sees a connected set of columns of the image, which are one-tenth
of the total columns in a row. Processing adjacent columns is an advantage since
many vision algorithms (e.g., median filter [6]) can use the result from a previous set
of columns to speed up the computation at the next set of columns to the right.
9 Memory requirements at a cell are minimized, since each cell must store
only one-tenth of row. This was important in the prototype Warp machines, since
they had only 4K words memory on each cell. On PC Warp, with 32K words of
memory per cell, this approach makes it possible to implement very large window
operations.
9 An unexpected side effect of this programming model was that it made it
easier to debug the hardware in the Warp machine. If some portion of a Warp cell is
not working, but the communications and microsequencing portions are, then the
output from a given cell will be wrong, but it will keep its proper position in the
image. This means that the error will be extremely evident--typically a black stripe
is generated in the corresponding position in the image. It is quite easy to infer from
such an image which cell is broken!
3.2. Apply on FT Warp
The 2-dimensional FT Warp array can be viewed as several one-dimensional
arrays. An image is usually divided into several swaths (adjacent groups of rows) on
FT Warp. The data of each swath are fed into the corresponding row of these
two-dimensional processors, as an image is fed into a one-dimensional array. This
results in each cell of FT Warp in seeing a rectangular portion of the image.
To make the bandwidth as high as possible and to use the COMPUTEROWmodel,
we input the data along the horizontal path and output data along the vertical path.
The typical FT Warp array is a five-by-five array, as opposed to ten cells in Warp,
and each cell is as powerful as a Warp cell. FT Warp, however, has much higher
bandwidth than Warp. Therefore, for complex image processing operations where
I / O bandwidth is not a factor, we expect FT Warp Apply programs to be 2.5 times
faster than Warp programs, and even faster in simple image processing operations
where I / O bandwidth limits Warp performance.
3.3. Apply on iWarp
The iWarp implementation of Apply uses a logical pathway mechanism to allow
each cell to process only data intended for that cell. This eliminates much of the
258 HAMEY, WEBB, AND WU

complication of Apply on Warp; there is no need for a cell to explicitly pass data on
to other cells, instead it can simply direct the rest of the data to pass on to later cells
without further intervention.
Our description of Apply on iWarp will be clear if we describe the action of
GETROW and PUTROW on this machine. In GETROW, each cell accepts data intended
for that cell and then releases control of the data to be passed on to the next cell
automatically, until the arrival of the start of the next row. After releasing control, it
goes on to process the data it has just received. In the meantime, it is allowing data
to pass by on the output channel until the end of the output row arrives. It then
tacks on its computed output to the end of this output row, completing PUTROW.
We expect this method of implementing Apply to be at least as efficient as the
COMPUTEROW model on Warp. Since the baseline iWarp machine has 64 cells, each
of which is 1.6 to 2 times as powerful as a Warp cell, total performance should be
from about 10 to 13 times greater than Warp. iWarp's I / O bandwidth is much
higher than Warp's, so this performance should be achievable for all but the most
simple image processing operations.

4. APPLY ON UNI-PROCESSOR MACHINES


The same Apply compiler that generates Warp code also generates C code to be
run under UNIX. We have found that an Apply implementation is usually at least as
efficient as any alternative implementation on the same machine. The computation
time of the Apply-generated code is usually faster than that of hand-coded pro-
grams. This efficiency results from the expert knowledge which is built into the
Apply implementation but which is too complex for the programmer to work with
explicitly. In addition, Apply focuses the programmer's attention on the details of
the computation, which often results in improved design of the basic computation.
The Apply implementation for uni-processor machines employs a technique,
called cyclic-scroll buffering here, which efficiently uses small space and time to
buffer the rows of the image. The technique allows the kernel to be shifted and
scrolled over the buffer with low cost.
The cyclic-scroll buffering technique which we developed for Apply on uni-
processor machines is described as follows. For an N • N image which will be

II I
1

2
i, [ I
0
I

1
Kernel
2 3 4 s13
KB: The e l e m e n t pointed by k e r n e l base.
RB: The e l e m e n t pointed by r o w base. Spare

FIG. 5. Processingthe first row by the cyclic-scrollbuffering.


ARCHITECTURE INDEPENDENT LANGUAGE 259
! I
m

2
I,I
KB

i
IJ
I
0 i 2 513
Kernel

KB: The e l e m e n t p o i n t e d by k e r n e l base,


RB: The e l e m e n t p o i n t e d by row base. Spare

FIG. 6. Processingthe second row by the cyclic-scrollbuffering.

processed with an M • kernel, a buffer with ( N + M - 1 ) • 1)


elements is required.
Figures 5 and 6 display the column-major arrangement for processing a 3 • 3
kernel. The pointers represent successive positions in memory. In addition, we keep
two base pointers for the buffer. One, called row base, points to the first pixel of the
three rows of the image and the other, called kernel base, points to the first pixel of
the kernel. C language subscripting can be used to directly access the elements of
the kernel except that the indices of row and column must be exchanged because the
rows of the images are stored in colunm-major order.
Initially, we put the first M rows of the image, including the border, into the
buffer in column-major order. When the first kernel is processed, row base points to
the first element of the buffer, and kernel base points to the center element of the
window to be processed. After the first kernel has been processed, the kernel base is
incremented by M to point to the first pixel of the next kernel. It is thus possible to
shift the kernel across the entire buffer of data with a cost of only one addition.
When processing an entire row is completed, the first row in the buffer from the
row base is discarded and the next row of the image is input into the discarded row
with a column displacement of one (i.e., beginning at the second element). Then the
row base is incremented by one. The purpose of column displacement 1 is that the
input row can be considered to be the M t h row of the buffer starting from the new
rose base. Effectively, the rolling is done at the same time. After the kernel base is
reset to point to the center element of the new window, we can do another row
operation in the same way as the first until all the rows are processed. Figures 5 and
6 show the processing of the first and second rows.
F o r each row operation, one more memory element is needed in the buffer.
Therefore, the total number of the elements in the buffer is M • ( N + M - 1) +
( N - 1).

5. APPLY ON THE HUGHES HBA


Apply has been implemented on the Hughes HBA computer [13] by Richard
Wallace of Carnegie Mellon and Hughes. In this computer, several MC68000
processors are connected on a high-speed video bus, with an interface between each
260 HAMEY, WEBB, AND WU

processor and the bus that allows it to select a subwindow of the image to be stored
into its memory. The input image is sent over the bus and windows are stored in
each processor automatically using DMA. A similar interface exists for outputting
the image from each processor. This allows flexible real-time image processing.
The Hughes HBA Apply implementation is straightforward and similar to the
Warp implementation. The image is divided in "swaths," which are adjacent sets of
rows, and each processor takes one swath. (In the Warp implementation, the swaths
are adjacent sets of columns, instead of rows). Swaths overlap to allow each
processor to compute on a window around each pixel. The processors independently
compute the result for each swath, which is fed back onto the video bus for display.
The HBA implementation of Apply includes a facility for image reduction, which
was not included in earlier versions of Apply. The HBA implementation subsamples
the input images, so that the input image window refers to the subsampled image,
not the original image as in our definition. We prefer the approach here because it
has more general semantics. For example, using image reduction as we have defined
it, it is possible to define image reduction using overlapping windows as in Sec-
tion 2.4.

6. APPLY ON OTHER MACHINES


Here we briefly outline how Apply could be implemented on other parallel
machine types, specifically bit-serial processor arrays and distributed memory
general purpose processor machines. These two types of parallel machines are very
common; many parallel architectures include them as a subset, or can simulate them
efficiently.

6.1. Apply on Bit-Serial Processor Arrays


Bit-serial processor arrays [3] include a great many parallel machines. They are
arrays of large numbers of very simple processors which are able to perform a single
bit operation in every machine cycle. We assume only that it is possible to load
images into the array such that each processor can be assigned to a single pixel of
the input image, and that different processors can exchange information locally, that
is, processors for adjacent pixels can exchange information efficiently. Specific
machines may also have other features that may make Apply more efficient than the
implementation outlined here.
In this implementation of Apply, each processor computes the result of one pixel
window. Because there may be more pixels than processors, we allow a single
processor to implement the action of several different processors over a period of
time, that is, we adopt the Connection Machine's idea of virtual processors [5].
The Apply program works as follows:

9 Initialize: For n • n image processing, use a virtual processor network of


n • n virtual processors.
9 Input: For each variable of type IN, send a pixel to the corresponding
virtual processor.
9 Constant: Broadcast all variables of type CONST to all virtual processors.
9 Window: For each IN variable, with a window size of m • m, shift it in a
spiral, first one step to the right, then one step up, then two steps to the left, then
ARCHITECTURE INDEPENDENT LANGUAGE 261

two steps down, and so on, storing the pixel value in each virtual processor the pixel
encounters, until a m • m square around each virtual processor is filled. This will
take m 2 steps.
9 Compute: Each virtual processor now has all the inputs it needs to calculate
the output pixels. Perform this computation in parallel on all processors.

Because memory on these machines is limited, it may be best to combine the


"window" and "compute" steps above, to avoid the memory cost of prestoring all
window elements on each virtual processor.

6.2. Apply on Distributed Memory General Purpose Machines


Machines in this class consist of a moderate number of general purpose proces-
sors, each with its own memory. Many general-purpose parallel architectures
implement this model, such as the Intel iPSC [7] or the Cosmic Cube [12]. Other
parallel architectures, such as the shared-memory BBN Butterfly [4; 11], can
efficiently implement Apply in this way; treating them as distributed memory
machines avoids problems with contention for memory.
This implementation of Apply works as follows:

9 Input: If there are n processors in use, divide the image into n regions, and
store one region in each of the n processors' memories. The actual shape of the
regions can vary with the particular machine in use. Note that compact regions have
smaller borders than long, thin regions, so that the next step will be more efficient if
the regions are compact.
9 Window: For each IN variable, processors exchange rows and columns of
their image with processors holding an adjacent region from the image so that each
processor has enough of the image to compute the corresponding output region.
9 Compute: Each processor now has enough data to compute the output
region. It does so, iterating over all pixels in its output region.

7. S U M M A R Y

The Apply language crystallizes our ideas on low-level vision programming on


parallel machines. It allows the programmer to treat certain messy conditions, such
as border conditions, uniformly. It also the programmer to get consistently good
efficiency in low-level vision programming, by incorporating expert knowledge
about how to implement such operators.
We have defined the Apply language as it is currently implemented and described
its use in low-level vision programming. Apply is in daily use at Carnegie Mellon
and elsewhere for Warp and vision programming in general; it has proved to be a
useful tool for programming under UNIX, as well as an introductory tool for Warp
programming.
We have described our programming techniques for low-level vision on Warp.
These techniques began with simple row-by-row image processing macros, which are
still in use for certain kinds of algorithms, and led to the development of Apply,
which is a specialized programming language for low-level vision. This language
could then be mapped onto other computers, including both uni-processors and
parallel computers.
262 HAMEY, WEBB, AND WU

One of the most exciting characteristics of Apply is that it is possible to


implement it on diverse parallel machines. We have outlined such implementations
on bit-serial processor arrays and distributed memory machines. Implementation of
Apply on other machines will make porting of low-level vision programs easier,
should extend the lifetime of programs for such supercomputers, and will make
benchmarking easier. Several implementation efforts are underway at other sites to
map Apply onto other parallel machines than those described here.
We have shown that the Apply programming model provides a powerful simph-
fled programming method which is applicable to a variety of parallel machines.
Whereas programming such machines directly is often difficult, the Apply language
provides a level of abstraction in which programs are easier to write, more
comprehensible, and more likely to work correctly the first time. Algorithm debug-
ging is supported by a version of the Apply compiler which generates C code for
uni-processor machines.

APPENDIX I : GRAMMAR OF THE A P P L Y LANGUAGE

procedure ::= PROCEDURE function-name (function-args)


IS
oariable-declarations
BEGIN
statements
END function-name;
function-name := name
function-args : = function-argument [, function-argument ]*
oariable-declarations := [ oar-list : type ; ]*
statements := [ statement ; ]*
name := alpha [ alpha [digit ]*
function-argument -'= vat-list I N type
"

[ BORDER border-type ]
[ SAMPLE ( integer-list ) ]
oar-list : OUT type
oar-list : CONST type
oar-list := name [, name ]*
type := ARRAY ( range [, range ]* ) OF elementary-type
elementary-type
statement ~= assignment-strut
if-strut
for-stmt
while-strut
alpha := A [B[CIDIEIFIG[rlIIIJ[KIL
MINIO[PIQ[S[TIU[VIW[Y[Z
digit ,= 0111213141516171819
border-type := integer-expr l MIRRORED
integer-list := integer [, integer ]*
range -'= integer-expr., integer-expr
elementary-type := sign-type object
object
assignment-stmt ~= scalar-oar : = expr
if-stmt := IF bool-expr THEN
statements
END IF
IF bool-expr THEN
statements
ARCHITECTURE INDEPENDENT LANGUAGE 263

ELSE
statements
END IF
for-strut := FOR variable IN range LOOP
statements
END LOOP
while-strut := WHILE bool-expr LOOP
statements
E N D LOOP
integer-expr ~= integer-expr binary-op integer-expr
!integer-expr
( integer-expr )
integer
integer ,= [sign]digit [ digit ]*
sign-type := [ SIGNED [ U N S I G N E D ]
object := BYTE
INTEGER
REAL
scalar-oar := variable
variable ( subscript-list )
expr 9 ~= expr binary-op expr
!expr
( expr )
pseudo-function ( expr )
variable ( subscript-list )
variable
bool-expr 9 := bool-expr AND bool-expr
bool-expr OR bool-expr
NOT bool-expr
( bool-expr )
expr < expr
expr < = expr
expr = expr
expr > = expr
expr > expr
expr / = expr
binary-op ,= + I-I*l/l^l"l"l&
sign ::= +1-
pseudo-function : -'= name
subscript-list : := expr [ , expr ]*
number : '-= integer [. [ digit ]* ]

ACKNOWLEDGMENTS
W e would like to acknowledge the contributions made by Steve Shafer, who helped develop the Apply
p r o g r a m m i n g model. The Warp project is a large project at Carnegie Mellon and General Electric
C o m p a n y . The authors are greatly indebted to this group, which has designed, built, and maintained the
W a r p machine, as well as implemented the W2 programming language, which is the basis for the Warp
implementation of Apply. Apply itself grew out of work in the standard vision programming environ-
m e n t at Carnegie Mellon, which is based on C/UNIX. Apply benefited from the use and criticism of
m e m b e r s of the Image Understanding Systems and Autonomous Land Vehicles group at Carnegie
Mellon.
264 HAMEY, WEBB, AND WU

REFERENCES
1. Reference Manual for the Ada Programming Language, MIL-STD 1815 edition, U.S. Department of
Defense, AdaTEC, SIGPLAN Technical Committee on Ada, New York, AdaTEC, 1982. Draft
revised MIL-STD 1815; draft proposed ANSI Standard document.
2. M. Annaratone, E. Arnould, T. Gross, H. T. Kung, M. Lam, O. Menzileioglu, K. Sarocky, and J. A.
Webb, Warp architecture and implementation, in Proceedings, 13th Annual International Sympo-
sium on Computer Architecture, June, 1986, pp. 346-356.
3. K. E. Batcher, Bit-serial parallel processing systems, IEEE Trans. Comput. C-31, No. 5, 1982,
377-384.
4. BBN Laboratories, The Uniform System Approach to Programming the Butterfly Parallel Processor, 1st
ed., Cambridge, MA, 1985.
5. W. D. Hillis, The Connection Machine, The MIT Press, Cambridge, MA, 1985.
6. T. S. Huang, G. J. Yang, and G. Y. Tang, A fast two-dimensional median filtering algorithm, in
International IEEE Conference on Pattern Recognition and Image Processing, 1978, pp. 128-130.
7. iPSC System Overview, Intel Corporation, 1985.
8. B. W. Kernighan and D. M. Ritchie, The M4 macro processor, in Unix Programmer's Manual, Bell
Laboratories, Murray Hill, NJ 07974, 1979.
9. H. T. Kung and J. A. Webb, Global operations on the CMU Warp machine, in Proceedings, 1985
AIAA Computers in Aerospace V Conference, October, 1985, pp. 209-218.
10. H. T. Kung and J. A. Webb, Mapping image processing operations onto a linear systolic machine,
Distributed Comput. 1, No. 4, 1986, 246-257.
11. T. J. Olson, An Image Processing Package for the BBN Butterfly Parallel Processor, Butterfly Project
Report 9, Department of Computer Science, University of Rochester, August 1985.
12. C. Seitz, The cosmic cube, Commun. ACM 28, No. 1, 1985, 22-33.
13. R. S. Wallace and M. D. Howard, HBA vision architecture: Built and benchmarked, in Computer
Architectures for Pattern Analysis and Machine Intelligence, IEEE Computer Society, Seattle,
Washington, December 1987.
14. R. S. Wallace, J. A. Webb, and I-C. Wu, Architecture independent image processing: Performance of
Apply on diverse architectures, Comput. Vision Graphics Image Process. 48 (1989).

You might also like