Image Processing On The MPP (Massively Parallel Processor)
Image Processing On The MPP (Massively Parallel Processor)
00/(1
Printed in Greal. Britain. Pergamon Press Ltd.
© 1982 Pattern Recognition Society
Abstract--The Massively Parallel Processor (MPP) is a 128 by 128 array of processing elements that
communicate with their horizontal and vertical neighbors by shifting data one bit at a time. This paper
describes the efficient use of M PP for various types of image processing operations, including point and local
operations, discrete transforms and computation of image statistics. A comparison between MPP and
ZMOB (a system consisting of 256 microprocessors) is also presented.
121
122 TODD KUSHNER, ANGELA Y. W u a n d AZRIEL ROSENFELD
amount of image loading and unloading time to In summary, on the VAX the time to perform the
consider. On the MPP, data is loaded from the host operation on the entire image is C v N 2, while the
VAX, via the UNIBUS, to a staging area of the MPP, time to perform it on the MPP is 2rN 2 +
where the data is input simultaneously to 128 edge CmN2/16,384. If 32,768r + C,~ < 16,384Ct,, using the
PEs, 128 bits at a time. Letting r be the rate at which a MPP is faster than using the VAX.
byte of data is transferred on the UNIBUS (400nsec) With local operations the situation is more com-
and p be the rate at which a bit of data is passed plicated because information must be shared between
between PEs, let us compute how long it takes to load neighboring processors. The next section will discuss
a 128 by 128 (say) image of byte-long pixels : (1) from the amount of time it takes to perform local operations
the VAX to M P P staging area via the UNIBUS; and using different neighborhood geometries. A com-
(2) from the MPP staging area to the PEs (a concurrent parison with performing an (iterated) operation on
process). Via the UNIBUS it takes 128 x 128 x r, or the host VAX will also be given. Due to the limited
6.534 msec. From the staging area to the PEs it takes local memory of MPP PEs, the focus of the discussion
128 x 128 x 8 bits x 1/128 (number of bits passed will be the one pixei per PE case.
simultaneously) x p, or 1.024/asec. Thus, the UNI-
BUS is the rate-limiting step of the MPP image 3. LOCAL O P E R A T I O N S
loading process, and the total time to load and unload
is r N 2 + r N 2 = 2rN 2. Each iteration of a local operation consists of two
No. of
Pass Pixels
Neighborhood Step Direction Passed Result
8-neighbor 1 Up 1 x
X
2 Right 2 xx
XX
3 Down 2 xx
xx
xx
4 Left 3 xxx
xxx
5XX
4-neighbor 1 Up 1 x
X
2 Right 1 xx
x
3 Down l x
xx
X
4 Left l x
XXX
X
2x2 1 Down 1 x
x
2 Right 2 xx
xx
8-component 1 Right 1 xx_
2 Left 1 xx__x
3 Down 3 xxx
xx_
4-component 1 Down l x
x
2 Right 1 x
xx
steps: a neighbor-passing step and a computation step Given that the VAX takes some fraction ~ of the time
involving the gathered neighborhood. Several types of that an M P P PE does for the given local operation
local neighborhoods are commonly used and these (ct will vary), how time-consuming must that local
(with the steps involved in passing neighbors) are operation be (on MPP, say) before it is worth moving
outlined in Fig. 1. Every passing sequence involves the to M P P for processing? Let C,, = ctC,. and solve
exact number of neighbors required, except for the 8-
neighbor connected component case, where one extra TVAx ----TMI,a
neighbor transfer occurs (due to the interconnection omCm N2 = 2 r N 2 + 8nmp + nC m
structure of MPP). In all, eight pixels are passed in the
2 r N 2 + 8nmp
8-neighbor case; four pixels in the 4-neighbor case;
C,. - ~tnN 2 - n
three pixels in the 2 x 2 case; five pixels in the 8-
neighbor connected component case; and two pixels in Tables 1 and 2 show typical results for the realistic
the 4-neighbor connected component case. In the values
following paragraphs we analyze a specific case, the 8-
neighbor local operation, and give a comparison N = 128
between the performance of M P P and of the host VAX m=8
itself. p = 3 × 10 -7 sec (300nsec/bit PE transfer rate)
When is using M P P better than simply using the r = 4 x 10-7 sec (400nsec/byte UNIBUS transfer
host VAX? In other words, when does the overhead of rate).
using M P P (loading and unloading an image via the Table 1 gives minimum M P P computation times for
UNIBUS) offset the time saved in performing an TvAx = TM~,p;Table 2 gives minimum times for TvAx
(iterated) local operation? To answer this we must first = 10TMpP.
obtain formulas for computation times on VAX and We can see from these tables that M P P will usually
MPP. be advantageous over, and often more than ten times
We will assume a 128 x 128 image, thus one pixel faster than, the VAX, since 1-10~sec is the minimum
per M P P PE. The relevant parameters are for M P P PE operations. For short, once-iterated
N = length of image side = 128 operations, M P P will be IO-bound: for C,. between
p = time to pass one bit between M P P PEs 10 -v and 10 -3 sec the fractional overhead in transfer-
m = number of bits per pixel (8, for 256 grey ring the image between the VAX and M P P is over
levels) 90%; at C,, = 10 -2 sec the overhead is 57~,,; at C,,, =
C,. = time to compute one local operation on 10-1 sec the overhead is 12To; and, at higher C,, values
MPP or for more than one iteration, the overhead drops well
C,. = time to compute one local operation on below 1%. Generally, more than one iteration of a local
VAX operation must be performed before M P P is useful.
n = number of iterations of the local operation In the case where we have several pixels per PE
r = time to pass one pixel over the UNIBUS (N by N image, N > 128) the situation is different. For
local operations on images larger than 128 by 128 the
On the VAX, the time to compute n iterations of a general formula for the computation time is C , , m N 2 / p
local operation taking Co time per pixel is and for the communication time is [ 4 ( N / x / P ) + 4]
Tv^ x = nCoN 2. (the number of points bordering the size N 2 / P sub-
region) times rap. Thus, with increasing N (within the
On MPP, the computation must be split into three constraint of the limited PE local memory), the
states: Loading (L,.), processing (P,.), and unloading computation time rises by the square and the com-
(U,.). As we have already seen, the loading of the M P P munication time rises linearly with N; consequently,
PEs is limited by the amount of time it takes to transfer the calculation becomes more CPU bound. In any
the image pixels over the UNIBUS (loading of the PEs case, the small amount of memory per PE limits the
from that point is much faster). Loading and unload- number of pixels that can be handled by a PE. The
ing times are the same: values ofa pixel and its eight neighbors already take up
L m = U,, = r N 2. a significant fraction of this memory (72 bits, or about
7~o). To handle a 2 × 2 block of pixels and their
There are two stages for each iteration of a local neighbors (a 4 x 4 block in all) requires nearly twice
operation on MPP" communication and com- this, and a 3 × 3 block in all) requires nearly twice
putation. For an 8-neighbor operation with one requires 4 0 ~ of the memory. It would be difficult to
pixel/PE, the pass time is 8rap per iteration and the handle much larger blocks.
compute time is Cm per iteration. Thus,
Pm = 8nrnp + nCm.
4. COMPUTATION OF IMAGESTATISTICS
In summary, the total time for M P P processing is TMf,r,
= L,, + U,, + P,,, or In this section we consider some M P P tasks involv-
ing computation of image statistics - in particular, the
TMp P = 2 r N 2 + 8nmp + nC,,.
124 TODD KUSHNER, ANGELA Y. Wu and AZRIEL ROSENFELD
Table 1. Thresholds (in seconds, MPP PE computation time) for MPP preferability to the VAX
Table 2. Thresholds (in seconds, M P P PE computation time) for lO-fold speedup over the VAX when using MPP
-~ m
t--~ 1 1/2 1/4 1/8 l/l 6 1/32 1/64
7
Reo w
p ~~. ~ I 2 3 4 5 6 7 8
l 2/0 310 4/0 3/0 0/0 6/0 7/0 6/0
2 3/0 4/0 3/0 0/0 6/0 7/0 6/0 2/1
3 4/0 3/1 0/1 6/1 7/1 6/1 2/1 3/2
4 3/0 0/0 6/0 7/0 6/0 2/0 3/0 4/1
5 0/0 6/0 7/0 6/0 2/0 3/0 4/0 3/0
6 6/1 7/1 6/2 2/2 3/2 4/2 3/2 0/2
7 7/1 6/1 2/1 3/1 4/1 3/1 0/1 6/1
8(0) 6/0 2/0 3/0 4/0 310 0/1 6/1 7/I
In entry a/b, a = value passing through, b = counter contents. The values are cyclically shifted upward. Each counter adds 1
when the value passing through it is equal to its row number. In this example, there are 8 PEs and 8 gray levels.
of the final bucket count. Meanwhile, the next-to-last O(N + log2N) steps.
bit propagates leftward after the LSB, being added to The total complexity ofhistogramming on the M P P
the next-to-last bit and the carry from the LSB is O(nlog2m) (m the number ofgray levels) from the first
addition, in the same fashion, until it propagates to the part plus O(N + Iog2N ) in the second part, which
left column. totals to O(Nlogm).
Since larger and larger counts are being formed as (c) Time requirements. The first step, column histo-
the column totals merge, the counters of each column gramming, involves cycling N m-bit pixels through the
must be extended to accommodate these sums. For column PEs, comparing the pixel value to the row
column N (numbering from I at the right to 128 at the number and (potentially) incrementing a counter at
left), that column's counter must be extended to each step (note that on an SIMD machine such as
[(log 2 N) + 8] bits. So that the algorithm may work in M P P a step such as this incrementing takes just as
proper synchrony, every bit of each counter must be much time whether it occurs or not, since the
passed upward, even leading zeros. Figure 3 presents a instruction(s) must be sent to each processor anway;
worked out example for a row of length 6. they are simply disarmed ifnecessary). Thus, at each of
Letting N = the number of processors in a row of N steps an m-bit pass, an m-bit compare and an
the processor array (128 on MPP), it takes N steps to (n + 1)-bit add occur; thus, the time taken for column
propagate the LSB to the left column. It then takes histogramming is
(2 log 2 N - 1) steps to pass the rest of the (2 log N)-bit
counter maintained by the PE in the second-to-left
column. Thus, this part of the algorithm takes T<oj = N[mp + mc + (n + 1)a].
II 01 00 10 11 10
11 01 00 10 11 10
1 01 00 II 1'_0 _00
1 01 01 10 100 00
1 01' 10 t00 _000 00
1 l'0 l_oo ooo ooo oo
1 l'00 _ooo ooo ooo oo
1 1_ooo ooo ooo ooo oo
101 1 0ooo ooo ooo ooo oo
In each entry, bits that have just been passed are underlined ; primes denote positions of carry bits
For the 300nsec cycle time of the VAX, t owill typically Tables 3 and 4 show the performance of M P P and
be 1-10 #see, depending on how the program is coded ZMOB, respectively, at various basic image processing
(assembly versus C). Thus, on the VAX, histogram- tasks. The M P P table uses bits as the basic image units,
ming a 128 by 128 image will take about 0.0016384 to whereas the ZMOB table uses pixels. These tables
0.016384 seconds. This is 0.15-1.5 times the total include total complexity measures for computation
M P P time, or 1.6-16 times the M P P computation time, communication time and memory requirements
time alone. Thus, M P P seems to offer only a marginal, as a function of image size (N, the diameter}, number of
if any, improvement over using the VAX for this task. processors (P), the number of gray levels (M) and
various constants. Tables 5 and 6 restate this infor-
4.2. Co-occurrence matrices mation for the histogramming algorithm, based on the
A co-occurrence matrix is essentially a 'histogram' relations of P and M to N. Note that a factor of 0(N2),
of the occurrences of pairs of gray levels; if there are M due to the UNIBUS image loading and unloading step,
different gray levels, it is an M by M matrix. To appears in each communication complexity formula,
compute the co-occurrence matrix of an image the separated by parentheses from the interprocessor
neighbor of each point at some displacement 6 is communication complexity.
obtained and the appropriate entry (gray-level1, gray- If the number of processors in ZMOB is regarded as
level2) of the matrix incremented by one. On the M P P proportional to the image diameter (N), and the
this would be analogous to the histogram algorithm number of processors in M P P as proportional to
presented earlier : the M by M matrix would be treated image size (N2), then we see in Tables 3 and 4 how
Image processing on MPP: 1 127
N2 = size of (N by N) image
P = number of processors
M = number of gray levels
C,. = time to compute one operation, per bit, on MPP
r8 = UNIBUS transfer rate, per bit
P8 = PE intercommunication rate, per bit
N2 = size of (N by N) image
P = number of processors
M = number of gray levels
Cz = time to compute one operation, per pixel, on ZMOB
r = UNIBUS transfer rate (per pixel)
p= conveyor belt transfer rate (per pixel)
>
O(N) NlogN+N+logN(+N210gN) N~/NIogN+N~/N+~/NIogN(+N210gN) N210gN+N2+NlogN+N(+N210gN) Z
logN N + -logN
- N z + IogN
N ,/N z
Image processing on MPP: 1 129
N
O(N) --(+N z) N ( + N 2) N ( + N z)
IogN
N
N N2+N
logN
C N N 2
addition, other algorithms, where communication the image, what points from each projection must be used to
does not occur in a tightly orchestrated way, become get (interpolate) that projection's contribution to the final
value? Since the projections are at various orientations, this
intractable. The severely limited local memory space is becomes a geometric operation problem which, except for the
also a difficulty in considering certain algorithms or two-projection situation, is of a form that the fixed geometry
certain (practical) image sizes. Nevertheless, M P P still of the MPP cannot easily handle.
manifests significant speed advantages, particularly In the Fourier reconstruction method, while rows of
processors may be able to transform the projections, and the
when it is used for point and local space-domain
projections, once in place among the appropriate processors,
operations or for transform-domain filtering. It will be may be fairly readily interpolated (and the image inverse
a powerful tool for image processing and analysis. transformed by the method in Section 5), it is not clear how to
smoothly get the transformed projection points to the
processors where they belong.
For image reconstruction on ZMOB, there is an attractive
way to implement the filtered back-projection method. Given
Acknowledgement--The help of Sherry Palmer in preparing P processors and projections, the circular image is partitioned
this report is gratefully acknowledged. into 2P sectors, and each processor is assigned two opposite
sectors such that each projection bisects each pair of sectors.
For an N by N image, each sector will contain approximately
REFERENCES nN2/4P points (about 50 for a 32 by 32 image with 16
partitions). Each processor is then loaded with the projection
1. K. E. Batcher, Design of a Massively Parallel Processor, data assigned to it. Each point in the sectors will add to a
I EEE Trans. Comput. C-29, 836-840 (1980l. running sum, as the back-projected contribution from that
2. T. Kushner, A. Y. Wu, and A. Rosenfeld, Image Process- projection, an interpolated value, depending on where a line
ing on ZMOB, TR-987, Computer Science Center, from the point, normal to the projection, falls on the
University of Maryland, College Park, MD, December projection. After the first projection is processed each pro-
1980. cessor passes those values to its next neighbor, then again to
the neighbor two over and so on (note that in later rounds the
normal each point drops onto the projection takes into
APPENDIX account the ray number it is working on).
Image reconstruction on MPP and ZMOB To calculate the computational, communication and space
The two methods of image reconstruction which will be complexity of this algorithm, define the following variables:
discussed for implementation on MPP and ZMOB are the N = image diameter (N by N image) (and projection
Filtered Back Projection and Fourier reconstruction me- lengthl
thods. The former basically involves taking each point of a P = number of processors (and projections)
density projection and 'smearing' its value, divided by an p= time to pass one point between processors
appropriate measure of width, across the image. This is C,,, = time to process one image point (interpolate and
repeated for each projection, its points being smeared ad- sum)
ditively, with suitable (pre- and) postprocessing of the image r = time to load one point into ZMOB via the
to compensate for the spread function of the back projection UNIBUS.
process. The latter method involves taking the Fourier
transform of each projection and, by applying the Fourier The computation time is the time for each point in one
Slice Theorem (which states that the transforms of the processor's allocation of the image (2 sectors) to be processed,
projections are the values of the central cross sections, at the for each projection :
same orientations, of the transformed image), using them as
values from which to interpolate the Cartesian-grid repre-
sentation of the transformed image, from which the recon-
structed image is derived by inverse transformation. Tcomp = P n Ci.,
On the MPP, the first method, filtered back-projection, is P
difficult, due to the non-linear nature of the reconstruction
process. The problem may be restated thus : for any point in = ¼n/V2C~.,.
p R 15:3 - [3
130 TODD KUSHNER, ANGELA Y. Wu and AZRIEL ROSENFELD
The communication time will consist of two parts: the time N = 512 (512 by 512 image, at lmm resolution)
to pass projections between processors and the time to load P = 256
the projection data (via the UNIBUS, as shown earlier to be p = 10 -5 sec (10/~sec/hyte ZMOB transfer rate)
the rate-determining step). Thus, r = 4 x 10 -7 sec (400nsec/byte UNIBUS transfer rate),
T . . . . = PNp + 2rN 2. we get
Finally, the amount of memory required is that for the T~omm = 1.31 + 0.210 = 1.52 sec
projection and the portion of the image: Tcomp = 205776Ci,t
About the Author--ToDD KUSHNERwas born in Bethseda, Maryland, on June 18, 1956. He received his B.S.
degree in Life Sciences from the Massachusetts Institute of Technology in 1976, his M.S. in Computer Science
from the University of Maryland in 1980 and is currently completing his Ph.D. on parallel image processing
from that institution. While at MIT he was awarded a National Science Foundation Undergraduate
Research Opportunities grant in Applied Biochemistry. He worked for two years at TMI Systems Corp. and
GTE-Telenet from 1977-1979 and has held a number of research positions during his graduate years,
including work at the National Institutes of Health; the Alcohol, Drug Abuse, and Mental Health
Administration; and staff research work for Congressman Howard of New Jersey. He is a member of the
ACM and AAAS.
About the Autbor--ANGELA Y. Wu received her B.S. in Mathematics from Villanova University in 1970 and
Ph.D. in Computer Science from the University of Maryland in 1978. From 1978 to 1980 she was an assistant
professor at the University of Maryland Baltimore County. Since Fall 1980 Angela Y. Wu has been an
associate professor at The American University, Washington, D.C. She has also been with the Computer
Vision Laboratory at the University of Maryland College Park since 1977.
About the Author--AzRIEL ROSENFELDreceived a Ph.D. in mathematics from Columbia University in 1957.
After ten years in the defense electronics industry, in 1964 he joined the University of Maryland, where he is
Research Professor of Computer Science. He edits the journal Computer Graphics and Image Processing,and
is president of the consulting firm lmTech, Inc. He has published 13 books and over 250 papers, most of them
dealing with the computer analysis of pictorial information. He is currently President of the International
Association for Pattern Recognition.