0% found this document useful (0 votes)
79 views10 pages

Image Processing On The MPP (Massively Parallel Processor)

1. The paper describes how to efficiently perform various types of image processing operations on the Massively Parallel Processor (MPP), a 128x128 array of processing elements that communicate by shifting data. 2. Point operations, where each pixel is mapped independently, can be performed quickly by distributing the image pixels evenly among the processors. Local operations, which require information sharing between neighbors, are more complex due to communication requirements. 3. Timing comparisons show that the MPP can perform image processing operations over 16,000 times faster than a conventional CPU, provided the data transfer rates between processors and from the host computer are sufficiently high.

Uploaded by

edgy-baud.00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views10 pages

Image Processing On The MPP (Massively Parallel Processor)

1. The paper describes how to efficiently perform various types of image processing operations on the Massively Parallel Processor (MPP), a 128x128 array of processing elements that communicate by shifting data. 2. Point operations, where each pixel is mapped independently, can be performed quickly by distributing the image pixels evenly among the processors. Local operations, which require information sharing between neighbors, are more complex due to communication requirements. 3. Timing comparisons show that the MPP can perform image processing operations over 16,000 times faster than a conventional CPU, provided the data transfer rates between processors and from the host computer are sufficiently high.

Uploaded by

edgy-baud.00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Puttern Recognition Vok 15. No. 3, pp. 121 130. 1982. 0031 320382/030121 I0 $03.

00/(1
Printed in Greal. Britain. Pergamon Press Ltd.
© 1982 Pattern Recognition Society

IMAGE PROCESSING ON MPP: 1"


TODD KUSIINER, ANGI!LAY. Wu't" and AZRII-L ROSI!NFI!LI)
Computer Vision Laboratory, Computer Science Center, University of Maryland, College Park, MD 20742,
U.S.A.

(Received 2 April 1981)

Abstract--The Massively Parallel Processor (MPP) is a 128 by 128 array of processing elements that
communicate with their horizontal and vertical neighbors by shifting data one bit at a time. This paper
describes the efficient use of M PP for various types of image processing operations, including point and local
operations, discrete transforms and computation of image statistics. A comparison between MPP and
ZMOB (a system consisting of 256 microprocessors) is also presented.

Image processing Parallel processing Massively Parallel Processor M PP Cellular arrays

I. I N T R O D U C T I O N 1.2. Image processing on M P P


This paper deals with the efficient use of M P P for
1.1. M P P
performing various types of image processing oper-
The Massively Parallel Processor (MPP) is a 128 by ations, including point and local operations, discrete
128 array of processing elements (PEs) that com- transforms and computation of image statistics. The
municate with their horizontal and vertical neighbors aim is to make the fullest possible use of MPP's
by shifting data one bit at a time. For a description of parallelism, so as to achieve a speedup by a factor
the M P P design see Batcher. (11 In the following proportional to the number of PEs (1282 = 16,384).
paragraphs we outline only a few basic features of We also compare M P P processing with performing
M P P that are needed in designing image processing operations on the host VAX itself, as well as with
algorithms for it. processing on ZMOB (a system consisting of 256
Each image processing algorithm implemented on microprocessors that communicate via a fast shift-
M P P will consist of two phases: computation and register bus). A more detailed treatment of image
communication. To support the computational aspect processing on ZMOB can be found in Kushner et al) 21
of parallel algorithms, each PE, while being a 'bit-slice'
processor, is capable of supporting a complete con-
ventional instruction set. Each PE has a bit address- 2. P O I N T O P E R A T I O N S
able local memory of 1024 bits and a number of fast
registers to support arithmetic and interprocessor A point operation on an image maps the value of
communication. each pixel into a new value, independent of the values
Parallel algorithms generally require interprocessor of other pixels. The image is divided equally among the
communication: to accomplish this, every PE can PEs; 1 pixel/processor for a 128 by 128 image, 4
synchronously shift data to its north, south, east or pixels/processor for a 256 by 256 image, 16 pixels/
west neighbor. (At the array edges, processor passing processor for a 512 by 512 image and so on. Images
may 'wrap around' to the PEs at the other end of the much larger than 512 by 512 cannot be held in the 1024
row or column.) When loading data from the host bits of local memory available to each PE. The PEs are
machine, a 128-long bit vector may be passed to the loaded with the image data from the host VAX over the
128 edge processors all at once, which may, in turn, UNIBUS, the point operation is performed and the
shift it across the image while the rest of the image is results are returned to the host VAX.
loaded. In the current configuration this data loading To compute the amounts of time needed to perform
occurs over a UNIBUS from a VAX host. point operations on M P P and on the VAX, let Cm and
C,, be the times for an M P P PE and for the VAX,
respectively, to perform the given operation on one
pixel. In an N by N image there are N 2 pixels: thus,
* This work was supported by the U.S. Air Force Office of C,,N z and C,,,N2/16,384 are the times to perform the
Scientific Research under Grant AFOSR-77-3271.
t Also with the Department of Mathematics, Statistics, and point operation on the VAX and M P P (with its 16,384
Computer Science, American University, Washington, D.C., processors) respectively.
U.S.A. However, in the case of the M P P there is also the

121
122 TODD KUSHNER, ANGELA Y. W u a n d AZRIEL ROSENFELD

amount of image loading and unloading time to In summary, on the VAX the time to perform the
consider. On the MPP, data is loaded from the host operation on the entire image is C v N 2, while the
VAX, via the UNIBUS, to a staging area of the MPP, time to perform it on the MPP is 2rN 2 +
where the data is input simultaneously to 128 edge CmN2/16,384. If 32,768r + C,~ < 16,384Ct,, using the
PEs, 128 bits at a time. Letting r be the rate at which a MPP is faster than using the VAX.
byte of data is transferred on the UNIBUS (400nsec) With local operations the situation is more com-
and p be the rate at which a bit of data is passed plicated because information must be shared between
between PEs, let us compute how long it takes to load neighboring processors. The next section will discuss
a 128 by 128 (say) image of byte-long pixels : (1) from the amount of time it takes to perform local operations
the VAX to M P P staging area via the UNIBUS; and using different neighborhood geometries. A com-
(2) from the MPP staging area to the PEs (a concurrent parison with performing an (iterated) operation on
process). Via the UNIBUS it takes 128 x 128 x r, or the host VAX will also be given. Due to the limited
6.534 msec. From the staging area to the PEs it takes local memory of MPP PEs, the focus of the discussion
128 x 128 x 8 bits x 1/128 (number of bits passed will be the one pixei per PE case.
simultaneously) x p, or 1.024/asec. Thus, the UNI-
BUS is the rate-limiting step of the MPP image 3. LOCAL O P E R A T I O N S
loading process, and the total time to load and unload
is r N 2 + r N 2 = 2rN 2. Each iteration of a local operation consists of two

No. of
Pass Pixels
Neighborhood Step Direction Passed Result
8-neighbor 1 Up 1 x
X

2 Right 2 xx
XX

3 Down 2 xx
xx
xx
4 Left 3 xxx
xxx
5XX

4-neighbor 1 Up 1 x
X

2 Right 1 xx
x
3 Down l x
xx
X

4 Left l x
XXX
X

2x2 1 Down 1 x
x
2 Right 2 xx
xx
8-component 1 Right 1 xx_
2 Left 1 xx__x
3 Down 3 xxx
xx_
4-component 1 Down l x
x
2 Right 1 x
xx

Fig. 1. MPP passing sequences for various types of ndghbourhoods.


Image processing on M I'P : I 123

steps: a neighbor-passing step and a computation step Given that the VAX takes some fraction ~ of the time
involving the gathered neighborhood. Several types of that an M P P PE does for the given local operation
local neighborhoods are commonly used and these (ct will vary), how time-consuming must that local
(with the steps involved in passing neighbors) are operation be (on MPP, say) before it is worth moving
outlined in Fig. 1. Every passing sequence involves the to M P P for processing? Let C,, = ctC,. and solve
exact number of neighbors required, except for the 8-
neighbor connected component case, where one extra TVAx ----TMI,a
neighbor transfer occurs (due to the interconnection omCm N2 = 2 r N 2 + 8nmp + nC m
structure of MPP). In all, eight pixels are passed in the
2 r N 2 + 8nmp
8-neighbor case; four pixels in the 4-neighbor case;
C,. - ~tnN 2 - n
three pixels in the 2 x 2 case; five pixels in the 8-
neighbor connected component case; and two pixels in Tables 1 and 2 show typical results for the realistic
the 4-neighbor connected component case. In the values
following paragraphs we analyze a specific case, the 8-
neighbor local operation, and give a comparison N = 128
between the performance of M P P and of the host VAX m=8
itself. p = 3 × 10 -7 sec (300nsec/bit PE transfer rate)
When is using M P P better than simply using the r = 4 x 10-7 sec (400nsec/byte UNIBUS transfer
host VAX? In other words, when does the overhead of rate).
using M P P (loading and unloading an image via the Table 1 gives minimum M P P computation times for
UNIBUS) offset the time saved in performing an TvAx = TM~,p;Table 2 gives minimum times for TvAx
(iterated) local operation? To answer this we must first = 10TMpP.
obtain formulas for computation times on VAX and We can see from these tables that M P P will usually
MPP. be advantageous over, and often more than ten times
We will assume a 128 x 128 image, thus one pixel faster than, the VAX, since 1-10~sec is the minimum
per M P P PE. The relevant parameters are for M P P PE operations. For short, once-iterated
N = length of image side = 128 operations, M P P will be IO-bound: for C,. between
p = time to pass one bit between M P P PEs 10 -v and 10 -3 sec the fractional overhead in transfer-
m = number of bits per pixel (8, for 256 grey ring the image between the VAX and M P P is over
levels) 90%; at C,, = 10 -2 sec the overhead is 57~,,; at C,,, =
C,. = time to compute one local operation on 10-1 sec the overhead is 12To; and, at higher C,, values
MPP or for more than one iteration, the overhead drops well
C,. = time to compute one local operation on below 1%. Generally, more than one iteration of a local
VAX operation must be performed before M P P is useful.
n = number of iterations of the local operation In the case where we have several pixels per PE
r = time to pass one pixel over the UNIBUS (N by N image, N > 128) the situation is different. For
local operations on images larger than 128 by 128 the
On the VAX, the time to compute n iterations of a general formula for the computation time is C , , m N 2 / p
local operation taking Co time per pixel is and for the communication time is [ 4 ( N / x / P ) + 4]
Tv^ x = nCoN 2. (the number of points bordering the size N 2 / P sub-
region) times rap. Thus, with increasing N (within the
On MPP, the computation must be split into three constraint of the limited PE local memory), the
states: Loading (L,.), processing (P,.), and unloading computation time rises by the square and the com-
(U,.). As we have already seen, the loading of the M P P munication time rises linearly with N; consequently,
PEs is limited by the amount of time it takes to transfer the calculation becomes more CPU bound. In any
the image pixels over the UNIBUS (loading of the PEs case, the small amount of memory per PE limits the
from that point is much faster). Loading and unload- number of pixels that can be handled by a PE. The
ing times are the same: values ofa pixel and its eight neighbors already take up
L m = U,, = r N 2. a significant fraction of this memory (72 bits, or about
7~o). To handle a 2 × 2 block of pixels and their
There are two stages for each iteration of a local neighbors (a 4 x 4 block in all) requires nearly twice
operation on MPP" communication and com- this, and a 3 × 3 block in all) requires nearly twice
putation. For an 8-neighbor operation with one requires 4 0 ~ of the memory. It would be difficult to
pixel/PE, the pass time is 8rap per iteration and the handle much larger blocks.
compute time is Cm per iteration. Thus,
Pm = 8nrnp + nCm.
4. COMPUTATION OF IMAGESTATISTICS
In summary, the total time for M P P processing is TMf,r,
= L,, + U,, + P,,, or In this section we consider some M P P tasks involv-
ing computation of image statistics - in particular, the
TMp P = 2 r N 2 + 8nmp + nC,,.
124 TODD KUSHNER, ANGELA Y. Wu and AZRIEL ROSENFELD

Table 1. Thresholds (in seconds, MPP PE computation time) for MPP preferability to the VAX

--m-~ ~ 1 1/2 1/4 1/8 1/16 1/32 1/64


I 8.01 x 10 -7 1.60 x10-6 3.21 x 10-6 6.41 x 10-6 1-28 x 10-s 2"57 x 10-5 5'15 x 10-5
2 4.01 x 10 -7 8.02 x l 0 - 7 1.61 x 10-6 3.21 x 10-6 6"43 x 10-6 1"29 x 10-5 2"58 x 10-5
4 2.01 x 10 -~ 4.02 x 10 -7 8.05x 10-7 1.61 x 10-6 3-22x 10-6 6"45 x 10-6 1'29 x 10-s
8 1.01 x 10 -7 2.02x 10 -7 4.05x 10 -7 8.13 X 10 -6 1.62x 10 -~' 3.24x 10 -6 6.50x 10 -°
16 5.12x 10 -a 1.02 x l 0 - ~ 2.05 x 10-7 4.10x 10-7 8"20x 10-7 1'64 x 10-6 3'29 x 10-6
32 2.62 x 10 -s 5.24 x l O - s 1.05 x 10-7 2.09 x 10-7 4"19x 10-7 8'39 x 10-~ 1"68 x 10-6
64 1.37 x 10 -s 2.73 x 10 -a 5.47 x 10 -a 1.09 x 10 -7 2.19x 10 -7 4.38 x 10 -~ 8.78 x 10 -~

Table 2. Thresholds (in seconds, M P P PE computation time) for lO-fold speedup over the VAX when using MPP

-~ m
t--~ 1 1/2 1/4 1/8 l/l 6 1/32 1/64
7

1 8.02 x 10 -6 1.60x l0 -5 3.21×10 -~ 6.44 x 10 -s 1.29 x 10 -4 2.61 x 10-" 5.34 x 10 -4


2 4.01 × l0 -6 8.03 x 10 -6 1.61 x 10 -s 3.23 x 10 -5 6.48 x 10 -5 1.31 x 10 -4 2.67 x 10 -4
4 2.01 x 10 -6 4.03 x 10 -6 8.07×10 -6 1.62x 10 -s 3.25 x 10 -s 6.57 x 10 -5 1.34 x 10-"
8 1.01 x 10 -6 2.03 x 10 -6 4.06 x 10 -6 8.13x 10 -s 1.63 x 10 -s 3.30 x 10 -5 6.74 x 10"-s
16 5.12x 10 -7 1.02 x 10 -6 2.05 x 10 -6 4.11 x 10 -6 8.27 x 10 -6 1.67 x 10 -s 3.41 x 10 -5
32 2.62 x 10 -7 5.24x 10 -7 1.05 x 10 -6 2.10x l0 -6 4.23 x 10 -6 8.54 x 10 -6 1.74x 10 -s
64 1.37 x 10 -7 2.74 x I0 -~ 5.48 x l0 -7 1.10x 10 -6 2.21 x tO -6 4.46 x 10 -6 9.11 x 10 -6

c o m p u t a t i o n of image h i s t o g r a m s a n d co-occurrence level c o r r e s p o n d i n g to the row n u m b e r of the P E


matrices o n M P P . passes t h r o u g h , the c o u n t e r in that P E is incremented
by 1. This m e t h o d is extensible to m o r e t h a n 128 gray
4.1. Histograms levels; the processors simply multiply their responsi-
T h e h i s t o g r a m a l g o r i t h m for M P P consists of two bility for gray levels (e.g., two each for 256 or four each
m a i n steps: h i s t o g r a m m i n g the c o l u m n s of the image for 512 gray levels); this is similarly true for larger
(creating a h i s t o g r a m for the pixels in each c o l u m n images. Letting N = the n u m b e r of processors in the
with the 'buckets' for each grey-level residing a l o n g c o l u m n (128) a n d m = the n u m b e r of gray levels (128,
with the pixels in the PEs o f e a c h row) a n d totalling the in this example), the complexity of this part of the
row so t h a t the (e.g.) leftmost c o l u m n of P E s c o n t a i n s a l g o r i t h m is O(nlogm). See Fig. 2 for an example of an
the final h i s t o g r a m for the image. F o r simplicity, the eight-long c o l u m n (and eight gray levels).
m e t h o d described below is designed for one pixel a n d (b) Totalling rows. Totalling the rows to o b t a i n the
o n e h i s t o g r a m bucket per P E - a 128 x 128 image, a n d final h i s t o g r a m is d o n e in a s o m e w h a t more com-
128 (i.e., seven-bit) gray levels. plicated fashion. T h e m e t h o d is to pass the counters
(a) Histogramming columns. T h e m e t h o d for histo- derived from the c o l u m n h i s t o g r a m m i n g step leftward
g r a m m i n g the columns of the image involves passing the a n d sum them at each level. This s u m m i n g may be
gray-levels cyclically (and s y n c h r o n o u s l y ) a r o u n d the d o n e bit-by-bit (by a d d i n g two bits a n d saving the
PEs of t h a t column, using the ' w r a p a r o u n d ' feature of carry for the next round), since they must be passed bit-
the M P P when passing pixels between processors. T h e wise anyway, to save time. T h e least significant bit
goal is to have the processor in row i of the given (LSB) is passed leftward first a n d this is a d d e d to the
c o l u m n c o n t a i n a c o u n t of the n u m b e r of occurrences LSB of the held c o u n t e r (with the carry saved in a
of gray level i in that column. In this example each P E special register); the LSB of the resulting n u m b e r is
sets aside a n eight-bit c o u n t e r for the h i s t o g r a m passed at the next step. This continues until the final
' b u c k e t ' a n d cycles the seven-bit gray-levels t h r o u g h LSB p r o p a g a t e s to the leftmost column, where it is
each of the 128 PEs in the column. W h e n e v e r a gray- a d d e d to that c o l u m n ' s c o u n t e r a n d results in the LSB
Image processing on MPP: 1 125

Reo w
p ~~. ~ I 2 3 4 5 6 7 8
l 2/0 310 4/0 3/0 0/0 6/0 7/0 6/0
2 3/0 4/0 3/0 0/0 6/0 7/0 6/0 2/1
3 4/0 3/1 0/1 6/1 7/1 6/1 2/1 3/2
4 3/0 0/0 6/0 7/0 6/0 2/0 3/0 4/1
5 0/0 6/0 7/0 6/0 2/0 3/0 4/0 3/0
6 6/1 7/1 6/2 2/2 3/2 4/2 3/2 0/2
7 7/1 6/1 2/1 3/1 4/1 3/1 0/1 6/1
8(0) 6/0 2/0 3/0 4/0 310 0/1 6/1 7/I

In entry a/b, a = value passing through, b = counter contents. The values are cyclically shifted upward. Each counter adds 1
when the value passing through it is equal to its row number. In this example, there are 8 PEs and 8 gray levels.

Fig. 2. Column histogramming example.

of the final bucket count. Meanwhile, the next-to-last O(N + log2N) steps.
bit propagates leftward after the LSB, being added to The total complexity ofhistogramming on the M P P
the next-to-last bit and the carry from the LSB is O(nlog2m) (m the number ofgray levels) from the first
addition, in the same fashion, until it propagates to the part plus O(N + Iog2N ) in the second part, which
left column. totals to O(Nlogm).
Since larger and larger counts are being formed as (c) Time requirements. The first step, column histo-
the column totals merge, the counters of each column gramming, involves cycling N m-bit pixels through the
must be extended to accommodate these sums. For column PEs, comparing the pixel value to the row
column N (numbering from I at the right to 128 at the number and (potentially) incrementing a counter at
left), that column's counter must be extended to each step (note that on an SIMD machine such as
[(log 2 N) + 8] bits. So that the algorithm may work in M P P a step such as this incrementing takes just as
proper synchrony, every bit of each counter must be much time whether it occurs or not, since the
passed upward, even leading zeros. Figure 3 presents a instruction(s) must be sent to each processor anway;
worked out example for a row of length 6. they are simply disarmed ifnecessary). Thus, at each of
Letting N = the number of processors in a row of N steps an m-bit pass, an m-bit compare and an
the processor array (128 on MPP), it takes N steps to (n + 1)-bit add occur; thus, the time taken for column
propagate the LSB to the left column. It then takes histogramming is
(2 log 2 N - 1) steps to pass the rest of the (2 log N)-bit
counter maintained by the PE in the second-to-left
column. Thus, this part of the algorithm takes T<oj = N[mp + mc + (n + 1)a].

Step Row contents

II 01 00 10 11 10
11 01 00 10 11 10
1 01 00 II 1'_0 _00
1 01 01 10 100 00
1 01' 10 t00 _000 00
1 l'0 l_oo ooo ooo oo
1 l'00 _ooo ooo ooo oo
1 1_ooo ooo ooo ooo oo
101 1 0ooo ooo ooo ooo oo

In each entry, bits that have just been passed are underlined ; primes denote positions of carry bits

Fig. 3. Row totalling example.


126 TODD KUSHNER, ANGELA Y. W u and AZRIEL ROSENFELD

Here as a size M 2 histogram; each processor would be


responsible for M/128 rows of the matrix; the points
N = length of image side = 128 are circulated around the columns, each PE updating
n = log 2 N = 7 appropriate entries of its rows; finally, these columns
m = number of bits per gray level = 7 (128 gray are passed leftward and totalled.
levels) However, since there are only 1024 bits (128 bytes}
p = time to pass one bit between M P P PEs available in the local memory of each PE, the largest
(300 nsec) number of values which can be accommodated is 128
a = time, per bit, to add two numbers on M P P (with no room to spare) or, practically, 64. Thus co-
(300 nsec) occurrence matrix computation on M P P should be
c = time, per bit, to compare two numbers on done for matrices of small size, e.g., 8 by 8.
M P P (400 nsec)
r = time to pass one pixei over the UNIBUS 5. TWO-DIMENSIONAL DISCRETE TRANSFORMS
(400 nsec).
On M P P the following method calculates the two-
The M P P instruction timing will vary, depending on
the exact programming of the algorithm. dimensional Fourier transform (or other similar dis-
For the second step, row totalling, there are crete transform)of an N by N image in O(N) time. The
(N + 2n - 1) steps where one bit is passed and one process is composed of two steps : the discrete trans-
addition takes place; thus, the time taken for row form of the image row-wise, then the discrete transform
column-wise. To transform the rows each processor
totalling is
computes the first complex term it will use in its
Trow = (N + 2n - l)(p + a). summation, multiplies it by the pixel value and stores
To this is added the time to load and unload the image, the result in a register. Then each pixel is shifted
which is circularly, the second term is calculated, multiplied,
added to the counter and so on. This process is
Tioaa = Tunload = rN 2. similarly repeated for the columns. Each takes N steps,
The total time for histogramming a 128 by 128 image thus the algorithm takes O(N) time. However, while
(128 gray levels) on M P P is thus this method does well on 128 by 128 images (one pixel
per PE), the processors quickly run out of local
ZMpp = memory with larger images.
M P P is also very limited in its ability to perform
N [ m p + mc + (n + 1)a] + (N + 2n - 1)(p + a) + 2rN 2
geometric operations on images, primarily due to
-~ 0.001019 (compute)+ 0.0098304 (load and unload memory constraints. Due to the fixed geometry of the
processors and the synchronous nature of their in-
-~ 0.0108494 sec.
tercommunication, unless each processor can hold the
On the VAX, histogramming requires the time it block of data it needs to calculate the values of the
takes to update one histogram bin (say tv) times the output pixels, there is no 'smooth' way of getting the
number of pixels in the image, N 2. Thus, the time to needed data to its destination in a parallel fashion.
histogram an image on VAX is

TVAx = N2t~, 6. COMPARISON OF MPP AND ZMOB

For the 300nsec cycle time of the VAX, t owill typically Tables 3 and 4 show the performance of M P P and
be 1-10 #see, depending on how the program is coded ZMOB, respectively, at various basic image processing
(assembly versus C). Thus, on the VAX, histogram- tasks. The M P P table uses bits as the basic image units,
ming a 128 by 128 image will take about 0.0016384 to whereas the ZMOB table uses pixels. These tables
0.016384 seconds. This is 0.15-1.5 times the total include total complexity measures for computation
M P P time, or 1.6-16 times the M P P computation time, communication time and memory requirements
time alone. Thus, M P P seems to offer only a marginal, as a function of image size (N, the diameter}, number of
if any, improvement over using the VAX for this task. processors (P), the number of gray levels (M) and
various constants. Tables 5 and 6 restate this infor-
4.2. Co-occurrence matrices mation for the histogramming algorithm, based on the
A co-occurrence matrix is essentially a 'histogram' relations of P and M to N. Note that a factor of 0(N2),
of the occurrences of pairs of gray levels; if there are M due to the UNIBUS image loading and unloading step,
different gray levels, it is an M by M matrix. To appears in each communication complexity formula,
compute the co-occurrence matrix of an image the separated by parentheses from the interprocessor
neighbor of each point at some displacement 6 is communication complexity.
obtained and the appropriate entry (gray-level1, gray- If the number of processors in ZMOB is regarded as
level2) of the matrix incremented by one. On the M P P proportional to the image diameter (N), and the
this would be analogous to the histogram algorithm number of processors in M P P as proportional to
presented earlier : the M by M matrix would be treated image size (N2), then we see in Tables 3 and 4 how
Image processing on MPP: 1 127

Table 3. MPP summary (measures per bit)

Computation Communication Memory

Point operations CmN2/P 2raN' logM (N2/P) logM

Local operations iCmlogMN2/p (4pai(Nx/P + 1) + (Nx/P + 2)2 IogM


(8-neighbors) 2raN2) IogM
i = No. of iterations

Histogramming c~N2MIogM/(Px/P) N2pBIogM (N2/P + log(N 2)


Cm1 ~ - 1st phase time/bit +c~(N+21ogN-l) +pn(N+21ogN-I) x Mx/P)logM
2 2nd phase time/bit
era x M/x/P x M/x/P + 2raN21ogM

Co-occurrence matrices cmNt 2M21ogM/(p~/p) 2N2pslogM/x/P [(N/x/P + l, )(N/x/P + 12)


(/t, 12) = displacement + c~(N + 21ogN - 1 ) + pB(N+ 21ogN- 1) +log(N )M /x/P]logM
c,~ = 1st phase time/bit x M2/~/P x M21N/P+(N/~/P+It)
c,,2 = 2nd phase time/bit x (N/x/P + 12)ralogM
+ N2rBlogM

Discrete transforms Cm2N 2NpBIogM + 2rnN 21ogM (N2/p)IogM

N2 = size of (N by N) image
P = number of processors
M = number of gray levels
C,. = time to compute one operation, per bit, on MPP
r8 = UNIBUS transfer rate, per bit
P8 = PE intercommunication rate, per bit

Table 4. ZMOB summary (measures per pixel)

Computation Communication Memory

Point operations C~N2/P 2rN 2 N2/P


Local operations iC~N2/P 4pi(N/~/P + 1) (N/x/P + 2)2
(8-neighbor) + 2rN2
i = No. of iterations

Histogramming C,N2/P M p ( P - 1)log (N2/P)/ N2/P + [Mlog(N2/P)


(PlogM) + 2rN 2 + Iog(N z)]/logM

Co-occurrence matrices CzN2P M2p(P- 1)log(N2/P)/ (N&/P + t,)(N&/e + t2)


(It, 12) = displacement (PlogM)+(N/x/P + [Mlog(NZ/P) + log(N:)]/
+ t I )(N/~/P + 12)r logM
+ rN"

Discrete transforms 2CzN 2 IogN/P p(N2/p-N2/p 2) N2/P


+ 2rN2

N2 = size of (N by N) image
P = number of processors
M = number of gray levels
Cz = time to compute one operation, per pixel, on ZMOB
r = UNIBUS transfer rate (per pixel)
p= conveyor belt transfer rate (per pixel)

computational complexity decreases, but intercom- 7. C O N C L U D I N G REMARKS


munication complexity increases, when the relative
number of processors assigned to a task increases. A D u e to the inflexible intercommunication structure
comparison of the actual timings of a histogram in M P P , certain algorithms are constrained to have a
algorithm, in Tables 1 and 2, and Tables 3 and 4 in value or values propagate from one end of the array to
Kushner et al.,c2~show that in reality the machines are the other, and thus have an unavoidable factor of N, or
quite close in their utility relative to the VAX. 8N for one-byte data, built into their complexity. In
Table 5. MPP Histogramming complexity: Computation, Communication, Memory

0(N 2) O(N) O(C)


.r,

N + logN N~/N log N + N~/N + ~/N log N N310gN + N 2 + NlogN + N z

>
O(N) NlogN+N+logN(+N210gN) N~/NIogN+N~/N+~/NIogN(+N210gN) N210gN+N2+NlogN+N(+N210gN) Z

IogN + Iog2N NlogN + Iog2Nx/N NZlogN + Nlog 2N .<

1 -I IogN x/N + logN + - -1 Nz + N + logN


N N ~ ,/N
>
X + I°gN-(+N2)N N~/N+~/N + -IogN 1
- ~ + ~'~- ( + N 2)
Nz + N + ]ogN(+N 2) N
o(c)

logN N + -logN
- N z + IogN
N ,/N z
Image processing on MPP: 1 129

Table 6. ZMOB histogramming complexity: Computation, Com-


munication, Memory

M~,,,..P O(N2) O(N) O(C)


C N Nz

N
O(N) --(+N z) N ( + N 2) N ( + N z)
IogN

N
N N2+N
logN

C N N 2

O(C) (N 2) IogN( + N 2) IogN( + N 2)

logN N + logN N 2 + logN

addition, other algorithms, where communication the image, what points from each projection must be used to
does not occur in a tightly orchestrated way, become get (interpolate) that projection's contribution to the final
value? Since the projections are at various orientations, this
intractable. The severely limited local memory space is becomes a geometric operation problem which, except for the
also a difficulty in considering certain algorithms or two-projection situation, is of a form that the fixed geometry
certain (practical) image sizes. Nevertheless, M P P still of the MPP cannot easily handle.
manifests significant speed advantages, particularly In the Fourier reconstruction method, while rows of
processors may be able to transform the projections, and the
when it is used for point and local space-domain
projections, once in place among the appropriate processors,
operations or for transform-domain filtering. It will be may be fairly readily interpolated (and the image inverse
a powerful tool for image processing and analysis. transformed by the method in Section 5), it is not clear how to
smoothly get the transformed projection points to the
processors where they belong.
For image reconstruction on ZMOB, there is an attractive
way to implement the filtered back-projection method. Given
Acknowledgement--The help of Sherry Palmer in preparing P processors and projections, the circular image is partitioned
this report is gratefully acknowledged. into 2P sectors, and each processor is assigned two opposite
sectors such that each projection bisects each pair of sectors.
For an N by N image, each sector will contain approximately
REFERENCES nN2/4P points (about 50 for a 32 by 32 image with 16
partitions). Each processor is then loaded with the projection
1. K. E. Batcher, Design of a Massively Parallel Processor, data assigned to it. Each point in the sectors will add to a
I EEE Trans. Comput. C-29, 836-840 (1980l. running sum, as the back-projected contribution from that
2. T. Kushner, A. Y. Wu, and A. Rosenfeld, Image Process- projection, an interpolated value, depending on where a line
ing on ZMOB, TR-987, Computer Science Center, from the point, normal to the projection, falls on the
University of Maryland, College Park, MD, December projection. After the first projection is processed each pro-
1980. cessor passes those values to its next neighbor, then again to
the neighbor two over and so on (note that in later rounds the
normal each point drops onto the projection takes into
APPENDIX account the ray number it is working on).
Image reconstruction on MPP and ZMOB To calculate the computational, communication and space
The two methods of image reconstruction which will be complexity of this algorithm, define the following variables:
discussed for implementation on MPP and ZMOB are the N = image diameter (N by N image) (and projection
Filtered Back Projection and Fourier reconstruction me- lengthl
thods. The former basically involves taking each point of a P = number of processors (and projections)
density projection and 'smearing' its value, divided by an p= time to pass one point between processors
appropriate measure of width, across the image. This is C,,, = time to process one image point (interpolate and
repeated for each projection, its points being smeared ad- sum)
ditively, with suitable (pre- and) postprocessing of the image r = time to load one point into ZMOB via the
to compensate for the spread function of the back projection UNIBUS.
process. The latter method involves taking the Fourier
transform of each projection and, by applying the Fourier The computation time is the time for each point in one
Slice Theorem (which states that the transforms of the processor's allocation of the image (2 sectors) to be processed,
projections are the values of the central cross sections, at the for each projection :
same orientations, of the transformed image), using them as
values from which to interpolate the Cartesian-grid repre-
sentation of the transformed image, from which the recon-
structed image is derived by inverse transformation. Tcomp = P n Ci.,
On the MPP, the first method, filtered back-projection, is P
difficult, due to the non-linear nature of the reconstruction
process. The problem may be restated thus : for any point in = ¼n/V2C~.,.

p R 15:3 - [3
130 TODD KUSHNER, ANGELA Y. Wu and AZRIEL ROSENFELD

The communication time will consist of two parts: the time N = 512 (512 by 512 image, at lmm resolution)
to pass projections between processors and the time to load P = 256
the projection data (via the UNIBUS, as shown earlier to be p = 10 -5 sec (10/~sec/hyte ZMOB transfer rate)
the rate-determining step). Thus, r = 4 x 10 -7 sec (400nsec/byte UNIBUS transfer rate),
T . . . . = PNp + 2rN 2. we get
Finally, the amount of memory required is that for the T~omm = 1.31 + 0.210 = 1.52 sec
projection and the portion of the image: Tcomp = 205776Ci,t

and for C~,, = (1 ,usec, 10#sec, 100~sec)


Memory size = - - + N we get T,omp = (0.206 sec, 2.06 sec, 20.6 sec)
P
for a total time of (1.73 sec, 3.58 sec, 22.1 sec),
gN 2
= +N.
4P For the range of Ctn, values used, which should be realistic
since many of the values used in projection normal com-
putation and interpolation may be precomputed instead of
To find how well this algorithm compares to commercial computed 'on-the-fly', the timings for ZMOB image recon-
algorithm timings (around 10 see), using the following repre- struction should be very attractive compared to commercial
sentative values, systems.

About the Author--ToDD KUSHNERwas born in Bethseda, Maryland, on June 18, 1956. He received his B.S.
degree in Life Sciences from the Massachusetts Institute of Technology in 1976, his M.S. in Computer Science
from the University of Maryland in 1980 and is currently completing his Ph.D. on parallel image processing
from that institution. While at MIT he was awarded a National Science Foundation Undergraduate
Research Opportunities grant in Applied Biochemistry. He worked for two years at TMI Systems Corp. and
GTE-Telenet from 1977-1979 and has held a number of research positions during his graduate years,
including work at the National Institutes of Health; the Alcohol, Drug Abuse, and Mental Health
Administration; and staff research work for Congressman Howard of New Jersey. He is a member of the
ACM and AAAS.

About the Autbor--ANGELA Y. Wu received her B.S. in Mathematics from Villanova University in 1970 and
Ph.D. in Computer Science from the University of Maryland in 1978. From 1978 to 1980 she was an assistant
professor at the University of Maryland Baltimore County. Since Fall 1980 Angela Y. Wu has been an
associate professor at The American University, Washington, D.C. She has also been with the Computer
Vision Laboratory at the University of Maryland College Park since 1977.

About the Author--AzRIEL ROSENFELDreceived a Ph.D. in mathematics from Columbia University in 1957.
After ten years in the defense electronics industry, in 1964 he joined the University of Maryland, where he is
Research Professor of Computer Science. He edits the journal Computer Graphics and Image Processing,and
is president of the consulting firm lmTech, Inc. He has published 13 books and over 250 papers, most of them
dealing with the computer analysis of pictorial information. He is currently President of the International
Association for Pattern Recognition.

You might also like