0% found this document useful (0 votes)
43 views10 pages

An Efficient ASIC Architecture For Real-Time Edge Detection

Uploaded by

prakashjyoti0901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

An Efficient ASIC Architecture For Real-Time Edge Detection

Uploaded by

prakashjyoti0901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1350 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO.

10, OCTOBER 1989

An Efficient ASIC Archtecture for


Real-Time Edge Detection
CHEN-YI LEE, FRANCKY V. M. CATTHOOR, MEMBER, IEEE, AND
HUGO J. DE MAN, FELLOW, IEEE

ture designer to correct this. In order to bridge the gap


Abstruct -In this paper, an efficient application-specific architecture
will be presented for a real-time edge detection system. This architecture is
between system designers and silicon designers, and to
based on the cooperating data-path model which has allowed to optimize
reduce the design cycle, a "meet-in-the-middle" design
both the throughput and the area for this recursive algorithm. Careful
methodology [3] is advocated at our laboratory. It has been
scheduling of the operations on the partly parallel, partly shared hardware
implemented in the CATHEDRAL tool-boxes [4] which
has allowed to balance the load on each of the 4 data-paths. In this way,
the inherently high degree of concurrency in the algorithm has been support high-level architectural synthesis, library-based,
effectively exploited in the parallel pipelined hardware.The layout of these
and technology-scalable module generation, and verifica-
data-path has been generated by means of powerful CAD tools and the use
tion at all levels. Each of these CATHEDRAL'S is based
of a parameterizable functional building block library. The corresponding
on a different target architecture [7].
global controller has been partitioned in order to optimize the critical path.
In this paper, the in-depth architecture design will be
This has increased the achievable clock-rate even further, up to 10 MHz.
studied for a large practical demonstrator: an edge detec-
Also the stringent 1/0 requirements have been taken into account. The
resulting ASIC has been verified by register-transfer simulation. It is
tor algorithm for gray-scale images. It is part of a robot
more than twice as fast as existing designs. The effectiveness of the
vision system which is described in Section 1-1.1. It has
cooperating data-path model is thus clearly substantiated with this large
practical test-vehicle. been used as a key vehicle to drive the development of one
of our target architectures, namely the cooperating data-
I. INTRODUCTION path model which combines shared cooperating data-paths,
distributed memory and hierarchically decomposed con-
A S VLSI technology advances, more and more com- trollers (Section 1-1.2). It will be shown that applying this
plex digital signal processing (DSP) algorithms can model for the recursive edge detector leads to an ASIC
be realized on a single chip. In the domain of high-speed which is superior to alternative application-specific ap-
image processing [l] the use of custom or application- proaches such as systolic arrays [5], [6], which are more
specific IC's is still essential to match the throughput suited for very modular front-end processing, or multiple
requirements imposed by the real-time environment.
microcoded processors [8] which are tuned towards the
Moreover, the architecture designer has to consider the back-end operations with much decision-making and lower
limitations on the power dissipation and pin-count in rates. In this way, each architectural model has its own
order to decrease the cost of the IC package. Finally, application range [7], depending on the ratio between
especially for relatively hgh volumes, also the chip area
sample and clock-rates, and on the signal flow graph
and the corresponding yield are important. General-pur-
(SFG) properties of the selected algorithm.
pose programmable solutions currently available such as
the DSP chips from Motorola, NEC, and TI, or even the
more domain-specific image processors from NEC [2] and 1.1. An Example: Real-time Robot Vision System
Toshiba [21], are not sufficient to reach all these goals. Currently, most industrial vision systems are based on
Therefore, at the Interuniversity Micro Electronic Center binary image processing. Using binary vision imposes
(IMEC) we have developed efficient ASIC methodologies, heavy restrictions on their applicability. Gray-level vision,
and CAD tools supporting them, which allow to combine on the other hand, offers a higher quality and it would
the objectives as listed above. The major problem for broaden the possibilities of machine vision. In the past
ASIC's lies in the large design times necessary to map a decades, the bottleneck of the large amount of calculations
complex DSP algorithm into a suitable architecture. Sys- has prevented making progress towards real-time realiza-
tem designers always ask for more functionality without tions, but efficient ASIC's are now emerging which do
directly taking into account the requirements imposed by solve this problem.
the physical implementation. It is the task of the architec- A typical vision system is divided in several stages each
with their own task [l].A particular implementation, as
Manuscript received November 11, 1988; revised May 26, 1989. proposed in [9], is presented in Fig. 1. It consists of Sobel
The authors are with the Interuniversity Micro-Electronics Centre amplitude generation, edge detection, feature extraction
(IMEC), B-3030 Heverlee, Belgium.
IEEE Log Number 8929986. and pattern recognition. The resulting information can for

0098-4094/89/1000-1350$01.00 01989 IEEE

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE er al.: REAL-TIME EDGE DETECTION 1351

30 ms 40 ms 40 ms 30 ms In addition, they have access to distributed on-chip


memory and to the outside through a high-speed 1 / 0
block. 1/0 communication is an important item for image
oblect processing applications, especially because of the need for
Fig. 1. A typical Robot vision system. a huge frame-memory which has to reside off-chip. As
fetching data from off-chip memories is rather slow and
limits the performance of the complete chip, it is essential
O F r C"1P
Global to take into account the 1/0 bottleneck during the design.
controller Mainly due to the time multiplexing but also because of
the necessary initialization and mode selection, the global
controller which "orchestrates" the entire chip can become
quite complex. It accepts all the flags from the dedicated
data-paths or the outside world and generates the corre-
sponding control signals required at the next cycle. As the
recursive nature of the algorithms addressed here will lead
to pipeline sections (and thus critical timing paths) which
Fig. 2. General model for the cooperating data-path.
span both data and control hardware, the controller typi-
cally has to be decomposed hierarchically to break up
instance be sent to the controller of a robot arm.To obtain these multiple critical paths into smaller segments. This
a stable system for real-time processing, each stage has its task is crucial for obtaining high throughputs and it is very
own time specification, as indicated. The preprocessing tedious.
stage collects gradient information over the gray-scale im- The sequel of this paper is organized as follows. In
age (Sobel amplitudes) in three directions. These data are Section 11, the selected demonstrator algorithm, i.e. an
combined into a single figure which represents a coarse edge follower, will be described briefly. Realistic system
measure for the ridge profile. The second stage will then specifications will be included to guide the chip design.
detect all edges based on a robust edge follower algorithm Construction of the detailed architecture with dedicated
[lo]. It will be explained in more detail in Section 11. Next, data-paths and a hierarchical controller will be discussed
the obtained edge information is sent to the feature extrac- in Sections I11 and IV. For both of them, the pros and
tion stage where the edge segments produced are linked cons of the cooperating data-path model will be evaluated.
into contours or more high-level features of the objects. The important 1/0 considerations will be mentioned in
Finally, pattern recognition can be performed or any other Section V. Finally, more global evaluations, some aspects
back-end image operation. of automated high-level synthesis and conclusions will be
given in the last three sections.
1.2. The Cooperating Data-Path Architectural Model
Several application-specific architectural styles have been 11. ALGORITHM AND SYSTEM CONSIDERATIONS
proposed for efficient image processing at different levels In this section, a brief description of the edge detection
[7]. Application of these architectural models allows to algorithm will be given. An SFG, representing the com-
exploit the properties of the algorithm to be implemented plete algorithm, will be presented too. Starting from this
and to obtain real-time behavior at reasonable area, power SFG, the approximate number of cycles required for
and pin-count costs. The edge detector demonstrator in searching all edges can be calculated.
this paper has to operate at a relatively high frame rate of
40 ms. However, it is recursive in nature as a new prospec- 2.1. The Edge-Detection Algorithm
tive edge element (pixel) can only be addressed when The original algorithm was developed at the Katholieke
processing of the previous edge element is finished. For University, Leuven [lo]. In this algorithm, edges which are
this class of algorithms, the cooperating data-path style is single-pixel wide can be detected along the ridges. The
very well suited [7]. input is a preprocessed image where Sobel gradients have
As shown in Fig. 2, the organization consists of several been identified.
bit-sliced data-paths optimized to their task, which are Several terms essential for understanding the algorithm
communicating with each other over dedicated busses and are defined in Fig. 3. The algorithm makes use of several
which are potentially pipelined internally. The construc- "windows" which are defined as sets of neighboring pixels
tion of each block is heavily dependent on the required either in vertical or in horizontal direction. There are
operations, the critical timing path and the communication basically two types: 2 main windows and 2 control win-
with other units. The data-paths are usually shared for dows. The former are defined along the orientation of the
several operations as the ratio between achievable clock progress on the detection of the edge element (both a
frequency and required sample rate is typically higher than horizontal and a vertical segment are employed), while the
1. They are constructed by looking into the detailed data latter are orthogonal to that orientation. For example, in
manipulations and by merging similar operations so that Fig. 3 the current orientation is bottom-right. In this case,
the hardware is fully exploited. x and y main windows are selected at the bottom and the

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1352 lEEE TRANSACTlONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989

From this center, 2 horizontal windows are


scanned. For both of these horizontal sections,
the center is located.
Again, starting from this newly found center, 2
vertical windows are scanned and the center is
determined.
IF the final center is found THEN GO back TO
step 2 for searching another segment ELSE
define it as the first element of a new edge
segment and GO TO next step.
Scan main windows ( X and Y ) and collect data
items for decision-making. These data items are:
the main candidates ( M A X and MA Y ) are those
pixels defined in the main windows (See Fig.
Fig. 3. Some terms defined in this edge follower algorithm. (a) Direc-
3(b)) which are equal to or larger than the
tion defined in the algorithm. (b) Definitions of main window and threshold.
control window. (c) In the same quadrant. (d) In different quadrants. the control candidates (COX and C O Y ) are pix-
els, defined in the control window, which are
right side, while x and y control windows are selected at equal to or larger than the threshold (as shown
the top and the left side of the current edge element. The in Fig. 3(b)).
size of control window is almost half of that of main the center position of the main windows (CEX
window because only that part would offer the useful and C E Y ) , to be used for identifying whether
information. The next edge element can only be found in they belong to the same quadrant or not, as
these two main windows. In order to decide whether the defined in Fig. 3(c).
main windows provide enough information for the detec- IF the number of main candidates in each window
tion of the next edge pixel, the two centers of the horizon- is less than 3, decrease threshold and GO back TO
tal ( x ) and vertical ( y ) main windows are checked. If they step 5 ELSE IF it is equal to or more than 5 ,
lie in the same quadrant enough information is available, if increase threshold and GO back TO step 5 ELSE
the other case additional information is collected from GO TO next step.
investigating the control windows. Continuous threshold- IF two centers are not in the same quadrant as
ing is applied to determine whether a pixel can reside on defined in Fig. 3(d) GO TO step 8 ELSE GO TO
the edge or not. Sometimes, if the information from the step 9.
main windows is not sufficient, also an adaptation of the Scan control window and collect data items (COX
current edge threshold is considered. This adaptation hap- and COY).
pens when the number of candidates (the pixels whose Find the address of t h s new edge element and its
Sobel value exceeds the threshold) in either of the main orientation relative to the previously found edge
windows is less than (or greater than) a pre-defined num- element.
ber (e.g., below “3” for a reduction, above “6” for an IF this new element does exist GO back TO step 2
increase). This makes the algorithm much more robust. A ELSE GO TO step 4.
3-bit code is used to indicate the orientation of a newly The previous description can be divided into 3 main stages:
found edge element (relative to the current edge pixel). initialization, searching the first element, and searching the
This can be determined straightforwardly from the current
other elements. Except for the initialization, recursive oper-
and the new edge location. ations have to be performed which are data-dependent.
complete algorithm is described in a procedural
The output from this algorithm will be the location and
direction of all edge elements, and also the starting address
of each edge segment.
Before starting edge detection, do initialization;
The implementation of this algorithm is different from
Wait until the edge detection can start.
other edge detection algorithms such as the one in [12],
Start searching a ridge larger than the threshold;
[18]. There, the edge detection is only performed up to the
IF found GO TO next step ELSE IF the whole
production of the Sobel amplitude. Therefore the structure
frame is scanned, send out FINISH status signal
is very regular. However, many pixels in a neighborhood
and GO TO step 1 ELSE continue search for
can reside on a specific ridge. In the algorithm we have
ridge.
selected, the output is a clearly defined edge which is only
Search first edge element and locate its address.
a single pixel wide, and which is already accompanied by
This can be divided into the following steps:
the link information. To obtain this result, more calcula-
Search vertically down for more pixels located tions and decision-making are needed. This means the
on the ridge and identify the central position of realization in the next two sections is inherently more
this vertical section. complex.

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: REAL-TIME EDGE DETECTION

Other gray-level edge follower algorithms can be found


in [14]-[17]. Due to the huge amount of complex computa-
tions involved, they are not suited for a real-time IC
implementation using state-of-the-art VLSI technology.
2.2. System Considerations
- 1353

This chip design is part of the entire vision system as


shown in Fig. 1. Unless indicated otherwise, all image sizes
will be assumed to be 512x512. To obtain real-time
processing, each block has its time constraint [9]. This
determining factor is quite important in an ASIC design
because it affects the design methodology of the whole
chip. Indeed, we have to exploit the parallelism implicit in
the algorithm to meet the high speed requirement of 40
msec per frame.
Another consideration is how to handle the data man-
agement. For front-end image processing, off-chp memo-
ries have to be used in order to store a complete image (b)
frame. Either static or dynamic memory can be used for Fig. 4. A reduced SFG and an exam le (a) A reduced SFG for the
this purpose. In order to reduce the 1 / 0 access time, we edge follower algorithm. (b) An e x a m p k A R C (image size: 256 X 256):
will have to insert a fast internal buffer unit. Furthermore Lef:original image; Right: output from edge-detector.
the number of I / O pins has to be limited in order not to
increase the cost of the ASIC package too much. quired throughput could not be obtained because of the
2.3. SFG for the Algorithm large amount of computations. Only by exploiting the
concurrency this is possible. From the given SFG, the
A reduced SFG corresponding to the procedure in Sec- required operators can be extracted and the corresponding
tion 11-2.1 is presented in Fig. 4(a). From this SFG, it is control flow can be obtained easily. Through a careful
clear that several loops are present whch represent the allocation, partitioning and merging of the operators and a
scanning of the many windows defined in Fig. 3. The balanced assignment and scheduling of the operations, an
upper bounds depend on the incoming data and cannot be efficient hardware organization is possible. It should be
defined in advance, so they are of the WHILE type. stressed that mapping the SFG to an ASIC architecture
Nevertheless, the average throughput obtained can be cal- takes many iterations to come up with an acceptable
culated roughly. For example, without adaptation, the solution. Only by applying a suitable architectural style
required number of cycles for obtaining an edge pixel is and a number of emerging CAD tools supporting this
about 25. If a control window is needed, 12 cycles have to stage, the design cycle can be kept reasonable.
be added. Details for the different cases are illustrated in
Table I [lo]. In this table, the cycle count required for a
simple and a complex object are compared. The basic 111. DEDICATED DATA-PATH DESIGN
amount for scanning a 512x512 image is 262 144 cycles. The proposed hardware organization of this c h p design
In addition, about 70 400 (299 200) cycles more are needed is shown in Fig. 5. It can be divided into three parts:
for processing a simple (complex) object and the detection dedicated data-paths, memories and a global controller.
time will be 33.3 (56.1) ms if the clock rate is 10 MHz. On The 3 off-chip memories are needed to make the Sobel
the average, this lies within the system requirements as amplitudes available at the input (Mem I) and to store the
defined above. An example should clarify this. For a produced edge information (location of edges in Mem 11;
simple object (like the ARC image in [lo]) as given in Fig. start of segments in Mem 111). In this section we first
4(b), the cycle-count to find an edge element by using only describe how the data-paths are constructed and how they
the main windows equals 25 cycles (i.e., steps 5, 6, 7, 9, 10 cooperate efficiently.
in Section 11-2.1). In addition, 12 more cycles (step 8 in The detailed algorithmic description [101 includes many
Section 11-2.1) are required for each edge element which arithmetic/logic operations which are repeated over and
needs the information from scanning also the control over again. The basic algorithm involves mainly addition,
windows. This happens only when there are breaks on the subtraction, comparison, and logic decision. Although
contour line, i.e., when the change of orientation is more multiplication is also needed, it occurs only during the
than 45 deg. This happens only infrequently. More details search for the first edge element. Hence, it should be
are given in Table I. replaced by an iterative shift/add-based scheme without
The two loops indicated with dashed lines in Fig. 4(a) affecting the throughput in a major way. As a result the
represent the search for the first edge element and the hardware realization is simplified.
search for the following edge elements respectively. In Taking all the necessary operations into account, the
these loops, a lot of arithmetic and logic operations have to required hardware for processing the data can be clustered
be performed. With a general-purpose processor, the re- into 4 blocks (i.e., dedicated data-paths): an Address Com-

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1354 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. VOL. 36, NO. 10, OCTOBER 1989

TABLE I
# OF CYCLES
FOR LOCATINGAN EDGEELEMENT IN A SIMPLE OBJECT
AND A COMPLEX
FOR AN IMAGE SIZE OF 512 X 512

Note:
1. M W Main Window

2. CW: Control Window


3. Cycle Count
4. R ratio between processing
rate and input rale

5. CACHE sue: 7x7


6. Total CT(without CACHE):
simple: 332544
complex: 561344
7. gain by CACHE:
simple: 8.59.
complex: 15.1%
Total edge elements (approximately):

putation Unit ( A C U ) , a Comparison and Adaptation Unit connected to the databus so that the users can load these
( C A U ) , a Decision-Making Unit ( D M U ) , and a counting parameters before starting the system. A shifter SH is
and Detecting Unit ( C D U ) . These 4 parallel blocks are inserted after this RF to speed up the location of the
exchanging data over customized busses. The detailed ar- center of the vertical (or horizontal) windows (see Section
chitecture of each block is illustrated in Fig. 6, and their 11-2.1). If the boundary of the image frame is reached
construction is discussed in the sequel. It will be shown during scanning, the search should be stopped because no
that they can be obtained by using a limited library of more information can be collected. For this purpose, the
parameterizable functional building blocks (FBB’s) [8], left and right boundaries are detected by a special-purpose
[ll]. In t h s way, the layout effort is reduced considerably. decoder DD at the output of AD which checks the 9 LSB’s
It should be noted too that the implementation of Figs. of the address (lower half). The top and bottom bound-
5-6 is not unique. It is the result of an intensive (manual) aries are checked using the sign and overflow flags of the
architectural exploration to find a solution which satisfies adder AD. An output buffer BF is used to offer more
all the constraints identified in Section I. driving capability (see Fig. 6(a)).

3.1. Address Computation Unit (ACU) 3.2. Comparison and Adaptation Unit (CA U)
Many address computations are required in the whole Comparisons needed in the algorithm can be classified
algorithm: into 4 types:
During horizontal scanning: increment or decrement
by one column. 1) input pixel with current threshold;
During vertical scanning: increment or decrement by 2) input pixel with the previous pixel;
one row (equal to the number of columns). 3) current threshold with maximal threshold;
Random jumps: jump to the required address by 4) current threshold with minimal threshold.
incrementing or decrementing several rows (columns).
In addition, threshold adaptation is performed in the
The main purpose of the ACU is to generate the required CAU unit. The required flags are LT (less than) and EQ
address for accessing the 3 off-chip memories. Due to the (equal to).
512x512 image size, the required wordlength is 18 bits. Taking all these operations into account, the FBB’s
Only a few FBB’s are needed, namely, registerfiles (RF), a indicated in Fig. 6b are needed. Two RF’s are required to
shifter (SH), an adder (AD) and a constant decoder (DD) store the parameters to be compared. One is used to store
(Fig. 6(a)). Two RF’s are used: one for storing the address, the threshold and the previous input pixel. The other
and one for storing the constants to be added (or sub- stores the current input pixel and those parameters which
tracted). In the first RF, registers are needed during the are loaded before starting the edge detection, such as the
scanning to store the next scanning address (nextscan),the maximal threshold, the minimal threshold and the adapta-
temporary address (tempaddr) and the current working tion step. A feedback loop is also included to adapt the
address (workaddr). In addition, a constant “0” is in- threshold whenever required. Multiplexer MUX is used to
cluded for initialization. A feedback loop is connected select the required inputs for the threshold adaptation. The
from the output of the added to the input of this register-file flags LT and EQ can be generated by AD and DD,
so that address information can be updated easily. Simul- respectively. Although these two flags could be generated
taneously, the new address can be sent to the off-chip by a comparator instead, an adder and a detector are
memories. The other RF stores parameters such as the preferred because also the threshold adaptation can be
image size: columns and rows. The input of this R F is handled through this combination. Obviously, this decision

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et U[.: REAL-TIME EDGE DETECTION 1355

other FBB's required are allocated twice: an adder, a


decoder, and a multiplexer. Flags to be generated are LTX,
LTY (from the adders) and EQX, EQY (from the de-
coders). The MUXs are used to select the data to be
compared for the decision-making. The feed-back loops
are included for updating the data, which are stored in the
RF's. In addition to collecting the data, this unit can also
be used for helping to find the center of a given window
during the search for the first edge element.
3.4. Counting and Detection Unit (CDU)
There are some constraints in the size of a scanning
Fig. 5. Block diagram of the edge letection chip design. window whch have to be checked continuously:
For searching the first element, the maximal number
of pixels to be scanned is 10.
I For the other cases, the maximal number is fixed
I
to 5.
For this purpose, the CDU unit has been included. The
required FBB's are: RF, CF, an AD, and a DD (Fig. 6(d)).
2)
The RF has two registers for initialization and counting.
CF offers the constants for resetting and counting. When-
ever necessary, the decoder detects whether the maximum
pixel count is reached or not.
3.5. Timing Considerations and Global Operations
Throughout t h s paper, a synchronous clock with a
period of 100 ns is assumed. To avoid a critical path in the
4 data-paths which would exceed this clock period, the
timing of the FBB's required (for the instantiated parame-
Fig. 6 . Detailed architecture of each dedicated data-path. (a) ACU. ters such as the wordlength) has been taken into account
(b) CAU. (c) DMU. (d) CDU. during the construction. In some cases, faster circuit real-
izations have been selected, such as a carry-bypass-adder
depends on the type of operation needed in the subalgo- instead of a simple ripple adder. If the speed requirement
rithm and of the speed requirement. still cannot be met, a register would have to be inserted.
In principle, these data-paths operate in parallel and are
3.3. Decision-Making Unit (DMU) carefully balanced. The amount of idle time is very small.
For example, while the ACU generates the address, an
To find a new edge candidate, several items have to be
input pixel is compared in CAU, data are collected in
collected from the main or control windows as defined in
DMU and the CDU detects whether the image boundary
Fig. 3. The operations defined in the DMU unit are as
is reached or not. Input pixels come from MEM I, the
follows.
starting address is stored in MEM 111 and the edge
Determination of the candidates ( M A X , M A Y ) in information is stored in MEM II.
the main windows (performed always). IV. GLOBAL CONTROLLER DESIGN
Location of the two centers (CEX, C E Y ) in the main
windows. They are used to see if control windows To orchestrate the complex interaction between all the
have to be scanned or not. other units, a global controller is essential for the supervi-
Determination of the candidates (COX, C O Y ) in the sion. As mentioned in Section 11-2.1, the edge detection
control windows (performed conditionally). has been divided into 3 stages. In each stage, the global
controller has to handle different situations. The parti-
Collection of these items can be done sequentially or in tioned controller architecture which is introduced here is
parallel. In order to reduce the cycle count in this critical well suited for the cooperating data-path model.
part of the edge location algorithm, some hardware has
been duplicated in t h s unit. The required FBB's are shown 4.1. Controller Architecture
in Fig. 6(c). Two RF's are used: one for storing MAX, As shown in Fig. 7, the global controller is essentially a
CEY, COX, the other for storing MAY, CEX, COY. A sequential finite state machine (FSM) controlled by a
constant file (CF) is needed to offer constants for two-phase clock. The numerous flags coming from the
decision-making, initialization, and data collection. The data-paths have to be combined with control signals and

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1356 IEEE TRANSACTIONSON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989

flags from the outside world in order to generate internal


control signals for the next operation. All input signals
have to arrive before the falling edge of al. The outputs
are produced and sent to the data-paths at the rising edge
of @2. In order to obtain consistency between all input
and output signals during all modes, tlus complex design
has seen many iterations. As mentioned before, off-clup
memories are used to store all image data. Hence, at least
two cycles are needed before data is read in the CAU and
DMU corresponding to an address generated in the ACU
and checked in the CDU. That means the flags from the I
ACU and CDU are delayed compared to those from the Fig. 7. Controlled for the cooperating data-path architecture.
CAU and DMU. This has complicated the design of Fig. 7
considerably. The control signals from the outside world
are unconditional so that no special arrangements are
necessary. Flags from the dedicated data-paths are divided
into two classes: one with the instant responses, the other
with the delayed responses. The latter category can still be
divided into several sub-classes depending on the delay
relative to the ACU which is selected as the reference.
These delay elements have been added at the output port
of the controller on the paths to those units which require %- %- I I L E U I I
them (Fig. 7). Inserting these delay elements would result
in additional operations during the initialization phase. To
avoid this, an UNDO block is included too which bypasses
the registers. For the operations which do not have to be
processed relative to the addresses from the ACU, the
delay operations are not needed.
In this design, many critical paths are present which
Fig. 8. Hierarchical controller for the edge follower chip design
pass from the @llatch at the input of the FSM over the
control signals to the data-paths and from there to the
flags which are generated and fed back to the FSM. This The LOCAL controller is used to handle on-chip
large pipeline section has to be completed in a clock period operations for the CAU and DMU units. The re-
of 100 ns. In some cases, this speed requirement cannot be quired input signals are state information from the
met. Then, hierarchical partitioning has been performed in MASTER, flags from the controlled data-paths and
order to split off non-critical parts from the hardware in internal status signals. The output signals are pro-
the critical path. This results in a reduced amount of logc cessed by the delay-select block which will take care
gate levels in cascade and a speed up. However, typically of the correct pipelining: it compensates the fact that
the area increases due to the necessary duplication of logic the LOCAL block adds a pipeline section to the loop
gates for constructing the non-critical signals (speed-area on the right.
tradeoff). It should be noted that during this partitioning, DIROP is a special unit for generating some impor-
the 1/0 timing cannot be destroyed as the location of the tant information requested by the MASTER con-
latches is not affected. Still, it has taken many iterations troller. To obtain the direction of a newly found edge
before a suitable controller design has been obtained. The pixel relative to the previous one, a lot of computa-
final result is shown in Fig. 8. tions have to be performed. For speed-up purposes, a
special direction generator is added which takes care
4.2. Three Sub-Blocks for the Global Controller of the selection between the four possible scanning
In order to handle the complete edge detection algo- directions. It controls both the operation type ( + / - )
rithm, the global controller has been partitioned in three of the adder, and the selection of either rows or
blocks (Fig. 8): columns in RF2 within the ACU. The output of this
unit is stored in the off-chip memory Mem I1 for
MASTER is the supervisor of the entire chip. It later processing.
generates the required control signals to the outside
world, the ACU, the CDU, and the state for the The implementation of these controllers is performed by a
other controllers. Inputs are flags from the dedicated silicon layout compiler, PLASCO [13]. For each controller
data-paths, internal status signals and status plus block, the description file with the Boolean expressions can
control signals from the outside world. be constructed, minimized, and simulated. -These PLA-

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et a/.: REAL-TIME EDGE DETECTION 1357

based controllers can now easily meet the speed require- that when the 1/0 rate equals the internal processing rate,
ment of the given system. the throughput advantage is eliminated. Hence, in that
case, it is better to replace the CACHE unit by register
V. 1/0 CONSIDERATIONS files. The latter are then still needed to provide a limited
All image data are stored in the off-chip memories buffering in time.
because the image frames are too large. To reduce the time For our synchronous ASIC, communication with the
to fetch off-chip data, an 1/0 communication block outside world has to be supervised by the global controller.
CACHE (Fig. 5) has been designed. This block can tem- The off-chip memories to be used can be either dynamic or
porarily store a certain amount of input pixels, which are static, The difference between these types has been taken
assumed to be processed later on. For example, if adapta- care of by including a robust uni-directional handshaking
tion is needed during the search for edge candidates, the protocol with a READY signal. The data bus is only
pixels in the main windows have to be scanned and com- available for READ/ WRITE operations when the READY
pared. Because of the buffering in t h s 1/0 block, data can signal is activated.
be obtained immediately without being fetched again from
off-chp memory. It is then assumed that a fixed upper VI. EVALUATION
bound can be determined for the buffer size. Typically The ASIC architecture in this paper allows to produce
pointer-addressed memories (PAM’s) such as a FIFO are the edges for an image frame of 512x512 Sobel ampli-
very well suited for such 1/0 operations. Sometimes even tudes [lo] in an average time of between 34 and 56 ms
a few useful addressing operations can be added, e.g., a depending on the complexity of the objects in the image
direct jump to the location of the central pixel of the main (see Table I). If other image sizes are needed (from 128 X
window as needed during scanning operations. 128 up to 512X512), the scanning time for locating the
In our CACHE unit, 3 PAM’s steered by a local con- “initial” edge elements will differ, but this is only a small
troller are included to buffer the pixels. The advantages of part of the total time needed. The main contribution is
using this CACHE are: (a) input data can be reused, for proportional with the number of edge pixels found in the
instance, during the rescanning of the main windows after image, and t h s depends mainly on the complexity of the
adaptation to the threshold; (b) direct two-dimensional objects in the images and not directly on the size. For
addressing is possible: the central pixel of the main win- image whose size is larger than the default size, the basic
dow can be directly accessed whenever requested. The 1/0 scanning time for the complete frame takes n 2 cycles. This
bottleneck is thus largely solved, even when the “data may violate the real time constraint. In this case, the
consumption rate” is (much) higher than the input rate source image has to be divided into several sub-images,
provided. This would happen for the edge detector archi- where each sub-image can be processed by the proposed
tecture when more than l clock cycle is needed to fetch ASIC. The final contour information can be obtained by
off-chip data from slow but cheap bulk memories, or when linking all the edges extracted from these sub-images
the internal clock period is higher than the 1/0 rate as in a through a post-processor. By applying more hardware for
more advanced 1-pm CMOS advanced technology. If we parallel edge detection, the proposed ASIC architecture
assume for instance a ratio of 2 between these 2 rates, then can still meet the real time constraint. However for most
for the image data from Table I, the gain in throughput real-time applications, the default image size (512 X 512) is
obtained by using a CACHE (7 X 7) is about 8.5 percent large enough to offer the required information [9]. There-
for a simple object and 15.1 percent for a complex object. fore, we believe this archtecture can remain roughly the
The computation for these two figures can be obtained same also for other sizes than the default 512 X 512.
from Table I. In order to achieve this critical timing specification, the
Determination of the optimal size is also an important features of the cooperating data-path model [8] have been
factor for the real implementation because of the require- exploited fully. Four customized data-paths have been
ment of large area for CACHE. A non-optimal size will constructed from a limited set of parameterizable FBBs
result in either too much hardware overhead, or little which are available in our library. The operation load
speed-up gain of the insertion of this block. More details between these units is carefully balanced. The internal
about the size of CACHE and its implementation will be critical timing paths in the pipeline sections have been
presented later on. The overall gain by using such a optimized to meet the required clock period of 100 ns
CACHE for reducing the 1/0 bottleneck for different (10 MHz). The exchange of data happens over a few
applications cannot be easily assessed. It is clear though dedicated busses. The communication with the 3 off-chip
that the higher the degree of reuse for the input data, the frame memories which store Sobel inputs and edge infor-
more important a CACHE becomes. This is especially so mation results, has been optimized by a careful scheduling
in recursive types of algorithms for image and video pro- of these 1/0 operations and by including a special input
cessing where the input date rate required to keep the buffer unit with 3 FIFO’s which allows to fetch data in
internal processing busy (above 10 MHz), is higher than one cycle. The supervision of this complex design is per-
what can be supplied at the external 1/0 interface (typi- formed by a sophisticated controller which has been de-
cally limited below 10 MHz). It should be noted though composed into three sub-units in order to reduce the

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 3 6 , NO. 1 0 , OCTOBER 1989

global critical paths spanning both data-paths and control the architectural synthesis stage where an efficient archi-
unit. Again, much effort has been spent to keep the timing tecture has to be conceived. Due to extensive decision-
within the 100-ns clock period. making requirements (conditional expressions), this design
During the architectural exploration for this application, could not yet be generated in a fully automated way from
many alternatives have been investigated. We believe, the Silage. Actually, this design has served as a major test-
design presented in this paper meets the system require- vehicle to provide more insight in the nature of the tasks
ments with a near to minimal area cost. The amount of needed to support this step, especially in terms of the
parallelism has been fully matched to the throughput re- (hierarchical) controller. From this mainly manual arch-
quirements and hardware is shared whenever possible. tectural design, mapping rules are being formalized which
With the cells of our 3-pm double metal CMOS technol- will be coded in the CATHEDRAL-3 synthesis system.
ogy, the area has been estimated to be 42 mmz, which Ths will allow to support the complete design path from
indicates that a single chip realization is feasible. More- algorithm to chip-level in the near future.
over, at the same time, the power consumption has been
minimized by reducing the number of FBB’s to a mini- VIII. CONCLUSION
mum. In addition, too heavy internal pipelining has been
A novel ASIC VLSI architecture for robust edge detec-
avoided, which reduces useless switching activity in the
tion is presented in t h s paper. With this design, we have
registers. Finally, the pin-count is also restricted by sharing
demonstrated the power of the cooperating data-path
pins as much as possible. For instance, all data and
model for medium-level (relatively high-speed) image pro-
address busses to the off-chip memories are common. Also
cessing applications. The result out-performs existing real-
the initial loading of parameters takes place over the
izations. The complete chip design can also be easily
existing paths. In this way, the total pin-count has been
adapted to the changing specifications, such as the image
limited to 43, where 18 pins are for the address, 10 pins are
size or the maximum and minimum thresholds for the
for the data, and the rest are for the control land status
edges which affect the robustness.
signals. More details about the implementation and the
The architectural part of the design has been performed
characteristics of t h s c h p will be described in a future
largely manually with the use of an RT-level simulator.
paper.
This has demonstrated the need for high-level synthesis
We can summarize that compared to other edge detec-
tools supporting the tedious and error-prone architectural
tion chips [18], [12], [19], our edge follower architecture has
exploration. Especially the distribution and the scheduling
the following distinctive features:
of the many operations over the multiplexed data-paths, M
U

handles gray-level images; but also the herarchical decomposition of the global con- -
v1

m
4
performs adaptive thresholding and produces a single troller have cost a lot of design time. Algorithmic CAD
pixel wide edge; tools are needed to take care of these optimizations. In
offers the edge information (location and orienta- addition, the allocation of different data-path and 1/0
tion) in real-time; units and the partitioning of the algorithm over the paral-
can be implemented on a single ASIC with a reason- lel hardware involves many design iterations. Rules have to
able cost in area, power consumption and pin-count. be extracted and formalized in order to come to appropri-
ate knowledge-based CAD tools suited to support these
VII. CAD-TOOLSINVOLVED tasks. At present, the CATHEDRAL-I11 environment [20]
is under development at IMEC which is targeted towards
Several tools available in the CATHEDRAL environ-
this class of architectures.
ment [4] have been used to support the main stages in t h s
Currently a prototype c h p is being assembled. Once the
chip design. More details will be described in a future
architecture was defined, the actual layout of the data-paths
paper. Silage [22] has been selected as the input language
and the controller has been generated by powerful module
for the architectural synthesis systems. Therefore, the edge
and control generation tools [3]. Also the floorplan has
detection algorithm has been described in this applicative
been constructed automatically. Hence, existing tools in
language. At this initial level, the correctness of the behav-
t h s area are more useful already. This work will be re-
ior has to be verified. This has been performed by means
ported in another paper. In the near future, also more
of the “S2C” simulator [23]. Next, the architecture pro-
applications will be explored which fit into the cooperating
posed here has been described at the Register-Transfer
data-path model.
(RT) level and verified by the “Logmos-11” simulator [25].
The layout of the dedicated data-paths and controller has
been generated with the help of the data-path synthesis ACKNOWLEDGMENT
tool “CHOPIN” [20], the module generator “MGE’ and The authors wish to express their gratitude to L. Van
the PLA synthesis toolbox “PLASCO’ [24]. The final Gool, M. Proesmans, and P. Vandenbergen at the
floorplan has been assembled by a commercial floorplan- K. U. Leuven for making available the edge follower
ner. Also the design verification at the layout level has algorithm. Also our colleagues at IMEC are acknowledged
been performed with commercial tools. It should be noted for the fruitful discussions, the suggestions,and the CAD
though, that a major part of the design effort is situated at support for this complex demonstrator example.

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: REAL-TIME EDGE DETECTION 1359

REFERENCES Chen-Yi Lee received the B.S degree from the


National Chtao Tung Umversity, Hinchu, Tau-
[l] J. Offen and R. Raymond, VLSI Image Processing. New York: wan, in 1982, and the M.S degree from the
McGraw-Hill, 1985.
[2] M. Yamashina, T. Enomoto, T. Kunio, I. Tamitani, H. Harasaki, T. Katholieke Universiteit Leuven (KUL), Belgium,
Nishtani, M. Satoh, and K. Kikuchi, “A microprogrammable real- in 1986, both in electrical engineering Currently
time video signal processor (VSP) LSI,” IEEE J. Solid-State Cir- he is working toward the Ph D degree at KUL.
cuits, vol. SC-22, Dec. 1987. Since September 1986, he has been a re-
[3] H.,De Man, J. Rabaey, P. Six, and L. Claesen, “Cathedral-11: A searcher in the Applications and Archtectural
sihcon compiler for digital signal processing,” IEEE Des. Test of Strategies group at Inter-university Mmo-Elec-
Comp., vol. 3, pp. 13-25, Dec. 1986. tromc Center (IMEC), Belgium. His interests
[4] H. De Man et al., “Synthesis of DSP Systems at Leuven,” in Proc. include the architecture design for apphcation-
IEEE Int. Conf. on Computer Design: VLSI in Computers and
processors, ICCD87, ( R Brook, ~ ~ NY), pp, 133-145, Oct. 5-8, specific IC’s for DSP algonthms, and VLSI implementation for high-speed
1987. digital image and video signal processing.
[5] H. T. Kung, “Why systolic architectures?,” IEEE Computer, vol.
15, Jan. 1982.
[6] A. N. Venetsanopoulos, K. M. Ty, and A. C. P. Loui, “High-speed m
architectures for digital image processing,” IEEE Trans. Circuits
S y ~ t .vol.
, CAS-34, Aug. 1987.
[7] F. Catthoor and H. De Man, “Customized architectural method-
ologies for high-speed image and video processing,” in Proc. IEEE Francky V . M. Catthoor (S’68-M87) received
Int. Conf. Acoust., Speech, Signal Processing, New York, pp. the engineenng degree and the Ph.D. in electrical
1985-1988, Apr. 1988. engineering from the Katholieke Umversiteit
[8] F. Catthoor, J. Rabaey, G. Goossens, J. L. Van Meerbergen, R. Leuven (KUL), Belgium, in 1982 and 1987, re-
Jain, H. De Man, and J. Vandevalle, “Architectural strategies for
an application-specific synchronous multi-processor environment,” spectively.
IEEE Tran. Acousi., Speech, Signal Processing, vol. 36, Feb. 1988. From 1983 to 1987 he was with a group in-
[9] L. Van Gool, “The Leuven edge detector board,” Int. Rep., ESAT, volved in VLSI design methodologies for Digital
K. U. Leuven, May 1988. Signal Processing, first, at ESAT, KUL; and
[lo] M. Proesmans and P. Vandenbergen, “A new edge follower,” Eng. from January 1985 at the Inter-umversity
thesis, K. U. Leuven, July 1987. Micro-Electronics Center (IMEC), Heverlee, Bel-
(111 F. Catthoor, “Architectural design strateges for complex DSP-sys- gium. In the summer of 1987 he spent a 2-month
tems in an automated synthesis environment,” Ph.D. dissertation, Post-Doctora] NFWO research feuowship at the Umversity of Califorma,
ESAT, K. U. Leuven, Belgium, May 1987.
[121 Kanopoulos, N. Vasanthavada, R. Baker, of an image Berkeley. Currently, he is heading the Apphcations and Architectural
edge detection filter using the %bel I E E E J , Solid-State Strategies group in the VSDM division at IMEC. His research activities
Circuits, vol. 23, Apr. 1988. are mainly in the field of architecture design for application-specific IC’s
(131 M. Barthololeus, L. Reynders, M. Pauwels, and H. De Man, intended for DSP-algorithms, including design for testability. He is also
PLASCO: A procedural silicon compiler for PLA-based system,” involved in the development of computer-aided design tools for the
in Proc. IEEE CuFtom Integrated Circuits Conf., (Portland, OR), high-level (behavioral) synthesis and optimization of general DSP appli-
pp. 226-229, May 1985.
[141 zGphI$Ei ~ ~ ~ ~ ~ ~ f ~ ~ ~
cations. In these fields, he has authored or coauthored about 30 papers,
of~ which
~ 2 received
e ~ a Best
~ Paper
~ Award.
~ He ~ received
~ the
~ Young
~ Scientist
~ c ~ ~ ~ , q
TX, pp. 285-288, Apr. 1987. Award from the Marconi International Fellowshp in 1986.
[15] Yi-Tong Zhou and Rama Chellappa,,,”Linear feature extraction
based on an AR model edge detector, in Proc. IEEE Int. Conf.
Acoust., Speech, Signal Processing, Dallas, TX, pp. 555-558, Apr.
1987. rIC
[16] J. E. Bevington and R. M. Mersereau, “Differential operator based
edge and line detection,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Processing, Dallas, TX, pp. 249-252, Apr. 1987.
[17] 0. J. Morris and M. de J. Lee, “A unified method for segmentation
and edge detection using graph theory,” in Proc. IEEE Int. Conf Hugo J. De Man (M’81-SM81-F‘86) received
Acoust., Speech, Signal Processing, Tokyo, Japan, pp. 2051-2054, the electrical engineering degree and the Ph.D
Apr. 1986. degree in Applied Sciences from the Katholieke
[18] I. Agi, P. J. Hurst and A. K. Jain, “An expandable VLSI processor Universiteit Leuven, Heverlee, Belgium, in 1964
array approach to contour tracing,” in Proc. IEEE Circuits and and 1968, respectively.
Systems, pp. 1969-1972, 1988. In 1968 he became a member of the staff of
[19] P. A. Ruetz and R. W Brodersen, “Architectures and design
techniques for real-time image-processingIC’s,” IEEE J. Solid-State the Laboratory for Physics and Electronics of
Circuits, vol. SC-22, Apr. 1987. Semiconductors at the Umversity of Leuven,
(201 S. Note, 3. Van Meerbergen, F. Catthoor, and H. De Man, “Auto- working on device physics and integrated circuit
mated synthesis of a high-speed CORDIC algorithm with the technology From 1969 to 1971 he was at the
Cathedral-I11 compilation system,” in Proc. IEEE In?. Symp. on Electronic Research Laboratory, University of
Circuits and Systems, (Helsinki, Finland), June 1988. California, Berkeley, as an ESRO-NASA Post-Doctoral Research Fellow,
(211 M. Sugai, A. Kanuma, K. Swuki, and M. Kubo, ‘‘VLSI Processor working on Computer-Aided Device and Circuit Design. In 1971 he
for image processing? IEEE, 75, PP’ 1160-1165, %pt.
1987. returned to the University of Leuven as a Research Associate of the
[22] P. N. Hilfinger, “A high-level language and silicon compiler for NFWo National Science Foundation).
digital signal processing,” in proc, IEEE curtorn Integrated clr- In 1974 he became a Professor at the University of Leuven During the
w i t s Conf., (Portland, OR), p 213-216, May 1985. winter quarter of 1974-1975 he was a Visiting Associate Professor at the
(231 C. Scheers, R. Severyns, F. Eatthoor, and H. De Man, “Compiling University of California, Berkeley. He was an Associate Editor for the
and Simulating an Applicative DSP Language,” Int. Rep., IMEC, IEEE JOURNAL OF SOLID-STATE CIRCUITSfrom 1975 to 1980 and was
Heverlee, Belgum, Au 1988. European Associate Editor for the IEEE TRANSACTIONS ON CAD from
1241 P. si, 1. Vandeweert K. Crees, and L. RiJnders, “Interactive 1982 to 1985. He received a Best paper Award at he I s s c c of 1973 on
generator based On symbolic layout,” in IEEE
In?. Conf. on Computer Design: VLSI in Computers and Processors, Bipolar Device Simulation and at the 1981 ESSCIRC conferenceforwork
(Rey Brook, NY), pp. 133-145, Oct. 5-8, !?87. on an integrated CAD system In 1986 he became fellow of the IEEE. His
[25] R. Severyns, p. De Worm, and E. Willems, RT-LOGMOS Version actual field of research is the design of integrated circuts and computer-
2: Register transfer compiler and simulator, Int. Rep, IMEC, aided design. Since 1984 he is Vice-president of the VLSI systems design
Heverlee, Belgium, June 1988. group of IMEC (Leuven, Belgum).

Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.

You might also like