An Efficient ASIC Architecture For Real-Time Edge Detection
An Efficient ASIC Architecture For Real-Time Edge Detection
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE er al.: REAL-TIME EDGE DETECTION 1351
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1352 lEEE TRANSACTlONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: REAL-TIME EDGE DETECTION
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1354 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. VOL. 36, NO. 10, OCTOBER 1989
TABLE I
# OF CYCLES
FOR LOCATINGAN EDGEELEMENT IN A SIMPLE OBJECT
AND A COMPLEX
FOR AN IMAGE SIZE OF 512 X 512
Note:
1. M W Main Window
putation Unit ( A C U ) , a Comparison and Adaptation Unit connected to the databus so that the users can load these
( C A U ) , a Decision-Making Unit ( D M U ) , and a counting parameters before starting the system. A shifter SH is
and Detecting Unit ( C D U ) . These 4 parallel blocks are inserted after this RF to speed up the location of the
exchanging data over customized busses. The detailed ar- center of the vertical (or horizontal) windows (see Section
chitecture of each block is illustrated in Fig. 6, and their 11-2.1). If the boundary of the image frame is reached
construction is discussed in the sequel. It will be shown during scanning, the search should be stopped because no
that they can be obtained by using a limited library of more information can be collected. For this purpose, the
parameterizable functional building blocks (FBB’s) [8], left and right boundaries are detected by a special-purpose
[ll]. In t h s way, the layout effort is reduced considerably. decoder DD at the output of AD which checks the 9 LSB’s
It should be noted too that the implementation of Figs. of the address (lower half). The top and bottom bound-
5-6 is not unique. It is the result of an intensive (manual) aries are checked using the sign and overflow flags of the
architectural exploration to find a solution which satisfies adder AD. An output buffer BF is used to offer more
all the constraints identified in Section I. driving capability (see Fig. 6(a)).
3.1. Address Computation Unit (ACU) 3.2. Comparison and Adaptation Unit (CA U)
Many address computations are required in the whole Comparisons needed in the algorithm can be classified
algorithm: into 4 types:
During horizontal scanning: increment or decrement
by one column. 1) input pixel with current threshold;
During vertical scanning: increment or decrement by 2) input pixel with the previous pixel;
one row (equal to the number of columns). 3) current threshold with maximal threshold;
Random jumps: jump to the required address by 4) current threshold with minimal threshold.
incrementing or decrementing several rows (columns).
In addition, threshold adaptation is performed in the
The main purpose of the ACU is to generate the required CAU unit. The required flags are LT (less than) and EQ
address for accessing the 3 off-chip memories. Due to the (equal to).
512x512 image size, the required wordlength is 18 bits. Taking all these operations into account, the FBB’s
Only a few FBB’s are needed, namely, registerfiles (RF), a indicated in Fig. 6b are needed. Two RF’s are required to
shifter (SH), an adder (AD) and a constant decoder (DD) store the parameters to be compared. One is used to store
(Fig. 6(a)). Two RF’s are used: one for storing the address, the threshold and the previous input pixel. The other
and one for storing the constants to be added (or sub- stores the current input pixel and those parameters which
tracted). In the first RF, registers are needed during the are loaded before starting the edge detection, such as the
scanning to store the next scanning address (nextscan),the maximal threshold, the minimal threshold and the adapta-
temporary address (tempaddr) and the current working tion step. A feedback loop is also included to adapt the
address (workaddr). In addition, a constant “0” is in- threshold whenever required. Multiplexer MUX is used to
cluded for initialization. A feedback loop is connected select the required inputs for the threshold adaptation. The
from the output of the added to the input of this register-file flags LT and EQ can be generated by AD and DD,
so that address information can be updated easily. Simul- respectively. Although these two flags could be generated
taneously, the new address can be sent to the off-chip by a comparator instead, an adder and a detector are
memories. The other RF stores parameters such as the preferred because also the threshold adaptation can be
image size: columns and rows. The input of this R F is handled through this combination. Obviously, this decision
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et U[.: REAL-TIME EDGE DETECTION 1355
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1356 IEEE TRANSACTIONSON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et a/.: REAL-TIME EDGE DETECTION 1357
based controllers can now easily meet the speed require- that when the 1/0 rate equals the internal processing rate,
ment of the given system. the throughput advantage is eliminated. Hence, in that
case, it is better to replace the CACHE unit by register
V. 1/0 CONSIDERATIONS files. The latter are then still needed to provide a limited
All image data are stored in the off-chip memories buffering in time.
because the image frames are too large. To reduce the time For our synchronous ASIC, communication with the
to fetch off-chip data, an 1/0 communication block outside world has to be supervised by the global controller.
CACHE (Fig. 5) has been designed. This block can tem- The off-chip memories to be used can be either dynamic or
porarily store a certain amount of input pixels, which are static, The difference between these types has been taken
assumed to be processed later on. For example, if adapta- care of by including a robust uni-directional handshaking
tion is needed during the search for edge candidates, the protocol with a READY signal. The data bus is only
pixels in the main windows have to be scanned and com- available for READ/ WRITE operations when the READY
pared. Because of the buffering in t h s 1/0 block, data can signal is activated.
be obtained immediately without being fetched again from
off-chp memory. It is then assumed that a fixed upper VI. EVALUATION
bound can be determined for the buffer size. Typically The ASIC architecture in this paper allows to produce
pointer-addressed memories (PAM’s) such as a FIFO are the edges for an image frame of 512x512 Sobel ampli-
very well suited for such 1/0 operations. Sometimes even tudes [lo] in an average time of between 34 and 56 ms
a few useful addressing operations can be added, e.g., a depending on the complexity of the objects in the image
direct jump to the location of the central pixel of the main (see Table I). If other image sizes are needed (from 128 X
window as needed during scanning operations. 128 up to 512X512), the scanning time for locating the
In our CACHE unit, 3 PAM’s steered by a local con- “initial” edge elements will differ, but this is only a small
troller are included to buffer the pixels. The advantages of part of the total time needed. The main contribution is
using this CACHE are: (a) input data can be reused, for proportional with the number of edge pixels found in the
instance, during the rescanning of the main windows after image, and t h s depends mainly on the complexity of the
adaptation to the threshold; (b) direct two-dimensional objects in the images and not directly on the size. For
addressing is possible: the central pixel of the main win- image whose size is larger than the default size, the basic
dow can be directly accessed whenever requested. The 1/0 scanning time for the complete frame takes n 2 cycles. This
bottleneck is thus largely solved, even when the “data may violate the real time constraint. In this case, the
consumption rate” is (much) higher than the input rate source image has to be divided into several sub-images,
provided. This would happen for the edge detector archi- where each sub-image can be processed by the proposed
tecture when more than l clock cycle is needed to fetch ASIC. The final contour information can be obtained by
off-chip data from slow but cheap bulk memories, or when linking all the edges extracted from these sub-images
the internal clock period is higher than the 1/0 rate as in a through a post-processor. By applying more hardware for
more advanced 1-pm CMOS advanced technology. If we parallel edge detection, the proposed ASIC architecture
assume for instance a ratio of 2 between these 2 rates, then can still meet the real time constraint. However for most
for the image data from Table I, the gain in throughput real-time applications, the default image size (512 X 512) is
obtained by using a CACHE (7 X 7) is about 8.5 percent large enough to offer the required information [9]. There-
for a simple object and 15.1 percent for a complex object. fore, we believe this archtecture can remain roughly the
The computation for these two figures can be obtained same also for other sizes than the default 512 X 512.
from Table I. In order to achieve this critical timing specification, the
Determination of the optimal size is also an important features of the cooperating data-path model [8] have been
factor for the real implementation because of the require- exploited fully. Four customized data-paths have been
ment of large area for CACHE. A non-optimal size will constructed from a limited set of parameterizable FBBs
result in either too much hardware overhead, or little which are available in our library. The operation load
speed-up gain of the insertion of this block. More details between these units is carefully balanced. The internal
about the size of CACHE and its implementation will be critical timing paths in the pipeline sections have been
presented later on. The overall gain by using such a optimized to meet the required clock period of 100 ns
CACHE for reducing the 1/0 bottleneck for different (10 MHz). The exchange of data happens over a few
applications cannot be easily assessed. It is clear though dedicated busses. The communication with the 3 off-chip
that the higher the degree of reuse for the input data, the frame memories which store Sobel inputs and edge infor-
more important a CACHE becomes. This is especially so mation results, has been optimized by a careful scheduling
in recursive types of algorithms for image and video pro- of these 1/0 operations and by including a special input
cessing where the input date rate required to keep the buffer unit with 3 FIFO’s which allows to fetch data in
internal processing busy (above 10 MHz), is higher than one cycle. The supervision of this complex design is per-
what can be supplied at the external 1/0 interface (typi- formed by a sophisticated controller which has been de-
cally limited below 10 MHz). It should be noted though composed into three sub-units in order to reduce the
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
1358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 3 6 , NO. 1 0 , OCTOBER 1989
global critical paths spanning both data-paths and control the architectural synthesis stage where an efficient archi-
unit. Again, much effort has been spent to keep the timing tecture has to be conceived. Due to extensive decision-
within the 100-ns clock period. making requirements (conditional expressions), this design
During the architectural exploration for this application, could not yet be generated in a fully automated way from
many alternatives have been investigated. We believe, the Silage. Actually, this design has served as a major test-
design presented in this paper meets the system require- vehicle to provide more insight in the nature of the tasks
ments with a near to minimal area cost. The amount of needed to support this step, especially in terms of the
parallelism has been fully matched to the throughput re- (hierarchical) controller. From this mainly manual arch-
quirements and hardware is shared whenever possible. tectural design, mapping rules are being formalized which
With the cells of our 3-pm double metal CMOS technol- will be coded in the CATHEDRAL-3 synthesis system.
ogy, the area has been estimated to be 42 mmz, which Ths will allow to support the complete design path from
indicates that a single chip realization is feasible. More- algorithm to chip-level in the near future.
over, at the same time, the power consumption has been
minimized by reducing the number of FBB’s to a mini- VIII. CONCLUSION
mum. In addition, too heavy internal pipelining has been
A novel ASIC VLSI architecture for robust edge detec-
avoided, which reduces useless switching activity in the
tion is presented in t h s paper. With this design, we have
registers. Finally, the pin-count is also restricted by sharing
demonstrated the power of the cooperating data-path
pins as much as possible. For instance, all data and
model for medium-level (relatively high-speed) image pro-
address busses to the off-chip memories are common. Also
cessing applications. The result out-performs existing real-
the initial loading of parameters takes place over the
izations. The complete chip design can also be easily
existing paths. In this way, the total pin-count has been
adapted to the changing specifications, such as the image
limited to 43, where 18 pins are for the address, 10 pins are
size or the maximum and minimum thresholds for the
for the data, and the rest are for the control land status
edges which affect the robustness.
signals. More details about the implementation and the
The architectural part of the design has been performed
characteristics of t h s c h p will be described in a future
largely manually with the use of an RT-level simulator.
paper.
This has demonstrated the need for high-level synthesis
We can summarize that compared to other edge detec-
tools supporting the tedious and error-prone architectural
tion chips [18], [12], [19], our edge follower architecture has
exploration. Especially the distribution and the scheduling
the following distinctive features:
of the many operations over the multiplexed data-paths, M
U
handles gray-level images; but also the herarchical decomposition of the global con- -
v1
m
4
performs adaptive thresholding and produces a single troller have cost a lot of design time. Algorithmic CAD
pixel wide edge; tools are needed to take care of these optimizations. In
offers the edge information (location and orienta- addition, the allocation of different data-path and 1/0
tion) in real-time; units and the partitioning of the algorithm over the paral-
can be implemented on a single ASIC with a reason- lel hardware involves many design iterations. Rules have to
able cost in area, power consumption and pin-count. be extracted and formalized in order to come to appropri-
ate knowledge-based CAD tools suited to support these
VII. CAD-TOOLSINVOLVED tasks. At present, the CATHEDRAL-I11 environment [20]
is under development at IMEC which is targeted towards
Several tools available in the CATHEDRAL environ-
this class of architectures.
ment [4] have been used to support the main stages in t h s
Currently a prototype c h p is being assembled. Once the
chip design. More details will be described in a future
architecture was defined, the actual layout of the data-paths
paper. Silage [22] has been selected as the input language
and the controller has been generated by powerful module
for the architectural synthesis systems. Therefore, the edge
and control generation tools [3]. Also the floorplan has
detection algorithm has been described in this applicative
been constructed automatically. Hence, existing tools in
language. At this initial level, the correctness of the behav-
t h s area are more useful already. This work will be re-
ior has to be verified. This has been performed by means
ported in another paper. In the near future, also more
of the “S2C” simulator [23]. Next, the architecture pro-
applications will be explored which fit into the cooperating
posed here has been described at the Register-Transfer
data-path model.
(RT) level and verified by the “Logmos-11” simulator [25].
The layout of the dedicated data-paths and controller has
been generated with the help of the data-path synthesis ACKNOWLEDGMENT
tool “CHOPIN” [20], the module generator “MGE’ and The authors wish to express their gratitude to L. Van
the PLA synthesis toolbox “PLASCO’ [24]. The final Gool, M. Proesmans, and P. Vandenbergen at the
floorplan has been assembled by a commercial floorplan- K. U. Leuven for making available the edge follower
ner. Also the design verification at the layout level has algorithm. Also our colleagues at IMEC are acknowledged
been performed with commercial tools. It should be noted for the fruitful discussions, the suggestions,and the CAD
though, that a major part of the design effort is situated at support for this complex demonstrator example.
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: REAL-TIME EDGE DETECTION 1359
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 15,2024 at 16:24:26 UTC from IEEE Xplore. Restrictions apply.