A Distributed Canny Edge Detector Algorithm and FPGA Implementation
A Distributed Canny Edge Detector Algorithm and FPGA Implementation
7, JULY 2014
Abstract The Canny edge detector is one of the most widely and has best performance [10]. Its superior performance is due
used edge detection algorithms due to its superior performance. to the fact that the Canny algorithm performs hysteresis thresh-
Unfortunately, not only is it computationally more intensive as olding which requires computing high and low thresholds
compared with other edge detection algorithms, but it also has
a higher latency because it is based on frame-level statistics. based on the entire image statistics. Unfortunately, this feature
In this paper, we propose a mechanism to implement the Canny makes the Canny edge detection algorithm not only more
algorithm at the block level without any loss in edge detection computationally complex as compared to other edge detection
performance compared with the original frame-level Canny algorithms, such as the Roberts and Sobel algorithms, but also
algorithm. Directly applying the original Canny algorithm at necessitates additional pre-processing computations to be done
the block-level leads to excessive edges in smooth regions and
to loss of significant edges in high-detailed regions since the on the entire image. As a result, a direct implementation of
original Canny computes the high and low thresholds based the Canny algorithm has high latency and cannot be employed
on the frame-level statistics. To solve this problem, we present in real-time applications.
a distributed Canny edge detection algorithm that adaptively Many implementations of the Canny algorithm have been
computes the edge detection thresholds based on the block type proposed on a wide list of hardware platforms. There
and the local distribution of the gradients in the image block.
In addition, the new algorithm uses a nonuniform gradient mag- is a set of work [1][3] on Deriche filters that have
nitude histogram to compute block-based hysteresis thresholds. been derived using Cannys criteria and implemented on
The resulting block-based algorithm has a significantly reduced ASIC-based platforms. The Canny-Deriche filter [1] is
latency and can be easily integrated with other block-based a network with four transputers that detect edges in a
image codecs. It is capable of supporting fast edge detection 256 256 image in 6s, far from the requirement for real-
of images and videos with high resolutions, including full-HD
since the latency is now a function of the block size instead of time applications. Although the design in [2] improved the
the frame size. In addition, quantitative conformance evaluations Canny-Deriche filter implementation of [1] and was able to
and subjective tests show that the edge detection performance of process 25 frames/s at 33 MHz, the used off-chip SRAM
the proposed algorithm is better than the original frame-based memories consist of Last-In First-Out (LIFO) stacks, which
algorithm, especially when noise is present in the images. Finally, increased the area overhead compared to [1]. Demigny pro-
this algorithm is implemented using a 32 computing engine
architecture and is synthesized on the Xilinx Virtex-5 FPGA. posed a new organization of the Canny-Deriche filter in [3],
The synthesized architecture takes only 0.721 ms (including the which reduces the memory size and the computation cost by a
SRAM READ / WRITE time and the computation time) to detect factor of two. However, the number of clock cycles per pixel
edges of 512 512 images in the USC SIPI database when of the implementation [3] varies with the size of the processed
clocked at 100 MHz and is faster than existing FPGA and GPU image, resulting in variable clock-cycles/pixel from one image
implementations.
size to another with increasing processing time as the image
Index Terms Distributed image processing, Canny edge size increases.
detector, high throughput, parallel processing, FPGA. There is another set of work [4][6] on mapping the Canny
edge detection algorithm onto FPGA-based platforms. The two
I. I NTRODUCTION
FPGA implementations in [4] and [5] translate the software
1057-7149 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
XU et al.: DISTRIBUTED CANNY EDGE DETECTOR 2945
Recently, the General Purpose Graphic Processing Unit time and the computation time) to detect edges of 512 512
(GPGPU) has emerged as a powerful and accessible parallel images in the USC SIPI database when clocked at 100 MHz.
computing platform for image processing applications [8], [9]. A preliminary version of this work was presented in [13].
Studies of GPGPU accelerated Canny edge detection have The work presented herein does not only provide more elabo-
been presented [10][12]. All of these implementations are rate discussions and performance results but also provides an
frame-based and do not have good edge detection performance improved distributed Canny edge detector both at the algo-
since they use the same fixed pair of high and low threshold rithm and architecture levels. The algorithm is improved by
values for all images. Furthermore, as shown later in the paper, proposing a novel block-adaptive threshold selection procedure
their timing performance is inferior compared to the proposed that exploits the local image characteristics and is more robust
algorithm in spite of being operated at a very high clock to changes in block size as compared to [13]. While the
frequency. architecture presented in [13] was limited to a fixed image
In the original Canny method, the computation of the high size of 512 512, a fixed image block of 64 64, this paper
and low threshold values depends on the statistics of the whole presents a general FPGA-based pipelined architecture that
input image. However, most of the above existing implementa- can support any image size and block size. Furthermore, no
tions (e.g., [4][6], [10][12]) use the same fixed pair of high FPGA synthesis results and no comparison results with exist-
and low threshold values for all input images. This results ing techniques were presented in [13]. In this paper, FPGA
in a decreased edge detection performance as discussed later synthesis results, including the resource utilization, execution
in this paper. The non-parallel implementation [7] computes time, and comparison with existing FPGA implementations
the low and high threshold values for each input image. are presented.
This results in increased latency as compared to the existing The rest of the paper is organized as follows. Section 2 gives
implementations (e.g., [4][6], [10][12]). Furthermore, the a brief overview of the original Canny algorithm. Section 3
non-parallel implementations ([4][7]) result in a decreased presents the proposed distributed Canny edge detection
throughput as compared to the parallel implementations algorithm which includes the adaptive threshold selection
([6], [10][12]). The issue of increased latency and decreased algorithm and a non-uniform quantization method to compute
throughput is becoming more significant with the increasing the gradient magnitude histogram. Quantitative conformance
demand for large-size high-spatial resolution visual content as well as subjective testing results are presented in Section 4
(e.g., High-Definition and Ultra High-Definition). in order to illustrate the edge detection performance of the
Our focus is on reducing the latency and increasing the proposed distributed Canny algorithm as compared to the
throughput of the Canny edge detection algorithm so that it can original Canny algorithm for clean as well as noisy images.
be used in real-time processing applications. As a first step, the In addition, the effects of the gradient mask size and the
image can be partitioned into blocks and the Canny algorithm block size on the performance of the proposed distributed
can be applied to each of the blocks in parallel. Unfortunately, Canny edge detection scheme are discussed and illustrated in
directly applying the original Canny at a block-level would fail Section 4. The proposed hardware architecture and the FPGA
since it leads to excessive edges in smooth regions and loss implementation of the proposed algorithm are described in
of significant edges in high-detailed regions. In this paper, Section 5. The FPGA synthesis results and comparisons with
we propose an adaptive threshold selection algorithm which other implementations are presented in Section 6. Finally,
computes the high and low threshold for each block based on conclusions are presented in Section 7.
the type of block and the local distribution of pixel gradients
in the block. Each block can be processed simultaneously, thus
II. C ANNY E DGE D ETECTION A LGORITHM
reducing the latency significantly. Furthermore, this allows
the block-based Canny edge detector to be pipelined very Canny developed an approach to derive an optimal edge
easily with existing block-based codecs, thereby improving detector to deal with step edges corrupted by a white Gaussian
the timing performance of image/video processing systems. noise. The original Canny algorithm [14] consists of the
Most importantly, conducted conformance evaluations and following steps: 1) Calculating the horizontal gradient G x and
subjective tests show that, compared with the frame-based vertical gradient G y at each pixel location by convolving with
Canny edge detector, the proposed algorithm yields better edge gradient masks. 2) Computing the gradient magnitude G and
detection results for both clean and noisy images. direction G at each pixel location. 3) Applying Non-Maximal
The block-based Canny edge detection algorithm is mapped Suppression (NMS) to thin edges. This step involves comput-
onto an FPGA-based hardware architecture. The architecture ing the gradient direction at each pixel. If the pixels gradient
is flexible enough to handle different image sizes, block sizes direction is one of 8 possible main directions (0, 45, 90,
and gradient mask sizes. It consists of 32 computing engines 135, 180, 225, 270, 315, the gradient magnitude of this
configured into 8 groups with 4 engines per group. All 32 com- pixel is compared with two of its immediate neighbors along
puting engines work in parallel lending to a 32-fold decrease the gradient direction and the gradient magnitude is set to zero
in running time without any change in performance when if it does not correspond to a local maximum. For the gradient
compared with the frame-based algorithm. The architecture directions that do not coincide with one of the 8 possible main
has been synthesized on the Xilinx Virtex-5 FPGA. It occupies directions, an interpolation is done to compute the neighboring
64% of the total number of slices and 87% of the local gradients. 4) Computing high and low thresholds based on
memory, and takes 0.721ms (including the SRAM read/write the histogram of the gradient magnitude for the entire image.
2946 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014
Fig. 5. (a) Original 512 512 Lena image; b) uniform block; (c) uniform/
texture block; (d) texture block; (e) edge/texture block; (f) medium edge
block of the Lena image. Shown blocks are of size 64 64.
Fig. 3. Proposed distributed Canny edge detection algorithm.
TABLE II
P1 VALUES FOR E ACH B LOCK T YPE W ITH D IFFERENT B LOCK S IZES
TABLE I
S TANDARD D EVIATIONS OF P1 VALUES FOR E ACH B LOCK
T YPE FOR 64 64 B LOCK
Fig. 11. (a) Original Houses image; Edge map using the original Canny
algorithm (b) with a 9 9 ( = 1.4) gradient mask and (c) with a 3 3
( = 0.4) gradient mask.
what is the best choice for the mask size for different types of
images, and what is the smallest block size that can be used by
our proposed distributed Canny algorithm without sacrificing
Fig. 9. Reconstruction values and quantization levels. performance.
1) The Effect of Mask Size: As indicated in Section 2,
the size of the gradient mask is a function of the standard
deviation of the Gaussian filter, and the best choice of
is based on the image characteristics. Canny has shown
in [14] that the optimal operator for detecting step edges in
the presence of noise is the first derivative of the Gaussian
operator. As stated in Section 2, for the original Canny
algorithm as well as the proposed algorithm, this standard
deviation is a parameter that is typically set by the user based
on the knowledge of sensor noise characteristics. It can also be
set by a separate application that estimates the noise and/or
blur in the image. A large value of results in smoothing
and improves the edge detectors resilience to noise, but it
undermines the detectors ability to detect the location of true
edges. In contrast, a smaller mask size (corresponding to a
lower ) is better for detecting detailed textures and fine edges
but it decreases the edge detectors resilience to noise.
An L-point even-symmetric FIR Gaussian pulse-shaping
filter design can be obtained by truncating a sampled version of
the continuous-domain Gaussian filter of standard deviation .
The size L of the FIR Gaussian filter depends on the standard
deviation and can be determined as follows:
L = 2 L side + 1 (3)
L side = 2 (1) log(C T ) (4)
Fig. 12. (a) Houses image with Gaussian white noise (n = 0.01); Edge
map using the original Canny algorithm (b) with a 9 9 ( = 1.4) gradient Fig. 15. (a) 512 512 Salesman image; edge-maps of (b) the original
mask and (c) with a 3 3 ( = 0.4) gradient mask. Canny edge detector, and (c) the proposed algorithm with a non-overlapping
block size of 64 64, using a 3 3 gradient mask.
Fig. 13. (a) Gaussian Blurred Houses image (blur = 2); Edge map using
the original Canny algorithm (b) with a 9 9 ( = 1.4) gradient mask and Fig. 16. (a) 512 512 Fruit image; edge-maps of (b) the original Canny
(c) with a 3 3 ( = 0.4) gradient mask. edge detector, and (c) the proposed algorithm with a non-overlapping block
size of 64 64, using a 3 3 gradient mask.
Fig. 17. (a) 512 512 Houses image; edge-maps of (b) the original Canny
edge detector, and (c) the proposed algorithm with a non-overlapping block
size of 64 64, using a 3 3 gradient mask.
TABLE III
C ONFORMANCE E VALUATION
Fig. 22. Block diagram of the embedded system for the proposed algorithm.
Fig. 20. Snapshot of the performed subjective test comparing the edge maps
generated by the original Canny algorithm and the proposed algorithm.
into the input local memory in the PU. The CEs read this
data, process them and store the edges into the output local The customized memory interface, shown in Fig. 24, has a
memory. Finally, the edges are written back to the SRAM one 2b-bit wide internal data-bus. In our application, the dual-
output value at a time from the output local memory. port SRAM, the memory interface and the local memories,
In order to increase the throughput, the SRAM external which connect with the SRAM interface, operate at the same
memory is organized into q memory banks, one bank per PU. frequency, which is f S R AM MHz.
Since only one b-bit data, corresponding to one pixel value,
can be read from a SRAM at a time, such an organization helps B. Computing Engine (CE)
multiple PUs to fetch data at the same time and facilitates As described before, each CE processes an m m over-
parallel processing by the PUs. For an image of size N N, lapping image block and generates the edges of an n n
each SRAM bank stores a tile of size N 2 /q image data, where non-overlapping block. The computations that take place in
the term tile refers to an image partition containing several CE can be broken down into the following five units: 1) block
non-overlapping blocks. The SRAM bank is dual-port so that classification, 2) vertical and horizontal gradient calculation as
the PUs can read and write at the same time. well as magnitude calculation, 3) directional non-maximum
In order to maximize the overlap between data read/write suppression, 4) high and low threshold calculation, and
and data processing, the local memory in each PU is imple- 5) thresholding with hysteresis. Each of these units is mapped
mented using dual port block RAM (BRAM) based ping- onto a hardware unit as shown in Fig. 25 and described in
pong buffers. Furthermore, in order that all the p CEs can the following subsections. The communication between each
access data at the same time, the local memory is organized component is also illustrated in Fig. 25 and will be described
into p banks. In this way, a total of pq overlapping blocks in detail in the following subsections.
can be processed by q groups of p CEs at the same time. Suppose the input image data has been stored in the external
The processing time for an N N image is thus reduced memory (SRAMs). For each PU, once the ping-pong buffers
approximately by a factor of pq. If there are enough hardware have loaded p m m overlapping image blocks, which we
resources to support more CEs and more PUs, the throughput refer to as a Group of Blocks (GOB), from the SRAM, all the
would increase proportionally. p CEs can access block data from ping-pong buffers at the
However, FPGAs are constrained by the size of on-chip same time. For each CE, the edge detection computation can
BRAM memory, number of slices, number of I/O pins; and start after n m m overlapping block is stored in CEs local
the maximum throughput achievable is a function of these memories. In addition, in order to compute the block type,
parameters. Since each CE processes an m m overlapping vertical gradient and horizontal gradient in parallel, the m m
image block, for a b-bit implementation, this would require overlapping block is stored in three local memories, marked
3 m m b bits to store the vertical and horizontal as local memory 1, 2 and 3 as shown in Figs. 26 and 27.
gradient components and the gradient magnitude, as discussed
1) Block Classification: As stated before, the m m over-
in Section 5.2. To enable parallel access of these blocks, there
lapping block is stored in the CEs local memory 1 and is
are three BRAMs of size m m b in each CE. In addition, used for determining the block type. The architecture for the
2 p m m b bits are needed for each of the ping-pong
block classification unit consists of two stages as shown in
buffers. Therefore, for each PU, 3 p m m b bits are
Fig. 26. Stage 1 performs pixel classification while stage 2
need for p CEs and 4 p m m b bits are needed for the performs block classification. For pixel classification, the local
input and output ping-pong buffers. This results in a total of
variance of each pixel is utilized and the variance is calculated
7 p m m b q = 7 pqm 2b bits for the FPGA memory. as follows:
Thus, if there are more CEs and/or larger sized block, more
9
FPGA memory is required. Similarly, if there are more PUs, 1
more I/O pins are required to communicate with the external var = (x i x)2 (5)
8
i=1
SRAM memory banks. Thus, the choice of p and q depends
on the FPGA memory resources and numbers of I/O pins. where x i is the pixel intensity and x is the mean value of
We do not consider the numbers of slices as a constraint since the 3 3 local neighborhood. Thus, the pixels in the 3 3
the number of available slices is much larger than required by windows are fetched from the local memory and stored in one
the proposed algorithm. FIFO buffer to compute the local variance. The computation
2954 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014
Fig. 27. Gradient and Magnitude Calculation Unit: (a) architecture; (b) executing time.
is done using one adder, two accumulators, two multipliers high and low threshold calculation and thresholding with
and one square (right shift 3 bits to achieve multiplication hysteresis units. Otherwise, these units do not need to be
by 1/8). Next, the local variance is compared with Tu and activated and the edge map with all zero value pixels is stored
Te [17] in order to determine pixel type. Then two counters back into the SRAM. The P1 and EN are the outputs for
are used to get the total number of pixels for each pixel type. the block classification unit and are stored in the registers
The output of counter 1 gives C1, the number of uniform for thresholding calculation. The latency between the first
pixels, while the output of counter 2 gives C2, the number of input and the output P1 is m m +12 clock cycles and the
edge pixels. The block classification stage is initialized once total execution time for the block classification component is
the C1 and C2 values are available. C2 is compared with m m +13.
0 and the result is used as the enable signal of COMP 5, 2) Gradient and Magnitude Calculation: Since the gradient
COMP 6 and MUX 2. C1 and C2 are compared with different and magnitude calculation unit is independent of the block
values as shown in Fig. 10(a), and the outputs are used as classification unit, these two components can work in parallel.
the control signals of MUX 1 and MUX 2 to determine the In addition, the horizontal gradient and vertical gradient can
value of P1 . Finally, the P1 value is compared with 0 to also be computed in parallel. The m m overlapping block
produce the enable signal, marked as EN. If the P1 value is also stored in the CEs local memory 2 and 3 and is
is larger then 0, then EN signal enables gradient calculation, used as input to compute the horizontal and vertical gradient,
magnitude calculation, directional non-maximum suppression, respectively. The architecture for the gradient and magnitude
XU et al.: DISTRIBUTED CANNY EDGE DETECTOR 2955
TABLE IV unit is pipelined with the directional NMS unit and the latency
R ESOURCE U TILIZATION ON XC5VSX240T FOR 1 CE of the NMS unit is 20 clock cycles. This is referred to as
T N M S in Fig. 31. TT C represents the latency of the thresholds
calculation stage and is equal to 4630 cycles, while TT H
represents the latency of the thresholding with hysteresis stage
and is equal to 4634 cycles. Table VII shows the latency for
each unit. Therefore, one CE takes TC E = TG R AD + TF I R +
T N M S + TT C + TT H = 18548 cycles.
TABLE V Each PU takes 18,496 cycles to load 4 68 68 overlapping
R ESOURCE U TILIZATION ON XC5VSX240T FOR 1 PU blocks from the SRAM into the local memory. It also takes
16,384 (4 64 64 non-overlapping blocks) cycles to write
final edge maps into SRAM. If SRAM operates at f S R AM , the
SRAM read time is 18,496/ f S R AM . The CE processing time
equals to 18,548/ f C E when CE is clocked at f C E . In order
TABLE VI to completely overlap communication with computation and
R ESOURCE U TILIZATION ON XC5VSX240T avoid any loss of performance due to communication with
FOR AN 8-PU A RCHITECTURE SRAM, given a fixed f S R AM , fC E should be selected such
that the processing time is approximately equal to the SRAM
read time (since the SRAM write time is less than the read
time). Thus, 18,496/ f S R AM = 18,548/ f C E , and the f C E can
be set to be 1.003 times higher than f S R AM .
The maximum speed of the employed SRAM device
(CY7C0832BV) is 133MHz. However, we choose the SRAM
TABLE VII clock rate as f S R AM = 100 MHz to allow for sufficient design
C LOCK C YCLES FOR E ACH U NIT margin. Thus, f C E 100 MHz, which is lower than the
maximum operating frequency (250 MHz) of the used FPGA
according to the synthesis report. The total computation period
for one CE is TC E = 18, 548/105 0.186 ms when clocked at
100 MHz. Thus, for a 512 512 image, the total computation
implemented using DSP48Es. The on-chip memory is imple- period is Tcom = 0.372ms; while the total execution time,
mented using BRAMs. Table VI summarizes the resource including the SRAM read/write time and the computation time,
utilization of the 8-PU architecture. It shows that the 8-PU is Tt ot al = (18, 496 + 16, 384)/1E 5 + 0.186 2 = 0.721 ms.
architecture occupies 64% of the slices and 87% of the BRAM The simulation results also show that, at a clock frequency of
memory. 100MHz, the execution time for processing 512 512 images
is 0.721ms for the images in the USC SIPI database.
B. Execution Time
Fig. 31 shows the pipeline implementation of SRAM C. FPGA Experiments and Results
read/write with the CE computation, where each SRAM bank 1) Conformance Tests: In order to validate the FPGA
stores a tile of size N 2 /q image data and each ping or pong generated results, two conformance tests are performed. One
buffer stores a group of blocks (GOB) of size p m m aims to evaluate the similarity between edges detected by
image data. Since our design has q = 8 PUs, one SRAM can the fixed-point FPGA and Matlab implementation of the
hold a tile of size 32,768 (64 64 8) image data for a distributed Canny edge detection algorithm. The other is to
512 512 image. In addition, for p = 4 CEs and for a 64 measure the similarity between edges detected by the fixed-
64 block size (n = 64; m = 68), the image data stored in the point FPGA and the 64-bit floating-point Matlab implemen-
SRAM result in two GOBs. These GOBs can be pipelined. tation of the distributed Canny edge detection algorithm. Our
As shown in Fig. 31, while GOB 2 is loaded to the ping-pong results for both tests were performed on the USC SIPI image
buffers, the CEs process GOB 1. Also, while GOB 1 is written database.
back into SRAM, the CEs process GOB 2 at the same time. For the first test, the difference between the edges detected
Such a pipelined design can increase throughput. by the fixed-point FPGA and Matlab implementations is
Fig. 31 also shows the computation time of each stage in calculated. Fig. 32 shows an example of the obtained fixed-
a CE during the processing of an m m overlapping block point Matlab and FPGA results for the Houses image using
(m = 68 for a 64 64 block and a 3 3 gradient mask). the proposed algorithm with a 64 64 block size and 3 3
As shown in Fig. 31, TBC , the time to classify the block gradient masks. The FPGA simulation result is obtained using
type, is less than TG R AD , the time for Gradient and Magnitude Modelsim and assumes the original image data has been stored
calculation, which equals to 9248 clock cycles. TF I R , the FIR in SRAMs. It can be seen that the two edge maps are the same.
filter computation latency equals to 8 clock cycles for a 3 3 Furthermore, the quantitative difference between the FPGA
FIR separable filter. The high and low thresholds calculation simulation result and the fixed-point Matlab simulation result
2958 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014
Qian Xu received the B.E. degree in optoelectronics Lina J. Karam (F13) received the B.E. degree
engineering from the Huazhong University of Sci- in computer and communications engineering from
ence and Technology, Wuhan, China, and the M.E. the American University of Beirut, Beirut, Lebanon,
degree in electrical engineering from the Shanghai and the M.S. and Ph.D. degrees in electrical engi-
Institute of Technical Physics, Chinese Academy of neering from the Georgia Institute of Technology,
Sciences, China, in 2006 and 2009, respectively. Atlanta, GA, USA, in 1989, 1992, and 1995, respec-
She is pursuing the Ph.D. degree in electrical tively.
engineering, specializing in image processing and She is a Full Professor with the School of
analysis, with Arizona State University, Tempe. Her Electrical, Computer and Energy Engineering, Ari-
research interests include image processing, com- zona State University, Phoenix, AZ, USA, where
puter vision, pattern recognition, and algorithm- she directs the Image, Video, and Usability (IVU)
architecture codesign. Research Laboratory. Her industrial experience includes image and video
compression development at AT&T Bell Laboratories, Murray Hill, NJ,
USA, multidimensional data processing and visualization at Schlumberger,
and collaboration on computer vision, image/video processing, compres-
Srenivas Varadarajan received the B.E. degree in sion, and transmission projects with industries, including Intel, NTT,
electronics and communications engineering from Motorola/Freescale, General Dynamics, and NASA. She has authored more
than 100 technical publications, and she is a co-inventor on a number of
the PSG College of Technology, Coimbatore, India,
and the M.S. degree in electrical engineering from patents.
Arizona State University, in 2003 and 2009, respec- Dr. Karam was the recipient of the U.S. National Science Foundation
tively, where he is currently pursuing the Ph.D. CAREER Award, the NASA Technical Innovation Award, the 2012 Intel
Outstanding Researcher Award, and the Outstanding Faculty Award by the
degree in electrical engineering in the area of image.
His research interests include texture analysis and IEEE Phoenix Section in 2012. She has served on several journal editorial
synthesis, image and video compression, computer boards, several conference organization committees, and several IEEE techni-
cal committees. She served as the Technical Program Chair of the 2009 IEEE
vision, 3-D modeling, and embedded-software opti-
mizations for media processing algorithms. He has International Conference on Image Processing, the General Chair of the 2011
about seven years of Industrial experience in image and video processing IEEE International DSP/SPE Workshops, and the Lead Guest Editor of the
in companies, including Texas Instruments, Qualcomm Research, and Intel P ROCEEDINGS OF THE IEEE, P ERCEPTION -BASED M EDIA P ROCESSING
I SSUE (S EP. 2013). She has co-founded two international workshops (VPQM
Corporation.
and QoMEX). She is currently serving as the General Chair of the 2016
IEEE International Conference on Image Processing and a member of the
IEEE Signal Processing Societys Multimedia Signal Processing Technical
Committee and the IEEE Circuits and Systems Societys DSP Technical
Chaitali Chakrabarti received the B.Tech. degree Committee. She is a member of the Signal Processing, Circuits and Systems,
in electronics and electrical communication engi- and Communications societies of the IEEE.
neering from IIT Kharagpur, India, and the M.S.
and Ph.D. degrees in electrical engineering from
the University of Maryland, College Park, in 1984,
1986, and 1990, respectively. She is a Professor with
the School of Electrical Computer and Energy Engi-
neering, Arizona State University (ASU), Tempe.
Her research interests include all aspects of low-
power embedded systems design and VLSI archi-
tectures, and algorithms for signal processing, image
processing, and communications.
Dr. Chakrabarti was the recipient of the Best Paper Awards at SAMOS07,
MICRO08, SiPS10, and HPCA13. She is a Distinguished Alumni with the
Department of Electrical and Computer Engineering, University of Maryland.
She was the recipient of several teaching awards, including the Best Teacher
Award from the College of Engineering and Applied Sciences from ASU
in 1994, the Outstanding Educator Award from the IEEE Phoenix Section
in 2001, and the Ira A. Fulton Schools of Engineering Top 5% Faculty
Award in 2012. She served as the Technical Committee Chair of the DISPS
subcommittee, IEEE Signal Processing Society, from 2006 to 2007. She
is currently an Associate Editor of the Journal of VLSI Signal Processing
Systems and the IEEE T RANSACTIONS OF VLSI S YSTEMS .