Image Hardware PDF
Image Hardware PDF
Chandrajit Pal, Avik Kotal, Asit Samanta, Amlan Chakrabarti, Ranjan Ghosh
1 Introduction
Human beings have historically relied on their vision for tasks ranging from basic
instinctive survival skills to detailed and elaborate analysis of works of art. Our ability
to guide our actions and engage our cognitive abilities based on visual input is a
remarkable trait of the human species, and much of how exactly we do what we
intend to do and seem to do it so well remains to be explored. The need to extract
information from images and interpret their contents is one of the driving factors in
the development of image processing and computer vision for the past decades, which
demands for processing of the same to extract use-ful information from it. Digital
image processing (DIP) is an ever growing area with a variety of applications
including medicine, video surveillance and many
2 Authors Suppressed Due to Excessive Length
more. To implement the upcoming sophisticated DIP algorithms and to process the
large amount of data captured from sources such as satellites or medical instruments,
intelligent high speed real-time systems have become imperative [1]. Image
processing algorithms implemented in hardware (instead of software) have recently
emerged as the most viable solution for improving the performance of image
processing systems. Our goal is to familiarize applications programmers with the state
of the art in compiling high-level programs to FPGAs, and to survey the relevant
research work on FPGAs. The outstanding features, which FPGAs o er such as
optimization, high computational density, low cost etc, make them an increasingly
preferred choice of experts in image processing eld today. Technological
advancement in the manufacture of semiconductor ICs of-fers opportunities to
implement a wider range of imaging operations in real time. Implementations of
existing ones need improvement. With the intrusion of reconfigurable hardware
devices together with system level hardware description languages further accelerated
the design and implementation of image process-ing algorithm in hardware. Due to
the possibility of ne-grained parallelism of imaging operations, FPGA circuits are
capable of competing with other calculation based implementation environments. This
advancement have now made it possible to design complete embedded systems on a
chip (SoC) by combining sensor, signal processing and memory onto a single
substrate. With the ideal use of System-on-a-Programmable-Chip (SOPC) technology
FPGAs prove to be a very efficient, cost-effective and attractive methodology for
design verification [2].
cost and minimizes bottlenecks by maximizing data flow right from capturing through
the processing chain to the nal output. Sometimes constant upgradation in the device
is required where ASICs (Application Specific Integrated Circuits) doesn't t well, as
once it is programmed it cannot be changed [6].
Most machine vision algorithms are dominated by low and intermediate level image
processing operations, many of which are inherently parallel. This makes them
amenable to a parallel hardware implementation on an FPGA, which have the
potential to significantly accelerate the image processing component of a machine
vision system.
On an FPGA system, each operation is implemented in parallel, on separate hardware
component allowing data to pass directly from one operation to an-other, significantly
reducing or even eliminating the memory overhead. Fortunately, the low and
intermediate level image processing operations typically used in a machine vision
algorithm can be readily parallelized. FPGA implementation results in a smaller and
more significantly lower power design that combines the flexibility and
programmability of software with the speed and parallelism of hardware [7].
Hence, we choose an FPGA platform to rapidly prototype and evaluate our design
methodology.
The work flow graph shown in Fig. 1 shows the basic steps of implementing an image
processing algorithm in hardware. Step 1 requires a detailed algorithmic
understanding and its subsequent software implementation. Secondly the design
should be optimized from both the algorithm (e.g. using algebraic transforms) and
hardware (using efficient storage schemes and adjusting fixed point computation
specifications) viewpoints. Finally, the overall evaluation in terms of speed, resource
utilization, and image fidelity, decides whether additional adjustments in the design
decisions are needed. Once done FPGA-in-the-Loop Verification is carried out, which
enables us to run the test cases faster. It also opens the possibility to explore more test
cases and perform extensive regression testing on our
6 Authors Suppressed Due to Excessive Length
designs ensuring that the algorithm will behave as expected in the real world. A good
software design does not necessarily correspond to a good hardware design and this
clearly serves the purpose as to follow the steps mentioned in Figure 1a.
Since 2000 we have seen a good amount of research on utilizing FPGA as a suit-able
prototyping platform for realizing image and video processing algorithms. Digital image
processing algorithms are normally categorized into 3 types: low, intermediate and high
level. Low level operations are computationally intensive and operate on individual pixels
and sometimes on its neighborhood involving geometric operation etc [7]. Intermediate-
level operation includes conversion of the pixel data into different representation like
histogram, segmentation, thresholding and the operations related to these. High level
algorithms tries to extract meaningful information from the image like object
identification, classification etc. As we move up from low to high level operations there is
an obvious de-crease in the exploitable data parallelism due to a shift from pixel data to
more descriptive and informative representations. Here we intend to focus on the low level
operational (local filters) algorithms to deliberately show the capabilities of FPGA for
computationally intensive tasks targeted for low and intermediate-level operations. As it is
well known, a separate class of low level computationally intensive task includes image
filtering operation based on convolution. Several related research works have been done so
far.
Paper [10] have shown the various hardware convolution architectures related
Lecture Notes in Computer Science: Authors' Instructions 7
--------- (1)
8 Authors Suppressed Due to Excessive Length
where (m,n) are pixel positions, h[m,n] denotes the filter response function and
x[m,n] is the image to be filtered. [a,b] denotes the window filter size [16].
The process scenario is clear from Fig.2.
some memory registers, which assists in loading a 3*3 neighborhood. For the
convolution operation it needs 9 multiplication and 8 addition operations and is a
generic architecture with the highest complexity. This architecture computes a new
output pixel at every clock cycle after an initial delay but consume more resources.
For Fig.4 The buffer line consists of a single port RAM, as shown in unit (2.a) of Fig. 4;
the counter in it is incremented to write the current pixel data and to read it subsequently.
The output of each of five buffers of unit-1 connects to respective inputs of unit-2, each of
five parallel sub-circuits of unit-2 consists of five MAC FIR engines; one such unit is
elaborately shown in unit-2.a of Fig. 4 depicting the ASR (Addressable Shift Register)
implementing the input delay buffer. The address port runs n times faster than the data
port, where n is the number of filter taps. The ROM and ASR address are produced by the
counter. The sequence counts from 0 to n 1, then repeats. Pipeline registers r0 r2 increase
performance. A capture register is required for streaming operation. A down sampler
reduces the capture register sample period to the output sample period. The filter
coefficients are stored in ROM. Five outputs of ve MAC engines are sequentially added to
get the result, whose absolute value is computed and the data is narrowed to 8-bits. The
blue colored block is elaborated in unit-2.b (Fig. 4) as the (multiply-accumulate)MAC
engine. Enabling the 'Pipeline to Greatest Extent Possible' mask configuration parameter
ensures the internal pipeline stages of the dedicated multipliers are used [17]. The yellow
box is elaborated in unit 2.c (Fig. 4), which calculates the absolute value before
multiplying with the scaling factor, which is the sum of the weight of the filter
coefficients. This architecture has the advantage of using less resources but needs 5 clock
cycles to process per pixel. The underlying 5-tap MAC FIR filters are clocked 5 times
faster than the input rate. Therefore the throughput of the design is 100 Mhz/5= 20 million
pixels per second. For a 64x64 image this is 20x10 6 /(64x64)= 4883 frames/sec. For our
experiment the image size is 150x150, so 889 frames/sec. This architecture consumes
very less hardware resources.
For linear operation, convolution has some interesting properties such as
commutatively. Therefore for PxP kernels can be rede ned as the convolution of a Px1
kernel (Q1) with a 1x P kernel (Q2). As a result the equation can be formulated as
10 Authors Suppressed Due to Excessive Length
I x Q1 x Q2 = I x Q2 x Q1 (2)
Fig.5 and 6 implements the right hand and left hand side of the equation 2
respectively. The design with separable convolution kernel architecture is shown in
Fig. 5 and Fig.6. In Fig.5 the column convolution has been carried out in the rst
section of the hardware before the row buffering scheme. The row bu ering is shown
in the detailed architecture in unit 1.a of Fig.4 as explained previously and the row
convolution in unit 4.a of Fig. 4 respectively. The partially processed pixels after the
column convolution is passed through the row convolution section to get the filtered
pixel and is capable of processing (100x106)/256x256= 1526 frames/sec. 100 stands
for the frequency of the FPGA board in MHz and image size is 256 x 256 and
100x106/(150x150) = 4444 frames/sec for a 150x150 size image.
where tframe is the processing time for one frame, C is the total number of clock cycles
required to process one frame of M pixels, f is the maximum clock
Lecture Notes in Computer Science: Authors' Instructions 11
Fig. 4. Hardware blocks showing the ltering hardware architecture of a 5x5 filter kernel
implementation [18].
frequency at which the design can run, ncore is the number of processing units, tp is
the pixel-level throughput with one processing unit (0 < t p < 1), N is the number of
iterations in an iterative algorithm and is the overhead ( latency ) in clock cycles for
one frame [3].
We have tested for our convolution architectures discussed above for a single
image filtering application and have measured the time via the well known eqn 3 [3].
For 150 x 150 resolution image, M= 22500, N = 1, t p = 1 i.e per pixel processed per
clock pulse, and = 350 i.e the latency in clock cycle, f = 100 MHz, n core = 1. Therefore the
tframe = 0.00022 seconds = 0.2 ms 33ms ( i.e much less than the minimum timing
threshold required to process per frame in real time video rate ). We have measured the
same execution in software and it came to be 0.008 second. Therefore the acceleration in
hardware is 0.008/0.00022 = 40x . From Table 1 it is clear that architecture in Fig.5, 6 and
7 are most suitable w.r.t resource usage. We have also measured the power consumption
of the individual hardware architectures as shown in Table 2. From the data it is
12 Authors Suppressed Due to Excessive Length
Register
LINE ROW
BUFFERING ABSOLUTE
CONVOLU- BLOCK
HARDWARE -TION
unit 4a
unit 4
IN
IN
IN
OUT
IN
Register Register
IN
unit 4a magnified
Fig. 5. Hardware blocks showing the filtering hardware architecture for separable
kernel. Right hand side of Eqn. 2.
clear that the normal convolution hardware in Fig.4 and the separable hardware
architectures in Fig.5, 6 consumes the least power among the rest.
In this paper we have discussed in brief our motivation towards the computer vision
algorithm implementation realized in hardware and presented various e efficient
convolution architectures with almost similar results, with minute changes in the
PSNR of the filtered output images resulted after applying Gaussian filtering on a
noisy image shown in Fig.13. We have also tested our architectures, which when
applied to a particular edge preserving algorithm produced good results (with
enhanced PSNR as shown in Fig.13). It has been shown that Xilinx System Generator
(XSG) environment can be used to develop hardware-based computer vision
algorithms from a system level approach, making it suitable for developing co-design
environments. We have also used FPGA-in-the-loop (FIL) verification [19], to verify
our design. This approach also ensures that the algorithm will behave as expected in
the real world. In future we need to explore more high level technique and approaches
to circuit optimization with energy efficiency.
Lecture Notes in Computer Science: Authors' Instructions 13
LINE ROW
BUFFERING
From WORKSPACE CONVOLU- Register
Register Register Register Register
HARDWARE -TION
Register
unit 5a
ABSOLUTE
BLOCK
Normalization
factor
IN
casting
IN
out
IN
OUT
IN
Register Register
IN
unit 5a magnified
Fig. 6. Hardware blocks showing the filtering hardware architecture for separable
kernel. Left hand side of Eqn. 2.
Acknowledgment
This work has been supported by the Department of Science and Technology, Govt of
India under grant No DST/INSPIRE FELLOWSHIP/2012/320 as well as grant from
TEQIP phase 2 (COE), University of Calcutta for the experimental equipments. The
authors wish to thank Dr. Kunal Narayan Chaudhury for his help regarding some
theoretical understandings.
References
1. Gribbon, K. Bailey, D. Johnston, C.: Design Patterns for Image Processing Algo-
rithm Development on FPGAs.TENCON 2005 - 2005 IEEE Region 10 Conference
doi: 10.1109/TENCON.2005.301109 147, 1-6 (2005).
2. Li, Ye Yao, Qingming Tian, Bin Xu, Wencong: Fast double-parallel image pro-
cessing based on FPGA:Proceedings of 2011 IEEE International Conference on
Vehicular Electronics and Safety pp. 97-102. doi: 10.1109/ICVES.2011.5983754
(2011)
3. Wenqian Wu and Acton, S.T. and Lach, J, Real-Time Processing of Ultra-sound
Images with Speckle Reducing Anisotropic Di usion. Fortieth Asilomar Conference
on Signals, Systems and Computers, 2006. ACSSC '06, pp:1458-
1464,doi=10.1109/ACSSC.2006.355000, 2006.
15
Fig. 8. Simulation results showing the time interval taken to process the image pixels
for a normal convolution hardware architecture in Fig.4 where 5 clock pulses are
needed to process per pixel. Each clock pulse duration is 10 ns.
Fig. 9. Simulation results showing the time interval taken to process the image pixels.
Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process.
This timing diagram is followed by all the architectures except for Fig.4. It is
implementing right hand side of equation 2.
.
Fig. 10. Simulation results showing the time interval taken to process the image pixels.
Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process.
This timing diagram is followed by all the architectures except for Fig.6. It is
implementing left hand side of equation 2.
.
4. Reg Zatrepalek, Hardent Inc. Using FPGAs to solve tough DSP design challenges,
23rd july 2007, "http://www.eetimes.com/document.asp?piddl_msgpage=2&doc_
Lecture Notes in Computer Science: Authors' Instructions 17
Fig. 11. Simulation results showing the time interval taken to process the image pixels.
Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process.
This timing diagram is followed by all the architectures except for Fig.7.
.
Fig. 12. Simulation results showing the time interval taken to process the image pixels.
Each clock pulse duration is 10 ns. Each pixel requires one clock pulse to process.
This timing diagram is followed by all the architectures except for Fig.3 and it is a
complete parallel architecture.
.
id=1279776&page_number=1".
5. J.a. Kalomiros, J.Lygouras, Design and evaluation of a hardware/software FPGA-
based system for fast image processing, Microprocessors and Microsystems,
Year:2008, Vol:32, Issue:2, Pages:95-106.
6. A. E. Nelson, Implementation of image processing algorithms on FPGA hardware,
May 2000, "http://www.isis.vanderbilt.edu/sites/default/files/Nelson_
T_0_0_2000_Implementa.pdf ".
7. D. Bailey, Machine Vision Handbook,2012, doi:10.1007/978-1-84996-169-1,
ISBN:978-1-84996-168-4.
8. Daggu Venkateshwar Rao, et al Implementation and Evaluation of Image Process-
ing Algorithms on Recon gurable Architecture using C-based Hardware Descrip-
tive Languages Available:www.gbspublisher.com/ijtacs/1002.pdf
9. Kuon, Ian Tessier, Russell Rose, Jonathan,FPGA Architecture: Survey and Chal-
lenges, pp-135-253, (2007),doi: 10.1561/1000000005.
10. Wiatr, K. Jamro, E,Implementation image data convolutions operations in FPGA
recon gurable structures for real-time vision systems, Proceedings Inter-national
Conference on Information Technology: Coding and Computing (Cat.
No.PR00540), pp: 152-157, doi: 10.1109/ITCC.2000.844199.
18 Authors Suppressed Due to Excessive Length
Fig. 13. Gaussian filtered output for image size of 150 150 applied over noisy image
2
with (variance) = 0:005. Filter settings s=20 (domain kernel std dev). The filtered
images (a),(b),(c),(d) and (e) correspond to the architectures shown in Figures 4, 5, 6,7
and 3.
Fig. 14. Filter output for checkerboard image of size 150x150 for the additive Gaussian
noise. Filter settings s=20, r=50 and =12 for the additive Gaussian noise [18].
12. Sriram, Vinay Kearney, David, A FPGA implementation of variable kernel convo-
lution, pp:105-109, doi: 10.1109/.45, (2007).
13. Hui Zhang, Mingxin Xia, and Guangshu Hu, A Multiwindow Partial Bu ering
Scheme for FPGA-Based 2-D Convolvers, pp:200-204, issue:2, vol-54, (2007).
14. Mohammad, Khader Agaian, Sos, E cient FPGA implementation of convolution,
pp:3478-3483, issue:october, (2009).
15. Ramrez, Juan Manuel Flores, Emmanuel Morales Martnez-carballido, Jorge En-
riquez, Rogerio, An FPGA-based Architecture for Linear and Morphological Image
Filtering, pp:90-95, issue:3, (2010).
16. Rafael C. Gonzalez, Richard E. Woods, Digital Image Processing 3 Edition, Pub-
lisher: Pearson (2008), ISBN-13 9788131726952.
17. James Hwang, Jonathan Ballagh,'Building Custom FIR Filters Using System Gen-
erator," in Springer Berlin Heidelberg, 2002, series vol. 2438, pp. 1101 { 1104.
18. Chandrajit Pal, K.N.Chaudhury, Asit Samanta, Amlan Chakrabarti, Ranjan
Ghosh,Hardware software co-design of a fast bilateral lter in FPGA , India Con-
ference (INDICON), 2013 Annual IEEE, pp:1-6, ISBN:978-1-4799-2274-1, doi:
10.1109/INDCON.2013.6726034.
19. www.mathworks.com/products/hdl-verifier