Subramaniam 2017
Subramaniam 2017
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters
I mage processing is integral to deciphering the intelligence programmable digital signal processing, ASICs, and
associated with it. Generally image processing involves reconfigurable architectures. In addition, the proposed
huge data handling, and modelling human intelligence architecture does not impose any constraint on the time
demands heavy amount of computations on automated required for reading and processing pixels.
processes. Handling images involves loss of original
information at various levels through environment and process II. MEDIAN FILTER MEETING REAL-TIME REQUIREMENTS
stages. One such level is the processing of image for feature Vega-Rodríguez MA et al. [4] have presented an
extraction [1]. Prior to this, pre-processing an image for the architecture for basic median filter for systolic array
retrieval of the original information from non-Gaussian noise implementation [10]. The architecture employs pipelining and
corruption is carried out mostly by nonlinear digital filters.
parallelism. It is finally implemented with FPGA as target
Dominant among these are median based filters.
device. The FPGA is interfaced with computer through 32 bit
PCI port for real-time interfacing and better human
Median filter provides robustness to impulse noise;
interactions. Every read instruction on a 32 bit system can read
however, the development of median filtering algorithms does
not include the requirements of real-time intelligence systems.
Generic processing ICs do not provide cost effective solution
for image processing because of predefined architectural
limitations [2]. In addition, there is always a trade-off between
the quality of information contained in images and the
resources required to handle the images. FPGAs are
sufficiently flexible and cost effective for prototyping and
reconfiguring the applications [3] and, therefore, provide
sufficient opportunity for the development of application
specific architectures which cater to real-time requirements.
1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters
4 pixels, each 8 bit wide. Multiple pixels on a single read A. The median filtering scheme for effective data handling
cycle and parallelism on systolic array lead to simultaneous and parallel filtering
production of 4 filtered pixels. The parallel and pipelined The proposed scheme is shown in Fig. 3. The objective is
median filter architecture is shown in Fig. 1. Smith JL’s to effectively handle pixels for reducing repeated read
network [10] introduces parallelism and pipelining by splitting operations. Beginning with the input image, the first four rows
the nine level systolic arrays into two stages. The first stage in of the pixels are selected; the pixels in the selected rows are
Fig. 1 is called elementary sorting stage a.k.a. E-stage. The grouped as column vectors of 4 pixels each. Each column
vector is read as one 32 bit data word. Four column vectors are
presented as a single 4x4 matrix where 3x3 masks can be
applied with respect to four center pixels. Only two words are
transferred instead of three or more words for completing one
processing cycle. Only a single processing cycle is required
for generating four filtered outputs; these four outputs are
generated by four filters, namely, F1 to F4. The two words can
be transferred in one or two machine cycles depending on the
capability of system. For simplicity, reading two words per
machine cycle is considered. The first two column vectors are
transferred in cycle 1 and the words are stored in elements S1
Fig. 2. Part of systolic array optimized by Smith JL [5]
and S2, each of size 4x1. In the next machine cycle, the third
next stage is called network sorting stage a.k.a. N-stage, with and fourth column vectors of selected rows are transferred and
6 levels of comparators as shown in Fig. 2. It is also denoted combined with the previously stored columns to form a 4x4
as Network Node in the figure. matrix. In this process, the elements of the 4x4 matrix are
grouped such way that four 3x3 sets are available with respect
The output pixel value is calculated by considering 8 to the pixels encircled in Fig. 3. Each encircled pixel happens
neighbors with respect to a center pixel and 3x3 moving mask. to be the center pixel of an 8 neighborhood. By median
In order to read 4 pixels of the input image in a machine cycle,
the first three rows are considered at the start. With respect to
these three rows, the first 4 columns capture three row vectors,
each vector containing 4 pixels. Each row vector containing a
set of four 8 bit pixels forms a 32 bit word. Before the start of
the processing, three words are available for transfer to the
architecture. Three consecutive read operations result in the
transfer of a 3x4 matrix to the architecture. The last two
columns of every 3x4 matrix processed previously are sorted
and stored. These stored elements are combined with
incoming 3x4 matrix for the formation of a 3x6 matrix.
Consequently, this provides for placing four 3x3 masks
simultaneously with respect to four center pixels.
III. ARCHITECTURE WITH REDUCED DATA HANDLING Fig. 3. Proposed two stage pipelining scheme
A fast median filtering scheme and a new architecture for filtering each set, 4 output pixels, namely, P1 to P4 are
hardware implementation on 32 bit system are proposed. The generated. Once the execution of filtering process is over, the
proposed scheme is common to both reconfigurable latest two column vectors are stored in S1 and S2; the latest
architectures and general purpose computing platforms. two column vectors are now ready for concatenation with the
new incoming set of vectors in the subsequent processing
cycles. This process repeats until the last column vector is
processed and the first sets of two rows are made available in
1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters
the output image. For the selection of the next set of rows, holding the sorted set of values produced by the E-stage. All
only rows 3 to 6 need be considered; the above process is four filtered output pixels are organized as a word; the word is
repeated for finding the next two rows of the output image. subsequently loaded to the memory for write operation.
CPMA, therefore, needs four machine cycles to complete read,
E-stage, N-stage and write operations. It is to be noted that the
machine cycles of this section are different from that of the
scheme for effective data handling described in the previous
section.
TABLE I
3X3 MEDIAN FILTER IN XILINX FPGA VIRTEX 4 XC4VSX25 (N=9, INPUT SAMPLE WIDTH=8 BITS)
Performance Metrics LCBP[5] Cadenas’s [8] Smith’s[10] FM-WCA[11] Vega’s (*) CPMA (*)
CLB 459 254 / 284 1552 1552 3790 (947.5) 3770 (942.5)
DFF 516 507 / 478 368 344 200 (50) 224 (56)
LUT 632 336 / 567 152 152 832 (208) 832 (208)
fmax (MHz) 327 332 / 286 / 335 454 454 454 (1816) 454 (1816)
Latency (clock cycles) 8 7 9 8 9 9
Throughput (median outputs per clock cycle) 1 1 1 1 4 (1) 4 (1)
No. of times a pixel in input image to be read 9 9 9 9 3 2
Pipelined (P) / Pipelined & Parallel (P&P) P P P P P&P P&P
* Resources required per output pixel
1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters
number of pixel read operations are executed for producing filter processing cycles is fixed by the size of the image to be
the filtered output image. In CPMA, an input image pixel is filtered, median filtering by the proposed architecture requires
read and transferred to the architecture only twice, which reduced data handling. The proposed architecture saves
means only 2 x m x n number of pixel read operations need to significant amount of resources and the time required to
be executed for producing the filtered image. Proposed CPMA handle median filtering of images. Further optimization on
handles only two-third of the data in comparison with the best pipelining and parallelism may lead to further improvement in
schemes available in the literature. Implementations presented the context of real-time processing.
in [5], [8], [10] and [11] require that pixel of the input image
be handled nine times. ACKNOWLEDGMENT
This work is partially supported by the project, “Capacity
Table I lists the latency cycles in the pipelined designs. building in the areas of EPDPT” of MeitY, Govt. of India,
Throughput is displayed next. Throughput is defined as the
implemented by NIELIT Chennai Centre and the scholarship
number of median outputs generated per clock cycle. Fast
scheme, “Visvesvaraya PhD Scheme for Electronics and IT”
median-finding word comparator array (FM-WCA) [11] is the
of MeitY, Govt. of India, granted through VIT University
latest median finding word comparator array and reported to
be the fastest in the pipelined architectures, for example, Low Chennai Campus, Chennai, Tamil Nadu, India.
hardware complexity pipelined rank filter (LCBP) [5] and
Cadenas’s method [8]. However, Cadenas’s method has fewer REFERENCES
latency cycles. FM-WCA uses 7% fewer DFFs and smaller [1] Baxes, Gregory A. Digital image processing: principles and applications.
latency than Smith’s. LCBP, Cadenas’s, Smith’s and FM- New York: Wiley, 1994.
[2] Smith, Michael John Sebastian. Application-specific integrated circuits.
WCA are implemented as pipelined architectures. Throughput Addison-Wesley Professional, 2008.
for all these methods is same and not modified for [3] Hauck, Scott. "The roles of FPGAs in reprogrammable
improvements at the same speed. Parallelism with pipelining systems." Proceedings of the IEEE 86.4 (1998): 615-638.
is introduced in Vega’s architecture and CPMA for the [4] Vega-Rodríguez, Miguel A., Juan M. Sánchez-Pérez, and Juan A.
Gómez-Pulido. "An FPGA-based implementation for median filter
improvement of throughput at 454 MHz. The introduction of meeting the real-time requirements of automated visual inspection
parallelism has reduced the amount of image data to be systems." Proc. 10th Mediterranean Conf. Control and Automation.
handled by these architectures. Vega’s architecture handles 2002.
only 33.3% of image data handled by pipelined methods. But [5] Prokin, Dragana, and Milan Prokin. "Low hardware complexity
pipelined rank filter." IEEE Transactions on Circuits and Systems II:
CPMA handles only 66.6% image data handled by Vega’s Express Briefs 57.6 (2010): 446-450.
architecture. It is evident from Table I that the number of [6] Cadenas, J., et al. "Fast median calculation method." Electronics
times a pixel in input image read is only twice in the complete letters48.10 (2012): 558-560.
filtering process of the input image. Per-throughput resources [7] Cadenas, J. "Pipelined median architecture." Electronics Letters 51.24
(2015): 1999-2001.
of CPMA are comparatively fewer than Vega’s architecture, [8] Cadenas, José O., Graham M. Megson, and Robert Simon Sherratt.
and speed of 1816 MHz is significantly higher than FM-WCA "Median filter architecture by accumulative parallel counters." IEEE
and other methods. Table II presents the implementation of Transactions on Circuits and Systems II: Express Briefs 62.7 (2015):
sorting based methods for evaluating resource utilization and 661-665.
[9] Morcego, B., J. Frau, and A. Català. "Suavizado de Imágenes en Tiempo
the speed on the state-of-the-art prototyping platform. Real mediante Filtrado por Mediana Utilizando Arrays Sistólicos." Proc.
Performance of CPMA is further confirmed by Xilinx FPGA of VII DCIS (1992): 545-546.
Virtex 7 implementation. CPMA uses at least 30% less [10] Smith, John L. "Implementing median filters in xc4000e fpgas." Xilinx
hardware resources than FM-WCA and offers increased Xcell 23 (1996): 16.
[11] SUBRAMANIAM, JANARTHANAM, Jagadeesh Kannan Raju, and
throughput four times that of FM-WCA. Similarly CPMA uses David Ebenezer. "Fast median-finding word comparator
2.5% less logical resources and handles 33.3% less image data aarray." Electronics Letters (2017).
for the same throughput than architecture in [4].
V. CONCLUSION
Computationally intensive median filtering algorithms are a
challenge in the context of real-time processing. For the basic
median filter, optimization of the amount of data handled at
the architecture level with pipelining and parallelism of
existing systolic array is considered. While the number of
TABLE II
3X3 MEDIAN FILTER IN XILINX FPGA VIRTEX 7 XC7VX330T
Performance
Smith’s FM-WCA Vega’s (*) CPMA (*)
Metrics
Slices 360 349 802 (200.5) 782 (195.5)
DFF 96 80 200 (50) 224 (56)
LUT 326 330 865 (216.25) 851(212.75)
fmax (MHz) 631 631 631 (2524) 631 (2524)
* Resources required per output pixel
1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.