0% found this document useful (0 votes)
28 views4 pages

Subramaniam 2017

Uploaded by

sceece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views4 pages

Subramaniam 2017

Uploaded by

sceece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters

Parallel and Pipelined 2D Median Filter


Architecture
Janarthanam Subramaniam, Raju Jagadeesh Kannan and David Ebenezer, Senior Member, IEEE

 Basic median filter can be implemented on specific


Abstract—The existing two dimensional median filters in the architectures by performing median operation through sorting
literature are computationally intensive. It is proposed to based systolic arrays [4] and non-sorting based techniques [5]-
optimally reduce the amount of data handled at the architecture [8]. Systolic arrays are continuously optimized by researchers
level realization of the basic median filtering operation on
[9]-[11]. The architecture developed by Vega-Rodríguez MA
images. The proposed architecture reads 4 pixels at a time in the
input image, 4 pixels forming a word on a 32 bit hardware et. al. [4] exploits 32 bit data-width hardware for image data
processing system; the subsequent processing is carried out by transfer; they have implemented the architecture as a systolic
parallel and pipelined median filter architecture. Two read array which is reported in [9] and [10].
operations process 8 input pixels which results in the generation The architecture proposed in this letter handles image data
of 4 output pixels with an initial latency. The proposed effectively in such a way that a pixel in the input image is read
architecture offers reduced number of read operations and
only twice in comparison with architectures reported in the
increased speed.
literature wherein a pixel is read three or more times.
Index Terms—Median Filter, Pipelined Median Filter, Systolic Reduction in the number of read operations for filtering an
Arrays, Parallel Median Filter. image ensures reduction in the overall operating time. The
proposed architecture also offers the advantage that it can be
I. INTRODUCTION employed for different word length realizations,

I mage processing is integral to deciphering the intelligence programmable digital signal processing, ASICs, and
associated with it. Generally image processing involves reconfigurable architectures. In addition, the proposed
huge data handling, and modelling human intelligence architecture does not impose any constraint on the time
demands heavy amount of computations on automated required for reading and processing pixels.
processes. Handling images involves loss of original
information at various levels through environment and process II. MEDIAN FILTER MEETING REAL-TIME REQUIREMENTS
stages. One such level is the processing of image for feature Vega-Rodríguez MA et al. [4] have presented an
extraction [1]. Prior to this, pre-processing an image for the architecture for basic median filter for systolic array
retrieval of the original information from non-Gaussian noise implementation [10]. The architecture employs pipelining and
corruption is carried out mostly by nonlinear digital filters.
parallelism. It is finally implemented with FPGA as target
Dominant among these are median based filters.
device. The FPGA is interfaced with computer through 32 bit
PCI port for real-time interfacing and better human
Median filter provides robustness to impulse noise;
interactions. Every read instruction on a 32 bit system can read
however, the development of median filtering algorithms does
not include the requirements of real-time intelligence systems.
Generic processing ICs do not provide cost effective solution
for image processing because of predefined architectural
limitations [2]. In addition, there is always a trade-off between
the quality of information contained in images and the
resources required to handle the images. FPGAs are
sufficiently flexible and cost effective for prototyping and
reconfiguring the applications [3] and, therefore, provide
sufficient opportunity for the development of application
specific architectures which cater to real-time requirements.

Janarthanam Subramaniam is with SENSE, VIT University Chennai


Campus, Chennai, Tamil Nadu, India and Scientist, VLSI & Embedded
Systems Group, NIELIT Chennai, Chennai-600025, Tamil Nadu, India (e-
mail: [email protected], [email protected]).
Raju Jagadeesh Kannan is with SCSE, VIT University Chennai Campus,
Chennai – 600127, Tamil Nadu, India (e-mail: [email protected]).
David Ebenezer is with Dept. of ECE, Anna University, Chennai –
600025, Tamil Nadu, India (e-mail: [email protected]). Fig. 1. Real-time median filter proposed by Vega-Rodriguez MA et. al. [4]

1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters

4 pixels, each 8 bit wide. Multiple pixels on a single read A. The median filtering scheme for effective data handling
cycle and parallelism on systolic array lead to simultaneous and parallel filtering
production of 4 filtered pixels. The parallel and pipelined The proposed scheme is shown in Fig. 3. The objective is
median filter architecture is shown in Fig. 1. Smith JL’s to effectively handle pixels for reducing repeated read
network [10] introduces parallelism and pipelining by splitting operations. Beginning with the input image, the first four rows
the nine level systolic arrays into two stages. The first stage in of the pixels are selected; the pixels in the selected rows are
Fig. 1 is called elementary sorting stage a.k.a. E-stage. The grouped as column vectors of 4 pixels each. Each column
vector is read as one 32 bit data word. Four column vectors are
presented as a single 4x4 matrix where 3x3 masks can be
applied with respect to four center pixels. Only two words are
transferred instead of three or more words for completing one
processing cycle. Only a single processing cycle is required
for generating four filtered outputs; these four outputs are
generated by four filters, namely, F1 to F4. The two words can
be transferred in one or two machine cycles depending on the
capability of system. For simplicity, reading two words per
machine cycle is considered. The first two column vectors are
transferred in cycle 1 and the words are stored in elements S1
Fig. 2. Part of systolic array optimized by Smith JL [5]
and S2, each of size 4x1. In the next machine cycle, the third
next stage is called network sorting stage a.k.a. N-stage, with and fourth column vectors of selected rows are transferred and
6 levels of comparators as shown in Fig. 2. It is also denoted combined with the previously stored columns to form a 4x4
as Network Node in the figure. matrix. In this process, the elements of the 4x4 matrix are
grouped such way that four 3x3 sets are available with respect
The output pixel value is calculated by considering 8 to the pixels encircled in Fig. 3. Each encircled pixel happens
neighbors with respect to a center pixel and 3x3 moving mask. to be the center pixel of an 8 neighborhood. By median
In order to read 4 pixels of the input image in a machine cycle,
the first three rows are considered at the start. With respect to
these three rows, the first 4 columns capture three row vectors,
each vector containing 4 pixels. Each row vector containing a
set of four 8 bit pixels forms a 32 bit word. Before the start of
the processing, three words are available for transfer to the
architecture. Three consecutive read operations result in the
transfer of a 3x4 matrix to the architecture. The last two
columns of every 3x4 matrix processed previously are sorted
and stored. These stored elements are combined with
incoming 3x4 matrix for the formation of a 3x6 matrix.
Consequently, this provides for placing four 3x3 masks
simultaneously with respect to four center pixels.

In the architecture, the arrangement of 4 parallel network


nodes results in 4 filtered pixels at a time. A set of three read
operations, one cycle of filtering operation, and one cycle of
write operation is the sequence of operations required for
completing one cycle. The parallelism uses only 52
comparators for producing 4 filtered output pixels. This
obviates the need for 76 comparator operations as employed in
the approaches reported in the literature.

III. ARCHITECTURE WITH REDUCED DATA HANDLING Fig. 3. Proposed two stage pipelining scheme

A fast median filtering scheme and a new architecture for filtering each set, 4 output pixels, namely, P1 to P4 are
hardware implementation on 32 bit system are proposed. The generated. Once the execution of filtering process is over, the
proposed scheme is common to both reconfigurable latest two column vectors are stored in S1 and S2; the latest
architectures and general purpose computing platforms. two column vectors are now ready for concatenation with the
new incoming set of vectors in the subsequent processing
cycles. This process repeats until the last column vector is
processed and the first sets of two rows are made available in

1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters

the output image. For the selection of the next set of rows, holding the sorted set of values produced by the E-stage. All
only rows 3 to 6 need be considered; the above process is four filtered output pixels are organized as a word; the word is
repeated for finding the next two rows of the output image. subsequently loaded to the memory for write operation.
CPMA, therefore, needs four machine cycles to complete read,
E-stage, N-stage and write operations. It is to be noted that the
machine cycles of this section are different from that of the
scheme for effective data handling described in the previous
section.

IV. HARDWARE IMPLEMENTATION AND PERFORMANCE


ANALYSIS

For the purpose of verifying the functionality of CPMA, the


design is implemented in RTL Verilog HDL using XILINX
ISE Design Suite 14.7 and synthesized for implementation in
Xilinx FPGA Virtex 4 XC4VSX25. This FPGA is chosen for
straight forward comparison with the state-of-the-art non-
sorting methods [5], [8], sorting methods [10], [11] and
sorting based parallel architecture [4] available in the
literature. The comparison is in terms of resource utilization
and speed. The results are presented in Table I for the purpose
of comparison of the performance metrics. The schemes in
Table I employ 3x3 masks. Input sample width is 8 bits. The
Fig. 4. Proposed CPMA focus of the methods presented in [5], [8], [10], [11] is on the
design of new median finding methods with improved
B. Column-vectors Processing Median filter Architecture resource and time metrics and their pipelined architectures.
The column-vectors processing median filtering architecture For each cycle, these methods require 9 input values and use
(CPMA) is based on the proposed scheme for effective data the resources optimally. Although the proposed CPMA is
handling and parallel filtering. CPMA requires 8 pixels, I(0-1, 0- based on the existing sorting method in [4], parallelism and
3), at the input. These 8 pixels are transferred as two 32 bit pipelining reduce repeated comparator operations of the input
words. These are the column vectors of input pixels, I(0, 0-3) and pixels. The proposed CPMA also reduces the number of times
I(1,0-3). Input pixels representing the first and second columns a pixel is read and transferred for processing.
are sorted in the ascending order and stored; the sorting is
carried out by the E-stage. After E-stage, 24 values in 8 Architecture in [4] needs 12 pixel values as input. CPMA
different sorted sets are available. These sets provide inputs to needs only 8 pixels as input. Reading of two words per cycle
four network nodes operating in parallel in the N-stage. N1 requires only one machine cycle for reading input pixels from
receives values from I(-1, 0-2), I(0, 0-2) and I(1,0-2) similar to F1 in memory. i.e. 8 pixels are transferred in two 32 bit words. In
Fig. 3. N2, N3 and N4 also receive values from input pixel comparison, scheme in [4] needs two reading cycles before an
matrix as shown in Fig. 4 similar to filters F2, F3 and F4. E- execution cycle. The extra read operation results in the
stage and N-stage are allocated each one machine cycle. addition of one more machine cycle which is 25% extra
Though E-stage has only three stages of compare-and-swap burden on overall processing time in comparison with the
operations in comparison with 6 stages of compare-and-swap proposed architecture.
operations at the N-stage; it is a trade-off between storage
elements and speed. However, the speed has a dependence on Consider an image with m x n number of pixels. In the
the user’s decision to split or not the N-stage. Pipelining scheme proposed in [4], every pixel is read and transferred
between E-stage and N-stage requires 12 storage elements for three times to the median finding architecture i.e. 3 x m x n

TABLE I
3X3 MEDIAN FILTER IN XILINX FPGA VIRTEX 4 XC4VSX25 (N=9, INPUT SAMPLE WIDTH=8 BITS)

Performance Metrics LCBP[5] Cadenas’s [8] Smith’s[10] FM-WCA[11] Vega’s (*) CPMA (*)
CLB 459 254 / 284 1552 1552 3790 (947.5) 3770 (942.5)
DFF 516 507 / 478 368 344 200 (50) 224 (56)
LUT 632 336 / 567 152 152 832 (208) 832 (208)
fmax (MHz) 327 332 / 286 / 335 454 454 454 (1816) 454 (1816)
Latency (clock cycles) 8 7 9 8 9 9
Throughput (median outputs per clock cycle) 1 1 1 1 4 (1) 4 (1)
No. of times a pixel in input image to be read 9 9 9 9 3 2
Pipelined (P) / Pipelined & Parallel (P&P) P P P P P&P P&P
* Resources required per output pixel

1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2017.2771453, IEEE Embedded
Systems Letters

number of pixel read operations are executed for producing filter processing cycles is fixed by the size of the image to be
the filtered output image. In CPMA, an input image pixel is filtered, median filtering by the proposed architecture requires
read and transferred to the architecture only twice, which reduced data handling. The proposed architecture saves
means only 2 x m x n number of pixel read operations need to significant amount of resources and the time required to
be executed for producing the filtered image. Proposed CPMA handle median filtering of images. Further optimization on
handles only two-third of the data in comparison with the best pipelining and parallelism may lead to further improvement in
schemes available in the literature. Implementations presented the context of real-time processing.
in [5], [8], [10] and [11] require that pixel of the input image
be handled nine times. ACKNOWLEDGMENT
This work is partially supported by the project, “Capacity
Table I lists the latency cycles in the pipelined designs. building in the areas of EPDPT” of MeitY, Govt. of India,
Throughput is displayed next. Throughput is defined as the
implemented by NIELIT Chennai Centre and the scholarship
number of median outputs generated per clock cycle. Fast
scheme, “Visvesvaraya PhD Scheme for Electronics and IT”
median-finding word comparator array (FM-WCA) [11] is the
of MeitY, Govt. of India, granted through VIT University
latest median finding word comparator array and reported to
be the fastest in the pipelined architectures, for example, Low Chennai Campus, Chennai, Tamil Nadu, India.
hardware complexity pipelined rank filter (LCBP) [5] and
Cadenas’s method [8]. However, Cadenas’s method has fewer REFERENCES
latency cycles. FM-WCA uses 7% fewer DFFs and smaller [1] Baxes, Gregory A. Digital image processing: principles and applications.
latency than Smith’s. LCBP, Cadenas’s, Smith’s and FM- New York: Wiley, 1994.
[2] Smith, Michael John Sebastian. Application-specific integrated circuits.
WCA are implemented as pipelined architectures. Throughput Addison-Wesley Professional, 2008.
for all these methods is same and not modified for [3] Hauck, Scott. "The roles of FPGAs in reprogrammable
improvements at the same speed. Parallelism with pipelining systems." Proceedings of the IEEE 86.4 (1998): 615-638.
is introduced in Vega’s architecture and CPMA for the [4] Vega-Rodríguez, Miguel A., Juan M. Sánchez-Pérez, and Juan A.
Gómez-Pulido. "An FPGA-based implementation for median filter
improvement of throughput at 454 MHz. The introduction of meeting the real-time requirements of automated visual inspection
parallelism has reduced the amount of image data to be systems." Proc. 10th Mediterranean Conf. Control and Automation.
handled by these architectures. Vega’s architecture handles 2002.
only 33.3% of image data handled by pipelined methods. But [5] Prokin, Dragana, and Milan Prokin. "Low hardware complexity
pipelined rank filter." IEEE Transactions on Circuits and Systems II:
CPMA handles only 66.6% image data handled by Vega’s Express Briefs 57.6 (2010): 446-450.
architecture. It is evident from Table I that the number of [6] Cadenas, J., et al. "Fast median calculation method." Electronics
times a pixel in input image read is only twice in the complete letters48.10 (2012): 558-560.
filtering process of the input image. Per-throughput resources [7] Cadenas, J. "Pipelined median architecture." Electronics Letters 51.24
(2015): 1999-2001.
of CPMA are comparatively fewer than Vega’s architecture, [8] Cadenas, José O., Graham M. Megson, and Robert Simon Sherratt.
and speed of 1816 MHz is significantly higher than FM-WCA "Median filter architecture by accumulative parallel counters." IEEE
and other methods. Table II presents the implementation of Transactions on Circuits and Systems II: Express Briefs 62.7 (2015):
sorting based methods for evaluating resource utilization and 661-665.
[9] Morcego, B., J. Frau, and A. Català. "Suavizado de Imágenes en Tiempo
the speed on the state-of-the-art prototyping platform. Real mediante Filtrado por Mediana Utilizando Arrays Sistólicos." Proc.
Performance of CPMA is further confirmed by Xilinx FPGA of VII DCIS (1992): 545-546.
Virtex 7 implementation. CPMA uses at least 30% less [10] Smith, John L. "Implementing median filters in xc4000e fpgas." Xilinx
hardware resources than FM-WCA and offers increased Xcell 23 (1996): 16.
[11] SUBRAMANIAM, JANARTHANAM, Jagadeesh Kannan Raju, and
throughput four times that of FM-WCA. Similarly CPMA uses David Ebenezer. "Fast median-finding word comparator
2.5% less logical resources and handles 33.3% less image data aarray." Electronics Letters (2017).
for the same throughput than architecture in [4].

V. CONCLUSION
Computationally intensive median filtering algorithms are a
challenge in the context of real-time processing. For the basic
median filter, optimization of the amount of data handled at
the architecture level with pipelining and parallelism of
existing systolic array is considered. While the number of
TABLE II
3X3 MEDIAN FILTER IN XILINX FPGA VIRTEX 7 XC7VX330T
Performance
Smith’s FM-WCA Vega’s (*) CPMA (*)
Metrics
Slices 360 349 802 (200.5) 782 (195.5)
DFF 96 80 200 (50) 224 (56)
LUT 326 330 865 (216.25) 851(212.75)
fmax (MHz) 631 631 631 (2524) 631 (2524)
* Resources required per output pixel

1943-0663 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like