Parallel and Pipelined 2-D Median Filter Architecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 10, NO.

3, SEPTEMBER 2018 69

Parallel and Pipelined 2-D Median


Filter Architecture
Janarthanam Subramaniam , Raju Jagadeesh Kannan, and David Ebenezer, Senior Member, IEEE

Abstract—The existing 2-D median filters in the literature are limitations [2]. In addition, there is always a tradeoff between
computationally intensive. It is proposed to optimally reduce the the quality of information contained in images and the
amount of data handled at the architecture level realization of resources required to handle the images. FPGAs are suffi-
the basic median filtering operation on images. The proposed
architecture reads 4 pixels at a time in the input image, 4 pixels ciently flexible and cost effective for prototyping and recon-
forming a word on a 32-bit hardware processing system; the figuring the applications [3] and, therefore, provide sufficient
subsequent processing is carried out by parallel and pipelined opportunity for the development of application specific archi-
median filter architecture. Two read operations process eight tectures which cater to real-time requirements. Basic median
input pixels which results in the generation of four output pixels filter can be implemented on specific architectures by perform-
with an initial latency. The proposed architecture offers reduced
number of read operations and increased speed. ing median operation through sorting-based systolic arrays [4]
and nonsorting-based techniques [5]–[8]. Systolic arrays are
Index Terms—Median filter, parallel median filter, pipelined continuously optimized by researchers [9]–[11]. The archi-
median filter, systolic arrays.
tecture developed by Vega-Rodríguez et al. [4] exploits 32-
bit data-width hardware for image data transfer; they have
I. I NTRODUCTION implemented the architecture as a systolic array which is
MAGE processing is integral to deciphering the intelligence reported in [10].
I associated with it. Generally, image processing involves
huge data handling, and modeling human intelligence demands
The architecture proposed in this letter handles image
data effectively in such a way that a pixel in the input image
heavy amount of computations on automated processes. is read only twice in comparison with architectures reported
Handling images involves loss of original information at vari- in the literature, wherein a pixel is read three or more times.
ous levels through environment and process stages. One such Reduction in the number of read operations for filtering an
level is the processing of image for feature extraction [1]. Prior image ensures reduction in the overall operating time. The
to this, preprocessing an image for the retrieval of the original proposed architecture also offers the advantage that it can be
information from non-Gaussian noise corruption is carried out employed for different word length realizations, programmable
mostly by nonlinear digital filters. Dominant among these are digital signal processing, ASICs, and reconfigurable architec-
median-based filters. tures. In addition, the proposed architecture does not impose
Median filter provides robustness to impulse noise; how- any constraint on the time required for reading and processing
ever, the development of median filtering algorithms does not pixels.
include the requirements of real-time intelligence systems.
Generic processing ICs do not provide cost effective solution
for image processing because of predefined architectural
II. M EDIAN F ILTER M EETING R EAL -T IME
Manuscript received September 22, 2017; revised November 3, 2017; R EQUIREMENTS
accepted November 5, 2017. Date of publication November 7, 2017; Vega-Rodríguez et al. [4] have presented an architecture
date of current version September 7, 2018. This work was supported
in part by the Project, “Capacity Building in the Areas of EPDPT” for basic median filter for systolic array implementation [10].
of Ministry of Electronics and Information Technology, Government of The architecture employs pipelining and parallelism. It is
India, Implemented by NIELIT Chennai Centre, and in part by the finally implemented with FPGA as target device. The FPGA is
Scholarship Scheme, “Visvesvaraya Ph.D. Scheme for Electronics and IT”
of Ministry of Electronics and Information Technology, Government of India, interfaced with computer through 32 bit PCI port for real-time
through VIT University Chennai Campus, Chennai, India. This manuscript interfacing and better human interactions. Every read instruc-
was recommended for publication by D. Sciuto. (Corresponding author: tion on a 32 bit system can read 4 pixels, each 8 bit wide.
Janarthanam Subramaniam.)
J. Subramaniam is with SENSE, VIT University Chennai Campus, Chennai Multiple pixels on a single read cycle and parallelism on sys-
600127, India, and also with VLSI and Embedded Systems Group, NIELIT tolic array lead to simultaneous production of four filtered
Chennai, Chennai 600025, India (e-mail: [email protected]). pixels. The parallel and pipelined median filter architecture
R. J. Kannan is with SCSE, VIT University Chennai Campus, Chennai
600127, India (e-mail: [email protected]). is shown in Fig. 1. Smith’s network [10] introduced paral-
D. Ebenezer is with the Department of ECE, Anna University, Chennai lelism and pipelining by splitting the nine level systolic arrays
600025, India (e-mail: [email protected]). into two stages. The first stage in Fig. 1 is called elementary
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. sorting stage also known as E-stage. The next stage is called
Digital Object Identifier 10.1109/LES.2017.2771453 network sorting stage also known as N-stage, with six levels of
1943-0663 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 09,2020 at 04:29:59 UTC from IEEE Xplore. Restrictions apply.
70 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 10, NO. 3, SEPTEMBER 2018

Fig. 1. Real-time median filter proposed by Vega-Rodríguez et al. [4].

Fig. 3. Proposed two stage pipelining scheme.

III. A RCHITECTURE W ITH R EDUCED DATA H ANDLING


A fast median filtering scheme and a new architecture
for hardware implementation on 32 bit system are proposed.
Fig. 2. Part of systolic array optimized by Smith [10]. The proposed scheme is common to both reconfigurable
architectures and general purpose computing platforms.

comparators as shown in Fig. 2. It is also denoted as network


node in the figure. A. Median Filtering Scheme for Effective Data Handling and
The output pixel value is calculated by considering eight Parallel Filtering
neighbors with respect to a center pixel and 3 × 3 mov- The proposed scheme is shown in Fig. 3. The objective is to
ing mask. In order to read 4 pixels of the input image in effectively handle pixels for reducing repeated read operations.
a machine cycle, the first three rows are considered at the Beginning with the input image, the first four rows of the pix-
start. With respect to these three rows, the first four columns els are selected; the pixels in the selected rows are grouped as
capture three row vectors, each vector containing 4 pixels. column vectors of 4 pixels each. Each column vector is read
Each row vector containing a set of four 8 bit pixels forms as one 32 bit data word. Four column vectors are presented as
a 32 bit word. Before the start of the processing, three words a single 4 × 4 matrix, where 3 × 3 masks can be applied with
are available for transfer to the architecture. Three consecutive respect to four center pixels. Only two words are transferred
read operations result in the transfer of a 3 × 4 matrix to the instead of three or more words for completing one processing
architecture. The last two columns of every 3 × 4 matrix pro- cycle. Only a single processing cycle is required for generat-
cessed previously are sorted and stored. These stored elements ing four filtered outputs; these four outputs are generated by
are combined with incoming 3 × 4 matrix for the formation four filters, namely, F1–F4. The two words can be transferred
of a 3 × 6 matrix. Consequently, this provides for placing in one or two machine cycles depending on the capability of
four 3 × 3 masks simultaneously with respect to four center system. For simplicity, reading two words per machine cycle
pixels. is considered. The first two column vectors are transferred in
In the architecture, the arrangement of four parallel network cycle 1 and the words are stored in elements S1 and S2, each
nodes results in four filtered pixels at a time. A set of three of size 4 × 1. In the next machine cycle, the third and fourth
read operations, one cycle of filtering operation, and one cycle column vectors of selected rows are transferred and combined
of write operation is the sequence of operations required for with the previously stored columns to form a 4 × 4 matrix.
completing one cycle. The parallelism uses only 52 compara- In this process, the elements of the 4 × 4 matrix are grouped
tors for producing four filtered output pixels. This obviates such way that four 3 × 3 sets are available with respect to
the need for 76 comparator operations as employed in the the pixels encircled in Fig. 3. Each encircled pixel happens
approaches reported in the literature. to be the center pixel of an eight neighborhood. By median

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 09,2020 at 04:29:59 UTC from IEEE Xplore. Restrictions apply.
SUBRAMANIAM et al.: PARALLEL AND PIPELINED 2-D MEDIAN FILTER ARCHITECTURE 71

filtered output pixels are organized as a word; the word is sub-


sequently loaded to the memory for write operation. CPMA,
therefore, needs four machine cycles to complete read, E-
stage, N-stage, and write operations. It is to be noted that the
machine cycles of this section are different from that of the
scheme for effective data handling described in the previous
section.

IV. H ARDWARE I MPLEMENTATION AND


P ERFORMANCE A NALYSIS
For the purpose of verifying the functionality of CPMA, the
design is implemented in RTL Verilog HDL using XILINX
ISE Design Suite 14.7 and synthesized for implementation
in Xilinx FPGA Virtex 4 XC4VSX25. This FPGA is cho-
sen for straight forward comparison with the state-of-the-art
nonsorting methods [5], [8], sorting methods [10], [11], and
sorting-based parallel architecture [4] available in the litera-
Fig. 4. Proposed CPMA. ture. The comparison is in terms of resource utilization and
speed. The results are presented in Table I for the purpose
of comparison of the performance metrics. The schemes in
Table I employ 3 × 3 masks. Input sample width is 8 bits.
filtering each set, four output pixels, namely, P1–P4 are gen- The focus of the methods presented in [5], [8], [10], and [11]
erated. Once the execution of filtering process is over, the is on the design of new median finding methods with improved
latest two column vectors are stored in S1 and S2; the lat- resource and time metrics and their pipelined architectures.
est two column vectors are now ready for concatenation with For each cycle, these methods require nine input values and
the new incoming set of vectors in the subsequent process- use the resources optimally. Although the proposed CPMA is
ing cycles. This process repeats until the last column vector based on the existing sorting method in [4], parallelism and
is processed and the first sets of two rows are made avail- pipelining reduce repeated comparator operations of the input
able in the output image. For the selection of the next set of pixels. The proposed CPMA also reduces the number of times
rows, only rows 3 to 6 need to be considered; the above pro- a pixel is read and transferred for processing.
cess is repeated for finding the next two rows of the output Architecture in [4] needs 12 pixel values as input.
image. CPMA needs only 8 pixels as input. Reading of two words per
cycle requires only one machine cycle for reading input pixels
from memory, i.e., 8 pixels are transferred in two 32 bit words.
B. Column-Vectors Processing Median Filter Architecture In comparison, scheme in [4] needs two reading cycles before
The column-vectors processing median filtering architec- an execution cycle. The extra read operation results in the addi-
ture (CPMA) is based on the proposed scheme for effective tion of one more machine cycle which is 25% extra burden
data handling and parallel filtering. CPMA requires 8 pix- on overall processing time in comparison with the proposed
els, I(0−1,0−3) , at the input. These 8 pixels are transferred architecture.
as two 32 bit words. These are the column vectors of input Consider an image with m × n number of pixels. In the
pixels, I(0,0−3) and I(1,0−3) . Input pixels representing the scheme proposed in [4], every pixel is read and transferred
first and second columns are sorted in the ascending order three times to the median finding architecture, i.e., 3 × m × n
and stored; the sorting is carried out by the E-stage. After number of pixel read operations are executed for producing
E-stage, 24 values in eight different sorted sets are available. the filtered output image. In CPMA, an input image pixel
These sets provide inputs to four network nodes operating in is read and transferred to the architecture only twice, which
parallel in the N-stage. N1 receives values from I(−1,0−2) , means only 2 × m × n number of pixel read operations need
I(0,0−2) , and I(1,0−2) similar to F1 in Fig. 3. N2, N3, and to be executed for producing the filtered image. Proposed
N4 also receive values from input pixel matrix as shown in CPMA handles only two-third of the data in comparison with
Fig. 4 similar to filters F2, F3, and F4. E-stage and N-stage the best schemes available in the literature. Implementations
are allocated each one machine cycle. Though E-stage has presented in [5], [8], [10], and [11] require that pixel of the
only three stages of compare-and-swap operations in com- input image be handled nine times.
parison with six stages of compare-and-swap operations at Table I lists the latency cycles in the pipelined designs.
the N-stage; it is a tradeoff between storage elements and Throughput is displayed next. Throughput is defined as the
speed. However, the speed has a dependence on the user’s number of median outputs generated per clock cycle. Fast
decision to split or not the N-stage. Pipelining between E- median-finding word comparator array (FM-WCA) [11] is
stage and N-stage requires 12 storage elements for holding the latest median finding word comparator array and reported
the sorted set of values produced by the E-stage. All four to be the fastest in the pipelined architectures, for example,

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 09,2020 at 04:29:59 UTC from IEEE Xplore. Restrictions apply.
72 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 10, NO. 3, SEPTEMBER 2018

TABLE I
3 × 3 M EDIAN F ILTER IN X ILINX FPGA V IRTEX 4 XC4VSX25 (N = 9, I NPUT S AMPLE W IDTH = 8 B ITS )

TABLE II
3 × 3 M EDIAN F ILTER IN X ILINX FPGA V IRTEX 7 XC7VX330T V. C ONCLUSION
Computationally intensive median filtering algorithms are
a challenge in the context of real-time processing. For the
basic median filter, optimization of the amount of data han-
dled at the architecture level with pipelining and parallelism
of existing systolic array is considered. While the number of
filter processing cycles is fixed by the size of the image to be
filtered, median filtering by the proposed architecture requires
reduced data handling. The proposed architecture saves sig-
nificant amount of resources and the time required to handle
median filtering of images. Further optimization on pipelin-
low hardware complexity pipelined rank filter (LCBP) [5] and
ing and parallelism may lead to further improvement in the
Cadenas’s method [8]. However, Cadenas’s method has fewer
context of real-time processing.
latency cycles. FM-WCA uses 7% fewer DFFs and smaller
latency than Smith’s. LCBP, Cadenas’s, Smith’s, and FM-
WCA are implemented as pipelined architectures. Throughput R EFERENCES
for all these methods is same and not modified for improve- [1] G. A. Baxes, Digital Image Processing: Principles and Applications.
ments at the same speed. Parallelism with pipelining is intro- New York, NY, USA: Wiley, 1994.
[2] M. J. S. Smith, Application-Specific Integrated Circuits. Reading, MA,
duced in Vega’s architecture and CPMA for the improvement USA: Addison-Wesley, 2008.
of throughput at 454 MHz. The introduction of parallelism [3] S. Hauck, “The roles of FPGAs in reprogrammable systems,” Proc.
has reduced the amount of image data to be handled by these IEEE, vol. 86, no. 4, pp. 615–638, Apr. 1998.
[4] M. A. Vega-Rodríguez, J. M. Sánchez-Pérez, and J. A. Gómez-Pulido,
architectures. Vega’s architecture handles only 33.3% of image “An FPGA-based implementation for median filter meeting the real-
data handled by pipelined methods. But CPMA handles only time requirements of automated visual inspection systems,” in Proc.
66.6% image data handled by Vega’s architecture. It is evident 10th Mediterr. Conf. Control Autom., 2002.
[5] D. Prokin and M. Prokin, “Low hardware complexity pipelined rank
from Table I that the number of times a pixel in input image filter,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 6,
read is only twice in the complete filtering process of the input pp. 446–450, Jun. 2010.
image. Per-throughput resources of CPMA are comparatively [6] J. Cadenas, G. M. Megson, R. S. Sherratt, and P. Huerta, “Fast median
calculation method,” Electron. Lett., vol. 48, no. 10, pp. 558–560,
fewer than Vega’s architecture, and speed of 1816 MHz is sig- May 2012.
nificantly higher than FM-WCA and other methods. Table II [7] J. Cadenas, “Pipelined median architecture,” Electron. Lett., vol. 51,
presents the implementation of sorting-based methods for eval- no. 24, pp. 1999–2001, Nov. 2015.
[8] J. O. Cadenas, G. M. Megson, and R. S. Sherratt, “Median filter archi-
uating resource utilization and the speed on the state-of-the-art tecture by accumulative parallel counters,” IEEE Trans. Circuits Syst.
prototyping platform. Performance of CPMA is further con- II, Exp. Briefs, vol. 62, no. 7, pp. 661–665, Jul. 2015.
firmed by Xilinx FPGA Virtex 7 implementation. CPMA uses [9] B. Morcego, J. Frau, and A. Català, “Suavizado de imágenes en tiempo
real mediante filtrado por mediana utilizando arrays sistólicos,” in Proc.
at least 30% less hardware resources than FM-WCA and offers VII DCIS, Toledo, Spain, 1992, pp. 545–546.
increased throughput four times that of FM-WCA. Similarly, [10] J. L. Smith, “Implementing median filters in xc4000e FPGAs,” Xilinx
CPMA uses 2.5% less logical resources and handles 33.3% Xcell, vol. 23, no. 1, p. 16, 1996.
[11] J. Subramaniam, J. K. Raju, and D. Ebenezer, “Fast median-finding
less image data for the same throughput than architecture word comparator array,” Electron. Lett., vol. 53, no. 21, pp. 1402–1404,
in [4]. Dec. 2017.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 09,2020 at 04:29:59 UTC from IEEE Xplore. Restrictions apply.

You might also like