An FPGA Based SIMD Architecture Implemented With 2D Systolic Architecture For Image Processing
An FPGA Based SIMD Architecture Implemented With 2D Systolic Architecture For Image Processing
Pankaj Kumar
Lecturer, Department of Computer Application Sahara Arts & Management Academy Bakshi Ka Talab, Lucknow [email protected]
Abstract: Image processing is widely used in many applications, including medical imaging, industrial manufacturing, and security systems. It is gaining larger importance in a variety of application areas e.g. for autonomous vehicles, requires substantial computational power, in order to be able to operate in real time. The recent advances in image processing make popular using images in different branches of human activity; robotics, biomedical applications, industrial process control and environmental control are among them. Each procedure in this environment demands variety of processes, methods and hardware. Nowadays parallel computer architecture has been developed to pipeline successive image processing functions. The architecture allows us to run many image processing operations in parallel. With this we can achieve a much higher data throughput than traditional computing systems. Parallel computer architecture is suggested for highly efficient image processing which includes parallel processors of the SIMD, MIMD type, multiprocessor systems, and pipelined processors. The main objective of this paper is to present the implementation of the image processor which is an SIMD processor build with large FPGA. The paper describes how the systolic architecture is best suitable for this architecture. And last it describes how the image processing is happening in this parallel processor. Keyword: FPGA, SIMD, MIMD, 2D Systolic Array, Pipelined processor General Word: Image Processing, Parallel Processor, Multiprocessor are hidden from users of the PIPT through a registration/call-back mechanism, which provides an opaque transport mechanism for moving parameterized data between nodes. The main work of PIPT is data distribution and massage passing. The PIPT uses a manager/worker scheme in which a manager process reads an image file from disk, partitions it into equally sized slices, and sends the pieces to worker processes. The worker programs invoke a specified image processing routine to process their slices and then send the processed slices back to the manager. The manager re-assembles the processed slices to create a final processed output image. Applications use the PIPT by calling image processing routines which are in turn built up from abstract computation kernel functions. The kernel functions interface to an opaque transport layer which transparently effects
Introduction
Image processing may be defined as the science of modification and analysis of continuous tone pictures. It is for improvement of pictorial information for human interpretation and processing of scene data for analysis by computers. Sometimes image processing is confused with computer graphics due to methods used in computer graphics and image processing overlap but these two are different to each other [ 1 3 ] . In computer graphics, a computer is used to create picture. Image processing, on the other hand, applies techniques to modify or interpret existing pictures, such as photographs and TV scans. The result of parallel image processing depends upon the degree of parallelism. Thats why PIPT (parallel image processing tool kit) is developed [12]. The details of parallelization
parallel execution. The transport mechanism makes calls to an MPI (massage passing interface) library for its parallel communication operations [12]. Although the PIPT provides for parallel execution of image processing routines, parallelism is encapsulated at a low level of the system so that users of the PIPT do not need to be concerned with parallel programming. Additionally, MPI functions allow the PIPT to create its own message passing space that is guaranteed not to conflict with any other messages from the user's application. 2 Cellular Arrays for Pixel Parallelism Over the last few years, advances in programmable logic devices have resulted in the commercialization of field programmable gate arrays (FPGA) which allow putting large numbers of programmable logic elements on a single chip [10]. The size and speed of those circuits improve at the same rate as microprocessors size and speed, since they rely on the same technology. Field programmable Gate Arrays (FPGA) offers the possibility that re-programmable, reconfigurable arrays can be constructed to efficiently compute certain problems. Now a day FPGA can implement a programmable, maximally parallel implementation of a cellular array.
can only be efficiently implement a small numbers of cells. 3 Image Pipeline for Instruction Level Parallelism
This is similar to cellular architecture. Data is provided to the cell through a continuous stream of pixels [6]. They are often supplied to the cell one sample at a time, and usually in raster scan order. This arrangement is often suitable for real-t i m e s ystems where data arrives directly from a serial I/O sensor. Since pixels are processed sequentially, the main way to achieve speed-up for an image pipeline is to execute multiple instructions in parallel. As shown in Figure 2 instructions can be implemented either in parallel (increasing the pipeline width) or in series (increasing the pipeline depth). Unlike cellular architectures, accessing a local neighborhood within an image pipeline must be carefully considered. All instructions in a pipeline are being executed at the same time, and therefore it may be difficult to provide data to all instructions at the right time.
Fig 2
Fig 1
Cellular arrays are a natural model for parallel image processing [6]. They consist of an array of cells in two, three or more dimensions. Each cell is associated with an image pixel and each cell has dedicated connections to its local neighborhood. This high bandwidth local communication is ideal for implementing neighborhood functions; all pixels are processed in parallel, and the entire image is updated in 1 instruction cycle. FPGA can implement a programmable, maximally parallel implementation of a cellular array, but
Unlike cellular architectures, accessing a local neighborhood within an image pipeline must be carefully considered. All instructions in a pipeline are being executed at the same time, and therefore it may be difficult to provide data to all instructions at the right time. 4 SIMD Based Architecture For implementation of image processing on parallel computer architecture I have taken SIMD (Single Instruction Multiple-Data) architecture together with a 2D torus connection topology, which includes the 2D systolic architecture (Fig 3). The SIMD architecture has a 2D torus interconnection topology of its processing elements (PE) and the same address and control signals are used
by every PE. The interconnection of all PEs is called processing matrix [11].
Exchanging status information with the SIMD controller. Managing data transfer between the host and the board. Data transfers between the host and the board use the global bus to send address and data to PEs and SIMD controller. 4.2 SIMD Controller
The SIMD controller is the control unit of the Image Processor. It reads a program from its instruction memory and uses its data memory for storing global information. Once an instruction is decoded, data and control signals are sent to the PEs through the global bus and a dedicated control bus. The global bus may be used to send both control and data signals.
Fig 3
There are three distinct memory used in this architecture; instruction memory, data memory and PEs memory ( register file) . T h e combination of instruction memory, data memory is called Stream memory [1]. The SIMD controller unit is responsible for reading data (e.g. pixel value) from the stream memory and transferring it to the register files and vice versa. The architecture is composed of three basic components: the SIMD controller, the processing matrix and the I/O controller. These three components are connected by one shared global bus and two control buses i.e. program control and PEs memory control and address signals. All processing element can communicate to each other by means of Inter PE Communication. 4.1 I/O Controller The I/O controller manages off-board communication and initiates memory transfer. The I/O controller is responsible for the following operations: Communicating with the host.
Fig 4
SIMD Controller
SIMD processor includes an image sensor which senses the image which is to be processed (fig 4) [2, 11]. The SIMD controller also provides addresses and control to every memory during both program execution and I/O memory transfers. If configured accordingly, it exchanges status information with the I/O controller. 4.3 Stream Memory The stream memory unit is the connection between external memory and I/O and the PE. It takes the data from external memory and sends it to PE via the SIMD controller. It is combination of data memory and instruction memory.
4.4
Processing Matrix
The PE matrix is a set of identical PEs interconnected in a 2D grid topology. Each PE is connected in direction North, South, East and West to its 4 neighbors (Figure 5).
FPGAs through the global bus and a dedicated control bus. The processing matrix is implemented by 2*2 FPGA matrixes. Each FPGA is connected to its North, South, East, West neighbors and to a local memory. Each North, South, East, West connection has 32 bits (Table 1). Conceptually an FPGA are shared among every PEs of its sub-matrix. The data and control signal of an FPGA are shared among every PEs of its submatrix. PCI local bus standard is taken for implementing the I/O interface. The design of interface is simplified by the PCI controller. It is a powerful and flexible controller supporting several levels of interface sophistication. Two clock signals are available on board. The first one is provided by the PCI controller while the second one is an on-board crystal clock. It is possible to connect two or more of this architecture by means of 124 pin connectors. In that case the north and south connections of the processing matrix are routed to other boards. 6 Result The parallel computer based on SIMD architecture is t o t a l l y dedicated to the processing of image but it has been seen that this architecture is also well suited for pattern recognition and neural network. The image processor has SIMD architecture with a 2D interconnection network that is well suited for implementing 2D systolic networks. The described architecture can be reconfigured. It is possible to connect two or more of this architecture by means of 124 pin connectors. In that case the north and south connections of the processing matrix are routed to other boards. As in this architecture process matrix put a great role so by changing the architecture of process matrix any one can find a new architecture. 7 Discussion Now days parallel processing concepts are used in variety of application where speed and accuracy matters a lot. Parallel processing improves the performance by executing more than one task at a time. So here parallel processing architecture is used. Here in this architecture pixel level parallelism is achieved thats why only one instruction is given at a
Fig 5
Detailed view of PE
Each PE has a local memory addressed by the SIMD controller. The processing matrix is implemented by a 2*2 FPGA matrix. Conceptually a FPGA represents a sub matrix of the global PE matrix [11]. The data and control signals are shared among every PE and its sub-matrix. 5 Implementation The architecture has 2*2 FPGA matrix and each of the memory blocks is a SRAM module of 64K*32 bits. The processing element as well as SIMD controller is implemented by SRAM. The I/O controller is implemented by EPROM.
Table 1 Width of each Bus
Bus Global Shared Bus Control Bus between I/O and SIMD Controller Three Address Buses Control Bus between SIMD controller and PE matrix 2D Grid connection Configuration Signal
SIMD controller is implemented by an FPGA. It implies that decoding and executing instruction may be different from one application to the other. Once the instruction is decoded data & control signal are send to the
time. So SIMD (Single Instruction and multiple data stream) parallel computer architecture is used. As this architecture is dedicated for image processing so this can be said as image processor. The speed as well as accuracy of the described architecture is found better than the other architecture which is based on serial computing. 8 Conclusion The demand for image processing applications for high application such as robotics, biomedical application, and industrial process has grown rapidly in recent years, so it become necessary to design a architecture which is well suited for these type of application. So In this paper, I took a suitable SIMD parallel computer architecture for image processing. The parameters which I took are the number of ALUs, the number of PEs, I/O controller and the SIMD controller. A parallel approach is used in the design so that data throughput can be maximized. The architecture is implemented on FPGA which allow putting large numbers of programmable logic elements on a single chip. Acknowledgement I would express my gratitude to Prof. D. N. Kakkar, Director Sahara Arts & Management Academy for his sincere s u p p o r t a n d motivation. I can not forget the cooperation of Mr. Brijesh Khendelwal HOD, Computer Science. I would also like to thank my entire colleague which gave me a lot of encouragement for writing this paper. Reference [1] Hamed Fatemi, Henk Corporaal Twan Basten Richard Kleihorst, and Pieter J o n k e r Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Antonio Gentile, Jos L. Cruz-Rivera, D. Scott Wills, Leugim Bustelo, Jos J. Figueroa, Javier E. Fonseca-Camacho, Wilfredo E. Lugo-Beauchamp, Ricardo Olivieri, Marlyn Quiones-Cerpa, Alexis H. Rivera-Ros, Iomar Vargas-Gonzles, Michelle Viera-Vera Real-Time Image Processing on a Focal Plane SIMD Array
[3]
Alok Choudhary and Sanjay Ranka, Syracuse University A Parallel Processing for Computer Vision and Image Understanding F.J. Seinstra D. Koelma J.M. Geusebroek Bridging the Gap between Computing and Imaging: Towards Effortless Parallel Image Processing Intelligent Sensory Information Systems. Andriy Lutsyk, Oleksiy Lutsyk, Olexandr Pelenskyy Parallel Image Processing on Configurable Computing Architecture . Reid B. Porter Image Processing pp. 4-5 Winner E. Alexander Parallel image processing with block data Parallel architecture IBM J: RES. Develop vol. 44 no. 5 sept. 2000. Mirosaw GAJER Parallel Image Processing on the Texas Instruments Multiprocessor System Systems Pro Dialog 11 (2000), 1329 NAKOM Publishers Pozna, Poland. Thomas Brunl Tutorial in Data Parallel Image Processing Australian Journal of Information Processing Systems (AJIIPS), vol. 6, no. 3, 2001, pp. 164174 (11).
[4]
[5]
[6] [7]
[8]
[9]
[10] James Greco Parallel Image Processing and Computer Vision Architecture University of Florida 2005, pp 15-16 [11] Jocelyn Cloutier, Eric Cosatto, Steven Pigeon, Francois R. Boyer and Patrice Y. Simardn An FPGA based image processor for image processing and neural networks [12] J. M. Squyresy, A. Lumsdainey, R. L. Stevensonz A toolkit for parallel image processing pp. 1-2 [13] Baker A survey of computer Science pp. 32-33
[2]
[14] Syeda