A Programmable Vlsi Architecture Based ON Processing? Multilayer CNN Paradigms For Real-Time Visual
A Programmable Vlsi Architecture Based ON Processing? Multilayer CNN Paradigms For Real-Time Visual
24,357-367 (1996)
AND
SlLVlO P. S A B A T l N l A N D GIACOMO M. B l S l O
Department of Biophysical and Electronic Engineering, University of Genova, Via all'opera Pin I I A , 1-16145 Genova, Italy
SUMMARY
A new digital VLSI architecture has been presented for the implementation of discrete-time multilayer CNNs. At
functional level, the architecture is organized as 12 layers of 64 x 64 cells which interact as specified by a set of 3D
generalized templates. At structural level the application of cloning templates occurs in a set of processing units
programmed by instruction masks, generated on the basis of the algorithm to be emulated. It is demonstrated that this
architecture is applicable to multilayer algorithms for visual processing and also to standard CNNs, including those that
use sequences of templates or that work in parallel. Simulations evidence the high efficiency of this implementation.
1. INTRODUCTION
Analogue CNNs have been proved to be very effective in various image-processing tasks, that can be
related to local interactions among processing units arranged in a two-dimensional grid. ' J However, for
machine vision processing and other related intelligent assignments it is necessary t o combine several
different elementary tasks defined by a series of templates to be used in sequence or conc~rrently.~ Hence
the need for programmable architectures emerges strongly.
Even though CNNs are especially tailored for analogue processing, since they can be therefore directly
mapped on a grid of simple analogue processors, and a number of papers have been devoted to the study
of specific analogue building blocks for CNN implementation and to programmable solutions based on
operational amplifiers, several practical constraints/limitations on the efficacy of analogue CNN
implementation are posed by occupation area, input-output interfacing, status memorization and flexibility.
Moreover, these solutions have a limited programmability of system parameters, thus preventing a global
reconfiguration of the structure of the elaboration (i.e. network structure for CNNs). Hence architectural
solutions able to map on several CNN paradigms the same computational substrate are strongly desirable to
fully exploit the potentialities of CNNs in real applications.
In this paper, starting from a generalized reformulation of cell dynamics for multilayer CNNs, we
present a reconfigurable digital VLSI architecture able to fulfil both these demands on programmability and
the requirements of higher efficiency with respect to commercial DSP or hardware accelerators . This
architecture will be motivated in relation both to standard CNN templates and to a specific algorithm4 based
on a multilayer cortical-like computational model of preattentive visual processing.
t Pan ofthis research has been reported in the Proceedings of the 1994 IEEE International Workshop on Cellular Neural Networks
and Their Applications held in Rome.
ME N, ( t ) )
Y(N) =f(x(n))
~ ( 0=)xn
where x , y , u, and I denote cell state, output, input and bias, respectively; N, is the r neighbourhood of the
cell ( i , j ) ; A, A', B , B' are the cloning templates; z is the memory duration time; and f is the non-linear
output function. It is noteworthy that if the input u is constant during the iteration, the delayed B-template
(B') is null.
N-dimensional generalizations (multilayer CNNs) of discrete-time CNNs can also be formulated on the
basis of the multilayer generalization introduced by Chua and Yang.' In multiple-layer CNNs each cell is
characterized by several variables instead of only one state variable as in the single-layer case. We can
observe that, whatever its level of complexity, a discrete-time multilayer CNN can be computationally
described by an ensemble of nodes locally interacting to reach the prescribed computation.
The set of L nodes associated with each cell can be viewed as the components of a column vector v j j
representing the whole set of inputs, outputs and delayed outputs related to the corresponding location ( i , j )
on the cell layer. Each set of nodes interacts only with neighbouring sets through a generalized 30 template
0 that spans the L layers:
where a and 6 index the components of the column vector (i.e. the layer) and ( i , j ) the elements of a layer.
For each component of a vector v j j one can recognize in the corresponding section of the 3D template
the conventional control operators B and B ' and the feedback operators A and A I. For example, the CNN
described in equation (I) with null delayed templates can be implemented assuming L = 3: v' = I , v 2 = u ,
v3 5 y ; D 2 2 =D3' = 1 for ( i , j )= (k,I) and null otherwise, D3'= B , 033A , OI2= 0 ' 3 0 2 ' 5 D 2 3 , 0.
g ' ( x ) = x , g 2 ( x ) = x, g3(.)= f(-).if delay temperatures are present, more components of ?I should be
considered, one for each previous output (and/or input) present in the algorithm. For example, t = 1: v l = I ,
v 2 = u , v 3 = y , v 4 = y ( t - 1); D " = D 2 2 = D D 3 1 r D 4 3 = I f(oi ,rj ) = ( k , l ) andnullotherwise, D 3 2 = B , D 3 3 = A ,
D34=A-', 0 ' 2 , D " = D2' 0 2 3 , DI4= D z 4DMs ~ 0 4 2 , D4' ~ 0g ' (; x = X , g3(.) = f ( . ) , g 4 ( . ) s x .
It is worth noting that in this way D specifies not only the strength of connections among the cells of the
CNN but also the interconnection structure of the CNN itself, thus allowing us to achieve the higher degree
of programmability required. This is the form of computation to which we should refer to for devising
architectural solutions.
3. ARCHITECTURAL SPECIFICATION
a high-level transformation of the original specification through unrolling of the innermost loop of
elaboration, to extract the implicit parallelism contained in the original specification, and subsequent loop
f i ~ l l i n g,' to exploit this parallelism by means of pipelining. This leads to an architectural specification
based on a limited set (one per layer) of processing units able to evaluate the new state of a cell through
few iterations but no reload of already processed input data. Two main blocks characterize the architecture:
the storage blockjn which the vector v are stored, and the processing block, which updates each vector
according to the cloning ternplates D. In this respect each template can be viewed as a 3D array. Since many
elements of a cloning template are null, in order to make the implementation more efficient, the 3D
template can be projected onto a reduced number of 2D masks (see Figure 1).
S 2 Iirzplementation
Limits on VLSI technology, power consumption and speed of computation pose some constraints on the
number of layers and cells, on the dimension of the instruction masks and on the number of bits to
represent weights. The trade-off between performance and available resource depends on the target
application domain, specified later in Section 4.2. On the basis of it we consider 12 layers of 64 x 64 cells,
interactions among first neighbours only (i.e. 3 x 3 x 12 cloning template) and weight magnitudes specified
with 3 bits as power of 2 (the successive non-linear block takes care of scaling). With this choice a
compact memorization of weights is achieved and weight multiplications occur through arithmetic shifts.
The architectural schema of our system is illustrated in Figure 2.
The storuge block is based on a single-port RAM. The currentlprevious outputs of cellular neural
network elements are stored in 64 x 64 locations of 96 bits, functionally subdivided into 12 groups of
8 hits to implement 12 layers (L1 , ...,L12).
The processing block is composed of 12 processing units (one for each layer), 12 pairs of 16 bit row
buffers and 12 instruction mask sets that play the role of cloning templates. The behaviour of each
proceccing unit is controlled by its set of masks, whose elements determine the weight sign and magnitude
anti the number that identifies the layer in which to read the output. Specifically, each element of a mask is
LA Y
/7
L3
mask1
Ki -w3
mask2
, . I
+Wl +w2 +w3 -wl -w2 -w3 . .
Figurc 1, A pictorial view of the generalized 3D CNN.On the left side, all the nodes contributing to the output of the marked cell in
layer L3 are evidenced. (positive weights are represented the corresponding 3D cloning template, while below the related 2D
projection masks are represented
360 L. RAFFO, S. P. SABATINI AND G. M.BISIO
from
RAM 1.111..
I
mNL
BLOCK
-.
signs s
'
BUFFERS 1
' I*--
Figure 3. The first and last processing units are depicted. From the bus they rcceivc the same datum referred to a complete colurnn
of cells. The data processor extracts and manipulates the data according to the actual mask element (sec text). When a datum
becomes available at the end of the adder cascade, it is stored in the buffer and then moved to the RAM
VLSI ARCHITECTURE BASED ON MULTILAYER CNN PARADIGMS 36 1
composed of two fields: the first specifies the weight (null flag, sign and magnitude); the second addresses
the layer in which the mask has to act.
At each iteration the actual value of the vector v is moved from the RAM (scanned row by row) towards
all the processing units (see Figure 3). In each processing unit: (i) each data processor (see Figure 4)
extracts a portion of the datum according to the content of the element of its masks, then shifts and
complements the result if requested; (ii) a cascade of adders operates to add it to the partial sum coming
from the preceding rows stored in the buffer.
At each iteration we need to have available three rows of data (the preceding, the actual and the next).
The data belonging to these rows are sent to the processing unit, row by row, element by element. When all
the data of a row are transferred to the processing unit in the buffer, the convolution between the row
considered and the first row mask is available. The content of the buffer is the starting value for the
convolution of the second tOW of the mask with the actual row and so on for all the masks. When a row is
completely processed, it cannot be moved directly to the RAM, because its values should be used for the
next row. Hence we need another buffer to store it for the time in which the next row is processed. When the
processing of the next row is completed, the content of the second buffer is moved to the RAM through a
non-linear block implemented by a clipping function with a programmable slope (Figure 5).
Figure 4. A data processor is depicted. The data from the 96 bit bus is subdivided into 12 blocks connected to and 8 bit bus through
three state buffers. The 8 bits are inverted o r buffered (according to the signs of the weights) and arithmetically shifted. To complete
the 2-complementation, the resulting value is incremented by one if the weight is negative
-1
./// ,'- - -
I//, - -
z< - ,
0-
-4 -3 -2 -1
c
1 2 3 4 x
This schema limits the number of transfers to the RAM, allowing the storage of the values useful only
for the next iteration and avoiding the physical duplication of the storage block.
Thanks to the horizontal pipeline schema and the parallelization of fetch from rnernory with mask
computation, the number of clock periods needed for an iteration update is nine, since a convolution mask
needs three buffer updates each lasting three clock periods (for data processing and sum).
4. APPLICATIONS
STATUS
OUTPUT
PREVIOUS STATES
(4
Figure 6. Possible utilization of the architecture for CNN algorithms: (a) different CNNs performing tliff'crent computations on the
same input; (b) several CNNs working in parallel on different inputs; (c) a delay-type CNN
VLSI ARCHITECTURE BASED ON MULTILAYER CNN PARADIGMS 363
Hence our architecture is able to implement both several CNNs working in parallel and delay-type CNNs,
as sketched in Figure 6 and detailed in the following two examples.
A delay template CNN can he implemented using a layer for each previous state of the elaboration we
are interested in. This architecture can implement a CNN with t C 10.
Edge detection. Many cloning templates for edge detection have been presented.' In Figure 7 is shown
the result of the implementation of a 3 x 3 cloning template A with circular symmetry (2 in the middle,
-0-25 the neighbours). This operator is mapped on the architecture according to the example of Section 2.
Connected component detection. We present in Figure 8 the results of the simulation of the CNN
proposed in Reference 8 for connected component detection (see caption).
Problem description. Many machine vision processing tasks are based on the recurrent application of
simple and uniform operators on a large set of data representing the image. These applications usually
require real-time performances that cannot be achieved by software implementation. In particular solving
visual tasks requires one (i) to extract elementary information from the data image (e.g. contrast, contrast
differences, etc.) and (ii) to merge such information in a global unifying percept. Both operations resort to
point and local interactions within restricted portions of the image. For an efficient hardware design it is
important to have a structure based on simple modules locally connected to limit communication overhead.
To this end, by studying biological solutions for vision processes and especially those evolved in visual
c ~ r t e x , one ~ " derive the following set of computational paradigm^.^
~ ~ can
1 . Local feature extraction. Each cell analyses the input image by performing a weighted sum over the
portion of the image around the current pixel.
2. Topology preservation. Adjacent locations on the visual cortex (i.e. the output port) correspond to
adjacent locations in the image, thus preserving the topographic organization of the image.
3. 3 0 mapping of local information. The 3D structure representing the cortex is composed of layers,
organized hierarchically. Each cell in a layer gains its properties both through feedforward
-
I.,* __.. :
1
i
(4 (b)
Figure 7. (a) Test image. (b) Output of the edge detection CNN using the template of Reference 7
Figure 8. (a) Test image. (b) Output of a connected component detection CNN with template of Reference 8. (c) Same as (b) using
a delay-type template A' with t = 3.' A' is mapped on the architecture by considering three additional layers in which the outputs of
the previous ones are copied at each step, realizing a memory of the last four output values
364 L. RAFFO, S. P. SABATINI AND G. M. BISIO
connections from cells in the previous layer and through horizontal and vertical, locally confined
recurrent paths. These computations, together with topology preservation, ensure a direct
correspondence between the morphology of connections and the detection of spatial relations among
featural elements.
Algorithm .ym$kxztion. The fundamental module of the model is a ‘column’, i.e. an ensemble of
orientation-selective cells present in simple, coniplex, and hypercomplex layers at the same location (see
Figure 9(a)). Each layer can be described as being composed of a number of (e.g. four) subluyers, each of
which can be described as a 2D regular grid of cells selective to the same oriented featural element. The
simple layer is the input layer and provides computational primitives to the complex layer to extract
oriented featural elements: the excitation e,(i,j , 8) reflects the dominant featural element among those
detected by a convolution with different kernels. The excitation of a neuron in the complex layer belonging
to column (i,;), with orientation preference 8, is the result of four contributions: direct excitation
z , = g ( r , ) from the corresponding position in the simple layer, where g ( . ) is a sigmoidal transfer function;
feedforward inhibition from a set M ( B ) of simple cells; recurrent inhibition from a set Nc(i, j, 6) of
complex neurons; positive feedback zh from the corresponding neuron in the hypercomplex layer. The
excitation of neurons in the tiyperromnpfexlayer results from two contributions: the feedforward actions
from il set f , ( i , j , 0) of neurons in the complex layer and the cross-orientation inhibition from a set
N,,(i,j , 0) of neurons in the hypercomplex layer (see Figure 9(b)).
Summarizing, the algorithm can be described by the following system of equations:
nub 12
Br rub 11
(33 iub 10
a sub 9
Layer
nub 6
sub 7
iub 6
sub 6
tar
Bub 4
nub 3
sub 2
84 rub 1
nimple layer
(a) (b)
Figure 9. (a) Artistic view of columns: the fundamental module of the neural computational model for visual processing
( s = simple; c = coniplex: h = hypercomplex). The arrows evidence feedforward and recurrent interactions occurring among layer.
(b) Feedforward, inhibitory and recurrent connection schemata among cells
VLSI ARCHITECTURE BASED ON MULTILAYER CNN PARADIGMS 365
where I(m, n ) denotes the intensity of a pixel at point ( m , n ) in the image plane; wP(rn,n , i , j , 8 ) with
p = 1 , 2 , 3 , 4 are the kernels of different contrast selectivity that describe the receptive field profile of the
neuron belonging to column ( i , j ) ; wsc, w,,, wh,, Whh and w,h denote the weights of connection from simple
to complex (feedforward), from complex to complex (intralayer), from hypercomplex to complex
(feedback), from hypercomplex to hypercomplex and from complex to hypercomplex respectively; and k is
the iteration index.
It is worth noting that the feedforward inhibition schema M ( 8 ) does not depend on the position of the
neuron considered in the layer; a complex neuron selective to I3 is inhibited by the two simple neurons (with
similar orientation preferences) belonging to the same column. N c ( i , j , 0) depends on the orientation
preference 8 of the target neuron; more precisely, a neuron selective to 8 receives inhibitory inputs from
two complex neurons (selective to I3 + n/2) that belong to the two closest columns lying along an axis
orthogonal to 8.
The set L ( i , j , 8 ) depends on the orientation preference of the target neuron. More specifically, the
connection schema can be defined as follows: if the target neuron is selective to 8, then the complex
neurons that provide the input are selective to 8 and belong to neighbouring columns that lie on an axis
oriented along 8. Typical values for the number of columns involved in the interaction range from three to
seven, but three is sufficient for major applications.
Architectural mapping. The functionality of complex and hypercomplex layers can be mapped on the
architecture presented here, while the functionality of simple cells has to be implemented by a specific
convolution block. I ' This block performs convolutions with four pairs of orthogonal filters (oriented along
0", 45", 90" and 135" directions), by four-pixel steps, and provides as output for each orientation the
maximum of the absolute value of the convolution pairs (see equation (3)) on an array of 64 x 64
elements. In this way, with an input image of 256 x 256 pixels, the resulting convolution is an array of
61 x 61 elements for each orientation. It is noteworthy that the frequencial selectivity of the masks of the
convolution blocks will determine the capability of the whole network to be sensitive to particular textures.
The outputs of the convolution stage z, are stored as excitation inputs in the four layers of the simple
layer. The statuses of complex and hypercomplex cells are stored in the corresponding quartet of layers.
The values in complex and hypercomplex layers are updated according to the programmed rules and the
values stored in all the layers. This occurs by setting the generalized template D of equation (2) according
to the explicit algorithmic formulation of equations (4) and (5).
Simulation results, performance and implementation perspectives. We have tested this implementation
on a natural textured image." The simulations presented here concern texture segregation on natural
images. In Figure 10 the test image and the content of the four hypercomplex layers of the architecture are
presented at convergence. The image is subdivided into four square areas that represent the resulting images
for the four types of orientation-selective cells along 0", 45", 90" and 135". The luminous intensity of a
pixel codes the activity of the corresponding neuron: if the pixel is light, the neuron is active; if the pixel is
dark, the neuron is inhibited; if the pixel has an intermediate value, the corresponding neuron is silent (i.e.
the neuron is not selective to the stimulus present in its receptive field). Taking into account the number of
elements per layer (64 x 64), the number of masks per layer (four) and the number of iterations (10) and
assuming a clock frequency of 50 MHz, a complete texture segregation of 256 x 256 pixel images is
estimated be obtained in about 30 ms, allowing one to process images at a commercial camera frame rate
(25 images/second). The VLSI design of this architecture is being pursued using a standard cell approach
366 L. RAFFO, S . P.SABATINI AND G. M. BISIO
(h)
Figure 10. (a) 256 x 256 pixel test image. (ti)Outputs of the four hypcrcornplex layers for the four angles (OO, 45'. 90° and 135'),
evidencing the pressure of textural features of corresponding orientation
with an appropriate customized memory rnodule generator. On the basis of a similar implementationt3it is
estimated that 15 mm x 15 mm of silicon in a 0.5 p m technology will be necessary.
5 . CONCLUSIONS
We have considered a digital VLSI architecture for the implementation of multilayer CNNs. This
architecture combines programmability with high efficiency. This has been achieved with the following
strategy: (i) an elementary recursive algorithm has been defined as the building block of every multilayer
C N N by introducing 3D generalized templates that tit well to a direct VLSI mapping; (ii) such sparse 3D
templates are projected onto a small set of 2D templates; (iii) the recursive operations of the whole
algorithm are sequenced with high efficiency using programmable dedicated architectural resources,
In comparison with general-purpose CNN implementations such as the CNN universal machine, l4 the
following major differences can be evidenced: (i) the issue of programmability for this reconfigurable
VLSI ARCHITECTURE BASED ON MULTILAYER CNN PARADIGMS 367
digital architecture has been explored in a specific application context, though this architectural approach
could be extended to other domains of application; (ii) a fully digital solution has been pursued.
ACKNOWLEDGEMENTS
This work was supported in part by CEC ESPRIT-BRA Project CORMORANT. The authors wish to thank
Dr. Paolo Faraboschi and Dr. Giovanni Nateri for useful suggestions.
REFERENCES
I . J. Vandewalle and T. Roska, Guest editorial-special issue on cellular neural networks’, Int. J. cir. theor. appl. 20, 449-451
(1992).
2. CNNA-94, IEEE, New York, 1994.
3. K. Halonen, V. Porra and T. Roska, ‘Programmable analogue VLSI CNN with local digital logic’, Int. J. cir. theor. appl., 20,
573-582 (1992).
4. G. Indiveri, L. Raffo, S . Sabatini and G. Bisio, ‘A neuromorphic architecture for cortical multi-layer integration of early visual
tasks’, Machine Vision Appl., in press.
5. L. Chua and L. Yang, ‘Cellular neural networks: theory’, IEEE Trans Circuits and Systems, CAS-35, 1257-1272 (1988).
6. G. Goossens, J. Rabaey, J. Vandewalle and H. De Man, ‘Loop optimization in register transfer scheduling for DSP systems’,
Proc. 26th ACMIIEEE Design Automation Conf., IEEE, New York, 1989.
7. L. Chua and C. Wu, ‘On the universe of stable cellular neural networks’, Int. j . cir. theor. appl. 20,497-518 (1992).
8. T. Roska and L. Chua, ‘Cellular neural networks with non-linear and delay-type template elements and non-uniform grids’, Int.
J. cir, theor. appl,, 20, 469-482 (1992).
9. D. Van Essen, C. Anderson and D. Felleman, ‘Information processing in the primate visual system-an integrated system
perspective’, Science, 255,419-423 (1992).
10. S. Grossberg, E. Mingolla and D. Todovoric, ‘A neural network architecture for preattentive vision’, IEEE Trans. Biomed. Eng.,
BE-36, 65-83 (1989).
1 1 . L. Raffo, S. Sabatini, G. Indiveri, G. Nateri and G. Bisio, ‘A memory-based recurrent neural architecture for chips emulating
cortical visual processing’, IEICE Trans. Electron., E77-C (1994).
12. P. Brodatz, Textures, a Photographic Album for Artists and Designers, Dover, New York, 1966.
13. M. Valle, G. Nateri, D. Caviglia, G. Bisio and L. Bnozzo, ‘An ASIC design for real time image processing in industrial
applications’, Proc. EDTC’95, 1995, pp. 385-390.
14. T. Roska and L. Chua, ’The CNN universal machine: an analogic array computer’, IEEE Trans. Circuits and Systems [ I ,
CAS-40, 163-173 (1993).