0% found this document useful (0 votes)
11 views12 pages

Development and Implementation of Parameterized FPGA-Based General Purpose Neural Networks For Online Applications

paper 5

Uploaded by

Yeshudas Muttu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Development and Implementation of Parameterized FPGA-Based General Purpose Neural Networks For Online Applications

paper 5

Uploaded by

Yeshudas Muttu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

78 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 7, NO.

1, FEBRUARY 2011

Development and Implementation of Parameterized


FPGA-Based General Purpose Neural Networks for
Online Applications
Alexander Gomperts, Abhisek Ukil, Senior Member, IEEE, and Franz Zurfluh

Abstract—This paper presents the development and implemen- tics include the number of layers in a network, the number of
tation of a generalized backpropagation multilayer perceptron neurons per layer, and the activation functions of those neurons,
(MLP) architecture described in VLSI hardware description etc. There remains a lack of a reliable means for determining the
language (VHDL). The development of hardware platforms has
been complicated by the high hardware cost and quantity of the
optimal set of network characteristics for a given application.
arithmetic operations required in online artificial neural networks Numerous implementations of ANNs already exist [5]–[8],
(ANNs), i.e., general purpose ANNs with learning capability. but most of them being in software on sequential processors
Besides, there remains a dearth of hardware platforms for design [2]. Software implementations can be quickly constructed,
space exploration, fast prototyping, and testing of these networks. adapted, and tested for a wide range of applications. However,
Our general purpose architecture seeks to fill that gap and at the in some cases, the use of hardware architectures matching the
same time serve as a tool to gain a better understanding of issues
unique to ANNs implemented in hardware, particularly using parallel structure of ANNs is desirable to optimize performance
field programmable gate array (FPGA). The challenge is thus to or reduce the cost of the implementation, particularly for appli-
find an architecture that minimizes hardware costs, while maxi- cations demanding high performance [9], [10]. Unfortunately,
mizing performance, accuracy, and parameterization. This work hardware platforms suffer from several unique disadvantages
describes a platform that offers a high degree of parameterization, such as difficulties in achieving high data precision with rela-
while maintaining generalized network design with performance
comparable to other hardware-based MLP implementations.
tion to hardware cost, the high hardware cost of the necessary
Application of the hardware implementation of ANN with back- calculations, and the inflexibility of the platform as compared
propagation learning algorithm for a realistic application is also to software.
presented. In our work, we aimed to address some of these disadvan-
Index Terms—Backpropagation, field programmable gate array tages by developing and implementing a field programmable
(FPGA), hardware implementation, multilayer perceptron, neural gate array (FPGA)-based architecture of a parameterized neural
network, NIR spectra calibration, spectroscopy, VHDL, Xilinx network with learning capability. Exploiting the reconfigura-
FPGA. bility of FPGAs, we are able to perform fast prototyping of hard-
ware-based ANNs to find optimal application specific config-
urations. In particular, the ability to quickly generate a range
I. INTRODUCTION
of hardware configurations gives us the ability to perform a
RTIFICIAL NEURAL NETWORKs (ANNs) present
A an unconventional computational model characterized
by densely interconnected simple adaptive nodes. From this
rapid design space exploration navigating the cost/speed/accu-
racy tradeoffs affecting hardware-based ANNs.
The remainder of this paper will begin by more precisely de-
model stem, several desirable traits uncommon in traditional scribing the motivation of our work and the current state-of-the-
computational models; most notably, an ANN’s ability to learn art in the field in Section II. Section III will provide the basics of
and generalize upon being provided examples. Given these ANNs and the backpropagation learning algorithm. Section IV
traits, an ANN is well suited for a range of problems that will cover the system’s hardware design and implementation de-
are challenging for other computational models like pattern tails of interest. In Section V, we will report the results of our ex-
recognition, prediction, or optimization [1]–[4]. perimentation using the selected sample application. Following
An ANN’s ability to learn and solve problems relies in part on this, we will discuss the results as they relate to the system
the structural characteristics of that network. Those characteris- implementation, and consider areas for further improvement in
Section VI, followed by conclusions in Section VII.

Manuscript received May 11, 2010; revised July 22, 2010; accepted II. MOTIVATION
September 15, 2010. Date of publication October 21, 2010; date of current
version February 04, 2011. This work was supported by ABB Corporate In the past, the size constraints and the high cost of FPGAs
Research, Switzerland. Paper no. TII-10-05-0116. when confronted with the high computational and interconnect
A. Gomperts is with Satellite Services B.V., 2201 DK, Noordwijk, The
Netherlands (e-mail: [email protected]).
complexity inherent in ANNs have prevented the practical use
A. Ukil and F. Zurfluh are with ABB Corporate Research, Segelhofstrasse of the FPGA as a platform for ANNs [11], [12]. Instead, the
1K, Baden 5 Daettwil, Switzerland (e-mail: [email protected]; franz. focus has been on development of microprocessor-based soft-
[email protected]). ware implementations for real world applications, while FPGA
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. platforms largely remained as a topic for further research.
Digital Object Identifier 10.1109/TII.2010.2085006 Despite the prevalence of software-based ANN implemen-
1551-3203/$26.00 © 2010 IEEE
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
GOMPERTS et al.: DEVELOPMENT AND IMPLEMENTATION OF PARAMETERIZED FPGA-BASED GENERAL PURPOSE NEURAL NETWORKS 79

tations, FPGAs and similarly, application specific integrated online backpropagation in [18] and using Xilinx Virtex XCV400
circuits (ASICs) have attracted much interest as platforms for for implementation of pipelined backpropagation ANN in [19].
ANNs because of the perception that their natural potential for Ferreira et al. discussed about optimized algorithm for activa-
parallelism and entirely hardware-based computation imple- tion functions for ANN implementations in FPGA [20]. Girau
mentation provide better performance than their predominantly described FPGA implementation of 2-D multilayer NN [21].
sequential software-based counterparts. As a consequence, Stochastic network implementation was reported by Bade and
hardware-based implementations came to be preferred for Hutchings [22]. Hardware implementation of backpropagation
high performance ANN applications [9]. While it is broadly algorithm was described by Elredge and Hutchings [23]. Later
assumed, it should be noted that an empirical study has yet on, we will compare our implementation with some of these.
to confirm that hardware-based platforms for ANNs provide From the application side, Alizadeh et al. used ANN in
higher levels of performance than software in all the cases [10]. FPGA to predict cetane number in diesel fuel [24]. Tatikonda
Currently, no well defined methodology exists to determine and Agarwal used FPGA-based ANN for motion control and
the optimal architectural properties (i.e., number of neurons, fault diagnosis of induction motor drive [25]. Mellit et al. used
number of layers, type of squashing function, etc.) of a neural ANN in Xilinx Virtex-II XC2V1000 FPGA for modelling and
network for a given application. The only method currently simulation of standalone photovoltaic systems [26]. Rahna-
available to us is a systematic approach of educated trial and maei et al. reported FPGA implementation of ANN to detect
error. Software tools like MATLAB Neural Network Toolbox anthelmintics resistant nematodes in sheep flocks [27].
[13] make it relatively easy for us to quickly simulate and
evaluate various ANN configurations to find an optimal archi- B. Platform
tecture for software implementations. In hardware, there are Our development platform is the Xilinx Virtex-5 SX50T
more network characteristics to consider, many dealing with FPGA [28]. While our design is not directed exclusively at
precision related issues like data and computational precision. this platform and is designed to be portable across multiple
Similar simulation or fast prototyping tools for hardware are FPGA platforms, we will mention some of the characteristics
not well developed. of the Virtex-5 important to the design and performance of our
Consequently, our primary interest in FPGAs lies in their re- system.
configurability. By exploiting the reconfigurability of FPGAs, This model of the Virtex-5 contains 4080 configurable logic
we aim to transfer the flexibility of parameterized software- blocks (CLBs), the basic logical units in Xilinx FPGAs. Each
based ANNs and ANN simulators to hardware platforms. Doing CLB holds eight logic function generators (in lookup tables),
this, we will give the user the same ability to efficiently explore eight storage elements, a number of multiplexers, and carry
the design space and prototype in hardware as is now possible in logic. Relative to the time in which this paper is written, this is
software. Additionally, with such a tool we will be able to gain considered a large FPGA; large enough to test a range of online
some insight into hardware specific issues such as the effect of neural networks of varying size, and likely too large and costly
hardware implementation and design decisions on performance, to be considered for most commercial applications.
accuracy, and design size. Arithmetic is handled using CLBs containing DSP48E slices.
Of particular note is that a single DSP48E slice can be used to
A. Previous Works implement one of two of the most common and costly oper-
Many ANNs have already been implemented on FPGAs. The ations in ANNs: either two’s complement multiplication or a
vast majority are static implementations for specific offline ap- single multiply accumulate (MACC) stage. Our model of the
plications without learning capability [14]. In these cases, the Virtex-5 holds 288 DSP48E slices.
purpose of using an FPGA is generally to gain performance ad-
vantages through dedicated hardware and parallelism. Far fewer III. ARTIFICIAL NEURAL NETWORKS (ANNS)
are examples of FPGA-based ANNs that make use of the recon- Artificial neural networks (ANN’s, or simply NN’s) are in-
figurability of FPGAs. spired by biological nervous systems and consist of simple pro-
Flexible Adaptable Size Topology (FAST) [15] is an FPGA- cessing elements (PE, artificial neurons) that are interconnected
based ANN that utilizes runtime reconfiguration to dynamically by weighted connections. The predominantly used structure is
change its size. In this way, FAST is able to skirt the problem of a multilayered feed-forward network (multilayer perceptron),
determining a valid network topology for the given application a i.e., the nodes (neurons) are arranged in several layers (input
priori. Runtime reconfiguration is achieved by initially mapping layer, hidden layers, output layer), and the information flow
all possible connections and components on the FPGA, then is only between adjacent layers [4]. An artificial neuron is a
only activating the necessary connections and components once very simple processing unit. It calculates the weighted sum of
they are needed. FAST is an adaptation of a Kohonen type neural its inputs and passes it through a nonlinear transfer function
network and has a significantly different architecture than our to produce its output signal. The predominantly used transfer
multilayer perceptron (MLP) network. functions are so-called “sigmoid” or “squashing” functions that
Interesting FPGA implementation schemes, specially using compress an infinite input range to a finite output range, e.g.,
Xilinx FPGAs, are described in the book edited by Ormondi , see [4].
and Rajapakse [16]. Izeboudjen et al. presented an implemen- Neural networks can be “trained” to solve problems that are
tation of an FPGA-based MLP with backpropagation in [17]. difficult to solve by conventional computer algorithms. Training
Gadea et al. reported comparative implementation of pipelined refers to an adjustment of the connection weights, based on a
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
80 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 7, NO. 1, FEBRUARY 2011

Backpropagation minimizes the overall network error by cal-


culating an error gradient for each neuron from which a weight
change is computed for each synapse of the neuron. The
error gradient is then recalculated and propagated backwards to
the previous layer until weight changes have been calculated for
Fig. 1. MLP model.
all layers from the output to the first hidden layer.
The weight correction for a synaptic weight connecting
neuron to neuron mandated by backpropagation is defined
by the delta rule

(3)

where is the learning rate parameter, is the local gradient of


Fig. 2. Processing element. neuron , and is the output of neuron in the previous layer.
Calculation of the error gradient can be divided into two
cases: for neurons in the output layer and for neurons in the
number of training examples that consist of specified inputs and hidden layers. This is an important distinction because we must
corresponding target outputs. Training is an incremental process be careful to account for the effect that changing the output
where after each presentation of a training example, the weights of one neuron will have on subsequent neurons. For output
are adjusted to reduce the discrepancy between the network and neurons, the standard definition of the local gradient applies
the target output. Popular learning algorithms are variants of
gradient descent (e.g., error-backpropagation) [29], radial basis (4)
function adjustments [4], etc. Neural networks are well suited
to a variety of nonlinear problem solving tasks. For example, For neurons in a hidden layer, we must account for the local
tasks related to the organization, classification, and recognition gradients already computed for neurons in the following layers
of large sets of inputs. up to the output layer. The new term will replace the calculated
error since, because hidden neurons are not visible from out-
A. Multilayer Perceptrons (MLPs) side of the network, it is impossible to calculate an error for
MLPs (Fig. 1) are layered fully connected feed-forward net- them. So, we add a term that accounts for the previously calcu-
works. That is, all PEs (Fig. 2) in two consecutive layers are lated local gradients
connected to one another in the forward direction.
During the network’s forward pass each PE computes its (5)
output from the input it receives from each PE in the
preceding layer as shown here where is the hidden neuron whose new weight we are calcu-
(1) lating, and is an index for each neuron in the next layer con-
nected to .
where is the squashing function of PE whose role is to As we can see from (4) and (5), we are required to differen-
constrain the value of the local field tiate the activation function with respect to its own argument,
the induced local field . In order for this to be possible, the
(2) activation function must of course be differentiable. This means
that we cannot use noncontinuous activation functions in a back-
is the weight of the synapse connecting neuron to neuron propagation-based network. Two continuous, nonlinear activa-
in the previous layer, and is the bias of neuron . Equation tion functions commonly used in backpropagation networks are
(1) is computed sequentially by layer from the first hidden layer the sigmoid function
which receives its input from the input layer to the output layer,
producing one output vector corresponding to one input vector. (6)
The network’s behavior is defined by the values of its weights
and bias. It follows that in network training the weights and and the hyperbolic tangent function
biases are the subjects of that training. Training is performed
using the backpropagation algorithm after every forward pass (7)
of the network.
Training is performed multiple times over all input vectors in
B. Backpropagation Algorithm the training set. Weights may be updated incrementally after
The backpropagation learning algorithm [29] allows us to each input vector is presented or cumulatively after the training
compute the error of a network at the output, then propagate that set in its entirety has been presented (one training epoch). This
error backwards to the hidden layers of the network adjusting second approach, called batch learning, is an optimization of the
the weights of the neurons responsible for the error. The net- backpropagation algorithm designed to improve convergence by
work uses the error to adjust the weights in an effort to let the preventing individual input vectors from causing the computed
output approach the desired output . error gradient to proceed in the incorrect direction.
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
GOMPERTS et al.: DEVELOPMENT AND IMPLEMENTATION OF PARAMETERIZED FPGA-BASED GENERAL PURPOSE NEURAL NETWORKS 81

Fig. 3. Block view of the hardware architecture. Solid arrows show which com-
ponents are always generated. Dashed arrows show components that may or may
not be generated depending on the given parameters.

IV. HARDWARE IMPLEMENTATION

A. Design Architecture
Our design approach is characterized by the separation of
simple modular functional components and more complex in-
Fig. 4. Graphical user interface to generate networks in MATLAB Neural
telligent control oriented components. The functional units con- Network toolbox [13].
sist of signal processing operations (e.g., multipliers, adders,
squashing function realizations, etc.) and storage components
(e.g., RAM containing weights values, input buffers, etc.). Con- data precision, network size, etc., in PC-based ANN designs.
trol components consist of state machines [16] generated to Nevertheless, we wanted to have the same flexible design
match the needs of the network as configured. During design philosophy for the FPGA-based design. Please note, this study
elaboration, functional components matching the provided pa- is limited to implementation of feed-forward MLP with back-
rameters are automatically generated and connected, and the propagation learning algorithm. Also, design parameters like
state machines of control components are tuned to match the number of layers, number of neurons, input-output sizes, types
given architecture. of interconnection, etc., are limited by the hardware resources
Network components are generated in a top-down hierar- available, depending on the type of FPGA used. There are no
chical fashion, as shown in Fig. 3. Each parent is responsible for general rules. One has to derive the limits by experiments.
generating its children to match the parameters entered by the Later on in this paper, we would present some statistics on the
user prior to elaboration and synthesis. Using VHDL generated variations of these design parameters over hardware resource
statements and a set of constants whose values are given by utilizations. Similar statistics for different platforms are also
the user configuration, the top level ANN block generates a reported in [18] and [19].
number of layer blocks as required by the configuration. Each
layer subsequently generates a teacher if learning is enabled B. Data Representation
along with a number of PEs as configured for that layer. Each
Network data is represented using a signed fixed point no-
PE generates a number of MACC blocks equal to the width
tation. This is implemented in VLSI hardware description lan-
of the previous layer as well as a squashing function block.
guage (VHDL), with the IEEE proposed fixed point package
Fig. 3 shows that there is a one-to-one relationship between
[30]. Fixed point notation serves as a compromise between tra-
the logical blocks generated (e.g., Layers, PEs, MACCs) and
ditional integer math and floating point arithmetic, the latter
the functions and architecture of the conceptual ANN model.
being prohibitively costly in terms of hardware [31].
This approach has been chosen over one in which logical
Our network has a selectable data width which is subdivided
blocks are time-multiplexed to allow for a high-performance
into integer and fractional portions. For example, our implemen-
fully pipelined implementation at the cost of greater resource
tation incorporates one sign bit , one integer place , and
demands.
three fractional places
In this context, it is interesting to note how it is done at
PC-based system. Fig. 4 shows the graphical user interface
to generate different networks using the MATLAB Neural
Network toolbox [13]. As shown in Fig. 4, one can choose The network data precision thus becomes (0.125 in our
network type (feed-forward backpropagation, MLP, radial example), where is the number of fraction bits and gives the
basis, etc.), number of layers, number of neurons per layer, the data set a maximum range of
transfer functions, etc. With these inputs, the network structure
is generated. Of course, one gets higher flexibility in terms of (8)
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
82 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 7, NO. 1, FEBRUARY 2011

[33]. More accurate approximations will result in faster, better


convergence, hence more accurate results. There has been a
significant amount of research into how a sigmoid function can
be efficiently approximated in hardware, while maintaining
an acceptable degree of accuracy and the continuity required
for reliable convergence in backpropagation learning [32]. To
create a generalized design we must add one additional require-
ment for sigmoid function approximation, that the method of
approximation must be valid for a variety of sigmoid functions.
Based on size and accuracy requirements to be met by the net-
work we are generating, we may select one of two implemen-
tation styles for a sigmoid function that we have implemented
in our generalized design: a uniform Lookup Table (LUT) or
a LUT with linear interpolation, described below. Compara-
tive analysis of the hyperbolic tangent squashing function im-
plementation, using LUT and piecewise linear approximation
techniques are discussed in [20]. Gadea et al. compared perfor-
mances of different types of implementation of squashing func-
Fig. 5. Functional blocks of the PE component. tion in [18].
LUT-based approach works much faster than piece-wise
linear approximation, though LUT consumes memory. So, if
( in our example) where is there is not much concern about memory, LUT is preferred, as
the total data width, , for our implementation. in our implementations, and real-time applications, e.g., motion
control and fault diagnosis of induction motor drive [25].
C. Processing Element 1) Uniform Lookup Table (LUT) Implementation: A uniform
LUT implemented in block RAM may be used to approximate
An effort was made to keep the realization of a single PE
a function of any shape.
as simple as possible. The motivation for this was that in our
The LUT is addressed using the local field. The address is
parallel hardware design many copies of this component would
formed by taking the inverse of the sign bit of the local field
be generated for every network configuration. So, keeping this
and concatenating the most significant bits required to represent
component small helps to minimize the size of the overall de-
the highest input value of the function mapped onto the LUT
sign. The control scheme was centralized external to the PE
down to the number of address bits of the LUT. Altogether, the
component to prevent the unnecessary duplication of function-
computation requires one cycle and minimal hardware to hold
ality and complexity.
the table itself.
In paring the design of the PE component down to its essential
The uniform LUT implementation, despite being popular in
features, we are left with a multiply accumulate function with
FPGA-based ANNs and while efficient in terms of speed and
a width equal to the number of neurons in the previous layer,
size, presents a problem in terms of its size versus accuracy
the selected squashing function implementation, and memory
tradeoff when it comes to modeling functions with steep slopes
elements such as registers contained the synaptic weights and
like a sigmoid function. As the slope increases so does the quan-
input buffers.
tization error between LUT entries (Fig. 6). A common solution
The PE is described by [(1) and (2)]. The full calculation of
for this problem is the use of a LUT with variable resolution
the PE is shown in Fig. 5. In Fig. 5, the MACC units multiply and
[Fig. 6(c)]. That is, a LUT with higher resolution for portions of
sum the inputs with the weights and the bias , to implement
the function with steeper slopes. However, this is a solution that
as in [(1) and (2)]. After that, the final quantity
must be custom crafted for every given sigmoid and thus is not
from the MACC is passed through the squashing function (see
easily generalized as we would like.
top right corner) to produce the network output . As indicated
2) Lookup Table With Linear Interpolation Implementation:
in Fig. 5, the squashing function utilizes lookup table (LUT)
To address the diminished accuracy of a uniform LUT while
(via “lut_en” instruction), described below.
maintaining generalizations over a variety of functions, we in-
corporate linear interpolation alongside the LUT. This solution
D. Squashing Function effectively draws a line between each point of a uniform LUT
The direct implementation of the preferred nonlinear as in Fig. 7. The resulting activation function is
squashing, e.g., the sigmoid function [see (6) and (7)], presents
a problem in hardware since both the division and exponen-
tiation operations require an inordinate amount of time and (9)
hardware resources to compute. The only practical approach in
hardware is to approximate the function [20], [32]. However, where is the value returned by a uniform LUT rep-
in order for training to converge or for us to obtain accurate resenting the target function for index (done as described in
offline results, a minimum level of accuracy must be reached Section IV-D1), is the bit widths of the local field, is the bit
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
GOMPERTS et al.: DEVELOPMENT AND IMPLEMENTATION OF PARAMETERIZED FPGA-BASED GENERAL PURPOSE NEURAL NETWORKS 83

Fig. 8. Backpropagation implementation.


Fig. 6. (a) Sigmoid function implemented on a uniform LUT. (b) Uniform LUT
error distribution for a sigmoid function. (c) Example of partitioning for a vari-
able resolution LUT.
its own backpropagation teacher component which is respon-
sible for the control flow of the algorithm for its layer. Since
the backpropagation algorithm is only being executed for one
layer at a time we need only one set of the necessary arithmetic
components.
Since the number of multipliers and adders, and the size of the
MACC are dependent on the size of a given layer and its prede-
cessor, we must compute and generate the worst case number
of arithmetic components that are needed during elaboration
of the given network design. The set of arithmetic components
used in the backpropagation calculation are then packaged into
the BP ALU component.
Fig. 7. The hyperbolic tangent function is approximated by finding the nearest The BP ALU is accessed by each backpropagation teacher via
indices of the LUT corresponding to the local field, then drawing a line between an automatically generated multiplexer (Fig. 8) which is con-
the values referenced by those indices. The quantization error is then used to
find the corresponding point on the line segment LUT(d)LUT(d + 1). trolled by the network controller.
To optimize the performance of the training algorithm, we
begin execution during the forward pass of the network. In
width of the LUT address bus, and is the quantization error the forward pass, we are able to compute two elements of
described by the algorithm: once a given layer receives its input, and
once the layer’s output has been computed. In the hidden
layers, the results of these preprocessing steps are then saved
(10) until the error gradient reaches them in the backward pass. The
output layer teacher continues immediately by calculating the
The algorithm flow is controlled by a state machine inside output error and the local error gradient (4) for every neuron
the squashing function component enabled by the network con- in the output. Once the error gradient has been calculated at
troller. After receiving input, the result is registered at the output the output layer, the final hidden layer may calculate its own
after five clock cycles. An analysis of the tradeoff between the error gradient (5) and pass that back. Synthesis example of BP
hardware cost, performance, and accuracy related to this method algorithm for a realistic application case would be presented in
is given in Section V-A. Section VI-B.
E. Backpropagation Computation F. Network Controller
The backpropagation algorithm (see Section III-B) relies on The design of the network controller was strongly guided by
the calculation of the network error at the output layer to esti- the highly generalized design of the network and the initial de-
mate the error of neurons in the previous layers. The error is cision to separate functional and control units. The decision to
estimated and propagated backwards layer by layer until the centralize control of the network was based on the goal of mini-
first hidden layer is reached. It follows that the error estimation mizing the size and complexity of components that must be gen-
of any given layer except the output layer is dependent on the erated multiple times. This is contrary to a distributed control
error calculation of its successor. Because of this, the training mechanism made up of independent components capable of de-
algorithm must be computed sequentially, layer by layer lim- termining their own state and communicating that state to one
iting parallelism to the neuron level. another. This would be the more modular solution, but would
The hardware implementation once again seeks to separate also inflict a significant time and hardware penalty caused by
control and functional-based components. Each layer contains the control overhead in the network, since control mechanisms
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
84 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 7, NO. 1, FEBRUARY 2011

TABLE I
WORST CASE ERROR OF APPROXIMATED HYPERBOLIC TANGENT FUNCTION
USING A LUT WITH LINEAR INTERPOLATION AND A UNIFORM LUT

Fig. 9. State machines for network controller.


V. EXPERIMENTAL RESULTS
To evaluate our system, we implement a basic sample
would be repeated many times through the network. The cen- application in a varied set of configurations. To speed the
tralized control scheme on the other hand, relies on the pre- process, we carried out testing using Mentor Graphics’ Mod-
dictability of the timing and behavior of any generated network elSim simulator. A similar network implementation built using
configuration. the MATLAB Neural Network Toolbox [13] is used as a basis
Depending on the network to be generated, the network con- for comparison.
troller is created as either an online or offline network controller. For comparison we take as metrics the root mean square error
Here, offline indicates that a network is pretrained, and then
(RMSE) of the ANNs when applied to a test data set after 50,
the network object is used on test data, without any training
100, 200, 400, and 800 epochs. We define the RMSE as
capability. This type of network is often used in hardware
implementations due to less implementation complexity. On
the other hand, online indicates that the network has dynamic (11)
training capability. For different applications, the network with
different architecture, would train itself following particular
training methods, e.g., backpropagation, before acting on test where is the calculated output of the network, is the desired
data. This provides a generalized flexibility, however, it is output, and is the number of input sets presented.
complex to implement. A speed comparison is not meaningful due to the dubious
Different implementations are necessary since in offline nature of any resulting measure considering the external factors
mode a pipelined network is generated and the online controller affecting execution time on a PC.
must include control for the computation of the backpropaga- We will also remark here on the results of our squashing func-
tion algorithm. Despite this, both controllers are implemented tion approximation method using a LUT combined with linear
in the same manner. interpolation.
The network controller is a Mealy state machine (see Fig. 9)
based on a counter indicating the number of clock cycles that A. LUT With Linear Interpolation
have passed in the current iteration (in the case of an online net-
work) or the total number of clock cycles passed (in the case To judge the accuracy of our approximation technique for sig-
of an offline network). As mentioned above, for online applica- moid function using a LUT with linear interpolation we calcu-
tion, we would need network with learning capability. It is to lated the average error and worst case error of the technique for
be noted that, for backpropagation learning algorithm, the error a range of network data precisions and LUT sizes. Using a uni-
needs to be fed back. Therefore, Mealy state machine is suit- form LUT as a control, we approximated the hyperbolic tangent
able, as it is a finite-state transducer that generates an output with input ranging [ 4:4]. Table I shows the worst case error
based on its current state and input. For the value of the counter using the two techniques and Table II the average error. From
to have any meaning we must be able to precalculate the latency the results we see that the LUT with linear interpolation pro-
to reach milestones in the forward and back passes of the net- vides a significant improvement in accuracy. For example, in a
work. These milestones are calculated during elaboration of the case of a network that uses 15-bit fractional precision, an 8192
design. Based on these milestones, the state machine outputs a element uniform LUT can be replaced by a 128 element LUT
set of enable signals to control the flow of the network. In com- with linear interpolation and achieve a slightly better quality ap-
parison, for offline applications, one would not require training, proximation on average.
hence the back passes in the network controller state machine Using the LUT with linear interpolation it becomes possible
(see Fig. 9). to reduce the error such that the magnitude of the error falls
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
GOMPERTS et al.: DEVELOPMENT AND IMPLEMENTATION OF PARAMETERIZED FPGA-BASED GENERAL PURPOSE NEURAL NETWORKS 85

TABLE II TABLE IV
AVERAGE ERROR OF APPROXIMATED HYPERBOLIC TANGENT FUNCTION USING RMSE OVER THE TEST SET
A LUT WITH LINEAR INTERPOLATION AND A UNIFORM LUT

which have a worst case error of , the bit resolution as


per Table III, must be at least 12 . Based
on this, we can say that a lin-LUT of size is the smallest
size LUT that gives a maximally accurate result for a network
fractional precision of 10 or 11.
TABLE III
BIT RESOLUTION REQUIRED TO RESOLVE THE
WORST CASE APPROXIMATION ERROR
B. Spectrometry Application
Our test application attempts to draw out conclusions from
the frequency spectrum returned by a spectrometer. In this case,
we are analyzing several cuts of meat to determine the fat con-
centration of each. The network is provided with the magni-
tudes of ten principle components of detected frequencies and
is trained at the output using the measured fat concentration of
the cut [34]. NNs are increasingly used as a nonlinear modeling
tool for spectra calibration [35], at PC-level.
Here, we compare the accuracy of the FPGA-based imple-
mentation with that of a similar MATLAB simulation. Both
use a 10-3-1 layer configuration trained using backpropagation.
That is, ten noncomputing neurons in the input layer corre-
sponding to the ten principle components, three neurons in
the hidden layer, and one output neuron corresponding to the
under the resolution of the network data precision. At this point, fat concentration. The MATLAB simulation employs batch
we have reached a maximally precise approximation for the learning with a learning rate of 0.05, while our system uses in-
given network. Table III shows the resolution necessary to reg- cremental learning with a learning rate of 0.1. The networks are
ister the approximation error in the worst case for each set up in trained over 200 input/target pairings, the final 40 pairings are
our tests. The boxed entries show the minimum LUT size for a used for testing to evaluate the network’s ability to generalize.
given network data resolution to reach a maximally precise ap- Input values were normalized between 2 and 2 and target
proximation. values between 1 and 1. Both networks use the same initial
The logarithmic scale on the axis of Table III is with base set of weights randomized between 2 and 2.
2. This is done because number of elements in optimal LUTs Table IV compares the RMSE of the test cases (not included
in hardware always have a base 2 logarithm and because only in training) for the MATLAB and FPGA implementations over
these values are used, the step size is defined this way. This different number of training epochs. Comparison of perfor-
means that 10 corresponds to elements (the reason mances for unseen test cases demonstrate the generalization
this is optimal is that to address 1024 elements, exactly 10 bits capability of the networks. FPGA implementations use data
are used). Likewise, a bit resolution of 10 is equivalent to a widths with 1 sign bit, 2 integer bits, and fractional data preci-
resolution of . 10 represents the binary sions of varying size (the total data width of the network data
place, i.e., 10 places after the decimal point. is given in the left column). The FPGA-based implementations
The purpose of Table III is to show for a given data precision utilize LUTs with linear interpolation of the size required for
in the network, what the optimal size for a LUT implementing maximally precise approximations as indicated in Table III.
a maximally accurate standard sigmoid function is. Maximally From Table IV, we can notice, the best RMSE using the
accurate is defined by a worst case error that is less than the MATLAB implementation for the testset is 2.63 (obtained using
precision of data propagating through the network. The contents 800 training epochs). In comparison, the FPGA implementation
of Table III are based on Table I, which describes the minimum achieves the best testset RMSE of 3.05 (for 18-bit data width,
bit precision needed to resolve the worst case error. The boxed using 400 training epochs). Expectedly, FPGA implementation
elements in Table III give the minimum size LUT necessary to results in less accuracy compared to PC-based implementation
achieve a worst case error less than the resolution of the network. using MATLAB, possibly due to factors like limitation in data
For example, if we consider the top left boxed entries in Table I, width, fixed point data precision, incremental learning instead
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
86 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 7, NO. 1, FEBRUARY 2011

TABLE V TABLE VI
RMSE OF AN FPGA-BASED NETWORK WITH SQUASHING FUNCTIONS IN THE A COMPARISON OF ANN PLATFORMS AND IMPLEMENTATIONS
HIDDEN LAYER IMPLEMENTED USING A LUT WITH LINEAR INTERPOLATION
AND THE SAME NETWORK IMPLEMENTED USING UNIFORM LUTS

of batch learning, etc. However, the best performance of the


FPGA implementation is reasonably acceptable compared to
that of MATLAB. Also, in the FPGA implementation of the
online network, we could flexibly parameterize data widths,
number of training epochs, etc., a central aim of this study.
A maximally precise approximation, however, is not always
necessary for a functional network. In [32], Tommiska regards
a maximum error of 0.0039 to be the functional limit on the
accuracy of the squashing function for network training, and a
maximum error of 0.0078 as the limit for the forward pass.
Table V compares the RMSE over the test set of two FPGA higher learning rate. The quality of the FPGA-based network’s
implementations of networks with 12-bit data precision, one im- convergence however, is not as good. The FPGA-based network
plemented with a 64-element LUT with linear interpolation and must utilize a higher learning rate to reduce the increased chance
one with a uniform 64-element LUT. We see that indeed the net- of the training algorithm becoming trapped in a local minimum
work is still functional as Tommiska predicts. However, training as a result of using incremental learning. This in turn causes
becomes noticeably erratic and the accuracy of the network is the overall network error gradient to become unstable earlier in
significantly diminished as expected. training; a symptom of overshooting the target value in training.
The MATLAB simulation on the other hand, is less likely to
VI. DISCUSSION overshoot a target in training thanks to its low learning rate and
Ideally, we would like to make accuracy and performance is at the same time less likely to become trapped in local minima
comparisons only against other hardware-based systems. This using batch learning.
is for two reasons. First, hardware-based ANNs allow for more The extremes of the target output are other contributors to
meaningful measures of computation time. In software—par- the error in the FPGA implementation. The range of the cal-
ticularly software in a PC environment—computation time is culated output does not reach the extreme ends of the target
affected by a series of external factors such as the characteris- range. This is attributed to the eventual averaging of the weight
tics of the processor and the current load on the system. Second, changes caused by changes to the error gradient for each itera-
software-based ANNs are able to employ a number of features tion in training.
that may not be practical or feasible in hardware such us floating This sample application proved to be a somewhat special
point notation or advanced training techniques. Software and case. In testing, we found that we could train the network for
hardware-based ANNs are two very different creatures. Hard- an unusually large number of epochs without experiencing any
ware in general—but not in all cases—tends to lend itself to overfitting in the test set. We found that the error in the test set
faster performance. However, it is less capable of utilizing more bore a strong relation to the training set error in both implemen-
complex techniques to achieve a higher degree of accuracy as tations. In this sample application, we never saw the expected
software is able to. This disparity reduces the value of such divergence caused by over training the network. This occurs
comparisons since in reality we are comparing two significantly when data sets contain few if any outliers, the training set is very
different networks. However, considering the scarcity of back- representative of the relation, and the test set conforms well to
propagation ANN networks in hardware, and without access to that relation. Given a larger, more diverse test set or a smaller
training and test data for those few ANNs, the MATLAB Neural training set, we expect to observe a divergence between error of
Network Toolbox [13] remains a viable alternative benchmark the training set and that of the test set.
for accuracy when care is taken to note the inequities inherent From Table IV, it is clear that the data precision of the ANN
in the comparison. On the other hand, we are able to make per- plays an important role in the network’s ability to converge in
formance comparisons against other hardware platforms using training and its resulting accuracy. Our own testing seems to
a set of standard performance metrics. This comparison will be show that Holt and Hwang’s assertion [33], that 13 bits is the
presented in Section VI-B. minimum required precision for network weights and biases to
reliably obtain convergence using backpropagation learning, is
A. System Accuracy
not a firm rule. We found that our 12-bit implementation can still
We found that in our sample application, the FPGA-based yield a successful convergence, while indeed, lower bit widths
network converges over fewer iterations than the MATLAB sim- like that of our 9-bit implementation results in rather poor con-
ulation. This comes as a result of incremental learning and a vergence. The unexpected success of the 12-bit implementation
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
GOMPERTS et al.: DEVELOPMENT AND IMPLEMENTATION OF PARAMETERIZED FPGA-BASED GENERAL PURPOSE NEURAL NETWORKS 87

TABLE VII
HARDWARE RESOURCE UTILIZATION IN 12-BIT SPECTROMETRY APPLICATION USING XILINX XC5VSX50T-2FF1136

TABLE VIII
RESOURCES USED IN DIFFERENT NN IMPLEMENTATION AND APPLICATIONS

is likely due in part to the expansion of the data path for inter- potential to have its own squashing function, implementation
mediate results in our network. type, and level of accuracy. As a result of choices to maximize
flexibility, network designs generated on our platform are likely
B. Hardware Implementation to be somewhat larger than any application specific designs.
To measure the performance of ANNs across the wide range Parallelism in the network is implemented at the synaptic
of platforms that they have been implemented on, there are two level, this represents the maximum level of parallelism for
common criteria: connections per second (CPS) and connection neural networks. Using synaptic level parallelism, all neurons
updates per second (CUPS). CPS is defined as the number of and synaptic connections in a network layer operate in parallel
multiply–add operations that occur in the forward pass of a net- with the goal of maximizing performance. Performance in the
work per second. CUPS is the number of weight updates per offline network is further improved with the use of a fully
second in the network. Two additional derivations of these met- pipelined design.
rics exist. First, to account for the number of connections in the As an example of the hardware resource demands of a
network that must be computed or updated we may calculate network generated using our system, we take the online im-
connections per second per weight (CPSPW) or connection up- plementations of the 12-bit spectrometry application network.
dates per second per weight (CUPSPW). Second, we may take Table VII shows the quantities of selected components gener-
into account the precision of the network by calculating the con- ated during synthesis to implement the networks of different
nection primitives per second (CPPS) or the connection primi- sizes as well as the total percentage of area consumed against
tives per second per weight (CPPSPW) to account for both the the available resources on the Virtex-5 xc5vsx50t-2ff1136
number of connections in the network and the precision of the platform [28].
network data [38]. The amount of hardware resources consumed by a network
Table VI presents a performance comparison between our implemented using our platform is strongly influenced by
network and other implementations using these metrics. The many factors. The two most influential of these factors are
performance data of our network reflects the 12-bit imple- the dimensions (governed by network structure) and the data
mentation of the spectrometry application presented earlier in width of the network. Naturally, as shown in Table VII, net-
Section V-B. works of greater size result in greater hardware requirements.
Due to the parallel implementation of FPGA-based neural The structure 10-3-1 has been used for the results shown in
networks like ours and the implementation presented in [17], Tables IV and V. For brevity purpose, networks of lower sizes
the CPS metric increases as a function of the number of synaptic are not shown.
connections. This is because as more neurons are implemented Besides, parallel calculation of the backpropagation makes
in parallel, more MACC operations can occur simultaneously, consecutive wide layers especially costly, requiring a number
thereby increasing the number of operations per second. The of MACCs equal to the product of the width of the two widest
CPSPW metric is a means for normalizing this behavior by di- consecutive layers. When choosing the data width of the net-
viding by the number of weights. However, in parallel imple- work, careful consideration should be made to match the data
mentations, this metric punishes networks with wide input vec- width to the width of hardware arithmetic units on the target
tors by including weights in the normalization that are not con- FPGA. Wider data widths than the width of hardware arithmetic
nected to multiply–add operations on the input side. units necessitate additional sets of those units to implement the
The implementation of our system maximizes flexibility and design, thereby significantly increasing the final hardware re-
parallelism over hardware cost. This is seen, for example, in that quirements of the implementation.
when a squashing function implementation for a layer of PEs Examples of hardware resource consumptions in other neural
requires a LUT, each PE is given a dedicated LUT as opposed network implementation and applications, reported by the syn-
to sharing it with other PEs. We do this to allow each PE the thesis results of various works, are briefly shown in Table VIII.
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
88 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 7, NO. 1, FEBRUARY 2011

It is to be noted that Table VIII employs many different network lion online. These speeds represent comparable performance to
structures for various applications. Hence, it should not be used other hardware-based MLP implementations. We have also con-
as a direct comparison to Table VII. firmed that our system is capable of producing accurate con-
Nevertheless, there are probably no parametric rules to vergence in training on a par close to MATLAB simulations.
estimate how network implementation in FPGA scales up. Statistics on hardware resource utilizations from the synthesis
Therefore, the statistics on hardware resource utilizations from reports of different network sizes are presented. These could be
the synthesis reports, as shown in Tables VII and VIII, should used to estimate hardware demands (for specific platforms) with
be used to estimate hardware demands (for specific platforms) growing network structure.
with growing network structure. Of course, for a particular Also presented was a new method for approximation of a
platform, there would be resource limitation, putting constraints sigmoid function in hardware. We showed that by applying a
on the upper limit of the possible network size. However, a linear interpolation technique to a uniform LUT, we could sig-
rough overview on the possible hardware requirements, e.g., in nificantly reduce the size of the necessary LUT, while main-
Tables VII and VIII, could be a rough guide in choosing the taining the same degree of accuracy at the cost of implementing
appropriate FPGA platform for any particular application and one adder and one multiplier.
network structure requirements.
ACKNOWLEDGMENT
C. Future Directions
The authors would like to thank the anonymous reviewers
Some interesting additions could bring the accuracy of for the constructive review, helping to upgrade the paper. This
this hardware-based network implementation closer to that paper is based on work carried out by Alexander Gomperts at
of software-based ones. The first step in this direction would ABB Corporate Research, Switzerland from September 2008
include the implementation of a momentum factor in the to February 2009 under the supervision of A. Ukil and F. Zur-
backpropagation training algorithm. The momentum factor fluh in partial fulfillment of the requirements of a Master’s of
speeds up learning and brings us to a better convergence by Electrical Engineering degree from the Technische Universiteit
rewarding training that consistently reduces error and punishes Eindhoven, The Netherlands. Supervision on the side of the Uni-
any increases in error. This could not be covered in this work versity was provided by L. Jóźwiak. An internship preceded this
mainly because of hard time-limits for the thesis. thesis work.
A second more complex step is the implementation of batch
learning. Batch learning is a common approach in software- REFERENCES
[1] I. A. Basheer and M. Hajmeer, “Artificial neural networks: Fundamen-
based implementations, and as we saw in Section V-B, effec- tals, computing, design, and application,” J. Microbio. Methods, vol.
tive in finding good convergence. When combined with a mo- 43, pp. 3–31, Dec. 2000.
mentum factor, batch learning is particularly adept at fast and [2] M. Paliwal and U. A. Kumar, “Neural networks and statistical tech-
niques: A review of applications,” Expert Systems With Applications,
accurate convergence. The implementation of batch learning on vol. 36, pp. 2–17, 2009.
an FPGA platform faces challenges in either calculating inter- [3] B. Widrow, D. E. Rumelhart, and M. A. Lehr, “Neural networks: Ap-
mediate weights quickly enough in order not to cause buffering plications in industry, business and science,” Commun. ACM, vol. 37,
no. 3, pp. 93–105, 1994.
of the weight array and network output, or, implementing ex- [4] A. Ukil, Intelligent Systems and Signal Processing in Power Engi-
ternal memory to buffer the necessary data without slowing the neering, 1st ed. New York: Springer, 2007.
[5] B. Schrauwen, M. D’Haene, D. Verstraeten, and J. V. Campenhout,
system to the point that a software-based solution becomes a “Compact hardware liquid state machines on FPGA for real-time
better choice. speech recognition,” Neural Networks, vol. 21, no. 2–3, pp. 511–523,
Because of our primary interest to implement and test ANN 2008.
[6] C. Mead and M. Mahowald, “A silicon model of early visual pro-
with learning on FPGA, we skipped real-time input data com- cessing,” Neural Networks, vol. 1, pp. 91–97, 1988.
munication to the FPGA, which would have required consid- [7] J. B. Theeten, M. Duranton, N. Mauduit, and J. A. Sirat, “The LNeuro
erable effort to design such communication module. Such real- chip: A digital VLSI with on-chip learning mechanism,” in Proc. Int.
Conf. Neural Networks, 1990, vol. 1, pp. 593–596.
time online input data feeding is important to check how fast the [8] J. Liu and D. Liang, “A survey of FPGA-based hardware implementa-
FPGA can process new data. However, that is somewhat depen- tion of ANNs,” in Proc. Int. Conf. Neural Networks Brain, 2005, vol.
2, pp. 915–918.
dent on the clock frequency of the FPGA platform, not the direct [9] P. Ienne, T. Cornu, and G. Kuhn, “Special-purpose digital hardware
focus of this work. This would be part of the future work, where for neural networks: An architectural survey,” J. VLSI Signal Process.,
we plan to test the ANN implemented in FPGA with real-time vol. 13, no. 1, pp. 5–25, 1996.
[10] A. R. Ormondi and J. Rajapakse, “Neural networks in FPGAs,” in Proc.
input data communication. Int. Conf. Neural Inform. Process., 2002, vol. 2, pp. 954–959.
[11] B. J. A. Kroese and P. van der Smagt, An Introduction to Neural Net-
VII. CONCLUSION works, 4th ed. Amsterdam, the Netherlands: The University of Ams-
terdam, Sep. 1991.
In this paper, we have presented the development and im- [12] J. Zhu and P. Sutton, “FPGA implementations of neural networks—A
plementation of a parameterized FPGA-based architecture for survey of a decade of progress,” Lecture Notes in Computer Science,
vol. 2778/2003, pp. 1062–1066, 2003.
feed-forward MLPs with backpropagation learning algorithm. [13] “MATLAB Neural Network Toolbox User Guide,” ver. 5.1, The Math-
Our architecture makes native prototyping and design space ex- Works Inc., Natick, MA, 2006.
ploration in hardware possible. Testing of the system using the [14] A. Rosado-Munoz, E. Soria-Olivas, L. Gomez-Chova, and J. V.
Frances, “An IP core and GUI for implementing multilayer perceptron
spectrometry sample application showed that the system can with a fuzzy activation function on configurable logic devices,” J.
reach 530 million connections per second offline and 140 mil- Universal Comput. Sci., vol. 14, no. 10, pp. 1678–1694, 2008.
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.
GOMPERTS et al.: DEVELOPMENT AND IMPLEMENTATION OF PARAMETERIZED FPGA-BASED GENERAL PURPOSE NEURAL NETWORKS 89

[15] E. Sanchez, “FPGA implementation of an adaptable-size neural net- [36] M. Holler, S. Tam, H. Castro, and R. Benson, An Electrically Train-
work,” in Proc. Int. Conf. ANN, 1996, vol. 1112, pp. 383–388. able Artificial Neural Network (ETANN) With 10240 “Floating Gate”
[16] A. R. Ormondi and J. Rajapakse, FPGA Implementations of Neural Synapses. Piscataway, NJ: IEEE Press, 1990, pp. 50–55.
Networks. New York: Springer, 2006. [37] D. Hammerstrom, “A VLSI architecture for high-performance, low-
[17] N. Izeboudjen, A. Farah, H. Bessalah, A. Bouridene, and N. Chikhi, cost, on-chip learning,” in Proc. Int. Joint Conf. NN, 1990, vol. 2, pp.
“Towards a platform for FPGA implementation of the MLP based back 537–544.
propagation algorithm,” Lecture Notes in Computer Science, vol. 4507/ [38] E. van Keulen, S. Colak, H. Withagen, and H. Hegt, “Neural network
2007, pp. 497–505, 2007. hardware performance criteria,” in Proc. IEEE World Congr. Comput.
[18] R. Gadea, R. C. Palero, J. C. Boluda, and A. S. Cortes, “FPGA im- Intell., 1994, vol. 3, pp. 1955–1958.
plementation of a pipelined on-line backpropagation,” J. VLSI Signal
Process., vol. 40, pp. 189–213, 2005.
[19] R. Gadea, J. Cerda, F. Ballester, and A. Mocholi, “Artificial neural net-
work implementation on a single FPGA of a pipeline on-line backprop-
agation,” in Proc. Int. Symp. Syst. Synthesis, 2000, pp. 225–230.
[20] P. Ferreiraa, P. Ribeiroa, A. Antunes, and F. M. Dias, “A high bit res-
olution FPGA implementation of a FNN with a new algorithm for the
activation function,” Neurocomputing, vol. 71, pp. 71–77, 2007. Alexander Gomperts received the B.Sc. degree
[21] B. Girau, “Building a 2D-compatible multilayer neural network,” in in electrical engineering from the Worcester Poly-
Proc. Int. Joint Conf. Neural Networks, 2000, pp. 59–64. technic Institute, Worcester, MA, in 2004 and the
[22] S. L. Bade and B. L. Hutchings, “FPGA-based stochastic neural net- M.Sc. degree in electrical engineering from the
work implementation,” in Proc. IEEE Workshop FPGAs for Custom Technische Universiteit Eindhoven, Eindhoven, The
Comput. Mach., 1994, pp. 189–198. Netherlands, in 2009.
[23] J. G. Elredge and B. L. Hutchings, “RRANN A hardware implemen- Currently, he is a Hardware Engineer at Satellite
tation of the backpropagation algorithm using reconfigurable FPGAs,” Services B.V., Noordwijk, The Netherlands.
in Proc. IEEE World Conf. Comput. Intell., 1994, pp. 77–80.
[24] G. Alizadeh, J. Frounchi, M. B. Nia, M. H. Zariff, and S. Asgarifar,
“An FPGA implementation of an artificial neural network for predic-
tion of cetane number,” in Proc. Int. Conf. Comp. Comm. Eng., 2008,
pp. 605–608.
[25] S. Tatikonda and P. Agarwal, “Field programmable gate array (FPGA)
based neural network implementation of motion control and fault diag- Abhisek Ukil (S’05–M’06–SM’10) received the
nosis of induction motor drive,” in Proc. IEEE Conf. Ind. Tech., 2008, B.E. degree in electrical engineering from Jadavpur
pp. 1–6. University, Calcutta, India, in 2000, the M.Sc. degree
[26] A. Mellit, H. Mekki, A. Messai, and H. Salhi, “FPGA-based implemen- in electronic systems and engineering management
tation of an intelligent simulator for stand-alone photovoltaic system,” from the University of Bolton, Bolton, U.K., and
Expert Systems With Applications, vol. 37, pp. 6036–6051, 2010. the Ph.D. degree from the Tshwane University of
[27] A. Rahnamaei, N. Pariz, and A. Akbarimajd, “FPGA implementation Technology, Pretoria, South Africa, in 2006.
of an ANN for detection of anthelmintics resistant nematodes in sheep After joining in 2006, he is currently a Prin-
flocks,” in Proc. IEEE Conf. Ind. Electronics Applications, 2009, pp. cipal Scientist at the Integrated Sensor Systems
1899–1904. Group, ABB Corporate Research, Baden-Daettwil,
[28] “Virtex-5 FPGA User Guide,” Xilinx, 2007. Switzerland. He is author/coauthor of more than
[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal 35 published scientific papers including a monograph Intelligent Systems and
representations by error propagation,” Nature, vol. 323, pp. 533–536, Signal Processing in Power Engineering (Springer: Heidelberg, Germany,
1986, newblock MIT Press. 2007), and inventor/co-inventor of six patents. His research interests include
[30] D. Bishop, “Fixed point package.” [Online]. Available: http://www. signal processing, machine learning, power systems, and embedded systems.
eda.org/vhdl-200x/vhdl-200x-ft/packages/fixed_pkg.vhd
[31] K. Nichols, M. Moussa, and S. Areibi, “Feasibility of floating-point
arithmetic in FPGA based artificial neural networks,” in Proc. Int. Conf.
Computer Applications in Industry and Engineering, 2002, pp. 8–13. Franz Zurfluh received the B.Sc. degree in electrical
[32] M. Tommiska, “Efficient digital implementation of the sigmoid func- engineering from the University of Applied Sciences
tion for reprogrammable logic,” in Proc. IEE Comput. Digital Tech., Northwestern Switzerland, Windisch, Switzerland, in
2003, vol. 150, pp. 403–411. 1985.
[33] J. Holt and J. Hwang, “Finite precision error analysis of neural net- He has worked in the research and development
work hardware implementations,” IEEE Trans. Computers, vol. 42, pp. areas for different companies mainly in the em-
281–290, 1993. bedded electronics sector. After joining in 2000, he
[34] “Tecator data set” Carnegie Mellon University. [Online]. Available: is currently a Principal Scientist at the Integrated
http://lib.stat.cmu.edu/datasets/tecator Sensor Systems Group, ABB Corporate Research
[35] A. Ukil, J. Bernasconi, H. Braendle, H. Buijs, and S. Bonenfant, Center, Baden-Daettwil, Switzerland. His current
“Improved calibration of near-infrared spectra by using ensembles of research interests include VLSI signal processing
neural network models,” IEEE Sensors J., vol. 10, no. 3, pp. 578–584, and control, design of FPGA-based systems for power control and automation
Mar. 2010. applications.

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:29:47 UTC from IEEE Xplore. Restrictions apply.

You might also like