Volume II, Issue IV, April 2015

Implementation of 32 Bit Floating Point MAC Unit to

Feed Weighted Inputs to Neural Networks
Yadagiri Karri#, Prof. Rajesh Misra*
Department of ECE, CUTM, JITM

Abstract- This paper describes an FPGA implementation of sequence of mantissa operations swap, shift, add, normalize,
IEE-754 format single precision floating point MAC unit that and round. A floating point adder first compares the
is used in artificial neural networks to feed the weighted inputs exponents of the two input operands, swaps and shifts the
to the neurons. Use of floating point numbers improves the mantissa of the smaller number to get them aligned. The
range of the representation of data from very small number to
number has to be adjusted if the incoming number is
a very large number which is mostly recommended for
Artificial Neuron Networks. negative. Finally, the sum is renormalized, Exponents are
adjusted accordingly, and resulting mantissas are truncated
Keywords- FPGA, IEEE-754, floating point MAC, weighted by an appropriate rounding scheme. If extra speed is
inputs, Artificial Neural Networks. required then FP adders use leading-zero anticipatory (LZA)
logic to carryout pre-decoding for normalization shifts in
I. INTRODUCTION parallel with the mantissa addition. Floating point
multiplication basically involves xoring of the signs,
multiplication of significands and adding exponents of both
T he main goal of this paper is designing of floating point
MAC unit. Representing real numbers in binary format
requires floating point numbers .In this paper floating point
the numbers. After addition the exponent is called tentative
result exponent. Then we have to subtract bias from added
numbers are represented according to IEEE 754 standard exponents. Result can be a normalized number if the MSB
format. In single precision, a Floating Point number consists is 1. In this paper floating point multiplier and floating point
of a 32-bit word divided in 1 bit for sign, 8 bits for exponent adder/subtractor are designed and an accumulator is
that constitutes a 127 bias, and 23 bits for the significand. designed. Then floating point MAC unit is designed.MAC
This standard supports two types of formats binary basically consists of adder, multiplier and an accumulator
interchanges format and decimal interchange format. In certain values are used for special number representation, as
many applications computation is done using floating point follows. If the exponent is 0 and mantissa is 0 then the
arithmetic. Earlier floating point operations were mainly number is a zero number. If the exponent is 0 and mantissa
implemented as software while for main stream general is greater than 0 then the number is a subnormal number. If
purpose processor hardware implementation was an option the exponent lies in between 0 and 255 and mantissa is
because cost of the hardware was not reasonable. Today greater than 0 then the number is a normal number. If the
every microprocessor is hardware specific for handling exponent is 255 and mantissa is 0 then the number is an
various floating point operations. In artificial neural infinite number. If the exponent is 255 and the mantissa is
network applications floating point MAC unit is required in greater than 0 then the number is not a number.
order to achieve desired performance. But because of the
Table 1. Special Numbers
advancements in reconfigurable logic now these Mac units
can be implemented on FPGA. The goal of this project is S.No Exponent Mantissa Output
FPGA implementation of floating point MAC unit for ANN 1 =0 =0 Zero
applications. Floating point number can be given by
2 =0 >0 Subnormal
equation (1):
3 0<E<255 >0 Normal

Z= (-1s) *2 (exp-bias)*(1*M) (1) 4 =255 =0 Infinity

5 =255 =0 NAN
Equation 1 represents IEEE 32 bit single precision floating
point format. One very important requirement of the IEEE-
754 representation is that the number should be represented The concept of ANN is basically introduced from the
with it closest equivalent for the precision chosen, which subject of biology where neural network plays a important
means that it is assumed that any operation is performed and key role in human body. In human body work is done
with infinite precision Any floating point number is first of with the help of neural network. Neural Network is just a
all converted into this format (1) and further operations are web of inter connected neurons which are millions and
performed. Floating-point (FP) addition is based on millions in number. With the help of these interconnected Page 40
Volume II, Issue IV, April 2015
neurons all the parallel processing is done in human body Point Unit (FPU) consisting of a multiplier and
and the human body is the best example of Parallel adder/Subtractor units is proposed. A novel multiplication
Processing. An Artificial Neuron is basically an engineering algorithm is proposed and used in the multiplier implementation.
approach of biological neuron. It has device with many
inputs and one output. ANN consists of large number of III. ORGANIZATION OF WORK
simple processing elements that are interconnected with
each other and also layered. Similar to biological Neuron In this section IEEE 754 single precision format, floating
Artificial Neural Network also have neurons which are point adder, Floating point multiplier and floating point
artificial and they also receive inputs from the other MAC unit are explained. MAC consists of a multiplier and
elements or other artificial neurons and then after the inputs an accumulator unit. Multiplier will multiply two numbers
are weighted and added, the result is then transformed by a and result will be added to the number already stored in the
transfer function into the output. The transfer function may accumulator.
be anything like Sigmoid, hyperbolic tangent functions or a

The standard binary floating point format was issued by

IEEE in 1985 [6]. It covers different types of floating-point
formats (e.g. single, double), special coding representations
(e.g. 0, +∞, −∞), rounding mechanisms, arithmetic
operations, etc. The standard radix-2 binary floating-point
representation can be written as in equation1 with s as the
sign bit, M as the mantissa or fraction.

Fig 1 Artificial Neuron S Bit Exponent Mantissa

32 31 23 0
Figure 3. IEEE single precision floating point format


The computation is done in four major steps:

1. Sorting: puts the number with the larger magnitude on
the top and the number with the smaller magnitude on the
bottom .
2. Alignment: aligns the two numbers so they have the same
exponent. This can be done by adjusting the exponent of the
small number to match the exponent of the big number.
The significand of the small number has to shift to the right
according to the difference in exponents.
3. Addition/subtraction: adds or subtracts the significands of
two aligned numbers.
4. Normalization: adjusts the result to normalized format.
Three types of normalization procedures may be needed:
Fig 2 Functions of an Artificial Neuron
i) After a subtraction, the result may contain leading zeros
II. RELATED WORK in front.
ii) After a subtraction, the result may be too small to be
Guillermo Marcus presents a multiplier and an normalized and thus needs to be converted to zero.
adder/subtractor for single precision floating point numbers iii) After an addition, the result may generate a carry-out bit.
in IEEE format. They have pipelined architecture which are
implemented in VHDL. Mohamed Al-Ashrafy presents a During alignment and normalization, the lower bits of the
floating point multiplier In IEEE single precision floating significand will be discarded when shifted out. The design
point format. The multiplier does not implement rounding is divided into four stages, each corresponding to a step in
and it just presents the significand multiplied result. Carlos the foregoing algorithm. The circuit in the first stage
Minchola has presented FPGA implementation of a Decimal compares the magnitudes and find the larger number and
Floating Point (DFP) Adder/Subtractor. Lamiaa S. A. the smaller number.. The comparison is done between
Hamid [10] has presented a high speed generic Floating Page 41
Volume II, Issue IV, April 2015
expl&fracl and exp2&f rac2. It implies that the exponents are subtracted from it. Subtraction is easily achieved by adding
compared first, and if they are the same, the significands are carry in to the sum and then subtracting 128 from it by
compared. complementing most significant bit. For multiplying
significands 48 bit multiplier is used. The 28 bits are
considered as the most significand bits out of which 24 bits
are the mantissa bits, 3 bits are for proper rounding, 1 bit is
for range overflow. The result is then normalized for proper
approximation to closest value. The approximation consists
of a possible single bit right shift and corresponding
exponent is incremented depending on b1 bit. The resultant
sign, exponent and mantissas are obtained. The resultant
sign, exponent and mantissas are then obtained. The figure
shown below is simple floating point multiplier. In this
paper three floating point multipliers have been designed
using carry save, carry look ahead, ripple carry adder. Same
flow is used for all of them .only for addition of exponents
different adders are used.

Figure 4 Block Diagram for floating point addition

The circuit in the second stage performs alignment. It first

calculates the difference between the two exponents, which
is expb - exps, and then shifts the significand, fracs, to the
right by this amount. The circuit in the third stage performs
sign-magnitude addition.Note that the operands are
extended by 1 bit to accommodate the carry-out bit.
The circuit in the fourth stage performs normalization,
which adjusts the result to make the final output conform to
the normalized format. The normalization circuit is
constructed in three segments. The first segment counts the Figure 5 Block Diagram for floating point multiplier
number of leading zeros. It is somewhat like a priority
encoder. The second segment shifts the significands to the
left by the amount specified by the leading-zero counting 3.4. FLOATING POINT MAC UNIT
circuit. The last segment checks the carry-out and zero
conditions and generates the final normalized number. Basically composed of adders, multiplier and an
accumulator. The inputs which are given to the MAC are
3.3. FLOATING POINT MULTIPLIER fetched from memory location and fed to the multiplier of
block of MAC which performs multiplication and gives the
Given two floating point numbers n1, n2 and after result back to adder which will accumulate the result and
multiplication the result is n. store it in a memory location. Complete process is achieved
in a single cycle. The design consists of 32 bit floating point
n=n1*n2 adder and 1 register for memory location. A typical MAC
= (-1)s1.n1.2e1*(-1)s2.n2.2e2 unit consists of multiplier, adder and accumulator. The most
= (-1)s1+s2.p1.p2.2e1+e2 important feature that differentiates general processor from
digital signal processor is it’s multiply and accumulate unit.
In Figure 5 we present a general multiplier block diagram. Each DSP algorithm would require some form of
The sign, exponent and mantissas are extracted from both Multiplication and accumulation system. This one is the
the numbers respectively. Pipelining has been used for most important block in DSP systems. Usually adders that
designing multiplier. The sign bits of both the numbers are are used are carry save, carry select, ripple carry adders
xored. The 8 bit exponents are added and then bias is because of their speed. The inputs of MAC are supposed to
be fetched from memory location and then they are fed to Page 42
Volume II, Issue IV, April 2015
the multiplier. Multiplier will multiply the inputs and it will Then a complete MAC unit is designed using floating point
give the results back to the adder and then the results of the adder and multiplier and its FPGA implementation is done.
multiplier are added to the previously accumulated results. This MAC Unit is used to feed the inputs to a neuron in
Computation of most important formula i.e. b (n) x (n-k) is Artificial Neural Networks.
easily solved by this operation.

Figure 7.Simulation of Floating point MAC unit.


A FP adder and a FP multiplier are presented in this paper.

Both are available in pipeline architectures and they are
implemented in VHDL, are fully synthesizable. Page 43

