0% found this document useful (0 votes)
35 views5 pages

MACIo T

This document describes a proposed low-power multiply-accumulate (MAC) unit for Internet of Things (IoT) processors. The MAC unit is capable of performing 16-bit, dual 16-bit, and 32-bit MAC operations on signed and unsigned numbers with up to three operands. It uses multiplexers and array multipliers to efficiently reuse hardware and minimize area and power consumption. The MAC unit was designed and implemented in VHDL, simulated using Vivado, and tested on an FPGA development board.

Uploaded by

swetha sillveri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views5 pages

MACIo T

This document describes a proposed low-power multiply-accumulate (MAC) unit for Internet of Things (IoT) processors. The MAC unit is capable of performing 16-bit, dual 16-bit, and 32-bit MAC operations on signed and unsigned numbers with up to three operands. It uses multiplexers and array multipliers to efficiently reuse hardware and minimize area and power consumption. The MAC unit was designed and implemented in VHDL, simulated using Vivado, and tested on an FPGA development board.

Uploaded by

swetha sillveri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2018 2nd European Conference on Electrical Engineering and Computer Science (EECS)

Implementation of Low-Power Multiply-Accumulate


(MAC) Unit for IoT Processors
Kareem Mansour Ahmed Saeed
Department of Microsystems Engineering - IMTEK Electrical Engineering Department
Albert-Ludwigs-Universität Freiburg Future University in Egypt
Freiburg, Germany Cairo, Egypt
[email protected] [email protected]

Abstract—Embedded processors are key building blocks for


IoT platforms. Multiply-Accumulate (MAC) units are vital arith-
metic circuits in several applications performed by the processors
including digital signal processing (DSP). It is necessary to
reduce the power consumed by the processor. In this paper,
the design and implementation of 32-bit MAC unit optimized
for low-power budget targeting IoT processors is introduced.
The proposed MAC unit is capable of performing several 16-
bit, dual 16-bit, and 32-bit MAC operations that can be carried
out on signed and unsigned numbers with up to three operands
involved. The performance of MAC unit is analyzed in terms of
delay and power. The unit is described in VHDL, implemented
and simulated on Vivado and tested using Nexys 4 DDR board
featuring Xilinx’s Artix-7 FPGA.
Index Terms—FPGA, IoT, MAC, Multiplier, Vedic, VHDL

I. I NTRODUCTION
The Internet of Things (IoT) refers to a giant network that
Fig. 1: Block diagram of IoT device.
extends to everyday objects, namely ’Things’. These things,
while not considered computers, can be sensors, actuators,
wearables, mechanical machines, home appliances or even
persons that are able to communicate with other objects and
computers through the Internet via embedded systems without the components that make up an ARM core; one of the most
human intervention. Fig. 1 depicts the components of an IoT pervasive processors in the world that are embedded in a wide
device. Sensors in each IoT device are used to monitor and range of products from cell phones to vehicles [2].
collect data from the surrounding environment; local processor DSP applications are typically performed by Multiply-
or microcontroller (MCU) is used to process these data and Accumulate unit which multiplies two numbers and accumu-
interface to a wireless device for connectivity [1]. lates the result onto an accumulator. MAC unit is a fundamen-
Typically, mobile IoT devices transfer a small amount of tal block that maximizes the performance of the processor.
data and are powered through rechargeable batteries and/or
This paper introduces a new design for a low-power MAC
ambient energy sources such as solar energy, thermal energy,
unit capable of performing several 16-bit, dual 16-bit, and 32-
wind energy, electromagnetic energy from radio transmitters,
bit operations in which up to three operands are involved;
vibrations or physical motion. This is reason that power
two operands are the multiplicand and multiplier while the
dissipation has become an important concern, where it is
third operand is used for optional accumulation and subtraction
essential for devices to use minimal power and provide a
purposes. The proposed MAC operations can be carried out on
good performance. Power consumption depends on the type
signed and unsigned numbers and the result can be 32-bit or
of sensors, microcontroller and radio transceiver within the
64-bit according to the type of operation. In order to maximize
device.
the performance, all the MAC operations are executed in one
Unlike the traditional embedded devices, which would
cycle.
contain two separate processor, most cutting-edge devices
can handle the interface and manipulate DSP applications by The rest of this paper is organized as follows: section I
means of one single-core microcontroller. Single-core micro- introduces the architecture and the design of the proposed
controllers can reduce power consumption and has the com- MAC unit in details, the simulation and results are discussed
putational power to process real-time signals. Fig. 2 abstracts in section II. Finally, section III concludes the paper.

978-1-7281-1929-8/18/$31.00 ©2018 IEEE 356


DOI 10.1109/EECS.2018.00072
result of multiplication is optionally accumulated on another
operand by means of a 32-bit and/or a 64-bit adder.
Multiplexers are responsible for choosing the desired inputs
and outputs of the multipliers and adders according to the
operation. They are used to select the desired words and
halfwords of the input and resulting operands. In addition,they
are used to select the type of multiplication, whether signed or
unsigned. In this way, the same hardware components can be
reused to perform a wide range of operations while consuming
less area and power by avoiding component replication.
The multipliers used are unsigned by default, and hence,
an extra hardware is required to perform signed multiplica-
tion. For unsigned multiplication, the operands are allowed
into the multipliers without any change. However, for signed
multiplication, the absolute value of operands is fed into the
multipliers, unsigned multiplication is performed, and finally
the sign is separately calculated and added to the result.
B. Design
The multiplier blocks, denoted by ‘M0’, ‘M1’, ‘M2’ and
Fig. 2: ARM core functional units and dataflow model. ‘M3’, are 16x16 bit unsigned array multipliers that use dig-
ital combinational circuits to perform parallel multiplication.
Array multipliers outperform serial multiplication schemes in
II. A RCHITECTURE AND D ESIGN terms of speed and performance. The design of an array
In this section, the architecture and the design of the multiplier is based upon partial product generation, shifting
proposed MAC unit are explained in detail. The multiplicand and addition. The partial product is generated by the multipli-
and multiplier operands are denoted by A and B, respectively, cation of the multiplicand with one multiplier bit. Each partial
while the third operand is denoted by R, and the result product is shifted according to its bit position. Finally, the
is denoted by Y . For demonstration, the main idea of the result is obtained by adding the shifted partial products. Fig. 4
proposed MAC unit operations is listed below: demonstrates the multiplication method of array multiplier.
In order to maximize the performance while maintaining
minimum power and area and enable hardware reuse, the vedic
−Multiply words and Accumulate scheme is used to construct a 32x32 bit multiplier. Fig. 5 shows
Y =R±A×B an example of 4x4 bit vedic multiplier, where the first row
−Multiply Halfwords and Accumulate represents the multiplicand (B = b3 b2 b1 b0 ) while the second
Y = R ± A(half word) × B(half word) row is the multiplier (A = a3 a2 a1 a0 ). In step 1, the least
−Multiply Word by Halfword and Accumulate significant bits are multiplied representing the least significant
Y = R ± A × B(half word) bit of the multiplication result. In the subsequent steps, the
multiplication results are added i.e., a0 × b1 + a1 × b0 in step
−Dual Multiply Halfwords, Add/Subtract, Accumulate
2, a0 ×b2 +a2 ×b0 +a1 ×b1 in step 3, etc. Any carry generated
Y = R ± A(half word) × B(half word)
±A(other half-word) × B(other half-word) from the addition process should be added to the next step of
addition. Same procedure is followed through the final step.
The same methodology can be extended to construct 32x32-
A. Architecture bit vedic multiplier [3].
The proposed MAC unit consists of four 16x16 bit array In the proposed design, ‘M0’ is used to multiply the lower
multipliers whose inputs and outputs are connected to adders halfwords of the input operands. ‘M3’ is used to multiply the
and multiplexers. Fig. 3 shows the block diagram of the upper halfwords. The other two multipliers ‘M1’ and ‘M2’
proposed MAC unit and demonstrates its dataflow architecture. are used to multiply the lower halfword of the first operand
The design of each block will be discussed in the next by the upper halfword of the second operand and vice-versa.
subsection. In this way, the result of multiplying the different halfwords is
For 16-bit MAC operations, only one multiplier is required obtained at the same time and can be used in operations where
to multiply two halfwords. In case of dual 16-bit operations, two standalone multiplication results are required. The result
two multipliers are involved in multiplying two pairs of stan- selection depends on the operation and is done by means of
dalone halfwords and their result is then added or subtracted. multiplexers. In order to use the same hardware to perform
The four multipliers are connected together along with adders 32x32 bit multiplication, the concept of vedic multiplication
to form a 32x32 bit vedic multiplier in 32-bit operations. The was used. The results of the four multipliers are connected to

357
R [63:0] B [31:0] A [31:0]

64 32 32

ABS ABS

ABS (B) [31:0] ABS (B [31:16]) ABS (B [15:0]) ABS (A) [31:0] ABS (A [31:16]) ABS (A [15:0])
32 16 16 32 32 16 16 32

MUX MUX

MUL_B [31:0] MUL_A [31:0]


32 32

MUL_B[31:16] | MUL_A[31:16] MUL_B[31:16] | MUL_A[15:0] MUL_B[15:0] | MUL_A[31:16] MUL_B[15:0] | MUL_A[15:0]


16 16 16 16 16 16 16 16

MUL 16x16 MUL 16x16 MUL 16x16 MUL 16x16

M3 [31:0] M2 [31:0] M1 [31:0] M0 [31:0]


32 32 32 32

Sign Vedic Sign

64

32 32 Sign 32 32

M3_SIGN Y_VEDIC 64 64 Y_VEDIC_SIGN M0_SIGN

MUX
64 MULTIPLICATION_RESULT

ADR64_A | ADR64_B ADR32_A | ADR32_B


64 64 32 32

ADDER64 ADDER32

64 ADR64_Y 32 ADR32_Y

MUX
64 ACCUMULATOR_RESULT

Y [63:0]

Fig. 3: The block diagram of the proposed MAC unit.

signed operands and is required only in signed multiplication


operations. To obtain the absolute value, the most-significant
bit of the operand is inspected. If it was ‘1’ then the operand is
negative and the absolute value is the two’s complement of the
operand. Otherwise, the operand is positive and the absolute
value is equal to the operand itself. The obtained absolute
value is ready to be used with the unsigned multipliers. Fig. 7
Fig. 4: The multiplication method of array multiplier. depicts the design of both the ‘ABS’ block.
The ‘Sign’ block is required to calculate the sign of the
result in signed multiplication. The resulting operand should
adders to extend the multiplication process. Fig. 6 depicts the be negative only if A and B were of different signs. In
connection of adders in the vedic block. order to achieve this, the most-significant bit of A and B
The ‘ABS’ block is used to obtain the absolute value of the are inspected. The signed result is the two’s complement of

358
Y A(31) B(31)
32

32 32 32

ADDER

32

SIGN(Y)
Fig. 5: vedic scheme for 4x4-bit multiplier.
Fig. 8: The design of the ‘Sign’ block.
M3 & X"0000" X"0000" & M2 M1 X"0000" & M0 [31:16]

48 48 32 32

ADDER ADDER

Y2 X"0000" & Y1

48 48
M0 [15:0]
(a) With guard evaluation.
ADDER

48 16

Y_VEDIC [63:16] Y_VEDIC [15:0]

Fig. 6: The design of the ‘Vedic’ block.


(b) Without guard evaluation.
A/B
Fig. 9: The Power report of the MAC unit: (a) With guard
32
evaluation. (b) Without guard evaluation.
1

32 32 32 III. S IMULATION AND R ESULTS


ADDER The proposed MAC unit was implemented using VHDL
A/B(31) using Vivado software tool by Xilinx. The power reports of
the MAC unit obtained from Xilinx Vivado are shown in
Fig. 9. The power consumption is shown with and without
using guard evaluation low-power technique in 9a and 9b
32
respectively. It can be clearly seen that the dynamic power
ABS(A/B) consumption has been significantly reduced from 29 mW to
Fig. 7: The design of the ABS block. 22 mW after using this technique. It is worthy to note that, the
power will reduced significantly after integrating the proposed
MCU into the whole processor design because of the resource
sharing.
the unsigned multiplication result when the most-significant
The design was simulated using the simulation tool ISim,
bits are different. Otherwise, the signed result is equal to the
integrated with Vivado, to test the functionality of the design.
unsigned multiplication result. Fig. 8 depicts the design of both
The simulation results showed that the MAC unit was able
the ‘Sign’ block.
to perform the designed MAC operations correctly. Fig. 10
Regarding the low-power consumption, one MAC operation
shows the simulation results the MAC operations. It is worth
does not use all the components at a time. Only some specific
mentioning that the test-bench covers more operations than
components are necessary to operate when executing a certain
those listed in Fig.10.
operation and the rest can be switched off. Therefore, guard
evaluation low-power technique [4] is used to block the change IV. C ONCLUSIONS
in inputs to these blocks; Hence, saving dynamic power due
Efficient hardware architecture for low-power 32x32 bit
to transitions. In the next section, the improvement in power
MAC unit for IoT processors have been designed and imple-
consumption after using this technique will be reviewed.

359
Fig. 10: The simulation results of the MAC unit.

mented in this work. The proposed MAC unit is implemented [2] Andrew Sloss, Dominic Symes, and Chris Wright. ARM System De-
using VHDL on Nexys 4 development board featuring Xilinx’s veloper’s Guide: Designing and Optimizing System Software. Morgan
FPGA. The implementation results obtained from simulation Kaufmann Publishers Inc., San Francisco, CA, USA, 2004.
[3] V. Kulkarni, L. Kulkarni, and V. Kulkarni. High speed and area efficient
show that power consumption is very low of about 22 mW and vedic multiplier. In 2012 International Conference on Devices, Circuits
the delay is very small. Although the implemented MAC unit and Systems, pages 360-364, March 2012.
has a minimum power consumption, it is expected that such [4] C. Ravishankar, J. H. Anderson and A. Kennings, ”FPGA Power
Reduction by Guarded Evaluation Considering Logic Architecture,” in
unit will have further reduction after being integrated with the IEEE Transactions on Computer-Aided Design of Integrated Circuits
other components of the processor. Future work will focus on and Systems, vol. 31, no. 9, pp. 1305-1318, Sept. 2012.
improving the power results for the whole processor to fit the
IoT power budge.

R EFERENCES
[1] Pallavi Sethi and Smruti R. Sarangi. Internet of things: Architectures,
protocols, and applications. J. Electrical and Computer Engineering,
2017:9324035:1–9324035:25, 2017.

360

You might also like