0% found this document useful (0 votes)
37 views8 pages

5 HISPE - High-Speed - Configurable - Floating-Point - Multi-Precision - Processing - Element

Uploaded by

shippu ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views8 pages

5 HISPE - High-Speed - Configurable - Floating-Point - Multi-Precision - Processing - Element

Uploaded by

shippu ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

HISPE: High-Speed Configurable Floating-

Point Multi-Precision Processing Element


Tejas B N Rakshit Bhatia Madhav Rao
IIIT-Bangalore IIIT-Bangalore IIIT-Bangalore
[email protected] [email protected] [email protected]
2024 25th International Symposium on Quality Electronic Design (ISQED) | 979-8-3503-0927-0/24/$31.00 ©2024 IEEE | DOI: 10.1109/ISQED60706.2024.10528733

Abstract—Multiple precision modes are needed for a configurable FP PE. With the HPS method, high-precision
floating-point processing element (PE) because they provide computing blocks are divided into smaller pieces that can be
flexibility in handling different types of numerical data with used for operations with lower precision. Only a small
varying levels of precision and performance metrics. portion of the hardware is needed for configuration, and
Performing high-precision floating-point operations has the speed is maintained. However, this method struggles with the
benefits of producing highly precise and accurate results while low utilization rate of the multiplication. For the LPC
allowing for a greater range of numerical representation. method, low-precision computing blocks are duplicated and
Conversely, low-precision operations offer faster computation combined together to support multiple-precision operations
speeds and lower power consumption. In this paper, we
with additional shifters and adders [2]. An LPC based
propose a configurable multi-precision processing element
configurable FP PE is proposed in [7] to perform a number
(PE) which supports Half Precision, Single Precision, Double
Precision, BrainFloat-16 (BF-16) and TensorFloat-32 (TF-32). of high-precision operations by grouping FP16 unit
The design is realized using GPDK 45 nm technology and multipliers. The LPC method has a better rate of utilization
operated at 281.9 MHz clock frequency. The design was also of the multiplication array, but the multiterm multiplication
implemented on Xilinx ZCU104 FPGA evaluation board. and accumulation operations result in long processing period.
Compared with previous state-of-the-art (SOTA) multi- The multi-precision PEs also support a wide variety of data
precision PEs, the proposed design supports two more floating formats to suit the mixed-precision computing needs for
point data formats namely BF-16 and TF-32. It achieves the various applications.
best energy performance with 2368.91 GFLOPS/W and offers
63% improvement in operating frequency with comparable
This article proposes a three-stage pipelined configurable
footprint and power metrics. multi-precision floating point processing element. The
processing element can operate in five modes of precision-
Keywords—Floating Point (FP), Processing Element (PE), Half Precision (FP16), Single Precision (FP32), Double
TensorFloat-32 (TF32), BrainFloat-16 (BF16), High- Precision (FP64), Brain Float – 16 (BF16) and Nvidia’s
Performance Computing (HPC), Multiply-Accumulate (MAC). Tensor Float-32 (TF32) based on the mode selected.
The main contributions of this article are summarized as
follows:
I. INTRODUCTION 1) Fast 7:2 Carry Save Adder (CSA) along with hybrid
As the Internet of Things (IoT) and Artificial Intelligence Carry Look Ahead (CLA) Adder structure integrated
(AI) continue to advance quickly, there is a huge computing with 12b×12b radix-4 booth multiplier is introduced.
need to carry out billions of Multiply-Accumulate (MAC)
operations per second [1], [2]. High-precision floating-point 2) Fast 2’s compliment block is proposed to reduce the
operations use a greater number of bits to represent the critical path delay of stage – 2 and stage -3 of the
values, resulting in more accurate results. This is especially pipelined architecture.
important in applications where accuracy is critical, such as
3) Fast Leading Zero Detector (LZD) optimized for area
scientific research, engineering, and financial analysis. In
and delay is introduced to reduce the critical path
addition to increased accuracy, high-precision floating-point
delay of stage-3 of the pipelined architecture.
operations can handle a wider range of numerical values,
allowing for larger representation ranges. This is useful in 4) The configurable floating-point multi-precision
applications where data can vary significantly in magnitude, element is retimed at RTL level to ensure the critical
such as astrophysics. However, this increased precision path delay of all the pipelined stages are comparable
comes at the cost of slower computation speed and higher thus ensuring maximum operating frequency.
power consumption. In contrast, low-precision floating-point
operations use fewer bits to represent the values, which This is the first time as per the authors’ knowledge that
results in faster computation speed and lower power multi-precision processing element incorporating five
consumption [2], [3]. However, the lower precision can lead different data-formats including BF16 and TF32 along with
to decreased accuracy in the final results. The reduced FP16, FP32, and FP64 are implemented and characterized.
precision has demonstrated large benefits, paving the way
The processing element designs are made freely available in
towards computing in mobile devices and IoT nodes [1]. So,
[8] for further usage to the researchers and designers’
many HPC and AI computing applications adopt
collaborative completion of multiple-precision FP data, PEs community. The paper not only improves the PE design but
and algorithms to ensure the accuracy requirements and also accommodates two more data-formats which is more
accelerate the computing process [1], [2], [4], [5], [6]. relevant for modern day AI computing applications. It is a
step towards designing hardware efficient AI system on chip.
Either high-precision-split (HPS) or low-precision-
combination (LPC) approaches are used to construct

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
II. TRADITIONAL IEEE-754 DESIGN APPROACH encoding scheme for the exponent and the mantissa, which
A. Standard IEEE-754 Format allows for greater precision and range of values. However,
posit is not widely used due to increased complexity and
The standard IEEE 754 format of floating number potential performance issues.
representation contains three parts – sign (S), exponent (E)
and mantissa (M). The sign part of the floating-point Bfloat16 1 8 7
representation is 1-bit, where, ‘0’ represents a positive - Sign Bit
16 - Exponent
number and ‘1’ represents a negative number. Fig. 1 shows
- Mantissa
the representation of Half Precision (FP16), Single Precision Nvidia’s TensorFloat 32 1 8 10

(FP32) and Double Precision (FP64) floating point


19
representation. Fig. 2. Format of BFloat16 and Tensor Float 32.
Half Precision (FP16) 1 5 10 - Sign Bit
- Exponent
16
- Mantissa
C. Low Precision Split Architecture based PE
Single Precision (FP32) 1 8 23
Low precision combination and high precision split are two
32 different techniques used in numerical computing. Low
Double Precision (FP64) 1 11 53
52 precision combination involves performing operations on
numbers with reduced precision, while high precision split
64
F involves breaking down numbers into several components to
ig. 1. IEEE 754 format of Half Precision (FP16), Single Precision (FP32) perform operations with higher precision.
and Double Precision (FP64).
Advantages of low precision combination over high
precision split include:
Half precision (FP16) format is characterized by 1 sign bit,
5 exponent bits and 10 mantissa bits; Single precision (FP32) 1) Reduced Memory Footprint: Low precision
format has 1 sign bit, 8 exponent bits and 23 mantissa bits; combination requires less memory than high precision
Double precision (FP64) has 1 sign bit, 11 exponent bits and split because it uses fewer bits to represent each
52 mantissa bits. The mantissa of the IEEE 754 format has to number. This is especially important when dealing with
be appended by 1 followed by a decimal point “1.Mantissa”. large datasets and computations that require a lot of
So, the number of significant bits of FP16, FP32 and FP64 memory. Also, a study by Nvidia found that using low
are 11, 24 and 53. The generalized formula to extract the precision computation in deep learning can reduce
value encoded in the IEEE 754 floating point representation memory usage and increase training speed without
is stated in the Equation 1. significantly impacting model accuracy [14].
Value = (-1)S × (1.M) × 2E – bias (1) 2) Improved Performance: Low precision combination
can perform calculations much faster than high
The bias values of FP16, FP32 and FP64 are 15, 127 and precision split because it requires fewer computations.
1023 respectively. The reason behind having biased This is particularly important in applications such as
exponents instead of 2s compliment representation is that, a machine learning and scientific simulations, where large
simple comparator can be employed instead of having a amounts of data need to be processed quickly. M.
complicated logic for comparing signed exponents. Courbariaux et al. [15] showed that binarized neural
networks, which use low precision weights and
activations constrained to +1 or -1, can achieve state-of-
B. Emerging Floating Point Data Format the-art performance on several image classification
IEEE 754 format has several drawbacks, such as inefficient datasets, while requiring fewer computations and less
representation of small numbers (allocates huge memory to a memory than traditional neural networks.
very small number), limited precision (cannot represent all 3) Energy Efficiency: Low precision combination can be
decimal fractions exactly), limited range, etc., thus impacting more energy-efficient than high precision split because
storage and performance of the system. To overcome these it requires fewer computations, which in turn requires
limitations, numerous variations of the IEEE 754 FP16, FP32 less energy consumption. This is especially important
and FP64 are developed. In [9], BFloat16 format is a for applications that are battery-powered or rely on
truncated FP32 representation. As the name indicates, it is a energy-efficient computing. Intel Corporation found
16-bit floating point representation with 1 sign bit, 8 that using low precision computation in neural networks
exponent bits (same as FP32) and 7 mantissa bits. Due to can reduce energy consumption by up to 3 times
reduction in the number of mantissa bits, BF16 requires less compared to high precision computation, while still
memory and computing power. But reduction in number of achieving similar accuracy [16].
mantissa bits leads to reduction in precision. Nvidia’s Tensor
Float (TF32) is similar to BFloat16, with 3 more bits of Low precision combination is a more efficient and cost-
mantissa compared to BFloat16 [10], [11]. Fig. 2. shows the effective option in many numerical computing applications,
structures of BFloat16 and Tensor Float 32. especially those that require processing large datasets or
running on resource-limited hardware. In multiplier array
Posit is a Type-III universal number, composed of sign, design with LPC philosophy, Wei et al. [2] achieves 100%
regime, exponent and mantissa [12], [13]. Posit numbers are multiplier array utilization. However, multiterm
similar to floating-point numbers in that they use a sign bit, multiplication and accumulation of the LPC architecture of
an exponent, and a mantissa to represent a number. However, the multipliers results in long processing time effecting the
unlike floating-point numbers, posits use a different critical path of the design.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
III. PROPOSED ARCHITECTURE Adder Tree composed of 3:2 CSA and 4:2 CSA is used to
find the sum of the 10 partial products in each clock cycle.
A. Data Path Flow The Sum and Carry generated by the Adder Tree is
The configurable Multi-Precision Processing Element accumulated by the Carry Select Adder (CSLA). In the final
supports 5 modes of precisions – FP16, FP32, FP64, BF16 stage, Fast 2s compliment block is employed to find the 2s
and TF32 using 3-stage pipelined (Single Instruction compliment of Adder Result if it is a negative number. The
Multiple Data) SIMD architecture. The structure is built results of the exponent and accumulation are adjusted
using the LPC method, and the bit width of the unit through normalization and rounding to obtain the final result
multiplier is determined to minimize redundancy cost for all- in the applicable standard formats.
precision implementation. The size of unit multiplier is
selected such that it can perform single operation of FP16,
BF16 and TF32. B. Modified Leading Zero Detector (LZD)
MODE = 3’b000: FP16
Traditional Leading Zero Detector is designed as a
MODE
MODE
=
=
3’b001:
3’b010:
FP32
FP64
counter_enable cascaded multiplexer like structure where the number of
MODE
MODE
=
=
3’b011:
3’b100:
Bfloat-16
Tensor-Float
MODE [2:0] INPUT B[159:0] INPUT A[159:0] Clock Counter cascaded multiplexer phases is equal to one less than the
Input Pre-Processing
number of bits in the input taken by the LZD. The number of
multiplexer in each phase is equal to
INPUT REGS ceil(log2(Number_of_Bit_Input)). The critical path of this
Sign Exponents Mantissa Control
architecture is characterized by delay of one 2:1 multiplexer
Sign Exponent Multiplier
Processing Comparison Array
multiplied by the number of multiplexers phases.
10 Partial
Exponent Products in[54:44] in[43:33] in[32:22]
in[65:55] in[21:11] in[10:0]
Difference

Alignment Shifter
11-bit Zero 11-bit Zero 11-bit Zero 11-bit Zero 11-bit Zero 11-bit Zero
Exponent 10 X (Shifted 60-bit Comparator Comparator Comparator Comparator Comparator Comparator
XORed-Sign Difference Partial Products)
PIPELINE STAGE-1 REGISTERS
10 X (Shifted 60-bit
................
Partial Products)

Fast 2’s Compliment Computer 6-bit

....
in[65:55] in[54:44] in[43:33] in[32:22] in[21:11] in[10:0]

Adder Tree
Chain of 10 Chain of 10 Chain of 10 Chain of 10 Chain of 10 Chain of 10
Carry Sum cascaded cascaded cascaded cascaded cascaded cascaded
2x1 mux 2x1 mux 2x1 mux 2x1 mux 2x1 mux 2x1 mux
CSLA phases phases phases phases phases phases
Adder Result
PIPELINE STAGE-2 REGISTERS 7-bit 7-bit 7-bit 7-bit 7-bit
7-bit

Fast 2’s Compliment Computer 6 x 7-bit

Input Line
Chain of 5 Select
Modified Leading Zero Detector cascaded `
Line
2x1 mux
phases

Exponent Mantissa 7-bit Output


Normalization Normalization
Fig. 4. Architecture of the proposed LZD.

Output Selection
The proposed architecture segments the input and achieves
OUTPUT REGS
parallelism between them ensuring a significant decrease in
the delay of the traditional LZD implementation. Here, we
OUTPUT [63:0] consider 66-bit unsigned sum from fast 2’s compliment
Fig. 3. Architecture of the proposed 3-stage pipelined multi-precision PE. block of stage-3 of the pipelined architecture as the input to
the LZD. Let N denote the number of segments. Modified
LZD has three stages: Stage 1 consists of N blocks of zero
In Fig. 3, 3-bit signal MODE is used to select the precision
comparators of width nw, as stated in the Equation 2, where,
of operation of the PE. Input Pre-Processing stage is used to
all the N blocks of comparators are operating in parallel.
bifurcate the data according to the mode selected so that they
Stage 2 consists of N blocks of (nw-1) cascaded multiplexers
can be stored appropriately in the input registers (Input-
phases (of bit-width 7) in cascaded structure. Here, all the N
Regs). Exponent Comparison block computes the maximum
blocks of cascaded 2:1 multiplexers are operating in parallel.
exponent and the exponent difference of the product terms.
Also, Stage 1 and Stage 2 operate in parallel. Stage 3 consists
Multiplier array designed with LPC methodology is used to
of a chain (N-1) cascaded 2:1 multiplexers phases. For a 66-
compute the partial products. The multiplier array is
bit input, Table I summarizes the area and delay performance
composed of 10 multipliers – 6 conventional multiplier and 4
of the LZD for different number of stages.
fused multipliers. Sign Processing determines the sign of the
product terms by performing XOR operations on the input
operands. Fast 2s compliment block is used to employed to nw =
find the 2s compliment of the partial products based on the (2)
sign generated by the Sign Processing block.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
TABLE I: AREA AND DELAY PERFORMANCE ANALYSIS OF LZD FOR DIFFERENT NUMBER OF SEGMENTS

Critical Path
Number of 2x1 Number of Number of Width of Zero
Number of Width of each Weighted Weighted
multiplexer in 2x1 Zero Comparator Number of 2x1
segments (N) segment (nw) Area Number of Delay
each phase multiplexers Comparators Used multiplexers
Comparators
phases
1
(Traditional 66 7 455 0 0 455 0 66 65
Design)
4 17 7 469 4 17 537 1 19 20
5 14 7 483 5 14 553 1 17 18
6 11 7 455 6 11 521 1 15 16
7 10 7 483 7 10 553 1 15 16
8 9 7 497 8 9 569 1 15 16
9 8 7 497 9 8 569 1 15 16
10 7 7 483 10 7 553 1 15 16
11 6 7 455 11 6 521 1 15 16

Here, weighted area is calculated by adding the product of A. Stage-0: Input Preprocessing
number of comparators and the width of each comparator
with total number of 2:1 multiplexers. Similarly, weighted In Fig. 6, the 160-bit input accommodates ten FP16
delay is calculated by adding the number of comparators in operands, ten BF16 operands, eight TF32 operands, five
the critical path with the number of multiplexer phases. From FP32 format or one FP64 format. In Input Pre-processing,
Table I, it is evident that the optimum number of stages is 6. based on the mode selected, the sign, exponent and mantissa
In Fig. 4, the circuit level implementation of the proposed bits of the operands are split and stored in input registers. For
LZD architecture is shown. example, if the mode is 3’b001, the selected mode of
operation is FP32. So, in input pre-processing, the two 160-
bit inputs A and B would be split into two groups of five sign
C. Fast 2’s Compliment Block bits (1-bit), five exponent bits (8-bits) and five mantissa bits
In the proposed design, 2’s compliment block is employed (23-bits).
in stage 1 and stage 2 of the pipeline architecture. In stage 1,
159 144 128 112 96 80 64 48 32 16 0
2’s compliment block is used to find the 2’s compliment of FP16hp9 FP16hp8 FP16hp7 FP16hp6 FP16hp5 FP16hp4 FP16hp3 FP16hp2 FP16hp1 FP16hp0
60-bit partial products from the alignment shifter based on
the sign computed by the sign processing unit. In stage 2, it FP32sp4 FP32sp3 FP32sp2 FP32sp1 FP32sp0
is used to find the 2’s compliment of the 66-bit adder result
0 FP64dp0
from CSLA if it is a negative number. In traditional design
of 2’s compliment block, for an M bit number, it involves BF16b9 BF16b8 BF16b7 BF16b6 BF16b5 BF16b4 BF16b3 BF16b2 BF16b1 BF16b0
computing the compliment of the number followed by
0 TF32t7 TF32t6 TF32t5 TF32t4 TF32t3 TF32t2 TF32t1 TF32t0
adding ‘1’ to the LSB. However, in this architecture, the 159 144 126 108 72 0
90 54 36 18
critical path has (M-1) number of 2-input AND gates, (M-1) Fig. 6. Unified formats for 160-bit input in FP16, FP32, FP64, BF16 and
number of 3-input OR followed by one 2-input XOR. TF32 precisions.
The proposed Fast 2’s Compliment Block aims at
reducing the delay and area by completely eliminating the
B. Stage 1: Configurable Multiterm Multiplication, Sign
need of adder circuitry. An M bit Fast 2’s Compliment
Processing, Exponent Comparison and Alignment Shifter
Block is built with (M-1) 2:1 multiplexers, (M-1) 2-input
OR gate and (M-1) inverters. The critical path has (M-1) 2- In stage 1 of the proposed multi-precision PE, operations of
input OR gate and one 2:1 multiplexer. In Fig. 5, the mantissa multiplication, sign processing, exponent
structure of the Fast 2’s Compliment Block for a 62-bit comparison and alignment shifting are carried out
input is shown in the Fig. 5. simultaneously. To reduce the number of multipliers and
achieve high utilization rate of the multipliers, the width of
in[61] in[60] in[5] in[4] in[3] in[2] in[1] in[0]
the multipliers designed is set to 12-bit [2]. To reduce the
number of multiplication operation of FP64 mode from 25 to
0 1 0 1
…………
0 1 0 1 0 1 0 1 0 1 20, fused multipliers are designed. Three architectures of
fused multipliers are used. In Fig. 7, the first multiplier
out[61] out[60] out[5] out[4] out[3] out[2] out[1] out[0] architecture performs 12b×12b multiplication. In Fig. 8, the
Fig. 5. Architecture of proposed Fast 2’s Compliment Block second multiplier architecture performs 12b×12b or two
12b×5b multiplication in parallel based on the mode of
operation selected. In Fig. 9, the third multiplier architecture
performs 12b×12 or 12b×5b in parallel with 17b×5b
IV. CIRCUIT IMPLEMENTATION
multiplication based on the mode selected [17], [18]. In the
The proposed architecture of multi-precision PE has a 3- multiplier array, a total of ten multipliers are used. Out of the
stage pipeline structure. The working of the proposed multi- ten multipliers, six of the multipliers follow the first
precision PE can be explained in 4-stages. The stages are as architecture, three multipliers follow the second architecture
follows: and one multiplier follows the third architecture. All the ten

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
multipliers follow radix – 4 booth structure with 7:2 Carry TABLE II: MULTPLIER TYPE USED AND NUMBER OF MULTIPLIERS
Save Adder (CSA) followed by a hybrid Carry Look Ahead REQUIRED
adder to add the seven partial products generated. In Table II,
Number of Number of
the type and number of multipliers required is summarized. Data Bit Bit Ratio for Multiplication
Number of
12bx12b
12bx12b parallel 12bx12b parallel
Type Width Segments Types 12bx5b fused 17bx5b fused
The hybrid Carry Look Ahead Adder is an implemented as Multipliers
multipliers multipliers
six 4-bit carry look ahead adder connected as a ripple carry
FP16 11 12 1 times 12b x 12b 1 0 0
structure. This implementation is able to optimize both area
and delay. The resulting partial products are to be shifted FP32 24 12:12 4 times 12b x 12b 4 0 0

according to the formula (3) where N is the number of 16 times 12b x 12b
5:12:12:12:1
segments. FP64 53
2
8 times 12b x 5b 16 3 1
1 time 5b x 5b
BF16 8 12 1 times 12b x 12b 1 0 0
× × ≪ × 12 TF32 11 12 1 times 12b x 12b 1 0 0
(3)
Operand A a11 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 a0 In the proposed PE, we use Cascaded Exponent
Operand B X b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
Comparator (CEC) method proposed in [2] to reduce the
PP1 * * * * * * * * * * * * hardware cost while meeting the timing requirement of
PP2 * * * * * * * * * * * *
PP3 * * * * * * * * * * * *
conventional EC design [18]. The CEC block used in the
PP4 * * * * * * * * * * * * proposed PE employs comparators to generate the results of
PP5 * * * * * * * * * * * * the input comparisons covering all possible combinations.
PP6 * * * * * * * * * * * *
PP7 * * * * * * * * * * * * Then, all the results are analysed by the control logic block
with embedded look up tables (LUTs) to find the maximum
PP1, PP2, PP3, PP4, PP5, PP6, PP7
exponent. The most appropriate number of stage levels S is
7-2 Compressor
calculated using the system clock period and the delay of one
Sum Carry
CEC block with embedded comparators and LUTs. Then, in
order to reduce the hardware cost, each stage makes use of
Hybrid-CLA
CEC blocks with nearly the same input numbers. The
Output
proposed PE employs a two-stage CEC design for the
comparison of ten operands. Stage 1 includes two three-input
Fig. 7. 12b × 12b Multiplication Operation. and one four-input Exponent Comparator (EC) blocks and
Stage 2 includes one three-input EC block.
Operand A a11 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 a0
Operand B b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
Once we get the ten multiplication results in each clock
Operand C X 0 c10 c9 c8 c7 c6 0 c4 c3 c2 c1 c0 cycle, two cascaded barrel shifters of 61-bit are used as an
PP1 * * * * * * * * * * * * alignment shifter to adjust the partial products. The first
PP2 * * * * * * * * * * * * barrel shifter shifts left according to (2). The second barrel
PP3 * * * * * * * * * * * * shifter shifts right by the exponent difference generated by
PP4 * * * * * * * * * * * *
PP5 * * * * * * * * * * * * the CEC block.
PP6 * * * * * * * * * * * *
PP7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C. Stage 2: Accumulation
PP1, PP2, PP3, PP4, PP5, PP6, PP7
Once we get the ten partial products from stage 1 pipeline
registers, if they are negative, their 2’s compliment
7-2 Compressor
Sum Carry
representation is computed. In each clock cycle, ten partial
products are generated. So, an adder tree shown in Fig. 10,
Hybrid-CLA capable of adding ten numbers is employed [2].
Output

data3 {3’b0, 5’b0, data3[47:16], fp_64_carryin}


Fig. 8. Two 12b×5b parallel multiplication operation.
0 1
clk_cnt&mode[1]
10-inputs from Fast 2’s Compliment Block
Operand A a11 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 a0 {[63:0]} in 2’s
Operand B b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 {[62:0]} in 2’s compliment {[62:0]} in 2’s
Operand C X 0 c10 c9 c8 c7 c6 0 c4 c3 c2 c1 c0 compliment form form compliment form
4:2 CSA 4:2 CSA
PP1 * * * * * * * * * * * *
Sum = {1/0s, [62:0]} Sum = {1/0s, [62:0]}
PP2 * * * * * * * * * * * * Carry = {[62:0], 1’b0} Carry = {[62:0], 1’b0}
PP3 * * * * * * * * * * * * 3:2 CSA 3:2 CSA
PP4 * * * * * * * * * * * * * * * * *
Sum = {1/0s, [63:0]} Sum = {1/0s, [63:0]}
PP5 * * * * * * * * * * * * * * * * * Carry = {[63:0], 1’b0} acc_in
Carry = {[63:0], 1’b0}
PP6 * * * * * * * * * * * * * * * * *
4:2 CSA
PP7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Sum = {1/0s, [64:0]}
PP1, PP2, PP3, PP4, PP5, PP6, PP7 Carry = {[64:0], 1’b0} 66’b0
0
3-2 Compressor
7-2 CSA
Sum = {1/0s, [65:0]}
Sum Carry clk_cnt&mode[0]
Carry = {[65:0], 1’b0}
CSLA
Hybrid-CLA
Result = [65:0]
Output
PIPELINE STAGE-2 REGISTERS
Fig. 9. 12b×5b parallel with 17b×5b multiplication operation. Fig. 10. Structure of ten-input adder tree.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
This PE employs a high-speed carry select adder (CSLA), architecture. The proposed design and the published design
as suggested in [2], [20]. Dual-output (DO) RCA is [2], was implemented and verified on FPGA platform –
employed, which is intended to replace the two-channel RCA
used in conventional CSLA. The parallel full adder (PFA) 60 bit 48 bit 36 bit 24 bit 12 bit 0 bit
and parallel half adder (PHA) that make up the proposed DO
PP1: a3 X b0
RCA's structure allow for the parallel generation of carry and PP2: a2 X b1
sum values under complementary carry input value cases. PP3: a1 X b2
<< 3X12=36bit

When the carry input is 0, the sum and carry outputs are PP4: a0 X b3
equal to P1 ⊕ P2 and P1 ⋂ P2, respectively. When the carry

STAGE -1
PP5: a1 X b1
PP6: a2 X b0
input is 1, the sum and carry output is equal to NOT(P1 ⊕ + PP7: a0 X b2
<< 2X12=24bit

P2) and P1⋃P2, respectively. The proposed DO RCA uses PP8: a1 X b0


<< 1X12=12bit
multiplexer to generate two results of carry 1 and 0 instead of PP9: a0 X b1
two RCAs. Additionally, to achieve the same logic functions PP10: a0 X b0 << 0X12=0bit

as traditional CSLAs, PFA and PHA replace two FAs and Carry (16-bit) Sum1 (48-bit) << 0bit
two HAs. The number of PFA and PHA in the n-bit proposed
CSLA are n-1 and 1, respectively. Within each PFA and
PHA, the number of CMOS transistor utilized are 72 and 24, PP11: (a3+a4) X b4 + a4 X b3 << 7X12=84bit
PP12: a3 X b3
respectively. The total number of transistors is therefore PP13: a2 X b4 + a4 X b2 {5'b0, Carry}
<< 6X12=72bit

72n - 48. The proposed approach, as compared to the PP14: a3 X b2


conventional architecture, achieves a 5.3%–25% hardware- PP15: a2 X b3 << 5X12=60bit

STAGE -2
saving ratio with bit-width and spanning from one to positive PP13: a1 X b4 + a4 X b1
+
infinity [2]. In Fig. 12, splitting of mantissa of FP64 PP17: a1 X b6
PP18: a1 X b7
multiplication is shown. The FP64 accumulation operation is PP19: a1 X b8
<< 4X12=48bit

given in Fig. 13. PP20: a0 X b4 + a4 X b0

Sum2 (58-bit) << 48bit

A[15:12] B[15:12] A[11:8] B[11:8] A[7:4] B[7:4] A[3:0] B[3:0] Fig. 13. Computing diagram of segmented multiplication and accumulation
of FP64 operands with detailed shifting value for each partial product.
4-bit 4-bit 4-bit 4-bit
DO-RCA DO-RCA DO-RCA DO-RCA

63 32 16 0

0 FP16out
0 1 0 1 0 1

0 FP32out
Sum[15:12] Sum[11:8] Sum[7:4] Sum[3:0]

Fig. 11. Structure of CSLA. FP64out

Operand B b4 (5-bit) b3 (12-bit) b2 (12-bit) b1 (12-bit) b0 (12-bit)


0 BF16out

Operand A a4 (5-bit) a3 (12-bit) a2 (12-bit) a1 (12-bit) a0 (12-bit)


0 TF19out
63 19 0

53-bit Mantissa Fig. 14. Unified Output Format.


Fig. 12. Splitting of FP64 Operands.

ZCU104 Evaluation Board. The proposed PE supports five


D. Stage 3: Normalization and Output Selection modes of precision: FP16, FP32, FP64, BF16 and TF32. In
This stage includes four steps: computing 2’s compliment the SOTA [2], published PE has three modes of precision:
if the sum is negative, find the normalization factor using FP16, FP32 and FP64. To properly benchmark, the
leading zero detector, mantissa normalization and exponent proposed PE was also modified to support three modes –
normalization. Mantissa normalizer is implemented as a FP16, FP32 and FP64. The proposed PE supporting five
barrel shifter that shifts left based on the input given by the modes of precision, proposed PE modified to support three
Leading Zero Detector. The exponent normalization modes of precision and the SOTA design [2] are
operation considers the maximum exponent generated from implemented in GPDK 45nm process. The designs were
the exponent comparator module of stage 1, adds or subtracts synthesized in Cadence Genus tool. The power,
the normalization factor generated by the LZD and finally performance and area estimates were obtained from
subtracts the bias (15 for FP16, 127 for FP32, BF16 and Cadence Genus tools.
TF32, and 1023 for FP64) to generate the normalized
exponent. In Fig. 14, the unified output format of the A. FPGA Implementation
proposed multi-precision PE is given.
The proposed design was implemented on Xilinx ZCU-
V. RESULTS 104 Evaluation FPGA board. In Table III, the performance
of the proposed work on FPGA is compared with SOTA [2].
The performance evaluation of the proposed design is In [2], the pipeline stage delay of the three stages is unequal
compared with recent SOTA work [2], following a similar with stage 2 having the highest delay. Due to the large delay
of stage 2, [2] operates at a maximum frequency of

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
126.88MHz. The FPGA resource utilization of [2] for LUTs proposed PE supporting three modes of precision consumes
and FFs is 8682 and 851. For [2], maximum energy the least amount of power, i.e, 2.23mW, which is 3.87%
efficiency performance is 3.176 GFLOPS/W for FP16 power saving in comparison with SOTA. It achieves a
precision. maximum energy efficiency of 2597.31 GFLOPS/W in
FP16 precision. On the other hand, proposed PE architecture
TABLE III: PERFORMANCE SUMMARY OF THE PROPOSED WORK ON supporting five modes of operation achieves a maximum
FPGA frequency of 281.9MHz. The proposed PE supporting five
Design [2] Proposed PE modes of precision consumes slightly more area because of
Frequency (MHz) 126.88 146.19 the additional circuitry in the control path to accommodate
Power (W) 0.799 0.854 the handling of extra two modes of operation. This also
Resource Utilization
LUT 8628 8708 leads to a slight decrease in the maximum frequency and
FF 851 1057
slight increase in power. Fig. 15 shows the power, area,
FP16 3.176 3.424
FP32 0.794 0.855
performance and energy efficiency comparison of the
Energy Efficiency
(GFLOPS/W)
FP64 0.159 0.171 proposed PEs with SOTA [2]. The proposed PE supporting
BF16 - 3.424 five modes of precision achieves a maximum energy
TF32 - 2.739
efficiency of 2368.91 GFLOPS/W, which is 59.67% more
than the SOTA design accommodating 3 modes. The
In the proposed PE, the design ensured all the three proposed PE supporting three modes of operation achieves
pipeline stages have comparable delay. So, the FPGA 68% improvement in the maximum frequency of operation
implementation of the proposed PE operate at a maximum and 3.87% improvement in power consumption compared to
frequency of 146.19MHz which is 15.22% better than [2]. SOTA [2]. Also, it achieves 75.06% improvement in
The FPGA resource utilization of the proposed PE for LUTs maximum energy efficiency in comparison with [2].
and FFs are 8708 and 1057, which are comparable to the Whereas, the proposed PE supporting five modes of
SOTA design. For the proposed PE, maximum energy operation achieves 63.80% improvement in the maximum
efficiency performance is 3.424 GFLOPS/W for FP16 which operating frequency and 59.67% improvement in the
is 7.80% more efficient than the SOTA design. Similar maximum energy efficiency in comparison with [2]. Also,
energy efficiency metric is observed for BF16 precisions. as evident from Table IV, due to balanced pipeline design
The energy efficiency of the proposed system is consistently and high throughput, the proposed PE achieves the highest
higher for all the three common data-format types, besides normalized energy efficiency in comparison with [2], [5],
the proposed design incorporates two new data-formats [6], [19] and [21].
including BF16 and TF32. % Improvement

Energy Efficiency
B. ASIC Implementation
In this section, the power, performance and area of the Power .

proposed multi-precision PE is compared with various FP


Area .
PE in recent years and tabulated in Table IV. In [5], the dot-
product PE implemented in single, double and quadruple Frequency
precision, consumes high power and area due to supporting
-20 -10 0 10 20 30 40 50 60 70 80
FP128 precision support. In [6], the PE has the capability to % Improvement (5 modes) % Improvement (3 modes)
facilitate operations involving multiple precisions by Fig. 15. Percentage Improvement of the Proposed PE with SOTA [2] for
breaking down high-precision multipliers. The considerable ASIC synthesized results.
bit width of the multipliers contributes to the huge delay,
leading to suboptimal performance and energy efficiency. In
[19] and [21], the numerous-term FP PE designed to achieve VI. CONCLUSION
impressive performance through exponent comparison and This research proposed a configurable multi-precision dot
an innovative four-input LZA for predicting leading zeros. product PE, which supports FP16, FP32, FP64, BF16 and
Nonetheless, these designs are limited to executing four- TF32 precisions. By balanced pipelining, modified LZD and
term operations solely in single precision. fast 2’s complement units’ implementation, the proposed PE
supporting three modes of operation achieves 68%
The proposed PE supporting five modes of precision and improvement in maximum operating frequency, 75%
proposed PE modified to support three modes of operation improvement in maximum energy efficiency and consumes
and SOTA design [2] is implemented in GPDK 45nm 3.8% less power in comparison with [2]. Whereas the
technology. Also, for fair comparison, the area and energy proposed PE supporting five modes of precision achieves
efficiency values of the compared works are scaled to obtain 63.8% improvement in maximum operating frequency and
normalized area and normalized energy efficiency. The 59.6% improvement in maximum energy efficiency in
normalized area is assumed to be proportional to the process comparison with [2]. Also, the proposed PE supporting five
feature size F2 and the normalized energy efficiency modes of precision achieves the best energy-efficient
proportional to F as elaborated in [2]. The implementation performance with a 2368.91 GFLOPS/W value in FP16 and
results including performance, area, power, throughput and BF16 precisions. The FPGA implementation for the
efficiency are listed in Table IV. As evident from Table IV, proposed design also reports 15.22% operating frequency
proposed PE supporting 3 modes of precision achieves the improvement and 7.80% more energy efficient in
highest operating frequency of 289.6MHz, which in comparison with [2]. Also, the proposed PE offers the
comparison with SOTA is 68.27% better. Also, the highest normalized energy efficiency in comparison with
[5], [6], [19] and [21].

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: PERFORMANCE COMPARISON OF THE PROPOSED PE WITH PREVIOUS DESIGNS THROUGH ASIC IMPLEMENTATION.

Design [5] [6] [19] [21] [2] Proposed PE (3 modes) Proposed PE (5 modes)

Process (nm) 90 90 45 45 45 45 45
Frequency (MHz) 667 357 1563 1493 172.1 289.6 281.9
Clock (ns) 1.5 2.8 0.64 0.67 5.8 3.45 3.54
Area (µm2) 795000 82000 63000 33000 37646 38304.34 40335.14
Normalized Area (µm2) 198750 20500 63000 33000 37646 38304.34 40335.14
Power (mW) 43.8 250.5 32.74 16.94 2.32 2.23 2.38
FP16 4 4 - - 10 10 10
FP32 2 4 4 2 2.5 2.5 2.5
No. of Multiprecision Operations
FP64 1 - - - 0.5 0.5 0.5
Each Cycle
BF16 - - - - - - 10
TF32 - - - - - - 8
FP16 2.67 1.43 - - 1.721 2.896 2.819
FP32 1.33 1.43 6.25 2.99 0.4303 0.724 0.704
Throughput
FP64 0.67 - - - 0.0861 0.1448 0.141
(GFLOPS)
BF16 - - - - - - 2.819
TF32 - - - - - - 2.255
FP16 121.77 11.41 - - 1483.6 2597.31 2368.91
FP32 60.88 11.41 381.8 352.43 370.95 649.33 591.59
Energy Efficiency
FP64 30.44 - - - 74.2 129.86 118.48
(GFLOPS/W)
BF16 - - - - - - 2368.91
TF32 - - - - - - 1894.95
FP16 243.54 22.82 - - 1483.6 2597.31 2368.91
FP32 121.76 22.82 381.8 352.43 370.95 649.33 591.59
Normalized Energy Efficiency
FP64 60.88 - - - 74.2 129.86 118.48
(GFLOPS/W)
BF16 - - - - - - 2368.91
TF32 - - - - - - 1894.95

[11] Kharya, P. (2020, May 18). What is the TensorFloat-32 Precision


REFERENCES Format? | NVIDIA Blog. NVIDIA
Blog. https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-
precision-format/
[1] V. Camus, L. Mei, C. Enz and M. Verhelst, "Review and
[12] J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its own
Benchmarking of Precision-Scalable Multiply-Accumulate Unit
game: Posit arithmetic,” Supercomput. Frontiers Innov., vol. 4, no. 2,
Architectures for Embedded Neural-Network Processing," in IEEE
pp. 71–86, 2017.
Journal on Emerging and Selected Topics in Circuits and Systems, vol.
9, no. 4, pp. 697-711, Dec. 2019, doi: [13] J. Lu, C. Fang, M. Xu, J. Lin, and Z. Wang, “Evaluations on deep
10.1109/JETCAS.2019.2950386. neural networks training using posit number system,” IEEE Trans.
Comput., vol. 70, no. 2, pp. 174–187, Feb. 2021.
[2] W. Mao et al., "A Configurable Floating-Point Multiple-Precision
Processing Element for HPC and AI Converged Computing," in IEEE [14] Nvidia Corporation. "Mixed-Precision Training of Deep Neural
Transactions on Very Large Scale Integration (VLSI) Systems, vol. Networks." Whitepaper. Accessed 19 April 2023.
30, no. 2, pp. 213-226, Feb. 2022, doi: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
10.1109/TVLSI.2021.3128435. Center/nvidia-mixed-precision-training-whitepaper.pdf.
[3] I. Sourdis, D. A. Khan, A. Malek, S. Tzilis, G. Smaragdos and C. [15] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio.
Strydis, "Resilient Chip Multiprocessors with Mixed-Grained "Binarized Neural Networks: Training Deep Neural Networks with
Reconfigurability," in IEEE Micro, vol. 36, no. 1, pp. 35-45, Jan.-Feb. Weights and Activations Constrained to +1 or -1." arXiv preprint
2016, doi: 10.1109/MM.2015.7. arXiv:1602.02830 (2016).
[4] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing [16] Intel Corporation. "Deep Learning Inference Efficiency: Mixed-
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. Precision vs. FP32." Whitepaper. Accessed 19 April 2023.
105, no. 12, pp. 2295–2329, Dec. 2017. https://www.intel.com/content/dam/www/public/us/en/documents/wh
ite-papers/deep-learning-inference-efficiency-mixed-precision-fp32-
[5] H. Zhang, D. Chen and S.--B. Ko, "Efficient Multiple-Precision
paper.pdf.
Floating-Point Fused Multiply-Add with Mixed-Precision Support," in
IEEE Transactions on Computers, vol. 68, no. 7, pp. 1035-1048, 1 July [17] S. -R. Kuang, J. -P. Wang and C. -Y. Guo, "Modified Booth
2019, doi: 10.1109/TC.2019.2895031. Multipliers With a Regular Partial Product Array," in IEEE
[6] Kuang, SR., Liang, CY. & Chang, MF. A Multi-functional Multi- Transactions on Circuits and Systems II: Express Briefs, vol. 56, no.
precision 4D Dot Product Unit with SIMD Architecture. Arab J Sci 5, pp. 404-408, May 2009, doi: 10.1109/TCSII.2009.2019334.
Eng 41, 3139–3151 (2016). https://doi.org/10.1007/s13369-016-2117-
3 [18] H. Kaul et al., "A 1.45GHz 52-to-162GFLOPS/W variable-precision
floating-point fused multiply-add unit with certainty tracking in 32nm
[7] H. Tan, G. Tong, L. Huang, L. Xiao and N. Xiao, "Multiple-Mode- CMOS," 2012 IEEE International Solid-State Circuits Conference,
Supporting Floating-Point FMA Unit for Deep Learning Processors" San Francisco, CA, USA, 2012, pp. 182-184, doi:
in IEEE Transactions on Very Large Scale Integration (VLSI) 10.1109/ISSCC.2012.6176987
Systems, vol. 31, no. 02, pp. 253-266, 2023.
doi: 10.1109/TVLSI.2022.3226185 [19] J. Sohn and E. E. Swartzlander, "A Fused Floating-Point Four-Term
[8] https://anonymous.4open.science/r/HISPE-High-Speed-Configurable- Dot Product Unit," in IEEE Transactions on Circuits and Systems I:
Floating-Point-Multi-Precision-Processing-Element-- Regular Papers, vol. 63, no. 3, pp. 370-378, March 2016, doi:
71D9/README.md 10.1109/TCSI.2016.2525042.
[9] “BFLOAT16: The secret to high performance on cloud tpus | google [20] B. Ramkumar and H. M. Kittur, "Low-Power and Area-Efficient
cloud blog,” Google. [Online]. Available: Carry Select Adder," in IEEE Transactions on Very Large Scale
https://cloud.google.com/blog/products/ai-machine-learning/bfloat16- Integration (VLSI) Systems, vol. 20, no. 2, pp. 371-375, Feb. 2012,
the-secret-to-high-performance-on-cloud-tpus. [Accessed: 19-Apr- doi: 10.1109/TVLSI.2010.2101621.
2023].
[10] U. Kster, T. J. Webb, X. Wang, M. Nassar, and N. Rao. TensorFlow: [21] J. Sohn and E. E. Swartzlander, “Improved architectures for a floating
Large-Scale Machine Learning on Heterogeneous Systems. Accessed: point fused dot product unit,” in Proc. IEEE 21st Symp. Comput.
Oct. 26, 2021. [Online]. Available: https://tensorflow.org Arithmetic, Apr. 2013, pp. 41–48.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.

You might also like