5 HISPE - High-Speed - Configurable - Floating-Point - Multi-Precision - Processing - Element
5 HISPE - High-Speed - Configurable - Floating-Point - Multi-Precision - Processing - Element
Abstract—Multiple precision modes are needed for a configurable FP PE. With the HPS method, high-precision
floating-point processing element (PE) because they provide computing blocks are divided into smaller pieces that can be
flexibility in handling different types of numerical data with used for operations with lower precision. Only a small
varying levels of precision and performance metrics. portion of the hardware is needed for configuration, and
Performing high-precision floating-point operations has the speed is maintained. However, this method struggles with the
benefits of producing highly precise and accurate results while low utilization rate of the multiplication. For the LPC
allowing for a greater range of numerical representation. method, low-precision computing blocks are duplicated and
Conversely, low-precision operations offer faster computation combined together to support multiple-precision operations
speeds and lower power consumption. In this paper, we
with additional shifters and adders [2]. An LPC based
propose a configurable multi-precision processing element
configurable FP PE is proposed in [7] to perform a number
(PE) which supports Half Precision, Single Precision, Double
Precision, BrainFloat-16 (BF-16) and TensorFloat-32 (TF-32). of high-precision operations by grouping FP16 unit
The design is realized using GPDK 45 nm technology and multipliers. The LPC method has a better rate of utilization
operated at 281.9 MHz clock frequency. The design was also of the multiplication array, but the multiterm multiplication
implemented on Xilinx ZCU104 FPGA evaluation board. and accumulation operations result in long processing period.
Compared with previous state-of-the-art (SOTA) multi- The multi-precision PEs also support a wide variety of data
precision PEs, the proposed design supports two more floating formats to suit the mixed-precision computing needs for
point data formats namely BF-16 and TF-32. It achieves the various applications.
best energy performance with 2368.91 GFLOPS/W and offers
63% improvement in operating frequency with comparable
This article proposes a three-stage pipelined configurable
footprint and power metrics. multi-precision floating point processing element. The
processing element can operate in five modes of precision-
Keywords—Floating Point (FP), Processing Element (PE), Half Precision (FP16), Single Precision (FP32), Double
TensorFloat-32 (TF32), BrainFloat-16 (BF16), High- Precision (FP64), Brain Float – 16 (BF16) and Nvidia’s
Performance Computing (HPC), Multiply-Accumulate (MAC). Tensor Float-32 (TF32) based on the mode selected.
The main contributions of this article are summarized as
follows:
I. INTRODUCTION 1) Fast 7:2 Carry Save Adder (CSA) along with hybrid
As the Internet of Things (IoT) and Artificial Intelligence Carry Look Ahead (CLA) Adder structure integrated
(AI) continue to advance quickly, there is a huge computing with 12b×12b radix-4 booth multiplier is introduced.
need to carry out billions of Multiply-Accumulate (MAC)
operations per second [1], [2]. High-precision floating-point 2) Fast 2’s compliment block is proposed to reduce the
operations use a greater number of bits to represent the critical path delay of stage – 2 and stage -3 of the
values, resulting in more accurate results. This is especially pipelined architecture.
important in applications where accuracy is critical, such as
3) Fast Leading Zero Detector (LZD) optimized for area
scientific research, engineering, and financial analysis. In
and delay is introduced to reduce the critical path
addition to increased accuracy, high-precision floating-point
delay of stage-3 of the pipelined architecture.
operations can handle a wider range of numerical values,
allowing for larger representation ranges. This is useful in 4) The configurable floating-point multi-precision
applications where data can vary significantly in magnitude, element is retimed at RTL level to ensure the critical
such as astrophysics. However, this increased precision path delay of all the pipelined stages are comparable
comes at the cost of slower computation speed and higher thus ensuring maximum operating frequency.
power consumption. In contrast, low-precision floating-point
operations use fewer bits to represent the values, which This is the first time as per the authors’ knowledge that
results in faster computation speed and lower power multi-precision processing element incorporating five
consumption [2], [3]. However, the lower precision can lead different data-formats including BF16 and TF32 along with
to decreased accuracy in the final results. The reduced FP16, FP32, and FP64 are implemented and characterized.
precision has demonstrated large benefits, paving the way
The processing element designs are made freely available in
towards computing in mobile devices and IoT nodes [1]. So,
[8] for further usage to the researchers and designers’
many HPC and AI computing applications adopt
collaborative completion of multiple-precision FP data, PEs community. The paper not only improves the PE design but
and algorithms to ensure the accuracy requirements and also accommodates two more data-formats which is more
accelerate the computing process [1], [2], [4], [5], [6]. relevant for modern day AI computing applications. It is a
step towards designing hardware efficient AI system on chip.
Either high-precision-split (HPS) or low-precision-
combination (LPC) approaches are used to construct
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
II. TRADITIONAL IEEE-754 DESIGN APPROACH encoding scheme for the exponent and the mantissa, which
A. Standard IEEE-754 Format allows for greater precision and range of values. However,
posit is not widely used due to increased complexity and
The standard IEEE 754 format of floating number potential performance issues.
representation contains three parts – sign (S), exponent (E)
and mantissa (M). The sign part of the floating-point Bfloat16 1 8 7
representation is 1-bit, where, ‘0’ represents a positive - Sign Bit
16 - Exponent
number and ‘1’ represents a negative number. Fig. 1 shows
- Mantissa
the representation of Half Precision (FP16), Single Precision Nvidia’s TensorFloat 32 1 8 10
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
III. PROPOSED ARCHITECTURE Adder Tree composed of 3:2 CSA and 4:2 CSA is used to
find the sum of the 10 partial products in each clock cycle.
A. Data Path Flow The Sum and Carry generated by the Adder Tree is
The configurable Multi-Precision Processing Element accumulated by the Carry Select Adder (CSLA). In the final
supports 5 modes of precisions – FP16, FP32, FP64, BF16 stage, Fast 2s compliment block is employed to find the 2s
and TF32 using 3-stage pipelined (Single Instruction compliment of Adder Result if it is a negative number. The
Multiple Data) SIMD architecture. The structure is built results of the exponent and accumulation are adjusted
using the LPC method, and the bit width of the unit through normalization and rounding to obtain the final result
multiplier is determined to minimize redundancy cost for all- in the applicable standard formats.
precision implementation. The size of unit multiplier is
selected such that it can perform single operation of FP16,
BF16 and TF32. B. Modified Leading Zero Detector (LZD)
MODE = 3’b000: FP16
Traditional Leading Zero Detector is designed as a
MODE
MODE
=
=
3’b001:
3’b010:
FP32
FP64
counter_enable cascaded multiplexer like structure where the number of
MODE
MODE
=
=
3’b011:
3’b100:
Bfloat-16
Tensor-Float
MODE [2:0] INPUT B[159:0] INPUT A[159:0] Clock Counter cascaded multiplexer phases is equal to one less than the
Input Pre-Processing
number of bits in the input taken by the LZD. The number of
multiplexer in each phase is equal to
INPUT REGS ceil(log2(Number_of_Bit_Input)). The critical path of this
Sign Exponents Mantissa Control
architecture is characterized by delay of one 2:1 multiplexer
Sign Exponent Multiplier
Processing Comparison Array
multiplied by the number of multiplexers phases.
10 Partial
Exponent Products in[54:44] in[43:33] in[32:22]
in[65:55] in[21:11] in[10:0]
Difference
Alignment Shifter
11-bit Zero 11-bit Zero 11-bit Zero 11-bit Zero 11-bit Zero 11-bit Zero
Exponent 10 X (Shifted 60-bit Comparator Comparator Comparator Comparator Comparator Comparator
XORed-Sign Difference Partial Products)
PIPELINE STAGE-1 REGISTERS
10 X (Shifted 60-bit
................
Partial Products)
....
in[65:55] in[54:44] in[43:33] in[32:22] in[21:11] in[10:0]
Adder Tree
Chain of 10 Chain of 10 Chain of 10 Chain of 10 Chain of 10 Chain of 10
Carry Sum cascaded cascaded cascaded cascaded cascaded cascaded
2x1 mux 2x1 mux 2x1 mux 2x1 mux 2x1 mux 2x1 mux
CSLA phases phases phases phases phases phases
Adder Result
PIPELINE STAGE-2 REGISTERS 7-bit 7-bit 7-bit 7-bit 7-bit
7-bit
Input Line
Chain of 5 Select
Modified Leading Zero Detector cascaded `
Line
2x1 mux
phases
Output Selection
The proposed architecture segments the input and achieves
OUTPUT REGS
parallelism between them ensuring a significant decrease in
the delay of the traditional LZD implementation. Here, we
OUTPUT [63:0] consider 66-bit unsigned sum from fast 2’s compliment
Fig. 3. Architecture of the proposed 3-stage pipelined multi-precision PE. block of stage-3 of the pipelined architecture as the input to
the LZD. Let N denote the number of segments. Modified
LZD has three stages: Stage 1 consists of N blocks of zero
In Fig. 3, 3-bit signal MODE is used to select the precision
comparators of width nw, as stated in the Equation 2, where,
of operation of the PE. Input Pre-Processing stage is used to
all the N blocks of comparators are operating in parallel.
bifurcate the data according to the mode selected so that they
Stage 2 consists of N blocks of (nw-1) cascaded multiplexers
can be stored appropriately in the input registers (Input-
phases (of bit-width 7) in cascaded structure. Here, all the N
Regs). Exponent Comparison block computes the maximum
blocks of cascaded 2:1 multiplexers are operating in parallel.
exponent and the exponent difference of the product terms.
Also, Stage 1 and Stage 2 operate in parallel. Stage 3 consists
Multiplier array designed with LPC methodology is used to
of a chain (N-1) cascaded 2:1 multiplexers phases. For a 66-
compute the partial products. The multiplier array is
bit input, Table I summarizes the area and delay performance
composed of 10 multipliers – 6 conventional multiplier and 4
of the LZD for different number of stages.
fused multipliers. Sign Processing determines the sign of the
product terms by performing XOR operations on the input
operands. Fast 2s compliment block is used to employed to nw =
find the 2s compliment of the partial products based on the (2)
sign generated by the Sign Processing block.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
TABLE I: AREA AND DELAY PERFORMANCE ANALYSIS OF LZD FOR DIFFERENT NUMBER OF SEGMENTS
Critical Path
Number of 2x1 Number of Number of Width of Zero
Number of Width of each Weighted Weighted
multiplexer in 2x1 Zero Comparator Number of 2x1
segments (N) segment (nw) Area Number of Delay
each phase multiplexers Comparators Used multiplexers
Comparators
phases
1
(Traditional 66 7 455 0 0 455 0 66 65
Design)
4 17 7 469 4 17 537 1 19 20
5 14 7 483 5 14 553 1 17 18
6 11 7 455 6 11 521 1 15 16
7 10 7 483 7 10 553 1 15 16
8 9 7 497 8 9 569 1 15 16
9 8 7 497 9 8 569 1 15 16
10 7 7 483 10 7 553 1 15 16
11 6 7 455 11 6 521 1 15 16
Here, weighted area is calculated by adding the product of A. Stage-0: Input Preprocessing
number of comparators and the width of each comparator
with total number of 2:1 multiplexers. Similarly, weighted In Fig. 6, the 160-bit input accommodates ten FP16
delay is calculated by adding the number of comparators in operands, ten BF16 operands, eight TF32 operands, five
the critical path with the number of multiplexer phases. From FP32 format or one FP64 format. In Input Pre-processing,
Table I, it is evident that the optimum number of stages is 6. based on the mode selected, the sign, exponent and mantissa
In Fig. 4, the circuit level implementation of the proposed bits of the operands are split and stored in input registers. For
LZD architecture is shown. example, if the mode is 3’b001, the selected mode of
operation is FP32. So, in input pre-processing, the two 160-
bit inputs A and B would be split into two groups of five sign
C. Fast 2’s Compliment Block bits (1-bit), five exponent bits (8-bits) and five mantissa bits
In the proposed design, 2’s compliment block is employed (23-bits).
in stage 1 and stage 2 of the pipeline architecture. In stage 1,
159 144 128 112 96 80 64 48 32 16 0
2’s compliment block is used to find the 2’s compliment of FP16hp9 FP16hp8 FP16hp7 FP16hp6 FP16hp5 FP16hp4 FP16hp3 FP16hp2 FP16hp1 FP16hp0
60-bit partial products from the alignment shifter based on
the sign computed by the sign processing unit. In stage 2, it FP32sp4 FP32sp3 FP32sp2 FP32sp1 FP32sp0
is used to find the 2’s compliment of the 66-bit adder result
0 FP64dp0
from CSLA if it is a negative number. In traditional design
of 2’s compliment block, for an M bit number, it involves BF16b9 BF16b8 BF16b7 BF16b6 BF16b5 BF16b4 BF16b3 BF16b2 BF16b1 BF16b0
computing the compliment of the number followed by
0 TF32t7 TF32t6 TF32t5 TF32t4 TF32t3 TF32t2 TF32t1 TF32t0
adding ‘1’ to the LSB. However, in this architecture, the 159 144 126 108 72 0
90 54 36 18
critical path has (M-1) number of 2-input AND gates, (M-1) Fig. 6. Unified formats for 160-bit input in FP16, FP32, FP64, BF16 and
number of 3-input OR followed by one 2-input XOR. TF32 precisions.
The proposed Fast 2’s Compliment Block aims at
reducing the delay and area by completely eliminating the
B. Stage 1: Configurable Multiterm Multiplication, Sign
need of adder circuitry. An M bit Fast 2’s Compliment
Processing, Exponent Comparison and Alignment Shifter
Block is built with (M-1) 2:1 multiplexers, (M-1) 2-input
OR gate and (M-1) inverters. The critical path has (M-1) 2- In stage 1 of the proposed multi-precision PE, operations of
input OR gate and one 2:1 multiplexer. In Fig. 5, the mantissa multiplication, sign processing, exponent
structure of the Fast 2’s Compliment Block for a 62-bit comparison and alignment shifting are carried out
input is shown in the Fig. 5. simultaneously. To reduce the number of multipliers and
achieve high utilization rate of the multipliers, the width of
in[61] in[60] in[5] in[4] in[3] in[2] in[1] in[0]
the multipliers designed is set to 12-bit [2]. To reduce the
number of multiplication operation of FP64 mode from 25 to
0 1 0 1
…………
0 1 0 1 0 1 0 1 0 1 20, fused multipliers are designed. Three architectures of
fused multipliers are used. In Fig. 7, the first multiplier
out[61] out[60] out[5] out[4] out[3] out[2] out[1] out[0] architecture performs 12b×12b multiplication. In Fig. 8, the
Fig. 5. Architecture of proposed Fast 2’s Compliment Block second multiplier architecture performs 12b×12b or two
12b×5b multiplication in parallel based on the mode of
operation selected. In Fig. 9, the third multiplier architecture
performs 12b×12 or 12b×5b in parallel with 17b×5b
IV. CIRCUIT IMPLEMENTATION
multiplication based on the mode selected [17], [18]. In the
The proposed architecture of multi-precision PE has a 3- multiplier array, a total of ten multipliers are used. Out of the
stage pipeline structure. The working of the proposed multi- ten multipliers, six of the multipliers follow the first
precision PE can be explained in 4-stages. The stages are as architecture, three multipliers follow the second architecture
follows: and one multiplier follows the third architecture. All the ten
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
multipliers follow radix – 4 booth structure with 7:2 Carry TABLE II: MULTPLIER TYPE USED AND NUMBER OF MULTIPLIERS
Save Adder (CSA) followed by a hybrid Carry Look Ahead REQUIRED
adder to add the seven partial products generated. In Table II,
Number of Number of
the type and number of multipliers required is summarized. Data Bit Bit Ratio for Multiplication
Number of
12bx12b
12bx12b parallel 12bx12b parallel
Type Width Segments Types 12bx5b fused 17bx5b fused
The hybrid Carry Look Ahead Adder is an implemented as Multipliers
multipliers multipliers
six 4-bit carry look ahead adder connected as a ripple carry
FP16 11 12 1 times 12b x 12b 1 0 0
structure. This implementation is able to optimize both area
and delay. The resulting partial products are to be shifted FP32 24 12:12 4 times 12b x 12b 4 0 0
according to the formula (3) where N is the number of 16 times 12b x 12b
5:12:12:12:1
segments. FP64 53
2
8 times 12b x 5b 16 3 1
1 time 5b x 5b
BF16 8 12 1 times 12b x 12b 1 0 0
× × ≪ × 12 TF32 11 12 1 times 12b x 12b 1 0 0
(3)
Operand A a11 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 a0 In the proposed PE, we use Cascaded Exponent
Operand B X b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
Comparator (CEC) method proposed in [2] to reduce the
PP1 * * * * * * * * * * * * hardware cost while meeting the timing requirement of
PP2 * * * * * * * * * * * *
PP3 * * * * * * * * * * * *
conventional EC design [18]. The CEC block used in the
PP4 * * * * * * * * * * * * proposed PE employs comparators to generate the results of
PP5 * * * * * * * * * * * * the input comparisons covering all possible combinations.
PP6 * * * * * * * * * * * *
PP7 * * * * * * * * * * * * Then, all the results are analysed by the control logic block
with embedded look up tables (LUTs) to find the maximum
PP1, PP2, PP3, PP4, PP5, PP6, PP7
exponent. The most appropriate number of stage levels S is
7-2 Compressor
calculated using the system clock period and the delay of one
Sum Carry
CEC block with embedded comparators and LUTs. Then, in
order to reduce the hardware cost, each stage makes use of
Hybrid-CLA
CEC blocks with nearly the same input numbers. The
Output
proposed PE employs a two-stage CEC design for the
comparison of ten operands. Stage 1 includes two three-input
Fig. 7. 12b × 12b Multiplication Operation. and one four-input Exponent Comparator (EC) blocks and
Stage 2 includes one three-input EC block.
Operand A a11 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 a0
Operand B b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
Once we get the ten multiplication results in each clock
Operand C X 0 c10 c9 c8 c7 c6 0 c4 c3 c2 c1 c0 cycle, two cascaded barrel shifters of 61-bit are used as an
PP1 * * * * * * * * * * * * alignment shifter to adjust the partial products. The first
PP2 * * * * * * * * * * * * barrel shifter shifts left according to (2). The second barrel
PP3 * * * * * * * * * * * * shifter shifts right by the exponent difference generated by
PP4 * * * * * * * * * * * *
PP5 * * * * * * * * * * * * the CEC block.
PP6 * * * * * * * * * * * *
PP7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C. Stage 2: Accumulation
PP1, PP2, PP3, PP4, PP5, PP6, PP7
Once we get the ten partial products from stage 1 pipeline
registers, if they are negative, their 2’s compliment
7-2 Compressor
Sum Carry
representation is computed. In each clock cycle, ten partial
products are generated. So, an adder tree shown in Fig. 10,
Hybrid-CLA capable of adding ten numbers is employed [2].
Output
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
This PE employs a high-speed carry select adder (CSLA), architecture. The proposed design and the published design
as suggested in [2], [20]. Dual-output (DO) RCA is [2], was implemented and verified on FPGA platform –
employed, which is intended to replace the two-channel RCA
used in conventional CSLA. The parallel full adder (PFA) 60 bit 48 bit 36 bit 24 bit 12 bit 0 bit
and parallel half adder (PHA) that make up the proposed DO
PP1: a3 X b0
RCA's structure allow for the parallel generation of carry and PP2: a2 X b1
sum values under complementary carry input value cases. PP3: a1 X b2
<< 3X12=36bit
When the carry input is 0, the sum and carry outputs are PP4: a0 X b3
equal to P1 ⊕ P2 and P1 ⋂ P2, respectively. When the carry
STAGE -1
PP5: a1 X b1
PP6: a2 X b0
input is 1, the sum and carry output is equal to NOT(P1 ⊕ + PP7: a0 X b2
<< 2X12=24bit
as traditional CSLAs, PFA and PHA replace two FAs and Carry (16-bit) Sum1 (48-bit) << 0bit
two HAs. The number of PFA and PHA in the n-bit proposed
CSLA are n-1 and 1, respectively. Within each PFA and
PHA, the number of CMOS transistor utilized are 72 and 24, PP11: (a3+a4) X b4 + a4 X b3 << 7X12=84bit
PP12: a3 X b3
respectively. The total number of transistors is therefore PP13: a2 X b4 + a4 X b2 {5'b0, Carry}
<< 6X12=72bit
STAGE -2
saving ratio with bit-width and spanning from one to positive PP13: a1 X b4 + a4 X b1
+
infinity [2]. In Fig. 12, splitting of mantissa of FP64 PP17: a1 X b6
PP18: a1 X b7
multiplication is shown. The FP64 accumulation operation is PP19: a1 X b8
<< 4X12=48bit
A[15:12] B[15:12] A[11:8] B[11:8] A[7:4] B[7:4] A[3:0] B[3:0] Fig. 13. Computing diagram of segmented multiplication and accumulation
of FP64 operands with detailed shifting value for each partial product.
4-bit 4-bit 4-bit 4-bit
DO-RCA DO-RCA DO-RCA DO-RCA
63 32 16 0
0 FP16out
0 1 0 1 0 1
0 FP32out
Sum[15:12] Sum[11:8] Sum[7:4] Sum[3:0]
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
126.88MHz. The FPGA resource utilization of [2] for LUTs proposed PE supporting three modes of precision consumes
and FFs is 8682 and 851. For [2], maximum energy the least amount of power, i.e, 2.23mW, which is 3.87%
efficiency performance is 3.176 GFLOPS/W for FP16 power saving in comparison with SOTA. It achieves a
precision. maximum energy efficiency of 2597.31 GFLOPS/W in
FP16 precision. On the other hand, proposed PE architecture
TABLE III: PERFORMANCE SUMMARY OF THE PROPOSED WORK ON supporting five modes of operation achieves a maximum
FPGA frequency of 281.9MHz. The proposed PE supporting five
Design [2] Proposed PE modes of precision consumes slightly more area because of
Frequency (MHz) 126.88 146.19 the additional circuitry in the control path to accommodate
Power (W) 0.799 0.854 the handling of extra two modes of operation. This also
Resource Utilization
LUT 8628 8708 leads to a slight decrease in the maximum frequency and
FF 851 1057
slight increase in power. Fig. 15 shows the power, area,
FP16 3.176 3.424
FP32 0.794 0.855
performance and energy efficiency comparison of the
Energy Efficiency
(GFLOPS/W)
FP64 0.159 0.171 proposed PEs with SOTA [2]. The proposed PE supporting
BF16 - 3.424 five modes of precision achieves a maximum energy
TF32 - 2.739
efficiency of 2368.91 GFLOPS/W, which is 59.67% more
than the SOTA design accommodating 3 modes. The
In the proposed PE, the design ensured all the three proposed PE supporting three modes of operation achieves
pipeline stages have comparable delay. So, the FPGA 68% improvement in the maximum frequency of operation
implementation of the proposed PE operate at a maximum and 3.87% improvement in power consumption compared to
frequency of 146.19MHz which is 15.22% better than [2]. SOTA [2]. Also, it achieves 75.06% improvement in
The FPGA resource utilization of the proposed PE for LUTs maximum energy efficiency in comparison with [2].
and FFs are 8708 and 1057, which are comparable to the Whereas, the proposed PE supporting five modes of
SOTA design. For the proposed PE, maximum energy operation achieves 63.80% improvement in the maximum
efficiency performance is 3.424 GFLOPS/W for FP16 which operating frequency and 59.67% improvement in the
is 7.80% more efficient than the SOTA design. Similar maximum energy efficiency in comparison with [2]. Also,
energy efficiency metric is observed for BF16 precisions. as evident from Table IV, due to balanced pipeline design
The energy efficiency of the proposed system is consistently and high throughput, the proposed PE achieves the highest
higher for all the three common data-format types, besides normalized energy efficiency in comparison with [2], [5],
the proposed design incorporates two new data-formats [6], [19] and [21].
including BF16 and TF32. % Improvement
Energy Efficiency
B. ASIC Implementation
In this section, the power, performance and area of the Power .
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: PERFORMANCE COMPARISON OF THE PROPOSED PE WITH PREVIOUS DESIGNS THROUGH ASIC IMPLEMENTATION.
Design [5] [6] [19] [21] [2] Proposed PE (3 modes) Proposed PE (5 modes)
Process (nm) 90 90 45 45 45 45 45
Frequency (MHz) 667 357 1563 1493 172.1 289.6 281.9
Clock (ns) 1.5 2.8 0.64 0.67 5.8 3.45 3.54
Area (µm2) 795000 82000 63000 33000 37646 38304.34 40335.14
Normalized Area (µm2) 198750 20500 63000 33000 37646 38304.34 40335.14
Power (mW) 43.8 250.5 32.74 16.94 2.32 2.23 2.38
FP16 4 4 - - 10 10 10
FP32 2 4 4 2 2.5 2.5 2.5
No. of Multiprecision Operations
FP64 1 - - - 0.5 0.5 0.5
Each Cycle
BF16 - - - - - - 10
TF32 - - - - - - 8
FP16 2.67 1.43 - - 1.721 2.896 2.819
FP32 1.33 1.43 6.25 2.99 0.4303 0.724 0.704
Throughput
FP64 0.67 - - - 0.0861 0.1448 0.141
(GFLOPS)
BF16 - - - - - - 2.819
TF32 - - - - - - 2.255
FP16 121.77 11.41 - - 1483.6 2597.31 2368.91
FP32 60.88 11.41 381.8 352.43 370.95 649.33 591.59
Energy Efficiency
FP64 30.44 - - - 74.2 129.86 118.48
(GFLOPS/W)
BF16 - - - - - - 2368.91
TF32 - - - - - - 1894.95
FP16 243.54 22.82 - - 1483.6 2597.31 2368.91
FP32 121.76 22.82 381.8 352.43 370.95 649.33 591.59
Normalized Energy Efficiency
FP64 60.88 - - - 74.2 129.86 118.48
(GFLOPS/W)
BF16 - - - - - - 2368.91
TF32 - - - - - - 1894.95
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on August 26,2024 at 11:23:11 UTC from IEEE Xplore. Restrictions apply.