Floating Point Unit Implementation and Verification For Machine Learning and AI Applications
Floating Point Unit Implementation and Verification For Machine Learning and AI Applications
Implementation
By
Under supervision of
Giza, Egypt
July 2021
ABSTRACT
This document describes briefly the flow of design, verification and realization of
floating point unit (FPU) that performs the following operations: addition, subtraction and
multiplication on two 32-bit representations (single precision and decimal representation decimal
format). SIMD instruction is also supported for any of the supported operations with a maximum
number of 16 similar operations per SIMD instruction
The project was distributed among four teams
• Design team
o Developing an algorithm to perform the required operations on the specified
representations using MATLAB
o Implement the algorithm using synthesizable RTL code
o Perform behavioural simulation, synthesis and post synthesis simulation
o Improving the design working frequency using pipelining
• System team
o Developing the host controller interface (HCI) specifications and writing the RTL code
of the top level design
o Building PULPino platform and adding the FPU as a peripheral to it
o Developing an application and testing it using C coding
• Verification team
o Building UVM testing environments to perform functional verification on each
combinational module separately
o Building UVM testing environment to test the clocked integrated modules with the host
controller interface (HCI)
• Physical design team
o Synthesize RTL with DC, generate gate level netlist, check STA, and formal verification
o Place and route generated gate level netlist
o Generating layout for FPU with speed 35.7 MHz
o Generating layout with area equal 154122.5275 µm2
1
Contents
1 Chapter One: Introduction .......................................................................................... 8
2 Chapter Two: Floating point representation .............................................................. 9
2.1 Floating point representations ..........................................................................9
2.1.1 Binary interchange format encoding ......................................................................... 9
2.1.2 Decimal interchange representation:....................................................................... 10
2.1.3 Special cases ........................................................................................................... 13
2.1.4 Exceptions ............................................................................................................... 14
2.1.5 Rounding ................................................................................................................. 14
3 Chapter Three: RTL design ...................................................................................... 15
3.1 High Level Design .........................................................................................15
3.1.1 Decimal-32 representation – Decimal format ......................................................... 15
3.1.2 Decimal-32 representation – Binary format ........................................................... 19
3.1.3 Single precision....................................................................................................... 21
3.2 Register Transfer Level (Verilog) ..................................................................23
3.2.1 Decimal-32 representation – Decimal format ......................................................... 23
3.2.2 Single precision....................................................................................................... 32
3.2.3 Frequency of the design .......................................................................................... 41
4 Chapter four: System ................................................................................................. 43
4.1 HCI (Host Contrller Interface) specification .................................................43
4.1.1 Abbreviations .......................................................................................................... 43
4.1.2 Memory-mapped FPU Host Controller Registers................................................... 43
4.1.3 FPU Command register........................................................................................... 43
4.1.4 FPU Status register ................................................................................................. 45
4.1.5 Operands A & B and Output registers: ................................................................... 47
4.1.6 Interrupt signal: ....................................................................................................... 47
4.2 Top level design .............................................................................................48
4.2.1 Design without SIMD support ................................................................................ 48
4.2.2 Design with SIMD support ..................................................................................... 51
4.2.3 Top level simulations .............................................................................................. 52
4.3 RISC-V ..........................................................................................................57
4.3.1 RISC-V features ...................................................................................................... 57
4.3.2 RISC-V processors.................................................................................................. 57
4.4 PULPino .........................................................................................................59
4.4.1 Features ................................................................................................................... 59
4.4.2 PULPino Architecture............................................................................................. 59
4.4.3 Memory map ........................................................................................................... 60
2
4.4.4 PULPino environment ............................................................................................ 60
4.5 FPU-RISC V Integration................................................................................63
4.5.1 Integration methods ................................................................................................ 63
4.5.2 Integrating the FPU with RISC-V via a Bridge ...................................................... 64
4.6 FPU application and testing ...........................................................................66
4.6.1 FPU application ...................................................................................................... 66
4.6.2 FPU testing.............................................................................................................. 69
5 Chapter five: Verification .......................................................................................... 72
5.1 Design under test specifications.....................................................................72
5.2 Work flow ......................................................................................................72
5.3 Verification environments .............................................................................72
5.3.1 Decimal representation testing environment .......................................................... 73
5.3.2 Single precision representation testing environment .............................................. 82
5.3.3 The integrated environment .................................................................................... 87
5.4 Testing ranges distribution .............................................................................91
5.4.1 Single precision representation ............................................................................... 91
5.4.2 Decimal encoding representation ............................................................................ 92
5.5 Bugs ...............................................................................................................93
6 Chapter Six: Synthesis and Formal Verification ...................................................... 95
6.1 Synthesis ........................................................................................................95
6.1.1 Flow Chart of the Synthesis Process....................................................................... 96
6.1.2 Setting the Libraries ................................................................................................ 96
6.1.3 Reading in the Design ............................................................................................. 97
6.1.4 Optimization Constrains ......................................................................................... 97
6.1.5 Compiling the Design ............................................................................................. 99
6.1.6 Report Analysis..................................................................................................... 100
6.1.7 Design challenges ................................................................................................. 103
6.2 Formal Verification......................................................................................104
6.2.1 Basic Definitions:.................................................................................................. 104
7 Chapter Seven: Physical Design, Placement and Routing Stages .......................... 106
7.1 Basic Physical Design Flow Using IC Compiler .........................................106
7.2 Floorplanning ...............................................................................................108
7.3 Placement .....................................................................................................108
7.4 Clock Tree Synthesis (CTS) ........................................................................110
7.5 Routing.........................................................................................................112
8 Projects Code Links ................................................................................................. 114
Integrated FPU RTL code ..........................................................................................114
System code ...............................................................................................................114
Testing environment link on EDA playground ..........................................................114
9 Bibliography ............................................................................................................. 115
3
List of Figures
Figure 1: Binary format .................................................................................................................. 9
Figure 2: Decimal format .............................................................................................................. 11
Figure 3: Addition MATLAB results – Decimal format .............................................................. 16
Figure 4: Subtraction MATLAB results – Decimal format .......................................................... 17
Figure 5: Mantissa calculation in MATLAB ................................................................................ 18
Figure 6: Multiplication MATLAB results – Decimal format...................................................... 18
Figure 7: Addition MATLAB result - Binary format ................................................................... 19
Figure 8: Subtraction MATLAB results - Binary format ............................................................. 20
Figure 9: Multiplication MATLAB results – Binary format ........................................................ 21
Figure 10: Addition MATLAB result - Single precision .............................................................. 22
Figure 11: Multiplication MATLAB result - Single precision ..................................................... 22
Figure 12: architecture of decimal adder ...................................................................................... 24
Figure 13: simulation result - decimal adder ................................................................................ 25
Figure 14: synthesis result - decimal adder................................................................................... 25
Figure 15: minimum clock allowed - decimal adder .................................................................... 25
Figure 16: Architecture of subtraction in decimal format ............................................................ 27
Figure 17: Behavioral simulation result - Decimal subtraction .................................................... 28
Figure 18: Area utilization - Decimal subttraction ....................................................................... 28
Figure 19: Timing report - Decimal subtraction ........................................................................... 28
Figure 20: Architecture of Multiplication in decimal representation - Decimal format ............... 30
Figure 21: Behavioral simulation result - Decimal multiplication ............................................... 31
Figure 22: Area utilization - Decimal mutiplication ..................................................................... 31
Figure 23: Timing report - Decimal mutiplication ....................................................................... 31
Figure 24: Architecture of addition in Single precision................................................................ 33
Figure 25: Behavioral simulation result – addition Single precision ........................................... 34
Figure 26: Area utilization – addition Single precision ................................................................ 34
Figure 27: Timing report - addition Single precision ................................................................... 34
Figure 28: Architecture of Subtractor in Single precision ............................................................ 37
Figure 29: Behavioral simulation result – subtraction Single precision ...................................... 37
Figure 30: Area utilization – subtraction Single precision ........................................................... 37
Figure 31: Timing report - Subtraction Single precision .............................................................. 37
Figure 32: Architecture of multiplication in Single precision ...................................................... 39
Figure 33: Behavioral simulation result – addition Single precision ........................................... 40
Figure 34: Area utilization – addition Single precision ................................................................ 40
Figure 35: Timing report - addition Single precision ................................................................... 40
Figure 36: Critical path of decimal multiplication ........................................................................ 41
Figure 37: Timing report - After pipelining.................................................................................. 41
Figure 38: Architecture of the decimal multiplication after pipelining ........................................ 42
Figure 39: Top level block diagram without SIMD...................................................................... 48
Figure 40: HCI connections .......................................................................................................... 49
Figure 41: SIMD connections ....................................................................................................... 51
Figure 42: Single instruction simulation (A) ................................................................................ 52
Figure 43: Single instruction simulation (B) ................................................................................ 53
Figure 44: FPU reset bit simulation .............................................................................................. 54
Figure 45: FPU enable bit simulation ........................................................................................... 54
4
Figure 46: FPU interrupt enable bit simulation ............................................................................ 55
Figure 47: SIMD instruction simulation (A) ................................................................................ 55
Figure 48: SIMD instruction simulation (B) ................................................................................. 56
Figure 49: SIMD instruction simulation (C) ................................................................................. 56
Figure 50: PULPino block diagram .............................................................................................. 59
Figure 51: PULPino memory map ................................................................................................ 60
Figure 52:Commands for making Modelsim work ....................................................................... 61
Figure 53: Script for installing and making Cmake ...................................................................... 62
Figure 54: Script for running hello world example ...................................................................... 62
Figure 55: Output of hello world example .................................................................................... 63
Figure 56: Pulpino memory map after replacement of I2C by FPU ............................................. 64
Figure 57: apb_fpu_bridge input/output signals ........................................................................... 64
Figure 58: FPU_Single_Instruction header .................................................................................. 66
Figure 59: FPU_Single_Instruction output ................................................................................... 67
Figure 60: FPU_SIMD_Instruction header ................................................................................... 67
Figure 61: FPU_SIMD_Instruction output ................................................................................... 68
Figure 62: compare header............................................................................................................ 68
Figure 63:FPU_get_status header ................................................................................................. 69
Figure 64: Integration test cases ................................................................................................... 70
Figure 65: Single instruction operation/representation test case .................................................. 70
Figure 66: Single instruction flags test case ................................................................................. 71
Figure 67: SIMD instruction test case .......................................................................................... 71
Figure 68: Test cases output ......................................................................................................... 71
Figure 69: Decimal representation testing environment ............................................................... 73
Figure 70: DE function ("random") .............................................................................................. 74
Figure 71: DE function ("gen_num") ........................................................................................... 74
Figure 72: DE function ("dec") ..................................................................................................... 75
Figure 73: DE generating random transactions ............................................................................ 75
Figure 74: DE driver run phase ..................................................................................................... 76
Figure 75: DE task ("send_op") .................................................................................................... 77
Figure 76: DE BFM task("write_to_monitor") ............................................................................. 77
Figure 77: DE command monitor function ("write_to_monitor") ................................................ 77
Figure 78: DE result monitor function ("write_to_monitor") ....................................................... 78
Figure 79: DE function ("predict_result") for decimal addition ................................................... 79
Figure 80: DE overflow condition ................................................................................................ 79
Figure 81: DE underflow condition and special case ................................................................... 80
Figure 82: DE rounding according to 9th digit ............................................................................. 80
Figure 83: DE result check algorithm ........................................................................................... 81
Figure 84: DE env ("connect_phase") function ............................................................................ 81
Figure 85: DE top module............................................................................................................. 82
Figure 86: Single precision representation testing environment................................................... 82
Figure 87: SP task ("body") of ("random_sequence") .................................................................. 83
Figure 88: SP driver (“run_phase”) .............................................................................................. 84
Figure 89: SP function ("predict_result") ..................................................................................... 84
Figure 90: SP predicted overflow flag .......................................................................................... 85
Figure 91: SP predicted underflow flag ........................................................................................ 85
5
Figure 92: SP predicted inexact flag ............................................................................................. 86
Figure 93: SP env connect phase .................................................................................................. 86
Figure 94: SP class base_teste ...................................................................................................... 86
Figure 95: SP ("random_test") class ............................................................................................. 87
Figure 96: integrated sequence_item data members ..................................................................... 88
Figure 97: integrated sequence_item control signals .................................................................... 88
Figure 98: integrated environment BFM data members ............................................................... 89
Figure 99: writing operands to the BFM....................................................................................... 89
Figure 100: DE predicted result .................................................................................................... 90
Figure 101: SP predicted result ..................................................................................................... 91
Figure 102: Single precision ranges.............................................................................................. 92
Figure 103: Single precision constraints ....................................................................................... 92
Figure 104: decimal encoding representation constraints ............................................................. 93
Figure 105: Design Flow Block Diagram ..................................................................................... 95
Figure 106: Synthesis process flow chart. .................................................................................... 96
Figure 107: Part of Timing report example ................................................................................ 101
Figure 108: Path Slack histogram ............................................................................................... 102
Figure 109: Area Report example ............................................................................................... 102
Figure 110: Summary of QoR Report ......................................................................................... 103
Figure 111: Unmatched Points.................................................................................................... 105
Figure 112: Verification Report .................................................................................................. 105
Figure 113: Basic Physical Design Flow .................................................................................... 106
Figure 114: Floorplan and power rings Placement of FPU ........................................................ 109
Figure 115: Zoomed in view of power rings and floorplanning placement ............................... 110
Figure 116:Balancing of Clock Skews ....................................................................................... 111
Figure 117:Handling Insertion Delay ......................................................................................... 111
Figure 118: Layout after CTS ..................................................................................................... 111
Figure 119: Zoomed in view after CTS ...................................................................................... 112
Figure 120: Summary of final area ............................................................................................. 112
Figure 121: Final Layout of FPU ................................................................................................ 113
6
List of Tables
Table 1: comparison between different precisions in Binary representation ................................ 10
Table 2: encoding of the combinational field ............................................................................... 11
Table 3: encoding of the trailing field........................................................................................... 12
Table 4: comparison between different precisions in Decimal representation-decimal format ... 12
Table 5: Special cases in single precision ..................................................................................... 13
Table 6: Special cases in Decimal representation ......................................................................... 13
Table 7: Exceptions....................................................................................................................... 14
Table 8: Memory-mapped FPU Host Controller Registers .......................................................... 43
Table 9: FPU Command register .................................................................................................. 45
Table 10: Register (0x110) ........................................................................................................... 46
Table 11: Register (0x114) ........................................................................................................... 46
Table 12: Register (0x118) ........................................................................................................... 46
Table 13: Register (0x11C) ........................................................................................................... 47
Table 14: Modified operation block function ............................................................................... 49
Table 15: Peripherals of SweRVolf, PULPino and PULPissimo ................................................. 58
Table 16: APB used signals description ....................................................................................... 65
Table 17: FPU signals description ................................................................................................ 65
Table 18: BUGS ............................................................................................................................ 94
Table 19: Floorplanning Parameters ........................................................................................... 108
7
1 CHAPTER ONE: INTRODUCTION
A floating-point unit (FPU) is a part of a computer system specially designed to carry out
operations on floating-point numbers. Typical operations are addition, subtraction,
multiplication, division, and square root.
The advantage of floating-point representation over fixed- point representation is that it
can support a much wider dynamic range (the largest and smallest numbers that can be
represented). The floating-point format needs slightly more storage (to encode the position of the
radix point), floating-point numbers achieve their greater range at the expense of slightly less
precision. Floating Point numbers has more flexibility than Fixed-point numbers which has
limited or no flexibility. The internal representations of data in floating-point hardware are more
exact than in fixed-point, ensuring greater accuracy in the results.
It is also important to consider fixed and floating-point formats in the context of precision
– the size of the gaps between numbers. Every time a Digital signal processor (DSP) generates a
new number via a mathematical calculation, that number must be rounded to the nearest value
that can be stored via the format in use. Rounding and/or truncating numbers during signal
processing naturally yields quantization error or ‘noise’ - the deviation between actual analog
values and quantized digital values. Since the gaps between adjacent numbers can be much
larger with fixed-point processing when compared to floating-point processing, round-off error
can be much more pronounced. As such, floating-point processing yields much greater precision
than fixed-point processing, distinguishing floating-point processors as the ideal DSP when
computational accuracy is a critical requirement.
The applications of using the floating-point format can be readily seen by contrasting the
data set requirements of video and audio applications. Floating Point units are used in high speed
objects recognition system and also in high performance computer systems as well as embedded
systems and mobile applications. In medical image recognition, greater accuracy supports the
many levels of signal input from light, x-rays, ultrasound and other sources that must be defined
and processed to create output images with useful diagnostic information. By contrast with these
applications, the enormous communications market is better served by floating-point devices.
FPUs execute dedicated trigonometric calculations used extensively in real-time applications
such as motor control, power management, and communications data management. The graphics
processing units (GPUs) today perform most arithmetic operations in the programmable
processor cores using IEEE 754-compatible single precision 32-bit floating-point operations,
newer GPUs such as the Tesla T10P also support IEEE 754 64-bit double-precision operations in
hardware.
The designed floating-point unit (FPU) supports two representations of floating-point
numbers according to IEEE754-2019 standard which are binary32 and decimal representation-
decimal format, the following arithmetic operations are supported for each of the two
representations, addition, subtraction and multiplication between operand A and operand B.
SIMD instruction is also supported for any of the supported operations with a maximum number
of 16 similar operations per SIMD instruction.
8
2 CHAPTER TWO: FLOATING POINT REPRESENTATION
Floating point format according to IEEE754 standard-2019 is a way of representing real numbers
with a string of digits. It maps the infinite range of real number by a finite subset with limited
precision. A floating point number can be characterized by the following:
• Sign: the polarity of the number, either positive (+), or negative (-).
• Radix: the base number for scaling, usually two (binary), or ten (decimal).
• Exponent range: the interval of the maximum and minimum power of the radix.
• Significand: also called Precision or Mantissa, it is a fixed number of significant digits in base
format.
In general any floating point number is represented with the following equation:
To put a number in one of the binary representation the number must be transformed to binary so
it can be written as for example:
111001 → 1.11001 ∗ 25
So e = 5, m = 11001 and s = 0.
9
Comparison between the same representations with different number of bits:
Parameter Binary16 Binary32 Binary64 Binary128
Storage width in bits k 16 32 64 128
Precision in bits p 11 24 53 113
Maximum exponent emax 15 127 1023 16383
Bias E 15 127 1023 16383
Sign bit 1 1 1 1
Exponent width in bits w 5 8 11 15
Significant field width 10 23 52 64
Table 1: comparison between different precisions in Binary representation
10
2.1.2.2 Decimal interchange format:
Representations of floating-point data in the decimal interchange formats are encoded in k bits in
the following three fields, whose detailed layouts and canonical (preferred) encodings are
described below.
a) 1-bit sign S.
b) A w+5 bit combination field G encoding classification and, if the encoded datum is a finite
number, the exponent q and four significand bits (1 or 3 of which are implied). The biased
exponent E is a w+2 bit quantity q+bias, where the value of the first two bits of the biased
exponent taken together is either 0, 1, or 2.
c) A t-bit trailing significand field T that contains J ×10 bits and contains the bulk of the
significand. When this field is combined with the leading significand bits from the combination
field, the format encodes a total of p = 3×J+1 decimal digits (1).
Decimal interchange format contains two ways of encoding they are decimal encoding and
binary encoding.
11
Trailing significand field
Each three digit in the mantissa are encoded as in the following table:
b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 d2 d1 d0 Values encoded
a b c d e f 0 g H I 0abc 0def 0ghi (0-7)(0-7)(0-7)
a b c d e f 1 0 0 I 0abc 0def 100i (0-7)(0-7)(8-9)
a b c g h f 1 0 L I 0abc 100f 0ghi (0-7)(8-9)(0-7)
g h c d e f 1 1 0 I 100c 0def 0ghi (8-9)(0-7)(0-7)
g h c 0 0 f 1 1 1 I 100c 100f 0ghi (8-9)(8-9)(0-7)
d e c 0 1 f 1 1 1 I 100c 0def 100i (8-9)(0-7)(8-9)
a b c 1 0 f 1 1 1 I 0abc 100f 100i (0-7)(8-9)(8-9)
x x c 1 1 f 1 1 1 I 100c 100f 100i (8-9)(8-9)(8-9)
Table 3: encoding of the trailing field
Comparison between decimal representation decimal formats with different number of bits:
Parameter Binary32 Binary64 Binary128
Storage width in bits k 32 64 128
Precision in digits p 7 16 34
Maximum exponent emax 96 384 6144
Minimum exponent emin -95 -383 -6143
Bias E 101 398 6176
Sign bit 1 1 1
Combination field in bits 5 5 5
Exponent continuation field in bits 6 8 12
Trailing significand field in bits 20 50 110
Table 4: comparison between different precisions in Decimal representation-decimal format
12
2.1.2.2.2 Decimal interchange binary encodings:
If the binary encoding is used for the significand, then:
• if G0 G1 is 00, 01, or 10, then E is made up of the bits G0 to Gw+1, and the binary
encoding of the significand C is obtained by prefixing the last 3 bits of G (i.e., Gw+2 Gw+3
Gw+4) to T.
• If G0 G1 is 11 and G2 G3 is 00, 01 or 10, then E is made up of the bits G2 to Gw+3, and the
binary encoding of the significand C is obtained by prefixing 100Gw+4 to T (1).
13
2.1.4 Exceptions
Exceptions are represented in RTL using flags to note that something abnormal happened to the
resulted number
overflow In case the result of addition or multiplication is bigger than the
highest representable number.
underflow In case the result of subtraction or division is lower than the smallest
representable number.
inexact In case rounding was done to the result or in case of underflow or
overflow
Invalid In case that the operation chosen can’t be done on the inserted inputs.
operation
Table 7: Exceptions
2.1.5 Rounding
Rounding process is very important after each operation, as all operations produce an
intermediate result with infinite precision, so it is required to round this result to finite precision
to be suitable for the destination precision format. IEEE 754-2019 standard defines five rounding
modes for arithmetic operations as follow (1),
• RoundTiesToEven: the absolute result is rounded to nearest number. If tie case occurs
the absolute result is rounded to nearest even value.
• RoundTiesToAway: the absolute result is rounded to nearest number. If tie case occurs
the absolute result is rounded to the larger number.
• RoundTiesToPositive: the result is rounded towards positive infinity (if the final result
sign is positive then the result is rounded up, else the extra digits are truncated).
• RoundTiesToNegative: the result is rounded towards negative infinity (if the final result
sign is negative then the absolute result is rounded away from zero, else the extra digits
are truncated).
• RoundTowardZero: the absolute result is rounded towards zero, (all extra digits are
truncated).
14
3 CHAPTER THREE: RTL DESIGN
3.1.1.1 Addition
To perform the addition operation, the two operands must have the same exponent, so first,
significand alignment is done by comparing the two exponents to determine which operand has
the largest one then calculating the difference between the two exponents and padding the
mantissa of smallest operand from the left with a numbers of zeros equal to four multiplied by
the calculated difference as the radix of the exponent is ten which means that if the difference
equals to one will be equivalent to shifting the mantissa one digit (4 bits), therefore the exponent
of the final result will be equal to the exponent of the largest operand.
Second, each 4 bits of the first operand starting from the right has been added to their
corresponding in the second operand, if the result is greater than or equal to ten then subtract ten
to obtain the corresponding 4 bits of the mantissa of the final results and a carry equals to one
will be added to the addition of the next two digits; otherwise the result was putted directly in the
mantissa.
Third, normalization is needed when the result of adding the 4 most significand bits is greater
than ten; normalization has been done by incrementing the exponent by one and the mantissa
will be equal to the most significand seven digits.
Finally, to check the correctness of this logic, different ranges of real numbers has been used, the
result was calculated by using MATLAB, and by using the previous logic after putting the real
numbers in decimal-32 representations.
As shown in Figure 3 below, the results are approximately equal in both cases.
15
Figure 3: Addition MATLAB results – Decimal format
3.1.1.2 Subtraction
First step, deciding which exponent is the bigger to be the exponent of the final result, then,
padding zeros to the mantissa of the number with the smaller exponent by the difference of the
two exponents.
Second step, is deciding which mantissa is the bigger if there was no difference in the exponent
because we will use the borrow method which needs to subtract the bigger from the smaller to
get a right result, so if the second mantissa was the bigger then the final result is negative .
Third step, subtracting the bigger mantissa from the smaller one by taking the last 2 digits from
each mantissa an subtract them if their result is negative then we need to add 10 to the result and
borrow one from the digit that is before them and continue this operation on the seven digits.
Finally, the results as shown in Figure 4 have been checked as in addition.
16
Figure 4: Subtraction MATLAB results – Decimal format
3.1.1.3 Multiplication
First, the exponent of the result has been calculated by adding the exponent of the two operands
also, the sign bit has been calculated by xoring the sign bits of the two operands.
Second, the mantissa has been calculated as shown in Figure 5.
Third, the resulted mantissa will be equal to 14 digits, so normalization has been done by taking
the most non-zero digits and then the number of remaining digits will be added to the exponent.
Finally, the results as shown in Figure 6 have been checked as in addition.
17
Figure 5: Mantissa calculation in MATLAB
18
3.1.2 Decimal-32 representation – Binary format
A function has been used to extract sign, exponent and mantissa from the represented number the
same as in decimal format.
3.1.2.1 Addition
Same as in decimal format first, significant alignment has been done by dividing the mantissa of
the smallest operand by ten because the radix of the exponent is ten and the mantissa is
represented in binary format which means two different bases; the number of divisions operation
is equal to the difference between the two exponents.
Second, the mantissa of the two operands after alignment has been added directly as normal
binary addition.
Finally, different ranges of real numbers have been used to check this logic, by comparing the
calculated result with the predicated result as shown in Figure 7.
3.1.2.2 Subtraction
Subtraction in binary format is very easy after making significant alignment, so as explained
before in the addition the way to make significant alignment is using division.
Then after this step we can make a normal binary subtraction between the normalized mantissa
of the two numbers.
Finally, the results as shown in Figure 8 have been checked as before.
19
Figure 8: Subtraction MATLAB results - Binary format
3.1.2.3 Multiplication
First, the exponent has been calculated using binary addition of the exponent of the two operand
and sign has been calculated same as in decimal format.
Second, the mantissa of the two operands have been multiplied together to calculate the resulted
mantissa which will be equal to 48 bits so normalization is required.
To normalize the mantissa, it is required to decide how many digits are presented in the
mantissa; this is done by comparing the mantissa with the largest number composed of 14 digits,
if it is greater than this number, then add seven to the exponent and divide the mantissa by ten
seven times, but if it is smallest than this number, then compare it with the largest number
composed of 13 digits, , if it is greater, then add six to the exponent and divide the mantissa by
ten six times, and if it is smallest complete the comparing process until the mantissa is
normalized to be the most seven non-zeros digits and the numbers of remaining digits in the
mantissa is added to the exponent.
Finally, the results as shown in Figure 9 have been checked as before.
20
Figure 9: Multiplication MATLAB results – Binary format
3.1.3.1 Addition
First, significand alignment has been done by calculating the difference between the exponents
of the two operands, then shifting the mantissa of the smallest operand with number of zeros
equal to the difference calculated, and the exponent of the final result will be equal to the largest
exponent.
Second, the mantissa of the two operands after the above modification have been binary added
too each other, if there is a carry, then the exponent increased by one and the mantissa will be
equal to the carry followed by the most 23 bits of the resulted mantissa.
Finally, different ranges of real numbers have been used to check this logic, by comparing the
calculated result with the predicated result as shown in Figure 10.
21
Figure 10: Addition MATLAB result - Single precision
3.1.3.2 Multiplication
First, to calculate the exponents of the result add the exponent of the two operands.
Second, binary multiply the mantissa of the two operands, then the resulted mantissa will be 48
bits, if the most significand bit equal to one then add one to exponent and take the most 24 bits of
the resulted mantissa, otherwise normalize the mantissa by decrementing the exponent until
reaching the first one.
Finally, the results as shown in Figure 11 have been checked as before.
22
3.2 REGISTER TRANSFER LEVEL (VERILOG)
This section describes how the high level design is translated to RTL code, in this project two
representations only were chosen for the design phase they are decimal 32 representation
decimal format and single precision representation.
3.2.1.1 Addition
Addition is done using the architecture in figure 12, each block has its role as described below:
1. Conversion from IEEE-754 to sign, exponent and mantissa:
This block acts as a decoder that decodes the number to extract the sign bit, mantissa
represented as a BCD number that is constructed of 28 bits bus each digit is represented
in four bits and the exponent in an 8 bit bus.
2. Remove leading zeros:
The function of this block is to remove the leading zeros in the entered in the
representation to not to lose precision or digits in the steps of normalization of rounding,
and this is done by checking the number of zeros in the mantissa, subtract this number
from the exponent and remove these zeros from the mantissa.
3. Binary subtractor :
This block is used to determine the exponent of the final result, calculate the difference
between the two exponents to make significand alignment in the next block and send a
signal called greater to indicate which mantissa needs significand alignment.
4. Significand alignment:
This block pads the mantissa of the smallest number by zeros their number equals to the
difference in the exponent but multiplied by four as each zero is represented in four bits
the same as in MATLAB but it keeps the last three removed digits in an 12 bits bus as
guard digit, round digit and sticky bit which is the ORing of all the removed digits from
the significand alignment, the rounding digit will be used in the rounding module.
5. BCD adder:
The BCD adder is constructed of 7-4bits binary adders and if the result of each adder is
greater than nine then we add six to the result and take the least four bits in the final
result and the carry is added in the next adder.
6. Rounding
This module adds one to the mantissa if the rounding digit is greater than five.
7. Normalization:
Normalization is used in case of a carry resulted from the addition which means that the
result in composed of eight digits and that can’t be represented so normalization module
is used to take the most seven digits as a result and add one to the exponent
of the final result. If the exponent of the final result exceeds 192=8’b1100-0000(the max
exponent of the representation) overflow flag and inexact flag are raised.
8. Conversion to IEEE-754 standard
This module takes the final exponent, the sign bit and the mantissa to encode them and
put them in the final representation.
23
Figure 12: architecture of decimal adder
24
Behavioral simulation results:
Synthesis result:
25
3.2.1.2 Subtraction
Subtraction is done using the architecture in Figure 16; each block has its role as described
below, repeated blocks are explained above:
1. Conversion form IEEE-754 standard to sign, exponent and mantissa.
As described above in addition.
2. Remove leading zeros
As described above in addition.
3. Binary subtractor.
As described above in addition.
4. Significand alignment
The same as in addition but the GRS digits are not extracted from the mantissa; they are a
part of the mantissa because they will enter in the subtraction process so the output of
significand alignment is two buses of fourteen bits.
5. BCD subtractor
This block subtracts the mantissa of the two operands from each other using the ten’s
complement, after making the ten’s complement of the second operand , the mantissa of
the first operand and the ten’s complement of the mantissa of the second operand enter in
the BCD adder the same as in the addition but check the result of the BCD adder if there
is a carry digit then the Result is positive and equal to the digits after the carry digit and if
not then the result is negative and the result is the ten’s complement of the result.
The resulted mantissa is the first 28 bits only of the result as they represent seven digit
and the next twelve bits are taken for the GRS digits.
6. Normalization
This block removes the leading zeros resulted from the subtraction, enter the GRS digits
instead of these zeros, add zeros to the right of the number to complete the seven digits
and subtract the number of the leading zeros from the exponent of the result. if the
exponent is less than the number of the leading zeros then remove zeros their number
equals to the exponent .
If the resulted mantissa and exponent equal to zero then underflow flag is raised.
7. Rounding
As described above in addition.
8. Conversion to IEEE-754 format
As described above in addition.
26
Figure 16: Architecture of subtraction in decimal format
27
Behavioral simulation results:
Synthesis result:
28
3.2.1.3 Multiplication
Multiplication has been done using the architecture shown in Figure 20, each block has its role as
described below
1. Conversion form IEEE-754 standard to sign, exponent and mantissa
As described above in addition.
2. Xor gate
This gate is used to determine the sign of the result.
3. Removing leading zeros
As described above in addition.
4. Binary adder
Add the exponent of the two operands, then subtract the bias to calculate the resulted
exponent, also this block can raise underflow or overflow signal if the sum is greater or
smaller than the available exponent.
5. BCD multiplier
Multiplication process has been done as explained in the MATLAB work but instead of
the “for” loop, use eight different cases to calculate the value of the carry in the next step.
6. Normalization
Switch case has been used to determine the number of digits resulted from the
multiplication operation, then take the first seven non-zeros digits, keep the following
three digits to be used in rounding, and add the number of the remaining digits in the
exponent, also check the underflow cases as sometimes although the addition of the two
exponent is less than available exponent after the normalization the underflow flag can be
lowered.
7. Rounding
If the round digit is greater than five than add one to the mantissa and raise the inexact
flag, then check if overflow has occurred.
8. Conversion to IEEE-754 format
As described above in addition.
29
Figure 20: Architecture of Multiplication in decimal representation - Decimal format
30
Behavioral simulation results:
Figure 21 shows an underflow case, a normal case, an overflow case, an invalid operation case
when one of the two operands equal infinity.
Synthesis result:
31
3.2.2 Single precision
3.2.2.1 Addition
Addition is performed using the architecture shown in Figure 24; the role of each block is
described below:
1. Conversion from IEEE-754 to sign, exponent and mantissa
This block has been used to extract the sign, exponent, and mantissa from the represented
numbers using bit selection, and then concatenate the implicit bit ( 1 in case of normal
numbers and 0 in case of subnormal numbers).
2. Binary subtractor
This block compare the two operands, set the exponent of the final result to the largest
exponent, calculate the difference between the two exponent, also it has a signal greater
indicate which operand is greater.
3. Significand alignment
This block shift the mantissa of the smallest operand recognized using greater signal,
number of shifts equal to the signal difference received from binary subtractor in case of
two normal numbers and equal to (difference – 1) in case of normal and subnormal
numbers, also it keeps the shifted bits to be used in rounding, the last two shifted bits in
the guard and round bits respectively, and the or-ing of the remaining shifted bits in the
sticky bit.
4. Binary adder
This block adds the mantissa of the two operands.
5. Normalization
This block receives the sum of the two mantissas and checks if there is a carry, then
increments the exponent and shifts the resulted mantissa, the shifted bit goes to the guard
bit, guard bit to round bit, and sticky bit is equal to the or-ing between the round bit and
the sticky bit.
6. Rounding
This block checks whether the guard bit and last bit in the mantissa are equal to one, or
the guard bit, round bit and sticky bit are equal to one, if one of the two cases exist then
increment the mantissa by one and raise inexact flag, and finally checks if there exist an
overflow.
7. Conversion to IEEE-754 format
This block takes the sign of the first operand and the final result of the exponent and the
mantissa, and then puts the result in single precision representation formats.
32
Figure 24: Architecture of addition in Single precision
33
Behavioral simulation results:
Synthesis result:
34
3.2.2.2 Subtraction
Subtraction is done using the architecture in Figure 28; each block has its role as described
below, repeated blocks are explained above:
1. Conversion from IEEE-754 to sign, exponent and mantissa
As described above in addition.
2. Binary subtractor
As described above in addition
3. Significand alignment
Same as in addition but it keeps the GRS bits in the shifted mantissa because they will
enter in the subtraction process.
4. Binary subtractor
This block subtract the two normalized mantissa with their GRS bits using two’s
complement, after making the two’2 complement to the second mantissa a normal binary
addition is done and if there is a carry bit then the result is positive else then the result is
negative and equals to the two’2 complement of the output of the adder.
5. Normalization
This block receives the result of subtraction and removes all the leading zeros resulted
from the subtraction and take the GRS bits instead of these leading zeros inside the bits
that can be represented only if the exponent of the result is bigger than the number of
leading zeros so we can subtract their number from the exponent to normalize the number
and if not remove a number of leading zeros equal the (exponent -1) and assign the
exponent to be equal zero so the number is now a subnormal number.
If the resulted mantissa and exponent after normalization equal zero then raise the
underflow flag.
6. Rounding
As described above in addition.
7. Conversion to IEEE-754 format
As described above in addition.
35
36
Figure 28: Architecture of Subtractor in Single precision
Synthesis result:
37
3.2.2.3 Multiplication
Multiplication is performed using the architecture shown in Figure 32; the role of each block is
described below:
1. Conversion form IEEE-754 standard to sign, exponent and mantissa
As described above in addition.
2. Xor gate
This gate is used to determine the sign of the result.
3. Binary adder
This block calculates the result exponent by adding the exponent of the two operands,
and then subtracts the bias from the sum, also it can raise the underflow or overflow
signal is the result is greater or smaller than the available exponent.
4. Binary multiplier
This block multiplies the mantissa of the two operands.
5. Normalization
First, using a switch case this block determine the state of the result according to the
numbers of leading zeros, assigns the mantissa to be the first 24 bits starting from the first
one from the left, and assigns the following two bits to be guard and round bit
respectively and the or-ing of the remaining bits to be sticky bit.
Second, in each state check the underflow signal if it is raised, then decides whether the
underflow can be solved by representing the number as a subnormal number or not and if
the underflow signal isn’t raised, then treats the number normally the subtracting the
number of leading zeros from the exponent.
Finally, check if there is an overflow occurs or not
6. Rounding
As described above in addition.
7. Conversion to IEEE-754 format
As described above in addition.
38
Figure 32: Architecture of multiplication in Single precision
39
Behavioral simulation results:
Synthesis result:
40
3.2.3 Frequency of the design
After designing the six modules a high level module is created to integrate the six modules
together, the whole design can work with the frequency of the slowest path that exists in the
decimal multiplication as seen in Figure 36 it exists in the path where the BCD multiplier exists.
So the design can work with the frequency of the critical path that is equal to 13.8 MHz, so
pipelining is used To Increase the frequency of the whole design .
Pipelining is done by dividing the BCD multiplier to two parts as seen in Figure 38, now the
design can work with frequency 25.8MHz which is the frequency of the critical path in the
decimal subtractor.
41
Figure 38: Architecture of the decimal multiplication after pipelining
42
4 CHAPTER FOUR: SYSTEM
4.1.1 Abbreviations
44
Software cannot terminate the reset process early by writing a zero to this
register.
Table 9: FPU Command register
45
clear status bit of the FPU Status register (register 0x110-bit 1) to one.
Table 10: Register (0x110)
Note: The five flags (bits 2 -6) in this register are the output flags in case of a single operation
instruction, in case of a SIMD instruction they are the output flags of the first SIMD instruction
operation.
46
instruction, they are the inexact output flags, as explained for bit 5 in FPU
status register (0x110), of the of the second to the 16𝑡ℎ SIMD instructions
operations in order.
Table 13: Register (0x11C)
47
4.2 TOP LEVEL DESIGN
HCI:
This block represents the interface between the software from one side and the FPU blocks from
the other side, from the software side it only sends and receives 32 bits data with certain
addresses, from the FPU blocks’ side control signals and status signals are sent and received as
well as operands and the FPU output as shown in Figure 40.The HCI block extracts the control
signals that are sent to the FPU blocks from the data sent from the software as well as uses the
status signals to update some bits in the data sent t the software. The HCI block contains some
registers which are FPU Command register, Operand A 0x010 register, Operand B 0x050
register, FPU Status register 0x110 and a dataout register, it also contains a multiplexer and a
demultiplexer to read from and write into these registers.
48
Figure 40: HCI connections
Modified operation:
This block uses three inputs which are the operation and the sign bits (bit number 31) of the two
operands to determine the modified operation as shown in Table 14.
Operation Sign bits of the operands Modified operation
Adiition Same sign Addition
Different signs Subtraction
Subtaction Same sign Subtraction
Different signs Addition
Multiplicaioon - Multiplication
Table 14: Modified operation block function
Enable decoder:
This block uses the modified operation together with the representation to enable only one of the
input registers to the different CLBs (combinational logic blocks) when there is a new single
instruction or new SIMD data. In the SIMD case, the enable decoder operates with each new data
because even though the operation and representation don’t change, the signs of the operands
change with different data which changes the modified operation and therefore the enable of the
registers changes.
49
Input registers:
There are six input registers each connected to one of the CLBs, they register and output the
operands to the CLBs, they input new operands when enabled by the enable decoder, when the
FPU reset bit is set to one they reset all operands to zero and when enabled they output a signal
to the output register of the same CLB to read the outputs and flags after one clock cycle.
Input registers:
There are six output registers each connected to one of the CLBs, they register and output the
outputs and flags of the CLBs, they input new outputs and flags when enabled by the signal from
the input register connected to the same CLB and they also output a ready signal that is set to one
whetn it receives the new outputs and flags.
Output multiplexer:
The output multiplexer inputs the outputs, flags and ready signals of the six output registers and
outputs the desired according to the modified operation and representation.
Other logic:
The doorbell_r signal is fed back to the HCI to reset the Doorbell bit of the command register to
zero indicating that the operands hane been inserted to the CLB.
The rst_r signal is fed back to the HCI to reset the FPU Reset bit of the command register to zero
in case it was set to one indicating that the input and output registers have all been reset.
50
4.2.2 Design with SIMD support
In order to support SIMD instructions some blocks were added and some modifications were
made as explained here.
SIMD:
This block is the main block in supporting SIMD, it interfaces with the software to directly read
the operands and register them, it also interfaces with the HCI through some control signals and
with other FPU blocks as shown in Figure 41, it contains 16 registers for each of the operands,
for the outputs and the flags, it also contains a finite state machine, counters ,multiplexers and
demultiplexers that control the operands that are outputted and the outputs and flags that are
read, it also achieves pipelining.
51
Operands multiplexer:
This block chooses which operands are to be sent to the input registers either those from the HCI
or the SIMD blocks according to whether it’s a SIMD operation (SIMD bit of the command
register is set to one) or not.
Interrupt multiplexer:
This block’s name is misleading, its function is to choose which output, falgs and ready signal
are to be sent to the HCI either those from the output multiplexer in case of single instruction or
the ouput and flags of the first register in the SIMD block and the simd_ready signal in case of
SIMD instruction.
This block interfaces with the software to output the sw_dataoout instead of the HCI, according
to the required register data if it’s the command register, status register 0x110 or output 0x130,
it’s read from the HCI output else for the other outputs and status registers, it’s read from the
SIMD output as their registers are located there.
52
Figure 43: Single instruction simulation (B)
53
FPU reset bit:
Figure 44 shows the case where the FPU reset bit of FPU command register (0x0) is set to one,
the operands registered in the input register are set to zero after two clock cycles and the outputs
registered in the output register are set to zero after another clock cycle, the software can read the
FPU command register (0x0) to find that this signal is reset to 0 after 3 clock cycles from the
negative edge of the sw_write_en signal.
Figure 45 shows the case where the FPU enable bit is kept zero, the fpu_doorbell_w signal is not
set to one and therefore no operation is carried out, trying to read the output, it is zero due to a
preceding reset.
54
FPU interrupt enable bit:
Figure 46 shows the case where the FPU interrupt enable bit is set to one, the fpu_interrupt
signal rises to one with the status bit of the FPU status register (0x110) and drops to zero again
when the clear status bit is set to one.
55
Figure 48: SIMD instruction simulation (B)
56
4.3 RISC-V
The RISC-V is an open-source ISA that was originally developed in the Computer Science
Division of the EECS Department at the University of California, Berkeley.
57
• RV32IMC
• Dubugger: RISC-V debug specification 0.13
2. RI5CY
• Language: SystemVerilog
• RV32IM[F]C 32-bit
• Optional full support for RV32F Single Precision Floating Point
Extensions (Floating-point support in the form of IEEE-754 single
precision)
• Dubugger: RISC-V debug specification 0.13
58
4.4 PULPino
4.4.1 Features
• Processor (Open-source RISC-V ISA processor).
• Ultra-low-power and ultra-low-area constraints.
Most of PULPino blocks are gated by clock (to turn off any useless block during
operations so it can save more power).
The peripherals connected to APB bus that is less power consumption than AXI bus.
• RI5CY or zero-riscy core.
The two cores have the same external interfaces and are thus plug-compatible.
The difference between RI5CY and zero-riscy is that the RI5CY core support more ALU
ISA extensions and complex operations than zero-riscy.
We are working with the RI5CY core which is enabled by default.
• Contains a broad set of peripherals: I2C SPI UART
• Available for FPGA (Synthesizable written in System Verilog)
The SoC uses a AXI as its main interconnect with a bridge to APB, both the AXI and the APB
buses feature 32 bit wide data channel, all peripherals in PULPino are connected to the APB bus
except the SPI slave which is a very special peripheral and not intended to be used from the core
itself. (3)
The core uses a very simple data and instruction interface to talk to data and instruction
memories directly.
59
4.4.3 Memory map
60
Figure 52:Commands for making Modelsim work
61
Figure 53: Script for installing and making Cmake
4. riscv-toolchain
ri5cy_gnu_toolchain was used, errors arose at first while running make, by
searching I reached a way that by making some changes in some files and rerunning
make, it finished successfully and the installation path to the bin was added to the
".bashrc" path. (5)
62
Figure 55: Output of hello world example
1. To connect it through the UART as an intermediate interface between both RISC-V and
FPU.
2. To replace one of the peripherals with the FPU to be directly connected to the APB.
3. To connect the FPU directly to the APB as a new peripheral.
The first method is not preferred because it requires unnecessary time and more complex
applications to handle the data between two different peripherals
The second method was the one used, the I2C peripheral is the one chosen to be replaced since
• The size of its memory suits that specified in the HCI memory map specification as
shown in Table 8, Pulpino memory map after this replacement is shown in Figure 56.
• The FPU registers slightly resemble those specified in the HCI specifications.
The third method is more practical as in practical one would want to extend or add new
peripherals to existing ones rather than replace an exsisting one but since it needs more
modifications in PULPino files than the second case and there wouldn’t be a difference in
functionality, the second case was chosen yet this case is a better practice.
63
Figure 56: Pulpino memory map after replacement of I2C by FPU
64
4.5.2.1 Bridge-APB interface
The used APB signals’ description according to the AMBA APB Protocol (Version: 2.0) are
shown in Table 16: APB used signals descriptionTable 16 (6)
Signal Desciption
PCLK Clock. The rising edge of PCLK times all transfers on the APB.
PRESETn Reset. The APB reset signal is active LOW. This signal is normally
connected directly to the system bus reset signal.
PADDR Address. This is the APB address bus. It can be up to 32 bits wide (here
12 bits wide) and is driven by the peripheral bus bridge unit
PWDATA Write data. This bus is driven by the peripheral bus bridge unit during
write cycles when PWRITE is HIGH. This bus can be up to 32 bits
wide (here 32 bits wide)
PRDATA Read Data. The selected slave drives this bus during read cycles when
PWRITE is LOW. This bus can be up to 32-bits wide (here 32 bits
wide)
PSELx Select. The APB bridge unit generates this signal to each peripheral bus
slave. It indicates that the slave device is selected and that a data
transfer is required. There is a PSELx signal for each slave.
PWRITE Direction. This signal indicates an APB write access when HIGH and
an APB read access when LOW.
PENABLE Enable. This signal indicates the second and subsequent cycles of an
APB transfer
PREADY Ready. The slave uses this signal to extend an APB transfer.
PSLVERR This signal indicates a transfer failure. APB peripherals are not required
to support the PSLVERR pin. This is true for both existing and new
APB peripheral designs. Where a peripheral does not include this pin
then the appropriate input to the APB bridge is tied LOW
Table 16: APB used signals description
65
4.5.2.3 Bridge-Event unit interface
The fpu_int signal is connected to PULPino lightweight event and interrupt unit.
Function Inputs:
• Two operands
• Operation
• Representation
Function Outputs:
• FPU output
• Flags
Function code flow:
1. Wait until status bit is set to zero by reading the command register in a loop.
2. Write operands.
3. Write command register
• FPU Enable and Doorbell are set to one, Interrupt enable is set to zero.
• Floating-point format and operation are set according to the floating-point
representation and operation sent.
66
4. Wait until status bit is set to one by reading the command register in a loop.
5. Reset status bit to zero by writing to the command register the same value read fom it (to
not affect the flags as they’ll be read later) but eith the clear bit set to one.
6. Read output.
7. Read flags.
Function Inputs:
• Two arrays for operands
• Operation
• Representation
• Number of SIMD operations
Function Outputs:
• Array of FPU outputs
• Array of flags
Function code flow:
1. Wait until status bit is set to zero by reading the command register in a loop.
2. Write operands in a loop according to number of SIMD operations.
3. Write command register
• FPU Enable and Doorbell are set to one, Interrupt enable is set to zero.
• Floating-point format and operation are set according to the floating-point
representation and operation sent.
• SIMD bit is set to one.
• Number of SIMD operations bits are set according to the required number of
operations.
4. Wait until status bit is set to one by reading the command register in a loop.
67
5. Reset status bit to zero by writing to the command register the same value read fom it (to
not affect the flags as they’ll be read later) but eith the clear bit set to one.
6. Read the four status registers then extract from them the flags of each operarion and
insert them in the array of flags.
7. Read outputs in a loop and insert them in the array of outputs.
4.6.1.3 Compare
Function Inputs:
• FPU output.
• FPU flags.
• Reference output.
• Reference flags.
Function Outputs:
• Boolean.
Function code:
The output is true if
1. FPU output is equal to Reference output.
2. All FPU flags are equal to Reference flags.
Else it’s false
68
4.6.1.4 FPU get status
This function reads the status register and resurns it’s value, it’s called in the
FPU_Single_Instruction and th FPU_SIMD_Insruction functions.
69
Figure 64: Integration test cases
70
Figure 66: Single instruction flags test case
71
5 CHAPTER FIVE: VERIFICATION
In this chapter we’re going to discuss the verification phase in this project, we’re required to
build a testing environment to perform functional verification on the RTL code, firstly we built
separate testing environments for each combinational module then an environment to test the
integrated floating point unit modules with the host controller interface (HCI)
The designed FPU performs 3 operations (Addition, subtraction and multiplication) on two
different 32 bits representations (single precision and decimal format decimal encoding, also the
designed FPU supports single instruction multiple data (SIMD) operations so it can perform the
same operation on different operands up to 16 operands
The output of the FPU has 32-bit result with the same representation of the two input
operands and 4-bits flags where the four flags are:
• Invalid operation: raised when the input operation is not one of the three specified
operations.
• Overflow flag: raised when the result is greater than the maximum representable number.
• Underflow flag: raised when the result is smaller than the maximum representable
number.
• Inexact flag: raised when the result is rounded up.
All the environments are built using UVM methodology which have more features than the
OOP environments and make the testing environment more reusable and editable
The UVM allows us to use
72
• Dynamically-generated objects that allow you to specify tests and test bench
architecture without recompiling
• A hierarchical testbench organization that includes Drivers, Monitors, and Bus
Functional Models
• Transaction-level communication between objects
• Testbench stimulus (UVM Sequences) separated from the testbench structure
73
5.3.1.2.1 Command transaction
The command transaction class contains four members which are:
• 32-bit (a_rep) for operand(1) represented in the decimal format
• 32-bit (b_rep) for operand(2) represented in the decimal format
• Real (a_dec) for operand(1) in real format
• Real (b_dec) for operand(2) in real format
This function is used to randomize the two operands of the transaction, firstly we create an
object of the class “rand_num “, this class contains 3 data members:
• Sign bit which represents the sign of the operand (0= positive number, 1=negative
number)
• Unsigned integer (“C”) which represent combination field of the number
• Integer (“exp”) which represents the exponent of the number
And one function member (“gen_num”), this function return a real number from the class
members by using the formula shown in Figure 71
Then these class members are randomized with certain constraints that will be discussed
later, after the randomization in order to store the randomized operands in 32-bit decimal format
we used the function (“represent”) which takes an object of the class “rand_num” as an argument
and return 32-bits in the decimal format, this process is repeated to generate the second operand.
74
Function (“dec”)
This function is used to store the two randomized numbers in the real format in the two
transaction members (“a_dec”) and (“b_dec”) using the function (“decode”) which takes the 32-
bit represented number in decimal format and return the corresponding real value using the
member function (“gen_num”) of class (“rand_num”).
And a function (“dec”) which takes the 32-bit represented result as an argument and
return the corresponding real value.
5.3.1.3.1 Tester
The tester block is responsible for generating the test cases which will propagate through
the environment, the test cases are generated using constrained randomization in the form of
transactions, the tester has a UVM_put_port which takes the data of transaction type to deliver
the test cases to the driver.
The tester generates the UVM command transactions then it calls the function
(“.random”) to randomize the transaction components, then the transaction is decoded using the
function (“.dec”) to have the random operands in real form which is needed to perform further
operations, then the transaction is put in the tester port then this process is repeated to generate
another test case and so on.
75
5.3.1.3.2 Command_f
It’s an UVM analysis FIFO of type transaction which delivers the test cases transactions
from the tester to the driver, the FIFO takes the transaction through its put port in the tester then
it blocks the tester from putting new transactions in the FIFO till the first added transaction is get
by the driver
5.3.1.3.3 Driver
The driver block extends UVM_component and has a UVM_get_port, also we instantiate
a virtual BFM in the driver so it can communicate with the DUT through the BFM
The UVM_get_port is used to get the test cases transactions from the tester through the
FIFO then it calls a built task (“.send_op”) which takes the two operands represented in the 32-
bit decimal format to be sent to the DUT through the BFM, the other task is
(“.write_to_monitor”) which takes two arguments which are the two operands in the real format
to be sent to the command monitor through the BFM
The run phase of the driver is inside a forever loop however the test ends when the loop
used in the tester to generate the test cases finishes because the phase objection is dropped after
generating the test cases and it was the only raised objection in the whole testing environment.
It also has instances of the command monitor and the result monitor which are used to
send the two operands and the DUT outputs to the scoreboard, and it has two member tasks
(“send_op”) and (“write_to_monitor”).
76
Task (“send_op”)
This task is called in the driver and takes two arguments which are the two operands in
the decimal format representation then they are assigned to the BFM members (“operand1”) and
(“operand2”) which are connected to the DUT, the task has a delay of 10 ns to model the
propagation delay of the operands through the combinational DUT
Task (“write_to_monitor”)
This task is called in the driver and takes two arguments which are the two operands in
the real format and then they are passed to the command monitor by calling the member function
of the command monitor instance (“write_to_monitor”), the output of the DUT which are the 32-
bit represented result and the 4-bit flags are passed to the result monitor also by calling the
member function of the result monitor instance (“write_to_monitor”).
77
5.3.1.3.6 Result monitor
The result monitor block extends the UVM _component, it has a virtual instance of the
BFM and has an analysis port of type (“result _transaction”) called (“ap_port”) which is used to
send the result and flags to the scoreboard, In the build phase, this result monitor is connected to
the result monitor instance in the BFM
The result monitor has a member function (“write_to_monitor”) (which is called in the
BFM), the function has two arguments 32-bit (“Result”) and 4-bit (“Flags”), firstly we create an
object of (“result_transaction”) then we assign the two arguments to the corresponding members
of the result transaction, we call the member function of the result transaction (“dec”) to put the
real format of the result in the member of the result transaction (“result_dec”), then the
transaction is sent through the port to the scoreboard.
5.3.1.3.7 Scoreboard
UVM scoreboard is a verification component that contains checkers and verifies the
functionality of a design. It usually receives transaction level objects captured from the interfaces
of a DUT via TLM Analysis Ports.
In our environment the scoreboard extends uvm_subscriber for the result monitor to be
connected to the analysis port of the result monitor and receive the result transaction, it has a
UVM_tlm_analysis_fifo of type command transaction which is connected to the command
monitor analysis port, it has two member functions (“predict_result”) and (“write”)
Function (“predict_result”)
This function has one argument of type command transaction and return result transaction with
the predicted result and flags.
Firstly we create an object of result transaction then we store the predicted real format
value of the result in (“result_dec”) member of the result transaction by carrying out the required
operation on the two operands in real format (“a_dec”) and (“b_dec”) which are members of the
command transaction passed through the argument
The operation applied on the two operands is changed according to the design under test
so that the same environment can be used for the addition, subtraction and multiplication designs
for decimal format representation just by changing the operator in the scoreboard.
78
Figure 79: DE function ("predict_result") for decimal addition
Function (“write”)
This function is called automatically when the result monitor write the data in its analysis
port and it takes one argument of type result transaction (“t”) which have the same data written
by the result monitor in the port.
We create two objects of type command transaction and result transaction (“cmd”),
(“predicted”) then we use the function (“try_get”) to get the data from the FIFO and store it in
the command transaction (“cmd”), this transaction is passed as an argument to the function
(“predict_result”) and return a result transaction which is the predicted result for the given
operands, then the predicted flags are calculated as will be discussed to be compared with the
DUT flags.
To calculate the overflow flag, the predicted result is compared to the maximum
representable numbers (positive or negative), if the result is greater than the maximum positive
number or smaller than the maximum negative number the overflow flag and the inexact flag are
raised, then the predicted flags are compared with the DUT flags and decide whether the test
case pass or fail, without comparing the result as they’re not checked in case of overflow.
The minimum representable number is (1 × 10−101 ) so that if the absolute result is less
than that number, the number will be not representable and the underflow flag will be raised, but
if the exponent is between -95 and -101 it may be underflow or not according to the precision of
the number for example:
79
If the exponent is -96, and the result is 1.23456 × 10−96 so the number will be
representable in the form 123456 × 10−101 but if the number is 1.234567 × 10−96 =
1234567 × 10−102 which is an underflow case, special cases is made for each exponent
between -95 and -101 to calculate the precision of the result and decide if it’s underflow or not.
The inexact flag is risen when there is rounding up in the last digit of the result (the 7th
digit) or if it’s an underflow or overflow case, this approximation is done according to rounding
digit which is the 9th digit of the result, where if this digit is greater than or equal 5 the result is
rounded up otherwise no rounding occur and this is done by the algorithm shown in Figure 82
There is a special case in the addition and subtraction operations where if the exponent
difference between the two operands is 14 or more so the result will be the larger operand and
the inexact flag will never be raised but this is not applied for the multiplication case
The result is compared by calculating the difference between the predicted result and the
result from the DUT, if this difference is more than a certain threshold (stated due to that the
operation done on real type in the testing environment has more precision of the decimal format
80
representation) or the predicted flags aren’t equal the DUT flags the test case is considered a
failure.
81
Figure 85: DE top module
82
5.3.2.2 Environment’s sequence item
The sequence item class contains six members which are the fields of the single precision
representation for each operand:
• 23-bit (“mantissa_1”) represents the significand of operand(1)
• 8-bit (“exp_1”) represents the exponent of operand(1)
• bit (“sign_1”) represents the sign bit of operand(1)
• 23-bit (“mantissa_2”) represents the significand of operand(2)
• 8-bit (“exp_2”) represents the exponent of operand(2)
• bit (“sign_2”) represents the sign bit of operand(2)
These members will be randomized with certain constraints which will be discussed later.
5.3.2.3.1 Sequence
In our environment this class is called (“random_sequence”) which extends
UVM_sequence of type (“sequence_item”), this sequence contains the testing scenario by
creating a sequence item object and randomize it then send it to the sequencer using the functions
(“start_item”) and (“finish_item”).
5.3.2.3.2 Sequencer
The sequencer is automatically deliver the sequence_item from the sequence to the
driver, it has no special functions, so it’s defined in the (“env_pkg”) using (“typedef”) and will
be instantiated in the (“env”) class, the sequencer has a built in port which will be connected to
the driver.
5.3.2.3.3 Driver
The driver class extends UVM_driver of type (“sequence_item”) the UVM_driver has a
built in port called (“seq_item_port”) which will be connected to the sequencer, the driver create
an object of type sequence item to store the data generated by using the function
(“get_next_item”), then sends this data to the BFM using function (“send_op”) and call the
83
function (“write_to_monitore”) which have been discussed previously, the function
(“item_done”) is called after sending the data which declares that the driver is ready to get
another sequence item.
5.3.2.3.4 Scoreboard
The scoreboard extends UVM_subscriber of type (“result_transaction”) which is the
same class in the decimal encoding environment previously discussed, it has a
(“UVM_tlm_analysis_fifo”) of type sequence item to get the command data from the command
monitor, it also has two function members (“write”) and (“predict_result”).
Function (“predict_result”)
It has one argument of type sequence_item , a result_transaction object is created to store
the predicted result and flags, each operand is converted to short real type using the built in
function (“$bitstoshortreal”) then the operation is carried out and the result is converted back to
the single precision representation using the built in function (“$shortrealtobits”), we are using
the type short real as it’s stored as bits in the form of single precision representation, so it’s
easier to switch between the two formats
The same function is used for all the operations by changing only the operator.
Function (“write”)
In this function the predicted flags are calculated then the DUT result and flags are
compared to the predicted to decide whether the test case pass or fail
Since the result is stored in short real type which have the same ranges as the single
precision representation, in the overflow case the result will be infinity which will be represented
84
as all ones in the exponent field and all zeros in the mantissa field
(“32’b1111111100000000000000000000000”)
To calculate the predicted inexact flag, the operation to get the predicted result is carried
out again but with storing the result in a real data type so that it has a higher precision, then to
decide if rounding occurred or not we get the difference between the mantissa of the predicted
result stored in short real and that stored in real.
In the case of the subnormal number the mantissa of the higher precision needs to be
normalized so that it can be subtracted from the single precision mantissa this normalization is
done by adding the bias of the higher precision (1024) to the exponent of the higher precision
and subtraction the bias of the lower precision format (128), then the mantissa of the higher
precision format is shifted right by the result of the previously described operation then the
mantissa is ready to be subtracted from the single precision mantissa
85
Figure 92: SP predicted inexact flag
86
5.3.2.3.7 Class (“random_test”)
This class extends (“base_test”) class ,firstly it creates an object from the (“sequence”)
class then the built_in function start after raise the uvm_phase objection this function takes one
argument which is the sequencer object then the objection is dropped after the test scenario is
done.
87
Figure 96: integrated sequence_item data members
5.3.3.2.2 BFM
The BFM class is connected to the DUT whose input and output data members are shown
in Figure 98
88
Figure 98: integrated environment BFM data members
The inputs and outputs are registered where we can give the address of the required register
through ("sw_address”) and write the data through (“sw_datain”) or read the output from
(“sw_dataout”).
The BFM has 3 tasks which are (“reset”), (“send_op”), (“write_to monitor”), and an initial block
for clock generation
Task (“send_op”)
The operands is sent to the DUT following a certain procedure corresponding to the
specifications of the HCI design, for synchronization we used the negative edge clock in the
environment to write the data on the DUT, then the DUT is sampling the data at the next positive
edge, by this method the testing environment is immune to the clock skew between the clock
generation in the BFM and the DUT and monitoring the result the flow of writing is as follow:
1. The DUT is reset at the start of the testing sequence.
2. The operands are written to registers with reserved addresses.
3. Then the control data (which contains the operation, representation, number of
simd operations and FPU enable) is written to the command register.
89
Task (“write_to_monitor”)
This task reads the results from the DUT and creates the result transaction that will be
sent by the result monitor to the scoreboard and at the same time the command monitor send to
the scoreboard the corresponding sequence item which includes the test case data
1. Firstly, we wait on the status bit (which is raised when the FPU finish the
required operations)
2. Then the result is read from the output registers
3. Then the flags are read from the flag registers
4. Then the result is send to the result monitor and the command is sent to the
command monitor using the function (“write_to_monitor”)
5. Then the clear bit is raised which makes the DUT ready to accept a new test case
and set the status bit to zero.
5.3.3.2.3 Scoreboard
The scoreboard extends the UVM_subcriber with the type (“result_transaction “), as the
previous testing environments it has two tasks (“predict_result”) and (“write”)
Task (“predict_result”)
• In the case of the single precision format (rep bit is one) the predicted result is
calculated by performing the required operation on the operands after converting
them by using the built-in function (“$bitstoshortreal”) then the result is converted
back to bits by using the built-in function (“$shortrealtobits”)
90
Figure 101: SP predicted result
Task (“write”)
Task write is responsible for reading the result transaction (which contains the result and
flags calculated by the DUT) from the result monitor and calculate the predicted flags is
calculated then the result and flags is compared to decide if the test case will pass or fail
The predicted flags are calculated according to the representation and operation bits in
the sequence item that is read from the command monitor the flags are calculated inside a for
loop that loops on the simd operation number, which perform the same logic as the previous
environments according to the test case representation.
These ranges distribution is done by applying constraints on the data members to be randomized.
The test is carried out with random seed for each run and each run generates about 200000 test
case in combinational modules and about 20000 test case for the integrated DUT where SIMD is
also randomized which can have upto 16 operation in each test case
91
Negative Positive
Negative Normal Positive Normal
Subnormal Subnormal
0
Figure 102: Single precision ranges
Since there are edges between positive and negative subnormal numbers (around zero) and
between subnormal and normal numbers (in both positive and negative cases) and the boundaries
of the range, so the corner cases are near these edges and boundaries, we distributed the weights
of the range to:
1) Large positive normal
2) Small positive normal
3) Large positive subnormal
4) Small positive subnormal
Other than the ordinary cases of the range between these numbers, Same for the negative
numbers.
The cases generated where the randomized operands are different combinations of these
ranges have a higher probability to cross the boundaries.
92
Figure 104: decimal encoding representation constraints
5.5 BUGS
93
SINGLE Inexact flag isn’t raised Change the length of Fixed
PRECISION correctly binary subtractor
SUBTRACTOR Incorrect result and inexact During normalization use Fixed
flag in case of operands with GRS bits, Switch the order
small difference in exponent between rounding and
normalization
94
6 CHAPTER SIX: SYNTHESIS AND FORMAL
VERIFICATION
RTL
Coding
Formal
Verification
(RTL vs
Netlist)
RTL Code
Verification
(Simulation)
Placement
and
Routing
Synthesis
6.1 Synthesis
When synthesizing any design, we have certain considerations to take. We need to set the
libraries to be used due to allowable fabrication technology and design techniques, and then we
need to analyze our designs and sub-designs. Moreover, we need to define Performance figures
like speed and power optimization constraints according to our needs. Furthermore, we need to
95
specify technology constraints like size and space (area). Finally, we need to compile the design
according to the specified constraints, in a top-down or bottom-up strategy, afterwards we obtain
reports about whether the constraints we set were satisfied or not, and the area, speed and power
consumption of our design (9).
The steps of the synthesis process are done in a sequential manner as seen below in Figure 106.
This sequence may be modified slightly to suit the design process of each designer.
We will refer to the synthesis flow using the Synopsys Design Compiler tool in the following
discussion because that is the tool we use in our practice.
It is vital to define the technology library to which the design will be mapped so that the
synthesis tool knows how to map the design. There are multiple library types, each contains
specific information about the cells and the technology itself, they are as follows (9):
• Target library: contains all the logic cells that should used for mapping during synthesis. In
other words, the tool during synthesis maps a design to the logic cells present in this library.
96
• Link library: contains information on the logic gates in the synthesis technology library. The
tool uses this library solely for reference but does not use the cells present in it for mapping as in
the case of target_library.
In order to specify the Technology library we set both the target library and the link library in our
design to “NangateOpenCellLibrary_ss0p95v125c.db” as it has highest temperature, lowest
voltage and slow-slow process, also notice that link library setting is a list that contains the
technology library as well as an asterisk, which indicates that DC should resolve references by
searching the memory (designs that have been analysed prior to this design) and then if it cannot
find the reference in memory it will look in the technology library. If DC does not find the
reference in either, it looks in the search path. The search_path is just a variable that tells
DC where to look in order to resolve references that have not been found in the link library.
After specifying our libraries, we need to allow DC to read in the design. This phase consists of
checking and analyzing the RTL for syntax errors, resolving references, mapping the design to
technology-independent implementation (GTECH) before building the generic logic for the
design. DC offers us with two options for accomplishing this. The read_file technique is the
first, while the analyse and elaborate approach is the second. The analyse command
also stores the result of the translation in the specified design library that maybe used later. So a
design analyzed once need not be analyzed again and can be merely elaborated, thus saving time.
Conversely read command performs the function of analyze and elaborate commands but does
not store the analyzed results, therefore making the process slow by comparison, so we use
analyse and elaborate.
In order to properly optimize our design to give minimum area and highest speed, we must
provide DC with constrains. This involves setting drive characteristics for input ports, setting
loads on input and output ports.
97
6.1.4.1 Clock Characteristics
1. It is important for prelayout phase where clock tree are incompelete to specify
transition time at register clock pins as it might be pessimistic. We can specify this
using command set_clock_transition.
3. The clock network latency is the time it takes a clock signal to propagate from the
clock definition point to a register clock pin. Design Compiler assumes ideal
clocking but specifying clock network latency provides an estimate of the clock tree
for pre-layout, so we used this command set_clock_latency with value 2.
DC must check that the capacitive load of driven nets and interconnects is less than the
max_capacitance attribute of the driving pin.
6.1.4.3 Speed
DC deals with timing constraints for speed optimization in a very specific way. Generally,
the tool classifies timing paths into 3 categories as follows:
98
o Path category 1: From input to register (this path is constrained according to the
input delay using set_input_delay command) and we set it with 0.4 from
our period.
6.1.4.4 Area
In order to constrain the area of the design we use the set_max_area command and
provide DC with the maximum area constraint. Setting max_area attribute to a value of
zero means we want DC to optimize the area to the smallest possible size.
The design is mapped onto technology-specific gates at this step, and the design is also optimised
at this time. We ask DC to map and optimise the design based on these limitations and
environment settings after we've defined all of our constraints and environment variables. This is
accomplished on three levels: the architectural, logic, and gate levels. “compile_ultra” is
command that has been used.
In the logic-level phase, DC is still working on GTECH implementation; here is where DC deals
with the hierarchy in the design.
99
DC works on the netlist created by logic-level synthesis to create a technology-specific
implementation in the gate-level phase. The actual technology-specific mapping, as well as delay
and area optimization (according to restrictions) and any design rule constraint violations are
completed in this step.
For large hierarchical designs consisting of many sub-circuits there are several strategies that we
may use to compile the design, but we use Top-Down Strategy.
Top-Down Strategy
In this strategy, we set the constraints for the top level module only. We read all lower level
modules, but we do not compile them separately. After all the modules have been read, and the
top level constraints have been defined we compile the top level only, and DC infers the
constraints required for lower level modules in order to satisfy the top level constraints, and thus
it maps all modules accordingly.
Design Compiler makes it easy for us to generate a range of reports to check the accuracy and
quality of our implementation. The time report, the area report, the quality of outcomes report,
and the constraint report are the most significant reports. We'll go over each of them in detail
below (9).
100
6.1.6.1 Timing Reports
Design Compiler has a built-in static timing analyzer called DesignTime. Static Timing Analysis
can determine if a circuit meets timing constraints without dynamic simulation which is an
advantage when it comes to saving time.
DesignTime works by breaking down our design into a set of timing paths, each has a
startpoint and an endpoint.
Below in Figure 107. Notice that the time units are units consistent with those defines in the
technology library, which is nanosecond in our case.
Incremental
Individual Total
Contribution Path Delay
Maximum
Path
Required
As shown in following histogram that worst slack is 0.268ns which means that there is no setup
time violation after synthesis.
101
Figure 108: Path Slack histogram
This report gives us the total area of the design. It calculates this area by adding the area
attributes of the gates from the technology library. Usually the area units are specified in the
technology library as well, or in an associated document. In our case the area units were µm2.
Figure 109: Area report example shows an example of the area report. Notice that the
interconnect area and the total area are reported as “undefined” because pre-layout we do not
have an idea of interconnect area.
102
Figure 110: Summary of QoR Report
At first hold time violation was “-1.88ns”as we used compile_ultra only. We tried to fix this
violation with:
3. Opening worst path and insert more buffers in it or use larger driving buffers in it (this
method works but there were large number).
103
6.2 FORMAL VERIFICATION
Formal verification is a method to verify a design without running Simulations, thus saving
simulation time. It works by comparing the “implementation” design against a “reference”,
golden model design, that has already been simulated (or proofed by formal verification against a
previous reference design).
104
• Verification: Checking that matched compare point in the implementation design is
logically equivalent to its peer in the reference design.
We used Formality tool by Synopsys and we got 4 unmatched points like in following Figure
111, but after checking with RTL team and system team we found that those 4 bits are redundant
as they will always lead to logic1 or logic 0, and also when we verify design all points passed.
105
7 CHAPTER SEVEN: PHYSICAL DESIGN, PLACEMENT
AND ROUTING STAGES
After the completion of the synthesis phase of a design, we can then move on to the next step,
which is placement and routing of the netlist. There are several stages in the placement and
routing (PnR) process.
The goal of physical design is to convert the synthesized netlist into a GDSII file that is
manufacturable (9). The main steps of PnR flow can be seen in the figure below.
106
• Libraries and Files Used During the Physical Design Process:
1- Logical Libraries
a. Provide timing and functionality information for all standard cells
b. Provide timing information for hard macros
c. Define drive/load design rules:
Max fanout
Max transition
Max/Min capacitance
d. usually the same ones used by Design Compiler during synthesis
e. Are specified with variables:
target_library
link_library
2- Physical Libraries
a. Contain physical information of standard, macro and pad cells, necessary for
placement and routing.
b. Define placement unit tile like Height of placement rows, Minimum, width
resolution, Preferred routing directions and Pitch of routing tracks
c. Are specified with the command:
create_mw_lib –mw_reference_library
3- Technology Files
A technology file is provided by the technology vendor. Technology file is unique
for each technology and contains the information related to metal/vias information
such as
a. Units & precision for electrical units (V, I and power)
b. Define colors and patterns of layers for displays
c. Number & name designations for each metal/vias
d. Physical & electrical characteristics of each metal/via
e. Define design rules such as min. wire width & min. wire to wire spacing
f. Contains ERC rules, Extraction rules, LVS rules
g. Provide parameterized cells for MOS capacitance
h. Create menus and commands
107
set tech2itf "$sc_dir/tech/rcxt/FreePDK45_10m.map"
7.2 FLOORPLANNING
Floorplan is one the critical & important step in Physical design. Quality of your Chip / Design
implementation depends on how good the floorplan is.
Parameter Value
7.3 PLACEMENT
After the floorplanning and power-planning stage, we need to begin placing the standard cells in
uniform rows inside the core area and fix the obtained placement of the macros as well. This
stage can greatly influence the timing parameters of our design, as it specifies the finalized
placement of blocks and standard cells, thus providing a more accurate estimate of interconnect
lengths and thus delays.
Keeping the above in mind, we need to be vigilant during our checks in this stage to ensure that
the rest of the flow will go as smoothly as possible. Here we start to fix hold violation using
“set_buffer_opt_strategy -effort high” which introduces buffers and inverters
to fix timing.
The placement stage is done using the place_opt command and it has several sub-steps.
There are several options for configuring the flow of this stage according to the needs of our
108
design. For example, we may invoke place_opt with -congestion to encourage the tool to
place cells with the goal of minimizing congestion or with -area_recovery which enables
buffer removal and cell downsizing of non critical paths, and in our design we use them both
with -effort high.
At end of this stage we check_legality sure that all the cells are placed in row with no
overlaps.
After Placement we use “report_timing -delay max -max_paths 20 >
output/top_place.setup.rpt” and “report_timing -delay min -
max_paths 20 > output/top_place.hold.rpt” to check violations and there was
a hold violation with -2.01ns which will be fixed in next steps.
109
Figure 115: Zoomed in view of power rings and floorplanning placement
We can go on to the Clock Tree Synthesis (CTS) stage after completing the placement stage with
acceptable timing and estimations of congestion/power usage. We deal with the clock nets that
were previously viewed as optimal throughout this stage. The parts that follow describe the step
and what is done during CTS, as well as the inputs to the stage and the desired outputs or goals.
CTS is essentially the insertion of buffers along the clock paths in the design in order to balance
the skew (differences in clock signal delay between clock inputs) and satisfy the required
insertion delay (time taken by clock signal to traverse from clock definition point to the sink of
the clock). The balancing of clock skew is done by building a buffer tree, as illustrated in
following figure, below. The handling of insertion delay is done by adding delay lines, as
illustrated too.
110
Figure 116:Balancing of Clock Skews
Figure 117:Handling Insertion Delay
111
Figure 119: Zoomed in view after CTS
7.5 ROUTING
After the CTS stage is completed with satisfactory skew-balancing and no hold (or setup) timing
violations, we may proceed to the routing stage. In this stage the design undergoes detailed
routing, where the actual path of interconnects across different metal layers and in different
geometric configurations is determined, so as expected area increased a lot as shown inFigure
120.
112
Figure 121: Final Layout of FPU
113
8 PROJECTS CODE LINKS
https://drive.google.com/drive/u/0/folders/1JWLysGlZydS-aQwMg3v_szmyj6i19_vM
SYSTEM CODE
https://drive.google.com/drive/u/0/folders/1NannWNHSFAEaqcV2o_RvE4jhI5tWiKcX
https://www.edaplayground.com/x/JvFm
https://drive.google.com/drive/folders/1-FJvMNPa0Mv9naBD6y11cY8nzqXDsSQ6
114
9 BIBLIOGRAPHY
115