Research and Analysis of Floating-Point Adder Prin
Research and Analysis of Floating-Point Adder Prin
DOI: 10.54254/2755-2721/8/20230092
Fengyuan Yang
School of Materials Science and Engineering, Northeastern University, Shenyang,
Liaoning Province, China, 110819
Abstract. With the development of the times, computers are used more and more widely, and
the research and development of adder, as the most basic operation unit, determine the
development of the computer field. This paper analyzes the principle of one-bit adder and
floating-point adder by literature analysis. One-bit adder is the most basic type of traditional
adder, besides bit-by-bit adder, overrun adder and so on. The purpose of this paper is to
understand the basic principle of adder, among them, IEEE-754 binary floating point operation
is very important. So that the traditional fixed-point adder is the basis of the floating-point
adder, which can have a new direction in the future development of floating-point adder
optimization. This paper finds that the floating-point adder is one of the most widely used
components in signal processing systems today, and therefore, the improvement of the
floating-point adder is necessary.
1. Introduction
Nowadays, human society has entered the information age, and various information computing and
storage technologies are the basis for the development of the information age. Computers,
microelectronics and communication technologies related to information technology are already the
core technologies that drive social progress.
In the microelectronic processing system, the basic quadratic operations (plus, subtract, multiply,
divide) can all be reduced to addition operations, so the adder is a very important arithmetic unit in
computer logic computing. In addition to this, the adder can also perform program counting and
calculate the effective address[1]. During the operation, the data is operated and stored in the form of 0
and 1. The data types are divided into fixed-point and floating-point. A floating-point representation is
widely used. According to Stuart F.Oberman, in floating-point operations, more than 55% of the basic
operations are floating-point addition operations, so the floating-point adder is one of the most
significant components of the microprocessor[2]. Fixed-point adders are the most basic and commonly
used part of various digital systems and are also fully used in floating-point operations. The tail
module in floating-point addition is essentially an addition operation of fixed-point numbers.
Therefore, understanding the fixed-point adder and designing a high-speed fixed-point adder is
essential to improve the performance of floating-point adders.
© 2023 The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0
(https://creativecommons.org/licenses/by/4.0/).
113
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/8/20230092
This paper introduces the most basic one-bit adder in fixed-point adder by literature method and
summary and induction method to understand its internal principle and have a preliminary
understanding of the principle of adder. In the third part, the floating-point adder is introduced,
focusing on IEEE-754 binary floating-point operation standard. The research in this thesis can help
beginners to understand floating point adders and help in the subsequent research of high performance
floating point adders.
2. One-position adder
One-position adder is the most basic type of adder, and other higher performance adders are studied
based on this adder, which includes half adder and full adder.
Ri Ai Bi Ri+1 Si
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1
114
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/8/20230092
The true value of a floating-point number consists of the ordinal base R (with the implied
convention of 2), the exponent-marker E and the mantissa M. The number sign indicates the positive
or negative of the number. The exponent-marker E is a fixed-point integer, represented by a
complement or shift code, whose number of bits determines the range of the value. The mantissa M is
a fixed-point decimal number, represented by the original or complementary code, and its digits
determine the precision of the number.
The IEEE-754 binary floating-point standard was developed by the Institute of Electric and
Electronics Engineers in 1985 and has been the industry standard for floating-point operations ever
since[6].
3.2.1. Four formats. The IEEE-754 standard has four formats: Single precision floating-point
numbers, Double precision floating-point numbers, extended double-precision floating-point (SPARC),
and extended double-precision floating-point (x86).
Table 3. Bit distribution of four types of float[5]
Format Total number Symbol bit Exponent-marker Mantissa
of bits S E M
Single-precision 32 bit 1 bit 8 bit 23 bit
floating-point
Double-precision 64 bit 1 bit 11 bit 52 bit
floating-point
Extended
double-precision 128 bit 1 bit 15 bit 112 bit
floating-point
(SPARC)
Extended 63 bit+1 bit
double-precision 80 bit 1 bit 15 bit (Explicit
floating-point leading
(x86) significant
digits)
3.2.3. Double-precision floating-point. A double precision floating point number has 64 bits,
including 53 bits for the mantissa M, 11 bits for the exponent-marker E, and one bit for the sign S. Bits
0 through 51 are the mantissa M, and bit 0 is the least significant bit of the mantissa. Bits 52 to 62 are
the 11-bit exponent-marker E, and bit 52 is the least significant bit of the exponent-marker. 11 bits of
115
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/8/20230092
the exponent can represent exponent values between 0 and 2047. When the exponent is negative, a
deviation value (Bias=1023) is introduced, and the sum of the deviation value and the exponent value
is used as the value stored in the exponent field. The 63rd bit is the sign bit, where positive numbers
are expressed as 0 and negative numbers are expressed as 1.
3.2.4. Extended double precision floating point (SPARC). The SPARC floating-point format is a
quadruple precision format that occupies four 32-bit fields: 112 mantissa bits, 15 exponent-marker bits,
and 1 symbol bit. Bits 0 through 11 are the mantissa M, bits 112 through 126 are the exponent-marker
E, and bit 127 is the sign bit S.
3.2.5. Extended double precision floating point(x86). The extended double precision floating point
format (x86) numbers three consecutive 32 bits by 96 bits. Bits 0 through 63 store the 64-bit mantissa,
the 15-bit exponent-marker is stored in bits 64 to 78, and the sign bit is stored in bit 79. However, the
format actually uses only 80 bits, i.e., the higher 16 bits of the highest 32 bits of the address are not
used by the structure.
4. Conclusion
This paper mainly discusses the principle of one-bit adder and IEEE-754 binary floating-point
arithmetic standard and analyzes the traditional algorithm of floating-point addition, especially the
four formats in floating-point arithmetic and their respective specific formats and differences between
each other. However, this thesis only introduces the basics of these arithmetic standards through a
brief summary of other literature and books. Still, due to time constraints, the actual research design of
the floating-point adder has not been mentioned yet. In future research, people can specifically explore
improving the existing floating-point adder and research a high-performance, low-power
floating-point adder.
References
[1] WANG Dong,LI Zhentao,MAO Erkun,LI Baofeng. CMOS VLSI Design (3rd Edition), Beijing:
China Electric Power Press, 2008
[2] Stuart F.Oberman.Design issues in high performance floating point arithmrtic Units
[D],Standford University, Degree of Doctor of Philosophy, 1996.
116
Proceedings of the 2023 International Conference on Software Engineering and Machine Learning
DOI: 10.54254/2755-2721/8/20230092
[3] JI Chao, LI Tuo, ZOU Xiaofeng & ZHANG Lu. (2022). Design of combinatorial logic circuit
based on memristor. Semiconductor Technology(08),649-659.
doi:10.13290/j.cnki.bdtjs.2022.08.010.
[4] Dai Guangzhen, Zhao Zhenyu, Song Xingwen, Han Mingjun & Ni Tianming. (2023).
Memristor hybrid logic circuit design and its application. Science in China: Information
Science (01), 178-190. doi:.
[5] WANG Dayu. (2012). Research and Design of High-Performance Floating-Point Adders
(Master's Thesis, Nanjing University of Aeronautics and Astronautics
https://kns.cnki.net/KCMS/detail/detail.aspx?dbname=CMFD201301&filename=101204159
8.nh).
[6] IEEE Std 754-1985:IEEE Standard for Binary Floating point Arithmatic,IEEE,1985.
[7] FENG Wei. (2009). Optimization Design of a Fast Floating-Point Adder (Master's Thesis,
University of Science and Technology of
China).https://kns.cnki.net/KCMS/detail/detail.aspx?dbname=CMFD2010&filename=20100
18994.nh
117