Residue Number Systems (RNS)
Residue Number Systems (RNS)
Residue Number Systems (RNS)
Residue
Number Systems:
A New Paradigm
to Datapath
Optimization
for Low-Power and
High-Performance
Digital Signal
Processing
Applications
Chip-Hong Chang,
Amir Sabbagh Molahosseini,
Azadeh Alsadat Emrani Zarandi,
and Thian Fatt Tay
T
he last decade has witnessed the
movement of application-specific
digital signal processors (DSPs) [1]
from a niche market to the mainstream.
Almost every electronic appliances and
gadgets are embedded with one or more
application-specific DSPs, thanks to the
densification of integrated circuit (IC)
technology enabled by the ever shrinking
device geometry. To sustain the economy
of scale by the continuity of this device
miniaturization trend, new nanoelectronic
devices such as carbon nanotube (CNT)
[2], spin transistor [3] and quantum-dot
cellular automata (QCA) [4] are now sought
to replace the complementary metal
oxide semiconductors (CMOS) technol-
ogy. Before these emerging devices reach
the maturity for mass manufacturability,
advancements in DSP applications have to
be derived largely from architectural inno-
vation particularly in domain-specific com-
puting [5]. There are rooms to enhancing
image licensed by graphic stock
Chip-Hong Chang and Thian Fatt Tay are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. Amir Sabbagh
Molahosseini is with the Department of Computer Engineering, Kerman Branch, Islamic Azad University, Kerman, Iran, and Azadeh Alsadat Emrani Zarandi is
with the Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran. Corresponding author e-mail: [email protected].
emerging VLSI technology for the optimization of essen- hardware acceleration for digital signal processing algo-
tial hardware attributes. rithms, leakage resistant arithmetic for cryptographic
It is well understood that the way numbers are repre- systems and fault-tolerant hybrid memory design.
sented in a digital system has an impact on all levels of In the next section of the paper, the fundamental
design abstraction from algorithm, architecture to cir- concepts of RNS, including the common notations,
cuit topology and layout. The choice of number system definitions and general architecture and transcoding
for the hardware implementation of an application influ- overheads, are introduced. The applications and effects
ences its workload by dictating the number and com- of using RNS are described in Section III. The aim is
plexity of operations required to accomplish a specific to present these applications in a way that will stimu-
task. Since data activities depend on the circuit topolo- late the ingenious use of RNS for new domain specific
gies and the stochastic properties of the inputs, the computing. Section IV discusses the influences and
representation of data has a direct effect on the opera- opportunities of technology evolution of implementa-
tor strength and the performance predictability. For tion platforms on RNS-based computations. Finally, the
example, although ripple carry architecture dissipates paper is concluded in Section V with the envisioned
less power, it has more variations and hence greater future of RNS in the context of emerging applications
unpredictability in its timing and power estimation, and technologies.
particularly in the nanometer technology nodes. Ironi-
cally, after more than forty years of enormous invest- II. RNS Background
ment into renewing almost every relevant technology
for IC design and manufacturing, the fundamental arith- A. Motivation and History
metic operations and algebraic structures used in the RNS is based on a puzzle introduced by the Chinese
prevalent DSPs are still based on the same conventional mathematician Sun-Tzu, which was later named as Chi-
weighted binary number representation inherited from nese Remainder Theorem (CRT) [7]. Based on CRT,
the earliest microprocessor design. Harvey Garner [8] invented RNS in 1959. It has several
Residue Number System (RNS) offers an opportunity interesting number theoretic properties and unique fea-
to bring energy-efficient and fast arithmetic operations tures that can be used to boost up the speed of certain
into DSP systems. Representing data in RNS can limit electronic computations [7]. The carry-propagation
inter-digit carry propagation. The inherently higher par- chain of conventional binary number system was then
allelism and sparser inter-digit communication make the main bottleneck of fast arithmetic operation and
it amenable to voltage-frequency scaling for speed became the key motivation driving researchers to ven-
enhancement and power reduction, particularly in sys- ture into this alternative number system for which the
tems requiring a large number of arithmetic computa- residue arithmetic operation in each modulus channel
tions. It remains advantageous to layout and routing for is independent and carry-free. Cheney in 1961 [9] used
the existing 2D and emerging 3D stacked IC technology these features of RNS to design a digital correlator with
[6] as well as field programmable logic (FPL) devices. To ten times faster speed than that based on conventional
truly leverage the potential of RNS, applications that are binary number system. This correlator was the first
uniquely suited to the characteristics of computations system-level design based on RNS. A year later, Guffin
in the residue domain should be explored as a complete designed a special-purpose digital computer for solving
system to hide, mitigate or trade the transcoding over- simultaneous equations using RNS with a great speed
heads for a larger benefits instead of treating residue advantage [10]. To broaden its applications, research-
computations single-mindedly as drop-in replacements ers are motivated to solve the difficult RNS operations
for the ordinary computing units in the accustomed in order to achieve overall performance improvement
weighted binary number system. Advantages of RNS for general digital computing systems [11]. Therefore,
will be exemplified from this perspective by new appli- division, overflow detection, sign-detection and magni-
cations such as reliability enhancement in wireless sen- tude comparison have also come into the limelight of
sor networks, packet processing and routing for mobile RNS research since 1962 [12], [13]. Meantime, further
ad hoc network, privacy protection in cloud computing, improvements of the essential RNS arithmetic units,
Inter-Modular Operations
Scaling, Sign Detection,
Magnitude Comparison
Arithmetic Channel
Arithmetic Channel
Arithmetic Channel
shared memory.
Control Unit:
Modulo
Modulo
Modulo
Modulo
B. RNS System Components ... ...
An RNS is characterized by a set of N pairwise relatively
prime numbers known as moduli m i for i = 1, 2, f, N.
The dynamic range M of data representable in RNS is R
R
determined by the product of all the moduli. An unsigned
... ...
integer X within M can be uniquely represented using
residue digits which are computed by taking the least
Reverse Converter:
positive number of the division of X by m i . To represent
Residues-to-Integer Conversion
a signed integer, M is divided into two sub-ranges. The
lower half and upper half ranges are used to represent
Weighted Integer Output
positive and negative integers, respectively [7].
The hardware implementation of an RNS based appli- Figure 1. Overview of Residue Number System.
cation is greatly dependent on the chosen moduli set.
Generally, there are two types of moduli sets: i) sets
with arbitrary moduli [19]–[22]; ii) sets with specific implemented in hardware. Of which sign detection can
power of two related moduli in the forms of 2 n and be considered as a requisite step for magnitude compar-
2 n ! 1 [23]–[28]. These two types of moduli sets have ison after the modular subtraction of the two residue
their own advantages. Moduli set with arbitrary moduli representations being compared. The sign of a residue
leads to more flexible and balanced RNS system due representation can be determined by checking if the
to the abundance of coprime integers of comparable reversed converted integer falls into the lower or upper
word-lengths. On the other hand, moduli set with spe- half of the dynamic range. Unlike forward converter,
cific power of two moduli offers attractive mathematical modular addition, subtraction and multiplication, these
properties for manipulation to simplify the arithmetic operations involve inter-modular computations that
units and converter designs. Fig. 1 shows the typical require more than one residue and product of several
components used for building an RNS based applica- moduli to compute. Due to the lack of correlation among
tion. The role of the forward converter is to compute the residues to resolve the data dependency in their com-
residue digits of the inputs represented in the weighted posite moduli, inter-modular operations cannot be car-
binary number system. In each modulus channel, modu- ried out in parallel and independently in each modulus
lar arithmetic operations are performed on the corre- channel. It should be noted that division is also a diffi-
sponding residue digits independently and their carry cult operation that cannot be easily parallelized in TCS.
outputs do not propagate across modulus channels. It is usually avoided in DSP algorithms and if there is
Therefore, the smaller modular arithmetic operations a need for its execution in the residue domain, meth-
can be carried out in parallel and at a faster speed than ods such as subtractive and multiplicative division can
in two›s complement number system (TCS) for the iden- be considered [7].
tical dynamic range. The role of the reverse converter A Redundant RNS (RRNS) [37], [38] with error detec-
is to reconstruct the integer from its residue represen- tion and correction capability can be formed by adding
tation. It serves as an interface to transfer the compu- redundant moduli into an existing moduli set to extend
tation results of the modular arithmetic units to other the legitimate range of the original information moduli.
TCS based system. Among these main building blocks, The extended range is called the illegitimate range. The
reverse converter has the greatest complexity. Other redundant modulus channels in Fig. 1 are annotated by
operations such as sign detection [29], [30], magnitude “ R ”. In RRNS, residue errors can be detected from the
comparison [31], [32], overflow detection [33], divi- recovered magnitude of the received residue digits. If
sion [7] and scaling [34]–[36] are also non-trivial to be the magnitude falls into the illegitimate range, it can be
0 1 3 2 0 1 1 0
Redundant
Modulus Modular
Subtraction |0 − 1|3 = 2 |3 − 2|4 = 1 |0 − 1|5 = 4 |1 − 0|7 = 1 Channel Arithmetic
Operations
Sum 1→2 1 1 1
Difference 2 1 4 1
Product 0 2 0 0
concluded that there exists one or more residue digit RRNS are independent of each other, which prevent the
errors provided that the number of residue digit errors residue error in one modulus channel from propagating
is not more than twice the number of redundant moduli. to another channel. Therefore, errors introduced into
If the number of residue errors does not exceed the num- the residue digits have only localized effect. The erro-
ber of redundant moduli, they can be located and cor- neous modulus channels can be easily removed with-
rected by subtracting the error digits from the received out affecting the other modulus channels provided that
residue digits. As modular arithmetic are performed the dynamic range of the remaining information moduli
on the operands in residue representation, RRNS can after the removal of the erroneous modulus channels is
correct arithmetical processing errors. This is a unique sufficient for further arithmetic operation.
and powerful capability that is missing in other error Fig. 2 shows a numerical example of the frequently
correction codes used for the reliable delivery or stor- encountered operations in an RNS and an RRNS defined
age of digital data. Furthermore, the residue digits in by the moduli sets {3, 4, 5} and {3, 4, 5, 7}, respectively.
+
...
Channel Ck/2-1
scaling factor is one of the moduli [34]–[36].
...
+ ×
Channel +
Ck/2-1
...
25
Node
of input sequences because mul- 2 1
2
tiplication by zero is undefined 1 0
25
in the index domain. The total 3 25
Data
{m
Conversion to Residues 1, m
8 Moduli 2, .
..,
m
8}
of small consecutive primes that satisfy the condition randomly distributed and moved to random waypoints.
M $ 2 w can be selected as the moduli without increas- The results showed that the RNS-based approach out-
ing the number of forwarders and yet the message performed the conventional AOMDV method in differ-
can still be reconstructed if one or more remainders ent simulated scenarios.
are lost. Given N and f, an ordered list of prime num-
bers stored in each sensor node’s memory and a set of C. Cloud Storage
lookup-tables (one for each possible w) can be used to The concerns of unexpected termination of services
retrieve the unique minimum prime set with f admissi- and breach of data confidentiality by current cloud stor-
ble faults in a distributed manner to provide an optimal age providers can be addressed by RRNS [71]. In order
tradeoff between reliability and energy consumption, to store a file over the cloud with greater reliability in
by taking into account erasure channels, physical layer terms of long-term availability and privacy, it is split
overhead, and actual computational resources of all into p + r residue-segments based on RRNS, where
nodes in a real WSN. r is the number of redundant moduli. Each chunk is
Another RNS application in network systems is the encoded by BASE-64 before it is encrypted with a sym-
technique proposed in [70], which utilizes the modu- metric algorithm to encapsulate the binary data in the
larity nature of residue code to reduce the number of payload of an XML wrapper file. Finally, each encrypted
dropped messages in Ad Hoc networks caused by mali- chunk is sent to a different cloud storage provider. An
cious nodes, buffer overflows, nodes movement and XML metadata map file describing the locations and
collision. This technique incorporates RRNS code into retrieval method of the different chunks is created and
a modified version of Ad hoc On-demand Multipath safeguarded by the client. This approach is illustrated
Distance Vector (AOMDV) routing protocol where a in Fig. 6. In the event that any of the storage devices
message is split into N number of parts and sent via breaks down temporarily or permanently, the origi-
multiple routes to the destinations. At the receiver nal file can still be easily reconstructed by the client
side, the message can be fully recovered as long as the using only p chunks including the redundant residue-
number of parts reaching the destination is more than segments. This storage approach not only protects the
N/2 with the condition that all the parts do not travel stored files from system failures, but also prevents the
via the same route. The performance of the proposed cloud provider from accessing the stored files because
technique was measured by counting the number of only the owner knows the chunk’s storage locations and
messages successfully delivered via nodes which were their access method. Furthermore, a parallel download
b b b b
d-Bit Data
of distinct chunks from different cloud storage provid- range. Otherwise, additional moduli can be added to
ers also results in efficient bandwidth utilization. The provide the desirable fault tolerance.
storage size of a file is a function of r and the minimum One of the applications that incorporates RRNS
number of moduli required to reconstruct the file. On based error detection and correction code is hybrid
equal error tolerance, the ratio of the storage required memory [73]. In hybrid memories, non-CMOS devices
by traditional redundancy approach that stores multi- are used as memory cell together with CMOS-based
ple copies to that required by this approach was found peripheral circuits. Compared to conventional CMOS
to be about 1.75 [71]. memory cells, hybrid memory offers bigger data storage
capacity but has a higher defect rate of 10% or more due
D. Fault-Tolerant Computing to the high manufacturing process variability of emerg-
The reliabilities of electronic circuits are greatly ham- ing nano-devices. The first RRNS code designed for
pered by aggressive device scaling. To minimize the defect-tolerant memory systems consists of six moduli of
yield losses and product failures every year, fault toler- the forms {2 n + 1, 2 n, 2 n - 1 - 1, 2 n - 2 - 1, 2 n - 3 - 1, 2 n - 4 + 1},
ance has emerged as a new design dimension of utmost where 2 n + 1 and 2 n are the information moduli and
significance to the reliable operation of nano-electronic 2 n - 1 - 1, 2 n - 2 - 1, 2 n - 3 - 1 and 2 n - 4 + 1 are the redun-
circuits. Several techniques, which include self-check- dant moduli. Contrary to conventional RRNS code, the
ing logic, module replication, error correction code, redundant moduli are smaller than the information
and reconfiguration, etc. [72], have been developed moduli which cause ambiguity in error correction. The
to enhance the dependability of electronic circuits ambiguity is eliminated using maximum likelihood de-
designed out of fallible devices, but none has its error coding technique. As a result, more data can be stored
isolation capacity inherent in the arithmetic operations using this scheme as its codeword length is shorter than
like RNS. The lack of ordered significance among the the Reed-Solomon (RS) code and conventional RRNS
residues of an RNS implies that errors due to process- (C-RRNS) code for 16-bit, 32-bit and 64-bit memories.
ing noise in one residue digit will not contaminate other As shown in Fig. 7, the input data is first converted to
residue digits, and a faulty modulus channel can be shut a set of residues by the RRNS encoder. The residues are
down if the surviving channels have adequate dynamic then concatenated to create the RRNS codeword before
Channels Adder
... Multiplier algorithms implemented in RNS can be generated auto-
matically and transparently to the system designer.
Although RNS is generally conceived to be a poor
fit for the implementation of programmable processor
Memory
due to its transcoding overheads, a patent was filed for
a RNS general-purpose arithmetic and logic unit (ALU)
capable of performing both integer and fractional oper-
ations on very large values [102]. This RNS-based ALU is
Write Back
used as a co-processor to a conventional CPU to acceler-
ate computations as shown in Fig. 9.
Figure 10. An embedded RNS RISC pipeline processor [103].
Recently, a multi-tier approach was adopted to design
a 32-bit RNS extension to the embedded reduced instruc-
have better performance if they are implemented in the tion set computer (RISC) processor based on the moduli
same way on FPGA. The same goes for the advantages set {2 n - 1, 2 n + k, 2 n + 1} [103]. By balancing the modular
of implementing isomorphic multipliers in the early multiplier delay across the three modulus channels, the
generation of general purpose FPGA resources. There- values of n and k were fixed at 9 and 15, respectively.
fore, new ways of utilizing the latest internal structure The RNS adder was designed for three operand addition
of FPGA should be exploited, as exemplified by the fast given that its speed was still significantly faster than the
modulo 2 n - 1 and 2 n + 1 adders that take advantage two-operand TCS adder. This adder was used for two-
of the internal carry propagate structure of modern operand subtraction with additional input logic to condi-
FPGAs [98]. To better exploit the new features of the tionally negate-and-correct the second operand. The RNS
latest FPGAs, modular additions and multiplications of adder/multiplier, a fully carry-save forward converter and
RNS were implemented using a ROM based approach a reverse converter were embedded in the execute stage
as opposed to the classical MUX based adders [99]. of a RISC instruction pipeline. Together with the regular
Different optimization techniques are proposed for the binary ALUs, they constituted a hybrid RNS processor
design of basic building blocks, including the forward shown in Fig. 10. To allow the conversion operations to
and reverse converters, based on moduli sets selected be scheduled in parallel with some other computation,
to optimally use the new 6-input lookup tables of the separate instructions for RNS addition (RADD), subtrac-
complex logic blocks. High speed and low resource uti- tion (RSUB) and multiplication (RMUL), and for convert-
lization rate were demonstrated by applying these tech- ing operand from TCS to RNS (FC) and vice versa (RC)
niques to the design of different orders of RNS filters. were added into an existing RISC instruction set archi-
Specifically, 40% saving on resource utilization over the tecture. A compiler was developed to analyze application
TCS implementation was reported for a 256-tap FIR filter data dependency graph for RNS profitability and map the
implemented by the moduli set {64, 31, 29, 23, 19, 17, 13} potential sub-graphs to RNS instructions. The instruc-
on the same Xilinx FPGA [99]. tion scheduling was then performed to hide the conver-
Due to the hot-spot problem, it is impossible to sion latency and minimize the runtime. This RNS-based
increase the throughput of microprocessor by increas- embedded processor was able to achieve more than 50%
ing the clock rate. Thus general purpose processor is of power saving in comparison with regular TCS for vari-
evolving towards the multicore architecture. The trend ous DSP benchmark kernels.