Residue Number Systems (RNS)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Feature

Residue
Number Systems:
A New Paradigm
to Datapath
Optimization
for Low-Power and
High-Performance
Digital Signal
Processing
Applications
Chip-Hong Chang,
Amir Sabbagh Molahosseini,
Azadeh Alsadat Emrani Zarandi,
and Thian Fatt Tay

Abstract applications. In recent years, technological advancement in semi-


conductor technology has revived the interests to reconsider RNS
Residue Number System (RNS) is a non-weighted number sys- for application-specific computing. There are at least two unique
tem which was proposed by Garner back in 1959 to achieve fast motivations which make RNS computations more attractive
implementation of addition, subtraction and multiplication opera- and applicable in modern digital signal processing applications.
tions in special-purpose computations. Unfortunately, RNS did Firstly, the modular and distributive properties of RNS are used
not turn out as a popular alternative to two’s complement number to achieve performance improvements especially in the emerg-
system in those days. The rigidity of instruction set architectures ing distributed and ubiquitous computing platforms such as cloud,
of the market-dominant computers and microprocessors then has wireless ad hoc networks, and applications which require toler-
been the main barrier to sustain the development of RNS-based ance against soft error. Secondly, energy efficiency becomes a
key driver in the continual densification of complementary metal
Digital Object Identifier 10.1109/MCAS.2015.2484118 oxide semiconductor (CMOS) digital integrated circuits. The high
Date of publication: 19 November 2015 degree of computational parallelism in RNS offers new degree

26 IEEE circuits and systems magazine 1531-636X/15©2015IEEE fourth QUARTER 2015


I. Introduction

T
he last decade has witnessed the
movement of application-specific
digital signal processors (DSPs) [1]
from a niche market to the mainstream.
Almost every electronic appliances and
gadgets are embedded with one or more
application-specific DSPs, thanks to the
densification of integrated circuit (IC)
technology enabled by the ever shrinking
device geometry. To sustain the economy
of scale by the continuity of this device
miniaturization trend, new nanoelectronic
devices such as carbon nanotube (CNT)
[2], spin transistor [3] and quantum-dot
cellular automata (QCA) [4] are now sought
to replace the complementary metal
oxide semiconductors (CMOS) technol-
ogy. Before these emerging devices reach
the maturity for mass manufacturability,
advancements in DSP applications have to
be derived largely from architectural inno-
vation particularly in domain-specific com-
puting [5]. There are rooms to enhancing
image licensed by graphic stock

the performance of domain-specific com-


puting within the energy and power budget
by optimizing its data paths. Leap changes
and improvements are likely to come from
a remake of the arithmetic operations used
to implement the DSP algorithms. This
is because the number representation
of freedom to optimize energy performance, particularly for very
long word length arithmetic such as those involved in the hardware adopted in the accustomed DSPs has the lowest base
implementation of cryptographic algorithms. Our aim in this paper and the bits are order-dependent. DSPs implemented
is to show this revolution by discussing interesting development in by algorithms based on such weighted binary number
RNS and foster the innovative use of RNS for more applications.
system suffer from the curse of dimensionality due to
Different applications of RNS are investigated to demonstrate how
this unconventional number system can be leveraged to benefit the inevitable long chain of carry propagation. It has
their implementation. limited “parallelism” and “modularity” to fully utilize the

Chip-Hong Chang and Thian Fatt Tay are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. Amir Sabbagh
Molahosseini is with the Department of Computer Engineering, Kerman Branch, Islamic Azad University, Kerman, Iran, and Azadeh Alsadat Emrani Zarandi is
with the Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran. Corresponding author e-mail: [email protected].

fourth QUARTER 2015 IEEE circuits and systems magazine 27


The hardware implementation of an RNS based application
is greatly dependent on the chosen moduli set.

emerging VLSI technology for the optimization of essen- hardware acceleration for digital signal processing algo-
tial hardware attributes. rithms, leakage resistant arithmetic for cryptographic
It is well understood that the way numbers are repre- systems and fault-tolerant hybrid memory design.
sented in a digital system has an impact on all levels of In the next section of the paper, the fundamental
design abstraction from algorithm, architecture to cir- concepts of RNS, including the common notations,
cuit topology and layout. The choice of number system definitions and general architecture and transcoding
for the hardware implementation of an application influ- overheads, are introduced. The applications and effects
ences its workload by dictating the number and com- of using RNS are described in Section III. The aim is
plexity of operations required to accomplish a specific to present these applications in a way that will stimu-
task. Since data activities depend on the circuit topolo- late the ingenious use of RNS for new domain specific
gies and the stochastic properties of the inputs, the computing. Section IV discusses the influences and
representation of data has a direct effect on the opera- opportunities of technology evolution of implementa-
tor strength and the performance predictability. For tion platforms on RNS-based computations. Finally, the
example, although ripple carry architecture dissipates paper is concluded in Section V with the envisioned
less power, it has more variations and hence greater future of RNS in the context of emerging applications
unpredictability in its timing and power estimation, and technologies.
particularly in the nanometer technology nodes. Ironi-
cally, after more than forty years of enormous invest- II. RNS Background
ment into renewing almost every relevant technology
for IC design and manufacturing, the fundamental arith- A. Motivation and History
metic operations and algebraic structures used in the RNS is based on a puzzle introduced by the Chinese
prevalent DSPs are still based on the same conventional mathematician Sun-Tzu, which was later named as Chi-
weighted binary number representation inherited from nese Remainder Theorem (CRT) [7]. Based on CRT,
the earliest microprocessor design. Harvey Garner [8] invented RNS in 1959. It has several
Residue Number System (RNS) offers an opportunity interesting number theoretic properties and unique fea-
to bring energy-efficient and fast arithmetic operations tures that can be used to boost up the speed of certain
into DSP systems. Representing data in RNS can limit electronic computations [7]. The carry-propagation
inter-digit carry propagation. The inherently higher par- chain of conventional binary number system was then
allelism and sparser inter-digit communication make the main bottleneck of fast arithmetic operation and
it amenable to voltage-frequency scaling for speed became the key motivation driving researchers to ven-
enhancement and power reduction, particularly in sys- ture into this alternative number system for which the
tems requiring a large number of arithmetic computa- residue arithmetic operation in each modulus channel
tions. It remains advantageous to layout and routing for is independent and carry-free. Cheney in 1961 [9] used
the existing 2D and emerging 3D stacked IC technology these features of RNS to design a digital correlator with
[6] as well as field programmable logic (FPL) devices. To ten times faster speed than that based on conventional
truly leverage the potential of RNS, applications that are binary number system. This correlator was the first
uniquely suited to the characteristics of computations system-level design based on RNS. A year later, Guffin
in the residue domain should be explored as a complete designed a special-purpose digital computer for solving
system to hide, mitigate or trade the transcoding over- simultaneous equations using RNS with a great speed
heads for a larger benefits instead of treating residue advantage [10]. To broaden its applications, research-
computations single-mindedly as drop-in replacements ers are motivated to solve the difficult RNS operations
for the ordinary computing units in the accustomed in order to achieve overall performance improvement
weighted binary number system. Advantages of RNS for general digital computing systems [11]. Therefore,
will be exemplified from this perspective by new appli- division, overflow detection, sign-detection and magni-
cations such as reliability enhancement in wireless sen- tude comparison have also come into the limelight of
sor networks, packet processing and routing for mobile RNS research since 1962 [12], [13]. Meantime, further
ad hoc network, privacy protection in cloud computing, improvements of the essential RNS arithmetic units,

28 IEEE circuits and systems magazine fourth QUARTER 2015


such as modulo adder, multiplier, forward converter
and reverse converter [14], [15], remain a hot pursuit.
Weighted Integer Operands
Their revitalization is mainly driven by new applica-
tions such as error detection and correction, advanced
Forward Converter:
DSP and cryptography [16]–[18]. Recently, the demand
Integer-to-Residues Conversion
for parallel and low-power computations has led to the
... ...
adaptation of RNS in emerging applications such as
wireless sensor networks (WSNs), cloud computing and

Inter-Modular Operations
Scaling, Sign Detection,
Magnitude Comparison

Producing Control Signals


RNS Processing Units
Arithmetic Channel

Arithmetic Channel

Arithmetic Channel

Arithmetic Channel
shared memory.

Control Unit:
Modulo

Modulo

Modulo

Modulo
B. RNS System Components ... ...
An RNS is characterized by a set of N pairwise relatively
prime numbers known as moduli m i for i = 1, 2, f, N.
The dynamic range M of data representable in RNS is R
R
determined by the product of all the moduli. An unsigned
... ...
integer X within M can be uniquely represented using
residue digits which are computed by taking the least
Reverse Converter:
positive number of the division of X by m i . To represent
Residues-to-Integer Conversion
a signed integer, M is divided into two sub-ranges. The
lower half and upper half ranges are used to represent
Weighted Integer Output
positive and negative integers, respectively [7].
The hardware implementation of an RNS based appli- Figure 1. Overview of Residue Number System.
cation is greatly dependent on the chosen moduli set.
Generally, there are two types of moduli sets: i) sets
with arbitrary moduli [19]–[22]; ii) sets with specific implemented in hardware. Of which sign detection can
power of two related moduli in the forms of 2 n and be considered as a requisite step for magnitude compar-
2 n ! 1 [23]–[28]. These two types of moduli sets have ison after the modular subtraction of the two residue
their own advantages. Moduli set with arbitrary moduli representations being compared. The sign of a residue
leads to more flexible and balanced RNS system due representation can be determined by checking if the
to the abundance of coprime integers of comparable reversed converted integer falls into the lower or upper
word-lengths. On the other hand, moduli set with spe- half of the dynamic range. Unlike forward converter,
cific power of two moduli offers attractive mathematical modular addition, subtraction and multiplication, these
properties for manipulation to simplify the arithmetic operations involve inter-modular computations that
units and converter designs. Fig. 1 shows the typical require more than one residue and product of several
components used for building an RNS based applica- moduli to compute. Due to the lack of correlation among
tion. The role of the forward converter is to compute the residues to resolve the data dependency in their com-
residue digits of the inputs represented in the weighted posite moduli, inter-modular operations cannot be car-
binary number system. In each modulus channel, modu- ried out in parallel and independently in each modulus
lar arithmetic operations are performed on the corre- channel. It should be noted that division is also a diffi-
sponding residue digits independently and their carry cult operation that cannot be easily parallelized in TCS.
outputs do not propagate across modulus channels. It is usually avoided in DSP algorithms and if there is
Therefore, the smaller modular arithmetic operations a need for its execution in the residue domain, meth-
can be carried out in parallel and at a faster speed than ods such as subtractive and multiplicative division can
in two›s complement number system (TCS) for the iden- be considered [7].
tical dynamic range. The role of the reverse converter A Redundant RNS (RRNS) [37], [38] with error detec-
is to reconstruct the integer from its residue represen- tion and correction capability can be formed by adding
tation. It serves as an interface to transfer the compu- redundant moduli into an existing moduli set to extend
tation results of the modular arithmetic units to other the legitimate range of the original information moduli.
TCS based system. Among these main building blocks, The extended range is called the illegitimate range. The
reverse converter has the greatest complexity. Other redundant modulus channels in Fig. 1 are annotated by
operations such as sign detection [29], [30], magnitude “ R ”. In RRNS, residue errors can be detected from the
comparison [31], [32], overflow detection [33], divi- recovered magnitude of the received residue digits. If
sion [7] and scaling [34]–[36] are also non-trivial to be the magnitude falls into the illegitimate range, it can be

fourth QUARTER 2015 IEEE circuits and systems magazine 29


Weighted Number Input
X = 15 Y = –14

x1 = |15|3 = 0 x2 = |15|4 = 3 x3 = |15|5 = 0 x4 = |15|7 = 1


Forward
Conversion
y1 = |–14|3 = 1 y2 = |–14|4 = 2 y3 = |–14|5 = 1 y4 = |–14|7 = 0

0 1 3 2 0 1 1 0

Addition |0 + 1|3 = 1 |3 + 2|4 = 1 |0 + 1|5 = 1 |1 + 0|7 = 1

Redundant
Modulus Modular
Subtraction |0 − 1|3 = 2 |3 − 2|4 = 1 |0 − 1|5 = 4 |1 − 0|7 = 1 Channel Arithmetic
Operations

Multiplication |0 × 1|3 = 0 |3 × 2|4 = 2 |0 × 1|5 = 0 |1 × 0|7 = 0

Sum 1→2 1 1 1
Difference 2 1 4 1
Product 0 2 0 0

M=3×4×5= M1 = 20 M2 = 15 M3 = 12 Error Detection:


MT = 3×4×5×7 = 420
60 k1 = 2 k2 = 3 k3 = 3
M1 = 140 M2 = 105 M3 = 84 M4 = 60
Addition Result: k1 = 2 k2 = 1 k3 = 4 k4 = 2
If No Fault in Channel m1:
X = |(20×2×1) + (15×3×1) + (12×3×1)|60 = 1 = 15 + (–14)
X = |(140×2×1) + (105×1×1) + Reverse
Subtraction Result: (84×4×1) + (60×2×1)|420 Conversion
X = |(20×2×2) + (15×3×1) + (12×3×4)|60 = 29 = 15 − (–14) = 1 < (M = 60) => No Error

Multiplication Result: If Fault Occurs in Channel m1:


X = |(140×2×2) + (105×1×1) +
X = |(20×2×0) + (15×3×2) + (12×3×0)|60 = –30 ≠ 15 × (–14)
(84×4×1) + (60×2×1)|420
Wrong Result Due to Overflow. 15 × (–14) = –210
Falls Beyond the Dynamic Range (-30, 29) = 281 > (M = 60) => Error!!

Figure 2. Example on computations in RNS and RRNS.

concluded that there exists one or more residue digit RRNS are independent of each other, which prevent the
errors provided that the number of residue digit errors residue error in one modulus channel from propagating
is not more than twice the number of redundant moduli. to another channel. Therefore, errors introduced into
If the number of residue errors does not exceed the num- the residue digits have only localized effect. The erro-
ber of redundant moduli, they can be located and cor- neous modulus channels can be easily removed with-
rected by subtracting the error digits from the received out affecting the other modulus channels provided that
residue digits. As modular arithmetic are performed the dynamic range of the remaining information moduli
on the operands in residue representation, RRNS can after the removal of the erroneous modulus channels is
correct arithmetical processing errors. This is a unique sufficient for further arithmetic operation.
and powerful capability that is missing in other error Fig. 2 shows a numerical example of the frequently
correction codes used for the reliable delivery or stor- encountered operations in an RNS and an RRNS defined
age of digital data. Furthermore, the residue digits in by the moduli sets {3, 4, 5} and {3, 4, 5, 7}, respectively.

30 IEEE circuits and systems magazine fourth QUARTER 2015


For the latter, 3, 4 and 5 are the information moduli independently on different modulus channels will be
and 7 is a redundant modulus. Let 15 and –14 be the undermined by the coding overheads required to con-
two integer operands. Their residues are computed vert the data into and out of the residue domain.
by the forward conversion as (0, 3, 0, 1) and (1, 2, 1, 0), The forward converter, also known as residue gener-
respectively. Arithmetic operations are performed on ator, is typically implemented by multi-operand modu-
the residues of these operands in each modulus chan- lar adder (MOMA) making use of the periodic property
nels independently. The results of these operations can of 2 j m for each modulus m of the RNS [39]. MOMAs
be converted back to integers by the reverse conver- designed for " 2 n ! k , moduli with restricted values of
sion algorithm. Chinese Remainder Theorem (CRT) [7] k [19], [40] is simpler than those designed for arbitrary
is applied for the reverse conversion in this example, moduli [20]. In addition to serving as the forward con-
which states that an integer X within the dynamic range verter in RNS, efficient residue generator for generic
M can be recovered from its residues (x 1, x 2, f, x N ) by: modulus is also a basic component instrumental to the
N
error decoding of RRNS and modulo reduction. How-
X= / Mi # ki # xi mi (1) ever, most of the residue generators are limited in terms
i=1 M
of the number of bits per channel due to the long and
where a b is a shorthand notation for a modulo b. irregular periodicity of large modulus. Until recently, a
M i = M m i and k i = M i-1 m i is the multiplicative inverse novel conversion structure was proposed to overcome
of M i m i , i.e., M i # M -i 1 m i = 1. this dependency by exploiting the distributive prop-
Computational steps for the reverse conversion erty of modular arithmetic [20]. As read-only memories
of the results of addition, subtraction and multiplica- (ROMs) are difficult to scale for larger dynamic range,
tion are illustrated for the RNS in Fig. 2. The redundant the latest forward conversion structure targeted com-
residue in RRNS can be used for error detection. For pletely memoryless implementation for moduli " 2 n ! k ,
the purpose of illustration, assume that a fault occurs with restricted and unrestricted values of k [40]. It
in the addition and changes the residue of the sum in computes the respective residue value of each subset
the modulus 3 channel from 1 to 2. By incorporating of input bits using modular additions and constant mul-
the redundant modulus 7 into the CRT, integer 155 is tiplications without the exponential area increase of
obtained upon reverse conversion, which falls outside ROM-based topologies. To automatically create param-
the legitimate dynamic range M = 60 of the information eterized forward converters for generic and optimized
moduli. Thus, the error can be detected. As this RRNS moduli sets, a web tool was recently developed [41]. It
has only one redundant modulus, it can only detect one comes with a heuristic to select a good moduli set based
residue digit error. To correct this error, the RRNS must on the user specified maximum input range.
have at least two redundant moduli. In general, an RRNS
with r redundant moduli can detect up to r residue digit
errors and correct up to 6r 2@ residue digit errors. C0 FIR Filter
Overflow occurs when the computation results
+ ×
exceed the defined dynamic range. Overflow will result Mod P1
...

+
...

in incorrect residue representation as illustrated by Channel Ck/2-1


the multiplication result in Fig. 2. Overflow detection is + ×
another difficult operation in RNS. It can be avoided by
Analog C0 FIR Filter Binary
Reverse Converter

scaling the operand with some sacrifice on output pre- Digital


Input
cision. Scaling can be performed more efficiently in the Signal + × Output
Mod P2
...

residue domain for special moduli set, especially if the +


...

Channel Ck/2-1
scaling factor is one of the moduli [34]–[36].
...

+ ×

C. Notes on RNS Transcoders


...
...

Modular adders [21]–[24] and multipliers [25]–[28] are C0 FIR Filter


the keystone of RNS computations. Due to the humon- + ×
Mod Pn
gous amount of publications on efficient implementa-
...

Channel +
Ck/2-1
...

tions of different modular arithmetic circuits, they will


not be discussed here. Being a non-conventional num- ADC as + ×
ber systems, most system peripheral communications Forward
Converter
are not built for RNS. So the advantage of being able
Figure 3. RNS-based FIR filter.
to perform short modular addition and multiplication

fourth QUARTER 2015 IEEE circuits and systems magazine 31


As for the reverse converter design, three main filter implementations have been reported in the last
classes of conversion algorithms have been used. Those half-century [50]–[61]. Fig. 3 shows the block diagram
directly based on the CRT [42]–[44], those that use an of a typical fully RNS-based implementation of FIR fil-
intermediate Mixed Radix representation [45], [46] and ter. The analog input signal can be directly converted
their hybrid combination [47]–[49]. While the MOMA into residues using RNS-based folding analog-to-digital
for each residue can be independently constructed converter (ADC) [54]. Discrete convolution is then per-
and executed in parallel for the forward converter, formed on the converted residues in their respective
the reverse converter have a more complex architec- modulus channels before the final residues are con-
ture due to the unwieldy composition of large modu- verted back to an integer in binary representation. Simi-
lar functions of residues and multiplicative inverses. lar implementation has been applied to infinite impulse
Limited in number theoretic properties and lacking of response (IIR) filters that are widely used in control
close form expressions for the multiplicative inverses applications [55].
required by CRT and mixed radix conversion, lookup One of the earliest comparisons of hardware area and
tables are indispensable for the implementation of the delay between the RNS and TCS based implementations
reverse converters for arbitrary moduli sets. Unfortu- of FIR filters was made based on their transpose stages
nately, ROM-based design cannot be easily pipelined to [52]. The RNS based architecture was implemented
increase the throughput and large ROMs are area and using special moduli in the forms of 2 n - 1, 2 n and 2 n + 1
time consuming. For special moduli sets, the rich and but with no restriction on the number of moduli used
coherent number theoretic properties of their moduli to compose the moduli set. The TCS-equivalent trans-
can be exploited for the derivation of multiplicative pose filter stage in comparison was implemented using
inverses and reduction of large modulo operations to a Booth encoded multiplier with optimized adder tree.
greatly simplify the reverse converter design. The final The synthesis results of a 16-tap filter using the 0.7 n
simplified formula for the reverse conversion can be m CMOS technology of the time showed gains of 1.35 +
realized more efficiently by bit selection, reshuffling and 1.71, measured in terms of the ratio of area-delay prod-
modular carry-save adders and different structures of uct of TCS stage to that of RNS stage, for M ranges from
two-operand vector merged modular adders [42]–[49]. 20 to 40 bits. Recently, the power efficiency of partially
Owing to the limited number of coprime moduli that are serialized RNS and TCS based high order and large
modular arithmetic friendly and have comparable word dynamic range FIR filters implemented on resource con-
length, by relaxing the balanced word length require- strained platforms was also studied [56]. The results
ment, more space can be explored for finding the high showed that the ASIC implementation of a 120-tap fully
cardinality special moduli sets with better performance parallel FIR filter clocked at 20 MHz consumed 844 pJ
and more optimized reverse converters [42], [43], [45]. per cycle in RNS but 1196 pJ per cycle in TCS. With clock
Nonetheless, the required number and sizes of the mod- gating, the serial/parallel FIR filter implemented in RNS
ular adders will also increase with the cardinality of the consumed 225 pJ per cycle, which was significantly
moduli set. lower than 364 pJ per cycle in TCS implementation.
Approaches to leverage RNS features for single-con-
III. RNS Applications stant multiplication (SCM) [58] and multiple-constant
RNS has been widely known as an alternative number multiplication (MCM) [59] were proposed and they out-
system that can accelerate the speed of applications performed conventional MCM implementations of fixed-
dominated by a large number of additions, subtractions coefficient FIR filters in speed and power efficiency at
and multiplications. The examples highlighted in this maximum throughput [59]. Last but not least, fully fault-
section include the well-known applications of RNS in tolerant RRNS-based FIR filters that use fault masking
DSP, cryptography, and memory module as well as its to provide additional fault-immunity for the RNS reverse
lesser-known applications in network and cloud com- converter were also developed and investigated against
puting, which use the features of RNS to satisfy certain the conventional triple modular redundancy (TMR)
critical constraints in security, power consumption and approach [57].
performance. An innovative use of RNS in DSP systems was
reported on the implementation of one-dimensional dis-
A. Digital Signal Processing crete wavelet transform (DWT) [62]. It uses enhanced
Finite impulse response (FIR) filter is one of the most index-transformation over Galois Fields and retiming
frequently used functions in DSP systems. It is an techniques to reduce the number of ROMs required
application that is most well suited to demonstrate the to support runtime programmable wavelet filter coef-
benefits of RNS. For this reason, many RNS-based FIR ficients to a single 2 n i # n i ROM for each coefficient

32 IEEE circuits and systems magazine fourth QUARTER 2015


multiplication, where n i is the
wordlength of the modulus m i .
Ingress Edge
Small prime moduli are selected Node Key: 4 Key: 7
to meet the dynamic range
3 2
requirement as only prime moduli 25
0
are admissible in index arithme- 2
1 1
tic system, and efficient circuitry
Egress Edge
is designed to detect zero values

25
Node
of input sequences because mul- 2 1
2
tiplication by zero is undefined 1 0
25
in the index domain. The total 3 25

area and maximum sampling rate Key: 3 Key: 5


obtained for the ASIC implemen-
tation of 8-tap RNS and TCS DWT Fabric Network
filter bank revealed that RNS solu- Figure 4. KeyFlow: RNS network switching strategy [68].
tion with efficient implementa-
tion of modulo multiplications by
index transformations and TCS solution using pipelined and dynamic range were investigated, resulting in the
Booth-encoded multipliers had comparable complexi- selection of 2 16 + 1 and 2 8 + 1 as the moduli to obtain
ties, but the former had higher speed and better routabil- transform length of up to 256 points with approximately
ity. DWT filter banks enhanced by RNS arithmetic with 24 bits of dynamic range. Comparing with FNT using sin-
21-, 23-, 27- and 29-bit outputs were found to be about gle modulo 2 24 + 1, the arithmetic units of the proposed
59%, 46%, 76% and 131% faster than the corresponding method with moduli 2 8 + 1 and 2 16 + 1 are smaller and
TCS designs in Chip Express 0.35 n m triple-level metal faster. Prior to this work, several methods were pro-
CX3003 CMOS technology. The total resources required posed based on Quadratic RNS (QRNS), which require
and maximum sampling rate obtained for a 4-tap DWT large storages for the computation of FNT in complex
filter bank implemented on an Altera FLEX10KE Field number system. Other key arithmetic-intensive opera-
Programmable Logic (FPL) device also indicated that tions frequently encountered in digital image process-
the RNS-enhanced DWT filterbanks, with 19-, 20-, 21- ing can also be accelerated by RNS in their hardware
and 22-bit outputs, were about two times faster than implementation [64]–[66]. One example is the design of
their TCS counterparts. This dramatic increase in the Gaussian smoothing and Sobel’s edge detector based
system performance is due to the efficient mappings of on the popular three moduli set {2 n - 1, 2 n, 2 n + 1} [65].
the index transformation multipliers to the embedded Its implementation in 32 nm CMOS standard cell tech-
array blocks and the 5-bit and 6-bit modular adders to nology consumed 112332 nm 2 of area and 30.23 mW of
the 8-bit logic elements to take advantage of the faster power at 1.05V and 250 MHz. The circuit was capable of
carry-propagation paths within the logic array block of processing a 256 # 256 image at double this frequency
a FPL device. as the actual critical path delay was less than 2 ns.
Convolutions and correlations are also essential In general, most DSP systems exploit the shorter
functions in video enhancement, restoration, filtering critical path with parallel carry-free arithmetic of RNS
and compression, but they are very expensive to per- to achieve higher energy efficiency with lower supply
form in real-time. Number theoretic transform such as voltage under the same computation precision as TCS.
Fermat Number Transform (FNT) can be used to replace For some error-tolerable systems, additional power sav-
the complex domain of Fast Fourier Transform (FFT) ings can be gained by scaling the supply voltage below
with a finite residue ring to avoid scaling and quanti- the critical voltage. However, voltage overscaling (VOS)
zation errors in image and video convolutions without causes soft errors due to timing violation. The soft errors
having to use floating point units. Unfortunately, the introduced into the RNS-based DSP systems can be
strict relation between the modulus q = 2 p + 1 and the mitigated by the reduced precision redundancy (RPR)
maximum transform length has severely limited long technique to achieve a high error recovery probability
length FNTs to only a few particular choices as p must with low hardware complexity [67]. A case study of VOS-
be an integer power of two. This limitation is relaxed by based FIR filter implemented in 0.25 nm, 2.5 V CMOS
using multiple short moduli of RNS to extend the word- technology showed that the RNS-RPR design saved 62%
length of small modulus FNTs [63]. Several combina- more energy than the conventional TCS filter with less
tions of moduli to support the required transform length than 2 dB degradation in signal-to-noise ratio (SNR).

fourth QUARTER 2015 IEEE circuits and systems magazine 33


on RNS computations. The explicit ID path coding at the
edges and stateless routing/forwarding in the core ena-
bles extended decoupling between the control and data
planes and shorter switching latency. KeyFlow achieved
more than 50% round trip time reduction, especially
M mod m1 M mod m2 M mod m3
for populated flow table’s networks. In addition, it also
X Y Z achieved more than 30% reduction in keeping active
flow state in the network.
Fig. 4 shows an example of a KeyFlow-enabled fabric
M
M M network for the communication between two autono-
B mous systems when there is a rule miss for the first pack-
age at the ingress edge node. Assume that the coprime
M (w-Bit)
moduli {m 1, m 2, m 3, m 4} = {4, 3, 5, 7} are assigned to the
W A local switches at network initialization to meet the man-
datory requirement that the modulus assigned to each
Figure 5. RNS-based packet forwarding in WSN [69].
switch is larger than its number of output ports. If the
KeyFlow controller selects an end-to-end path through
B. Communications and Networking the nodes {4, 3, 5} via their respective output interfaces
Software-defined networking (SDN) is an approach to with port IDs = (x 1, x 2, x 3) = (1, 1, 0), then the route
simplify computer networking through abstraction of ID can be calculated directly in the controller with-
lower-level functionality. The network services are man- out any further interaction with the KeyFlow-enabled
aged by decoupling the control plane that decides the nodes. According to CRT in (1), since M = 4 # 3 # 5 = 60,
traffic flow from the data plane that forwards traffic M 1 = 15, M 2 = 20 and M 3 = 12,
to the selected destination. OpenFlow is a well-known
protocol to realize SDN. However, OpenFlow-enabled X = 15 # 15 -1 4 # 1 + 20 # 20 -1 3 # 1 + 12 # 12 -1 5 # 0 60
switching network has difficulty to meet the core net- = 15 # 3 # 1 + 20 # 2 # 1 + 12 # 3 # 0 60 = 25
(2)
work demands in its reactive mode because the total
number of end-to-end active flows in a given route is In each hop, the output port can be determined with-
severely constrained by the hardware flow table imple- out table lookup by a modulo reduction of this unique
mentation in a multi-vendor core network. Firstly, the path ID in the package header by the local switch key.
complex rule insertion creates a bottleneck for the maxi- RNS can also be used to enhance the reliability of
mum flow rate. At the most only a few hundred flows per WSN by reducing the mean energy consumption of
second can be set up. Besides, the delay for flow inser- each sensor node [69]. The main idea is to distribute
tion and flow modification can vary significantly. Sec- the network loads among all nodes so that the maxi-
ondly, the time taken for the flow to be fully functional mum number of transmitted bits per node is substan-
is severely affected by the control channel latency. Emu- tially reduced. The network is organized into clusters to
lated experiments show that a control channel latency minimize the number of hops needed to reach the sink.
of 300 ms will result in 20 s to reach full flow rate across Through exchange of initialization messages, each node
32 switches. The use of expensive and power-hungry will know its own next-hops, its predecessor nodes in a
fast content addressable memories to address the state- multi-hop transmission, and the number of chunks N a
ful requirement for the active flows imposes scalability, received packet can be split. Packets to be forwarded
responsiveness, cost and power consumption problems from the sensor nodes will be split whenever there are
for the application of SDN in core networks. at least N next-hop nodes, and the packet has not been
RNS can unlock the SDN usage on core networks. previously split. The splitting is done by dividing the
Such an alternative to OpenFlow is demonstrated by message by the selected moduli, and the reminders will
KeyFlow [68] for building simpler core network ele- be sent separately by the next-hop nodes. The origi-
ments for package forwarding. With the core network nal message will be reconstructed after all the chunks
switches’ IDs of a route represented by the moduli of have been received by the sink which is assumed to be
an RNS, the output ports of the core switches used by computationally and energetically more equipped than
the route can be identified by the residues of the large other sensor nodes. In Fig. 5, a scenario of splitting a
integer route path ID. In this way, the table lookup bot- w-bit packet transmitted from node A to the sink at the
tleneck in the data plane of the SDN core network can second hop is depicted. To reduce the number of bits
be overcome by highly scalable KeyFlow fabric based needed to be forwarded by each next-hop node, a set

34 IEEE circuits and systems magazine fourth QUARTER 2015


File Stored over Cloud

Data

{m
Conversion to Residues 1, m
8 Moduli 2, .
..,
m
8}

eW1 eW2 eW3 eW4 eW5 eW6 eW7 eW8

Retrieving Original Data 5 Moduli

Figure 6. File storage over cloud using RNS [71].

of small consecutive primes that satisfy the condition randomly distributed and moved to random waypoints.
M $ 2 w can be selected as the moduli without increas- The results showed that the RNS-based approach out-
ing the number of forwarders and yet the message performed the conventional AOMDV method in differ-
can still be reconstructed if one or more remainders ent simulated scenarios.
are lost. Given N and f, an ordered list of prime num-
bers stored in each sensor node’s memory and a set of C. Cloud Storage
lookup-tables (one for each possible w) can be used to The concerns of unexpected termination of services
retrieve the unique minimum prime set with f admissi- and breach of data confidentiality by current cloud stor-
ble faults in a distributed manner to provide an optimal age providers can be addressed by RRNS [71]. In order
tradeoff between reliability and energy consumption, to store a file over the cloud with greater reliability in
by taking into account erasure channels, physical layer terms of long-term availability and privacy, it is split
overhead, and actual computational resources of all into p + r residue-segments based on RRNS, where
nodes in a real WSN. r is the number of redundant moduli. Each chunk is
Another RNS application in network systems is the encoded by BASE-64 before it is encrypted with a sym-
technique proposed in [70], which utilizes the modu- metric algorithm to encapsulate the binary data in the
larity nature of residue code to reduce the number of payload of an XML wrapper file. Finally, each encrypted
dropped messages in Ad Hoc networks caused by mali- chunk is sent to a different cloud storage provider. An
cious nodes, buffer overflows, nodes movement and XML metadata map file describing the locations and
collision. This technique incorporates RRNS code into retrieval method of the different chunks is created and
a modified version of Ad hoc On-demand Multipath safeguarded by the client. This approach is illustrated
Distance Vector (AOMDV) routing protocol where a in Fig. 6. In the event that any of the storage devices
message is split into N number of parts and sent via breaks down temporarily or permanently, the origi-
multiple routes to the destinations. At the receiver nal file can still be easily reconstructed by the client
side, the message can be fully recovered as long as the using only p chunks including the redundant residue-
number of parts reaching the destination is more than segments. This storage approach not only protects the
N/2 with the condition that all the parts do not travel stored files from system failures, but also prevents the
via the same route. The performance of the proposed cloud provider from accessing the stored files because
technique was measured by counting the number of only the owner knows the chunk’s storage locations and
messages successfully delivered via nodes which were their access method. Furthermore, a parallel download

fourth QUARTER 2015 IEEE circuits and systems magazine 35


Correct d-Bit Data

Checker RRNS Decoder Converter

b b b b

Block Block ... Block Block


Address Decoder

Memory Cell Array


n-Bit
Address

Block Block ... Block Block

Residue 1 Residue 2 Residue n – 1 Residue n


b b b b
RRNS
Modulo
Modulo 1 Modulo 2 Encoder Modulo n
n–1
...

d-Bit Data

Figure 7. Error detection and correction in memories by RRNS [73].

of distinct chunks from different cloud storage provid- range. Otherwise, additional moduli can be added to
ers also results in efficient bandwidth utilization. The provide the desirable fault tolerance.
storage size of a file is a function of r and the minimum One of the applications that incorporates RRNS
number of moduli required to reconstruct the file. On based error detection and correction code is hybrid
equal error tolerance, the ratio of the storage required memory [73]. In hybrid memories, non-CMOS devices
by traditional redundancy approach that stores multi- are used as memory cell together with CMOS-based
ple copies to that required by this approach was found peripheral circuits. Compared to conventional CMOS
to be about 1.75 [71]. memory cells, hybrid memory offers bigger data storage
capacity but has a higher defect rate of 10% or more due
D. Fault-Tolerant Computing to the high manufacturing process variability of emerg-
The reliabilities of electronic circuits are greatly ham- ing nano-devices. The first RRNS code designed for
pered by aggressive device scaling. To minimize the ­defect-tolerant memory systems consists of six moduli of
yield losses and product failures every year, fault toler- the forms {2 n + 1, 2 n, 2 n - 1 - 1, 2 n - 2 - 1, 2 n - 3 - 1, 2 n - 4 + 1},
ance has emerged as a new design dimension of utmost where 2 n + 1 and 2 n are the information moduli and
significance to the reliable operation of nano-electronic 2 n - 1 - 1, 2 n - 2 - 1, 2 n - 3 - 1 and 2 n - 4 + 1 are the redun-
circuits. Several techniques, which include self-check- dant moduli. Contrary to conventional RRNS code, the
ing logic, module replication, error correction code, redundant moduli are smaller than the information
and reconfiguration, etc. [72], have been developed moduli which cause ambiguity in error correction. The
to enhance the dependability of electronic circuits ambiguity is eliminated using maximum likelihood de-
designed out of fallible devices, but none has its error coding technique. As a result, more data can be stored
isolation capacity inherent in the arithmetic operations using this scheme as its codeword length is shorter than
like RNS. The lack of ordered significance among the the Reed-Solomon (RS) code and conventional RRNS
residues of an RNS implies that errors due to process- (C-RRNS) code for 16-bit, 32-bit and 64-bit memories.
ing noise in one residue digit will not contaminate other As shown in Fig. 7, the input data is first converted to
residue digits, and a faulty modulus channel can be shut a set of residues by the RRNS encoder. The residues are
down if the surviving channels have adequate dynamic then concatenated to create the RRNS codeword before

36 IEEE circuits and systems magazine fourth QUARTER 2015


it is stored in the memory cells. To retrieve the data, a block lengths of RRNS-STBCs are also well suited in
weighted number is reconstructed from the residues. If low latency communication systems and multiple-input
the weighted number falls within the legitimate range multiple-output implementations of wireless standards.
of the information moduli, the correct data is retrieved.
Otherwise, the data contains error and the erroneous E. Cryptography
residue digit can be located by iteratively computing the Public key cryptography is fundamentally used to
magnitude of the residue representation after exclud- assure confidentiality, authenticity and non-repudabil-
ing one residue at a time until the computed magnitude ity in electronic communications without requiring a
falls within the legitimate range. The excluded residue covert channel for key exchange between the communi-
that brings the magnitude back to the legitimate range cation parties. Encryption and decryption in public key
is erroneous. algorithm involve computationally costly large modular
In addition, RRNS was also used to mitigate errors in multiplications and modular exponentiations due to
FIR filters [60], [61]. In comparison with TMR, the RRNS- the large keys required for security reason. This has
based approach consumed less hardware area for the important implications for their practical use in mobile
same error correction capability [57]. The RRNS-based devices and light weight applications.
approach can eliminate soft errors produced by single- RNS is useful in speeding up the hardware implemen-
event upset (SEU) in FIR filters with zero fault missing tation of modular multiplication and modular exponen-
rate and used less hardware area than traditional SEU tiation in Rivest, Shamir and Adleman (RSA) [76] and
mitigation schemes [60]. Elliptic Curve Cryptography (ECC) [77] algorithms. As a
Another application of RRNS based code is the mul- core in RSA modular exponentiation and ECC point mul-
ticarrier modulation scheme [74]. It is adaptively coded tiplication, RNS based Montgomery multiplier is faster
in RRNS with three information and six redundant resi- than the binary Montgomery multiplier [78], [79]. The
dues. Each residue has 8 bits. Compared to other RRNS use of RNS in RSA dates back to two well-known RNS-
codes with different number of information and redun- based Montgomery multiplications in 1992 and 1995
dant residues, the RRNS (9, 3) code provides the strong- [18], [80]. These were later extended to a full-fledged
est error correction capability with a code rate of 0.33. RNS implementation of RSA [76], where all computations
This scheme can effectively counter the frequency- are performed in the RNS domain without requiring any
selective fading effects caused by dispersive wide- forward and reverse conversions. To achieve this, the
band channels. For the target bit error rate of 10 -4, it sender and the receiver have to agree on a set of RNS
outperformed all other convolution constituent code parameters beforehand. On the other hand, the most
based schemes in terms of bit per symbol throughput important and computationally intensive arithmetic
when the SNR was above 15 dB [74]. To support higher operation of ECC is point multiplication. Any speedup
system diversity and coding gain, space-time block on point multiplication will result in a noteworthy
codes (STBCs) have recently been formulated based on improvement in ECC’s performance. RNS Montgomery
RRNS [75]. The residues are mapped directly to com- modular multiplication has proven to be the fastest
plex constellations using direct, distance-aware and design approach to speed up the elliptic curve point
indirect mapping schemes. The
mapped symbols are constructed
as space-time block codes and
Simple Calculation on Cipher Text: y′ + z′
transmitted over multiple anten- Without Knowing Original Data y, z
nas. At the receiver, the original
information is recovered using
£*× €... ...
CRT. RRNS code structure ena- ®..× 4
bles the soft decision maximum
likelihood decoding complex- ...

ity to be reduced by adaptive


M-ary demapping scheme with Simple Homomorphic HORNS Simple Homomorphic
distance-aware mapping. More Encryption Decryption
(Example: y′ = 2 × y, (Example: (y′ + z′)/
importantly, the highly modular 2 = y + z)
z′ = 2 × z)
systematic RRNS makes encod-
ing and decoding codes with dif- 0 1 0 0 1 1 1 0 1… 1 0
ferent parameters by the same
Figure 8. HORNS for Cloud Computing [86].
hardware possible. The short

fourth QUARTER 2015 IEEE circuits and systems magazine 37


multiplications using less than half the areas of other original data domain by finding the residue of Xl with
previous design efforts [77]. It performs increasingly respect to R X . M can be similarly obfuscated by mul-
better as the key length of the ECC increases. The speed tiplying the modulus m i by a random noise rm i ! R M and
and area of RNS Montgomery multiplication were further the results from the cloud performed using the trans-
improved by using arithmetic-friendly moduli with pipe- formed modulus mli = m i $ rm i can be converted to the
lined architectures [78]. original modulus by ^ x i mod mli h mod m i . To reduce the
Fully homomorphic encryption (FHE) marks an chance for any cloud to successfully infer the data and
important milestone in the advance of modern cryp- the dynamic range by collusion, the computations can
tography [81], [82]. With FHE, it is possible to perform be distributed to different clouds. The use of HORNS to
computations directly on the cipher texts without dis- perform computations on user data based on the cloud
closing the secret key. This can be advantageous for framework is shown in Fig. 8.
cloud computing to chain different services together Side-channel attacks (SCAs) are critical threats to
without exposing the data to each of those services. cryptosystems [87], which is capable of revealing the
However, the implementation complexity of FHE has secret by analyzing different sources of leakage such as
impeded its use in practice. To show the magnitude of time and power consumption during the execution of
problem, the first implementation of a lattice based FHE cryptographic operation on hardware devices. One pos-
variant took 17 Mega bytes of public keys, more than sible strategy to curb this problem at arithmetical level
one second to encrypt one bit and nearly half a minute is to use RNS for masking the data and internal computa-
to recrypt a primitive on a high end Intel Xeon based tions in conjunction with SCA-protected methods. This
server. Dedicated software and hardware methods have is the foundation of the Leak Resistant Arithmetic con-
been explored to make FHE more efficient [83], [84]. cept proposed in [88]. Moreover, base randomizations
Lately, a more efficient leveled FHE scheme without in RNS-based RSA can provide high robustness against
bootstrapping has emerged, but a large integer matrix- SCAs without incurring any overhead in comparison to
vector multiplication is required for its encryption. RNS the unprotected regular implementation [89]. Besides,
can be used to decompose each large integer element of AES algorithm can also be protected against fault
the matrix into smaller residues [85]. The vector opera- attacks by polynomial RNS [90]. In this method, three
tions involving modular multiplications and additions parallel RNS-based AES cores, including one redundant
are then performed on these smaller residues. The final core, are used to process the RNS-representation of the
result is reconstructed from the residues by CRT. As the original data. Upon completion of the computations, the
vector operation process consumed most of the cen- residues are converted back to the weighted equivalent
tral processing unit (CPU) time, it was accelerated by followed by an overflow detection mechanism, which
graphic processing unit (GPU). This RNS-based method can detect up to 4 bits of errors.
on CPU and GPU implementations were respectively Power analysis has been proven to be the most dread-
7.8 times and 273 times faster than the regular Number ful SCA on embedded systems. To protect public key
Theory Library on CPU [85]. ciphers from both simple and differential power analy-
In the homomorphic encryption scheme of the form sis attacks, it is possible to take advantage of the par-
c = pq + m, where c is the ciphertext, m is the plain- allelization in multi-modulo RNS to deform the secure
text message, p is the key and q is a random number, information and select the moduli randomly for each
the encryption function happens to be the inverse of key bit operations [91]. Besides the enhanced security,
the residue operation, i.e., m is the residue of c with it has all benefits of RNS to process a large number of
respect to modulus p. This is the principle underpin- inputs faster and consume less power.
ning the security of HORNS (stands for HOmomorphic
encryption with RNS) [86] to use untrusted cloud for the IV. Technology Evolution on RNS Implementation
computation of client’s data. HORN generates multiple It is clear from our literature review that the technol-
residues of a data, and the operations on them are done ogy revolution of hardware implementation platforms
by the cloud using homomorphic encryption. To assure and design automation tools can be a game changer on
data confidentiality, the data X and the dynamic range the competitiveness of RNS. What used to be the most
M should not be inferred by the cloud from the residues critical figures of merits for a winning design in the pre-
x i and its moduli m i . To confuse the cloud, random vious decade may no longer be valid today due to the
noise rx ! R d is added to the data, and the operations revaluation of cost factors in memory, wiring, power dis-
are performed on X l = X + R X $ rx instead of X. As the sipation, thermal, yield and reliability as technologies
range of the data is reduced to M R X , the result of any advance. In the early days, RNS was exploited mainly
computation from the cloud can still be converted to the for maximizing the performance and minimizing the

38 IEEE circuits and systems magazine fourth QUARTER 2015


footprint for ASIC implementation of DSP algorithms. Unfortunately, the devices in these promising nanoscale
Today, power consumption and heat dissipation have technologies are even more prone to manufacturing
become more important design constraints due to the process variability and defects. One interesting study
proliferation of battery-operated and portable devices. on residue arithmetic addition of different word lengths
This has resulted in a paradigm shift in the state-of- for several practical cases of inter- and intra-die varia-
the-art synthesis and placement-and-route tools, which tions showed that they were significantly more robust
put more emphasis on reducing long interconnections to timing variations than the traditional adders [95].
on critical paths and load balancing to equalize path The ability of RNS to mitigate excessive delay variations
delays. The speed advantage of RNS by virtue of its of nanoscale devices was further validated by quanti-
locality property and shorter datapath is thus diminish- tative digital filter design space exploration taking into
ing. Instead of exploiting the locally bound and shorter account the filter order and output signal quality [96].
datapath for speed, these intrinsic features of RNS can By comparing the three-moduli RNS {2 8, 2 10 - 1, 2 12 + 1}
be used to redistribute the logic on fewer low driving against the equivalent TCS implementations of a four-
strength cells to equalize the average delay mismatch tap FIR filter, savings of normalized delay-variation of
at the gates’ inputs [92]. Since only a small fraction of up to 58% and timing yield of up to 100% for a four-tap
circuits in the critical modulus channel with the long- FIR filter were demonstrated.
est delay has tight timing constraint, the timing of a Field-programmable gate array (FPGA) is another
large majority of datapaths can be relaxed to reduce the popular implementation platform which is preferred
amount of spurious switching. The reduced glitch per- to ASIC for its lower cost, programmability and shorter
centage was shown to produce significant power sav- turnaround time, and to DSP and microprocessor for its
ings in the RNS-based FIR filters compared with its TCS faster speed and lower power consumption. Like ASIC
equivalents [56]. The locality property of RNS can also technology, FPGAs have also evolved rapidly in density
be leveraged to reduce the leakage power by the tech- and architectural heterogeneity. The newest generation
nology mapper using the multithreshold voltage cells of FPGAs has migrated towards a hybrid architecture
libraries. The differential delay of input signals can be that has more complex interconnect structure and logic
more readily compensated than TCS without impacting element topology, and embedded with full custom pro-
the timing of other paths by replacing standard- Vt cells cessing elements such as multipliers, hardware pro-
with the slower but low leakage high- Vt cells. Moreo- cessor cores, multiply-add units and very high speed
ver, the modular operations of RNS, being non-linear serial I/O blocks. These major changes have resulted
and independent of the input distribution, has a natural in the FPGA implementation of fast VLSI adder struc-
propensity to cut out the input correlations and restore tures such as various parallel-prefix adders to have
the bit probability distribution uniformly over the entire worse delay and area compared with the ripple carry
dynamic range of each modulus channel. The immunity adder [97]. This means that efficient implementation of
of power consumption to input correlations and changes modular arithmetic units in ASIC may not necessarily
in input distribution is beneficial
as many power reduction strate-
gies, such as voltage-frequency
scaling, work best with designs Main
Application Modular Modular
that have better power con- Program Magnitude
Integer Fractional
sumption predictability. Graphic Unit Unit
Comparison
Even if CMOS device can Display
continue to scale in its current System
pace before reaching the quan- Sign Digit
ALU Status
CPU Detection Extension
tum physical limits, many issues
are beginning to arise and are
becoming increasingly difficult Instruction
ALU
and costly to solve. In recent Decode Converters
Control
& DMA
years, research on non-CMOS
paradigm has borne fruits that RNS ALU as Co-Processor
prove the feasibility of imple- I/0 Devices (Printer,
Keyboard, …)
menting arithmetic circuits
by using emerging nanoelec-
Figure 9. RNS ALU as a co-processor [102].
tronic devices [2], [4], [93], [94].

fourth QUARTER 2015 IEEE circuits and systems magazine 39


towards parallelism can be benefited from RNS by parti-
tioning sequential operations in computation-intensive
Instruction Fetch
applications to parallelize their execution on multiple
cores. One good example is the acceleration of matrix-
Instruction vector multiplication for leveled FHE algorithm [85] men-
Decode tioned in Section III.E. Another example is the efficient
parallel implementation of the highly data-dependent
elliptic-curve point multiplication on GPUs [100]. The
Execute parallel execution of point multiplications on GPU cores
RNS Register File Integer Register with RNS partitioning can achieve up to 122% through-
File put improvement. A general framework for automati-
cally generating RNS-based design and implementation
of the computation algorithms on GPUs was proposed
Modular Arithmetic
Converter
Reverse

Regular in [101]. Under this framework, the parallel programs of


Converter
Forward

Channels Adder
... Multiplier algorithms implemented in RNS can be generated auto-
matically and transparently to the system designer.
Although RNS is generally conceived to be a poor
fit for the implementation of programmable processor
Memory
due to its transcoding overheads, a patent was filed for
a RNS general-purpose arithmetic and logic unit (ALU)
capable of performing both integer and fractional oper-
ations on very large values [102]. This RNS-based ALU is
Write Back
used as a co-processor to a conventional CPU to acceler-
ate computations as shown in Fig. 9.
Figure 10. An embedded RNS RISC pipeline processor [103].
Recently, a multi-tier approach was adopted to design
a 32-bit RNS extension to the embedded reduced instruc-
have better performance if they are implemented in the tion set computer (RISC) processor based on the moduli
same way on FPGA. The same goes for the advantages set {2 n - 1, 2 n + k, 2 n + 1} [103]. By balancing the modular
of implementing isomorphic multipliers in the early multiplier delay across the three modulus channels, the
generation of general purpose FPGA resources. There- values of n and k were fixed at 9 and 15, respectively.
fore, new ways of utilizing the latest internal structure The RNS adder was designed for three operand addition
of FPGA should be exploited, as exemplified by the fast given that its speed was still significantly faster than the
modulo 2 n - 1 and 2 n + 1 adders that take advantage two-operand TCS adder. This adder was used for two-
of the internal carry propagate structure of modern operand subtraction with additional input logic to condi-
FPGAs [98]. To better exploit the new features of the tionally negate-and-correct the second operand. The RNS
latest FPGAs, modular additions and multiplications of adder/multiplier, a fully carry-save forward converter and
RNS were implemented using a ROM based approach a reverse converter were embedded in the execute stage
as opposed to the classical MUX based adders [99]. of a RISC instruction pipeline. Together with the regular
Different optimization techniques are proposed for the binary ALUs, they constituted a hybrid RNS processor
design of basic building blocks, including the forward shown in Fig. 10. To allow the conversion operations to
and reverse converters, based on moduli sets selected be scheduled in parallel with some other computation,
to optimally use the new 6-input lookup tables of the separate instructions for RNS addition (RADD), subtrac-
complex logic blocks. High speed and low resource uti- tion (RSUB) and multiplication (RMUL), and for convert-
lization rate were demonstrated by applying these tech- ing operand from TCS to RNS (FC) and vice versa (RC)
niques to the design of different orders of RNS filters. were added into an existing RISC instruction set archi-
Specifically, 40% saving on resource utilization over the tecture. A compiler was developed to analyze application
TCS implementation was reported for a 256-tap FIR filter data dependency graph for RNS profitability and map the
implemented by the moduli set {64, 31, 29, 23, 19, 17, 13} potential sub-graphs to RNS instructions. The instruc-
on the same Xilinx FPGA [99]. tion scheduling was then performed to hide the conver-
Due to the hot-spot problem, it is impossible to sion latency and minimize the runtime. This RNS-based
increase the throughput of microprocessor by increas- embedded processor was able to achieve more than 50%
ing the clock rate. Thus general purpose processor is of power saving in comparison with regular TCS for vari-
evolving towards the multicore architecture. The trend ous DSP benchmark kernels.

40 IEEE circuits and systems magazine fourth QUARTER 2015


V. Conclusion and Future of RNS establishment flits and optimize packet routing in NoC
RNS, with its ability to perform parallel and fast compu- communications to overcome the drawbacks of unneces-
tations, adds values to modern DSPs that place energy sary delay increase due to the setup of path from source
efficiency as a major concern. This paper reviews the to destination in circuit switching techniques and the
advantages of RNS, and its brilliant potential to distinc- inefficient link utilization in round robin arbitration.
tively enhance digital systems in many fields. It is shown The fact that power consumption increases with
that RNS features can be applied as a tool in a variety of decreasing supply voltage for analog circuits is the funda-
applications to either augment their functions or acceler- mental reason of why they cannot benefit from technology
ate their performance in a positive way. Not only is RNS scaling like their digital counterparts. Thus we see the use
useful in the prevalent DSP computations, but also in pro- of digital techniques to complement analog circuit design
longing the life time of WSN, easing message routing of in mixed-signal ASICs. Digitally assisted analog circuits can
SDN, and making shared memory more error-resilient and offload the accuracy constraints, noise and nonlinear dis-
cloud computing more secure. We have discussed how tortions and process calibration problems to digital pro-
evolution of implementation technologies can influence cessor. There are great promises for RNS, with its locality
the performances and design of RNS and demonstrated and fault-resilience, to play an important role in digitally-
how new platforms such as GPUs can be benefited from assisted analog design and be likewise benefited from the
software parallelism by means of RNS. Through survey of elimination of parallel interconnections between cores
practical findings from its contemporary and emerging by analog or current mode signaling. Analog-to-residue
applications, this paper attempts to demystify that RNS (ARC) and residue-to-analog (RAC) converters design are
is more than a data representation of only academic inter- watersheds between the two domains, and they can be
est. As a catalyst for future research, we present the fol- efficiently designed to eliminate the substantial overheads
lowing visions for the future of RNS. of ADC/DAC with RNS forward/reverse converters.
With the increase in complexity of DSP algorithms, Last but not least, modular partitioning of RNS can
there is a trend towards calculations that involve num- ease functional portioning with optimal use of inter-
bers with disparate magnitudes. For example, the radar posers to create 3D-specific designs that do more than
system used for navigation and guidance may require a recasting a 2D optimal design into the third dimension.
wider dynamic range for tracking but need only a small We believe that the design space offered by RNS have not
subset of range for target acquisition and identification. been well explored by the design tools to benefit from the
This put RNS at a disadvantage as changing the dynamic newly available 3DIC technology. Because of the stronger
range in residue domain incurs cost that is comparable influence of temperature on floorplanning decisions, 3D
at least to the overhead of a reverse converter. With a ICs are also worse impacted by process variations. There
single base and a fixed scaling factor, there is limited flex- are new opportunities for variability mitigation that are
ibility to trade accuracy for performance when frequent unique to RNS-based computations in 3D integration. The
and abrupt change in dynamic range are encountered at flexibility to do post-manufacturing sorting and changing
runtime. The dilemma is the moduli set chosen to accom- of die stacking ordering for each 3D chip after the bare die
modate the full precision will result in a design overkill have been tested can be augmented by error detection,
but the output quality and robustness will suffer if the isolation and correction of RRNS to significantly improve
chosen moduli set does not provide enough dynamic the performance and thermal yield. Similar potentials
range. The way forward is to consider multi-base RNS for may also exist in RNS implementation by other emerging
dynamic range adaptation, supported by efficient base technologies. For example, the short spin diffusion length
transformers [104], programmable scaler [35], and scal- of spintronic devices [3] may favor RNS to TCS implemen-
able multi-function residue operators [28], [79]. tation and the nonvolatile logic-in-memory architecture,
With Systems-on-Chip (SoC) integration, the impact where nonvolatile memory elements are distributed over
of DSP core on die size has greatly reduced, and given a logic-circuit plane, may resolve some deficiencies in
rise to the multiprocessors SoC (MPSoC). To cope with using arbitrary moduli set for RNS-based computation.
the heterogeneous and challenging on-chip communi-
cation requirements between the cores of MPSoC, new Chip Hong Chang received the B.Eng.
Network-on-Chip (NoC) technology replaces traditional (Hons.) degree from the National Uni-
bus-based global wirings with routers and switches to versity of Singapore in 1989, and the M.
adapt advanced data communication concepts for better Eng. and Ph.D. degrees from Nanyang
bandwidth utilization. Motivated by KeyFlow [68] that Technological University (NTU) of Sin-
uses RNS for package forwarding in SDN, we envision gapore in 1993 and 1998, respectively.
that RNS partitioning can also rescale the connection He served as a Technical Consultant in industry prior

fourth QUARTER 2015 IEEE circuits and systems magazine 41


to joining the School of Electrical and Electronic Engi- also selected as his university’s best researcher in 2014.
neering (EEE) of NTU in 1999, where he is currently an He is currently a visiting researcher in the Signal Pro-
Associate Professor. He holds joint appointments with cessing Systems Group, INESC-ID, IST, University of Lis-
the university as Assistant Chair of Alumni of the School bon, Lisbon, Portugal. His current research interest is
of EEE from 2008 to 2014, Deputy Director of the Center Computer Arithmetic with special emphasis on Residue
for High Performance Embedded Systems from 2000 to Number Systems architectures and applications.
2011, and Program Director of the Center for Integrated
Circuits and Systems from 2003 to 2009. He has coed- Azadeh Alsadat Emrani Zarandi
ited two books, published eight book chapters and more received the B.S. degree from Shahid
than 200 refereed international journal and conference Bahonar University of Kerman (SBUK),
papers (mostly in IEEE). His current research interests Iran in 2006 and the M.S. degree from
include residue number system, hardware-oriented Isfahan University of Technology, Iran
security, low power arithmetic circuits and digital fil- in 2008 with the highest honors. She
ter design. He is the co-recipient of the Gold Leaf and started her Ph.D. in computer engineering at Science and
Silver Leaf certificates of the 2010 Asia Pacific Confer- Research Branch, Islamic Azad University, Tehran, Iran.
ence on Postgraduate Research in Microelectronics and She is also a lecturer in computer engineering depart-
Electronics, and the recipient of the 2007 Collaboration ment of SBUK. She is currently a visiting researcher in
Development Award for Microelectronics and Embed- the Signal Processing Systems Group, INESC-ID, IST, Uni-
ded Systems and the 2007 Microsystems Strategic Alli- versity of Lisbon, Lisbon, Portugal. Her main research
ance of Quebec Research Collaboration Award. He is the interests are Residue Number Systems and Wireless
co-author of the finalist of the best paper award of the Sensor Networks.
1995 IFIP International Conference on Very Large Scale
Integration and the finalist of the best student paper Thian Fatt Tay received the B. Eng.
competition award of the 2015 IEEE International Sym- (Hons.) degree in Electrical and Elec-
posium on Circuits and Systems. tronic Engineering from Nanyang Tech-
Dr. Chang serves as the Associate Editor of IEEE Ac- nological University (NTU), Singapore,
cess from 2013–2015, IEEE Transactions on Circuits and in 2011. He is currently working towards
Systems-I from 2010–2013, IEEE Transactions on Very the Ph.D. degree in the Division of Cir-
Large Scale Integration (VLSI) Systems since 2011, Integra- cuits and Systems, School of Electrical and Electronic
tion, the VLSI Journal from 2013-2015, and Microelectron- Engineering, NTU Singapore. His area of research is
ics Journal since May 2014. He was the editorial advisory high-speed, low-power, and fault-tolerant computer
board member of Open Electrical and Electronic Engi- arithmetic circuit designs.
neering Journal from 2007 to 2013 and Journal of Electri-
cal and Computer Engineering from 2008 to 2014, as well
as the guest editor of special issues for IEEE Transactions References
on Circuits and Systems-I, Journal of Circuits, Systems and [1] R. Schneiderman, “DSPs evolving in consumer electronics applica-
tions,” IEEE Signal Process. Mag., vol. 27, no. 3, pp. 6–10, May 2010.
Computers and Journal of Electrical and Computer Engi- [2] S. Lin, Y. B. Kim, and F. Lombardi, “CNTFET-based design of ternary
neering. He also served in more than 40 international logic gates and arithmetic circuits,” IEEE Trans. Nanotechnol., vol. 10,
no. 2, pp. 217–225, Mar. 2011.
conference advisory and technical program committees. [3] S. Sugahara and J. Nitta, “Spin-transistor electronics: An overview
He is a senior member of IEEE and a Fellow of the Institu- and outlook,” Proc. IEEE, vol. 98, no. 12, pp. 2124–2154, Dec. 2010.
tion of Engineering and Technology (IET). [4] E. E. Swartzlander, H. Cho, I. Kong, and S. W. Kim, “Computer arith-
metic implemented with QCA: A progress report,” in Proc. Asilomar
Conf. Signals, Systems and Computers, Pacific Grove, CA, Nov. 2010,
Amir Sabbagh Molahosseini received pp. 1392–1398.
[5] J. Cong, V. Sarkar, G. Reinman, and A. Bui, “Customizable domain-specific
the B.Sc. degree from Shahid Bahonar
computing,” IEEE Des. Test Comput., vol. 28, no. 2, pp. 6–15, Mar. 2011.
University of Kerman, Iran in 2005, [6] J. Ouyang, G. Sun, Y. Chen, and L. Duan, “Arithmetic unit design us-
M.Sc. and Ph.D. degrees both with ing 180nm TSV-based 3D stacking technology,” in Proc. IEEE Int. Conf.
3-D System Integration, San Francisco, CA, Sept. 2009, pp. 1–4.
the highest honors from Science and [7] A. Omondi and B. Premkumar, Residue Number Systems: Theory and
Research Branch of Islamic Azad Uni- Implementations. London: Imperial College Press, 2007.
[8] H. L. Garner, “The residue number system,” IRE Trans. Electron. Com-
versity, Tehran, Iran, in 2007 and 2010, all in computer
put., vol. 8, no. 2, pp. 140–147, June 1959.
engineering. He is an Assistant Professor in Department [9] P. W. Cheney, “A digital correlator based on the residue number sys-
of Computer Engineering, Kerman Branch, Islamic Azad tem,” IRE Trans. Electron. Comput., vol. 10, no. 1, pp. 63–70, Mar. 1961.
[10] R. M. Guffin, “A computer for solving linear simultaneous equations
University, Kerman, Iran, and leads the High-Perfor- using the residue number system,” IRE Trans. Electron. Comput., vol. 11,
mance Computer Arithmetic Group, since 2010. He was no. 2, pp. 164–173, Apr. 1962.

42 IEEE circuits and systems magazine fourth QUARTER 2015


[11] R. D. Merrill, “Improving digital computer performance using [36] L. Sousa, “2n RNS scalers for extended 4-moduli sets,” IEEE Trans.
residue number theory,” IRE Trans. Electron. Comput., vol. 13, no. 2, Comput. (to be published).
pp. 93–101, Apr. 1964. [37] T. F. Tay and C. H. Chang, “A non-iterative multiple residue digit er-
[12] N. Szabo, “Sign detection in nonredundant residue systems,” IRE ror detection and correction algorithm in RRNS,” IEEE Trans. Comput.,
Trans. Electron. Comput., vol. 11, no. 4, pp. 494–500, Aug. 1962. to be published.
[13] Y. A. Keir, P. W. Cheney, and M. Tannenbaum, “Division and over- [38] V. T. Goh and M. U. Siddiqi, “Multiple error detection and correc-
flow detection in residue number systems,” IRE Trans. Electron. Com- tion based on redundant residue number systems,” IEEE Trans. Com-
put., vol. 11, no. 4, pp. 501–507, Aug. 1962. mun., vol. 56, no. 3, pp. 325–330, Mar. 2008.
[14] A. Sasaki, “Addition and subtraction in the residue number sys- [39] S. J. Piestrak, “Design of residue generators and multioperand
tem,” IEEE Trans. Electron. Comput., vol. 16, no. 2, pp. 157–164, Apr. modular adders using carry-save adders,” IEEE Trans. Comput., vol. 43,
1967. no. 1, pp. 68–77, Jan. 1994.
[15] D. K. Banerji and J. A. Brzozowski, “On translation algorithms [40] P. M. F. M. Matutino, R. Chaves, and L. Sousa, “Arithmetic-based binary-
in residue number systems,” IEEE Trans. Comput., vol. 21, no. 12, to-RNS converter modulo {2n±k} for jn-bit dynamic range,” IEEE Trans. Very
pp. 1281–1285, Dec. 1972. Large Scale Integration (VLSI) Syst., vol. 23, no. 3, pp. 603–607, Mar. 2015.
[16] R. W. Watson and C. W. Hastings, “Self-checked computation us- [41] G. Petrousov and M. Dasygenis, “A unique network EDA tool to cre-
ing residue arithmetic,” Proc. IEEE, vol. 54, no. 12, pp. 1920–1931, Dec. ate optimized ad hoc binary to residue number system converters,” in
1966. Proc. 24th Int. Workshop on Power and Timing Modeling, Optimization
[17] G. A. Jullien, W. C. Miller, J. J. Soltis, A. Baraniecka, and and Simulation (PATMOS), Palma de Mallorca, Sept. 2014, pp. 1–8.
B. Tseng, “Hardware realization of digital signal processing elements [42] B. Cao, C. H. Chang, and T. Srikanthan, “An efficient reverse con-
using the residue number system,” in Proc. IEEE Int. Conf. Acous- verter for the 4-moduli set {2 n –1, 2 n, 2 n+1, 22n+1} based on the New Chi-
tics, Speech, and Signal Processing, Hartford, CT, May 1977, vol. 2, nese Remainder Theorem,” IEEE Trans. Circuits Syst.-I, vol. 50, no. 10, pp.
pp. 506–510. 1296–1303, Oct. 2003.
[18] K. C. Posch and R. Posch, “Residue number systems: a key to par- [43] A. S. Molahosseini, K. Navi, C. Dadkhah, O. Kavehei, and S. Timar-
allelism in public key cryptography,” in Proc. IEEE Symp. Parallel and chi, “Efficient reverse converter designs for the new 4-moduli sets {2 n ­–1,
Distributed Processing, Arlington, TX, Dec. 1992, pp. 432–435. 2 n, 2 n+1, 22n+1–1} and {2 n –1, 2 n+1, 22n, 22n+1} based on new CRTs,” IEEE
[19] A. B. Premkumar, E. L. Ang, and E. M. Lai, “Improved memoryless Trans. Circuits Syst.-I, vol. 57, no. 4, pp. 823–835, Apr. 2010.
RNS forward converter based on the periodicity of residues,” IEEE [44] H. Pettenghi, R. Chaves, and L. Sousa, “Method to design gen-
Trans. Circuits Syst. II, vol. 53, no. 2, pp. 133–137, Feb. 2006. eral RNS reverse converters for extended moduli sets,” IEEE Trans.
[20] J. Y. S. Low and C. H. Chang, “A new approach to the design of ef- Circuits Syst.-II, vol. 60, no. 12, Dec. 2013.
ficient residue generators for arbitrary moduli,” IEEE Trans. Circuits [45] L. Sousa and S. Antao, “MRC-based RNS reverse converters for the
Syst.–I, vol. 60, no. 9, pp. 2366–2374, Sept. 2013. four-moduli sets {2 n ­–1, 2 n, 2 n+1, 22n+1–1} and {2 n ­–1, 2 n+1, 22n, 22n+1–1},” IEEE
[21] R. A. Patel, M. Benaissa, N. Powell, and S. Boussakta, “Novel power- Trans. Circuits Syst.-II, vol. 59, no. 4, Apr. 2012.
delay-area-efficient approach to generic modular addition,” IEEE Trans. [46] L. Sousa and S. Antao, “On the design of RNS reverse converters for
Circuits Syst.-I, vol. 54, pp. 1279–1292, June 2007. the four-moduli set {2 n, 2 n –1, 2 n+1, 2 n+1–1},” IEEE Trans. Very Large Scale
[22] S. Ma, J. H. Hu, and C. H. Wang, “A novel modulo 2n-2k-1 adder Integr. (VLSI) Syst., vol. 21, no. 10, pp. 1945–1949, Oct. 2013.
for residue number system,” IEEE Trans. Circuits Syst.-I, vol. 60, no. 11, [47] B. Cao, C. H. Chang, and T. Srikanthan, “A residue-to-binary con-
pp. 1166–1178, Nov. 2013. verter for a new 5-moduli set,” IEEE Trans. Circuits Syst.-I, vol. 54, no. 5,
[23] R. A. Patel, M. Benaissa, and S. Boussakta, “Fast parallel-prefix pp. 1041–1049, May 2007.
architectures for modulo 2 n -1 addition with a single representation of [48] P. Patronik and S. J. Piestrak, “Design of reverse converters for
zero,” IEEE Trans. Comput., vol. 56, no. 11, pp. 1484–1492, Nov. 2007. general rns moduli sets {2 k , 2 n –1, 2 n+1, 2 n+1–1} and {2 k , 2 n –1, 2 n+1, 2 n–1–1}
[24] H. T. Vergos and G. Dimitrakopoulos, “On modulo 2 n+1 adder de- (n even),” IEEE Trans. Circuits Syst.-I, vol. 61, no. 6, pp. 1687–1700, June
sign,” IEEE Trans. Comput., vol. 61, no. 2, pp. 173–186, Feb. 2012. 2014.
[25] L. Sousa and R. Chaves, “A universal architecture for designing ef- [49] A. A. E. Zarandi, A. S. Molahosseini, M. Hosseinzadeh, S. Sorouri, S.
ficient modulo 2 n+1 multipliers,” IEEE Trans. Circuits Syst. -I, vol. 52, no. 6, Antao, and L. Sousa, “Reverse converter design via parallel-prefix adders:
pp. 1166–1178, June 2005. novel components, methodology and implementations,” IEEE Trans. Very
[26] R. Muralidharan and C. H. Chang, “Radix-8 booth encoded modulo Large Scale Integr. (VLSI) Syst., vol. 23, no. 2, pp. 374–378, Feb. 2015.
2n -1 multipliers with adaptive delay for high dynamic range residue num- [50] W. K. Jenkins and B. J. Leon, “The use of residue number systems in
ber system,” IEEE Trans. Circuits Syst. I, vol. 58, no. 5, pp. 982–993, May 2011. the design of finite impulse response digital filters,” IEEE Trans. Circuits
[27] R. Muralidharan and C. H. Chang, “Area-power efficient modulo 2n –1 Syst., vol. 24, no. 4, pp. 191–201, Apr. 1977.
and modulo 2n+1 multipliers for {2n –1, 2n, 2n+1} based RNS,” IEEE Trans. [51] T. Stouratitis and V. Paliouras, “Considering the alternatives in low-
Circuits Syst. I, vol. 59, no. 10, pp. 2263 – 2274, Oct. 2012. power design,” IEEE Circuits Devices, vol. 7, no. 4, pp. 22–29, July 2001.
[28] R. Muralidharan and C. H. Chang, “Radix-4 and radix-8 Booth en- [52] R. Conway and J. Nelson, “Improved RNS FIR filter architectures,”
coded multi-modulus multipliers,” IEEE Trans. Circuits Syst.-I, vol. 60, no. IEEE Trans. Circuits Syst. II, vol. 51, no. 1, pp. 26–28, Jan. 2004.
11, pp. 2940–2952, Nov. 2013. [53] G. C. Cardarilli, A. Nannarelli, and M. Re, “Residue number system
[29] T. Tomczak, “Fast sign detection for RNS (2 n –1, 2 n , 2 n+1),” IEEE for low-power DSP applications,” in Proc. 41nd Asilomar Conf. Signals,
Trans. Circuits Syst.-I, vol. 55, no. 6, pp. 1502–1511, July 2008. Systems, and Computers, Pacific Grove, CA, Nov. 2007, pp. 1412–1416.
[30] C. H. Chang and S. Kumar, “Area-efficient and fast sign detection [54] C. H. Vun, A. B. Premkumar, and W. Zhang, “A new RNS based DA
for four-moduli set RNS {2 n –1, 2 n , 2 n+1, 22n+1},” in Proc. IEEE Int. Symp. approach for inner product computation,” IEEE Trans. Circuits Syst.-I,
Circuits and Systems (ISCAS), Melbourne, June 2014, pp. 1–5. vol. 60, no. 8, pp. 2139–2152, Jan. 2013.
[31] S. Bi and W. J. Gross, “The mixed-radix Chinese remainder theo- [55] J. C. Bajard, L. S. Didier, and T. Hilair, “p-Direct form transposed and
rem and its applications to residue comparison,” IEEE Trans. Comput., residue number systems for filter implementations,” in Proc. 54th IEEE
vol. 57, no. 12, pp. 1624–1632, Dec. 2008. Int. Midwest Symp. Circuits and Systems, Seoul, Aug. 2011, pp. 1–4.
[32] T. F. Tay and C. H. Chang, “New algorithm for signed integer compari- [56] M. Petricca, P. Albicocco, G. C. Cardarilli, A. Nannarelli, and M.
son in four-moduli superset {2n, 2n –1, 2n+1, 2n+1–1},” in Proc. IEEE Asia Pa- Re, “Power efficient design of parallel/serial FIR filters in RNS,” in Proc.
cific Conf. Circuits and Systems (APCCAS), Ishigaki, Nov. 2014, pp. 519–522. 46nd Asilomar Conf. Signals, Systems and Computers, Pacific Grove, CA,
[33] H. Siewobr and K. A. Gbolagade, “RNS overflow detection by oper- Nov. 2012, pp. 1015–1019.
ands examination,” Int. J. Comput. Appl., vol. 85, no. 18, pp. 1–5, Jan. 2014. [57] S. Pontarelli, G. C. Cardarilli, M. Re, and A. Salsano, “Totally fault
[34] C. H. Chang and J. Y. S. Low, “Simple, fast and exact RNS scaler for tolerant RNS based FIR filters,” in Proc. Int. Symp. On-Line Testing,
the three-moduli set {2n –1, 2 n, 2 n+1},” IEEE Trans. Circuits Syst.-I, vol. 58, Rhodes, July 2008, pp. 192–194.
no. 11, pp. 2686–2697, Nov. 2011. [58] E. Vassalos and D. Bakalis, “CSD-RNS-based single constant mul-
[35] J. Y. S. Low and C. H. Chang, “A VLSI efficient programmable power- tipliers,” J. Signal Process. Syst., vol. 67, no. 3, pp. 255–268, June 2012.
of-two scaler for {2n –1, 2 n, 2 n+1} RNS,” IEEE Trans. Circuits Syst.-I, vol. 59, [59] I. Shuli, M. Petricca, G. C. Cardarilli, A. Nannarelli, and M. Re, “Mul-
no. 12, pp. 2911–2919, Dec. 2012. tiple constant multiplication through residue number system,” in Proc.

fourth QUARTER 2015 IEEE circuits and systems magazine 43


43nd Asilomar Conf. Signals, Systems and Computers, Pacific Grove, CA, [83] W. Wang, Y. Hu, L Chen, X. Huang, and B. Sunar, “Exploring the fea-
Nov. 2009, pp. 736–739. sibility of fully homomorphic encryption,” IEEE Trans. Comput., vol. 64,
[60] Z. Gao, P. Reviriego, W. Pan, Z. Xu, M. Zhao, et al., “Efficient arith- no. 3, pp. 698–706, Mar. 2015.
metic-residue-based SEU-tolerant FIR filter design,” IEEE Trans. Circuits [84] Y. Doroz, E. Ozturk, and B. Sunar, “Accelerating fully homomor-
Syst.-II, vol. 60, no. 8, pp. 497–501, Aug. 2013. phic encryption in hardware,” IEEE Trans. Comput., vol. 64, no. 6,
[61] Z. Luan, X. Chen, N. Ge, and Z. Wang, “Simplified fault tolerant FIR pp. 1509–1521, June 2015.
filter architecturebased on redundant residue number system,” Elec- [85] W. Wang, Z. Chen, and X. Huang, “Accelerating leveled fully homo-
tron. Lett., vol. 50, no. 23, pp. 1768–1770, Nov. 2014. morphic encryption using GPU,” in Proc. IEEE Int. Symp. Circuits and
[62] J. Ramirez, U. Meyer-Base, F. Taylor, A. Garcia, and A. Lloris, “De- Systems, Melbourne, June 2014, pp. 2800–2803.
sign and implementation of high-performance RNS wavelet processors [86] M. Gomathisankaran, A. Tyagi, and K. Namuduri, “HORNS: a homo-
for custom IC technologies,” J. VLSI Signal Process., vol. 34, no. 3, pp. morphic encryption scheme for cloud computing using residue num-
227–237, July 2003. ber system,” in Proc. 45th Annu. Conf. Information Sciences and Systems
[63] T. Toivonen and J. Heikkilä, “Video filtering with Fermat number (CISS), Baltimore, MD, Mar. 2011, pp. 1–5.
theoretic transforms using residue number system,” IEEE Trans. Circuits [87] V. X. Standaert, “Introduction to side-channel attacks,” in Secure
Syst. Video Technol., vol. 16, no. 1, pp. 92–101, Jan. 2006. Integrated Circuits and Systems, edited by I. M. R. Verbauwhede. New
[64] W. Wang, M. N. S. Swamy, and M. O. Ahmad, “RNS application for York: Springer, 2010, pp. 27–42.
digital image processing,” in Proc. IEEE Int. Workshop on System-on-Chip [88] J.-C. Bajard, L. Imbert, P.-Y. Liardet, and Y. Teglia, “Leak resistant
for Real Time Applications, Banf, Alberta, July 2004, pp. 19–21. arithmetic,” in Cryptographic Hardware and Embedded Systems, Lec-
[65] E. Vassalos, D. Bakalis, and H. T. Vergos, “RNS assisted image filter- ture Notes in Computer Science, 2004, vol. 3156, pp. 62–75. New York:
ing and edge detection,” in Proc. IEEE Int. Conf. Digital Signal Processing Springer.
(DSP), Fira, July 2013, pp. 1–6. [89] G. Perin, L. Imbert, L. Torres, and P. Maurine, “Electromagnetic
[66] N. I. Chervyakov, P. A. Lyakhov, and M. G. Babenko, “Digital analysis on RSA algorithm based on RNS,” in Proc. Euromicro Conf. Digi-
filtering of images in a residue number system using finite-field tal System Design (DSD), Los Alamitos, CA, Sept. 2013, pp. 345–352.
wavelets,” J. Autom. Control Comput. Sci., vol. 48, no. 3, pp. 180–189, [90] J. Chu and M. Benaissa, “Error detecting AES using polynomial
May 2014. residue number systems,” Microprocessors Microsyst., vol. 37, no. 2,
[67] J. Chen and J. Hu, “Energy-efficient digital signal processing via pp. 228–234, Mar. 2013.
voltage-overscaling-based residue number system,” IEEE Trans. Very [91] J. A. Ambrose, H. Pettenghi, D. Jayasinghe, and L. Sousa, “Ran-
Large Scale Integr. (VLSI) Syst., vol. 21, no. 7, pp. 1322–1332, July 2013. domised multi-modulo residue number system architecture for double-
[68] M. Martinello, M. R. N. Ribeiro, R. E. Z. D. Oliveira, and R. D. A. Vitoi, and-add to prevent power analysis side channel attacks,” IET Circuits,
“Key flow: aprototype for evolving SDN toward core network fabrics,” Devices Syst., vol. 7, no. 5, pp. 283–293, Sept. 2013.
IEEE Netw., vol. 28, no. 2, pp. 12–19, Mar. 2014. [92] P. Albicocco, G. C. Cardarilli, A. Nannarelli, and M. Re, “Twenty
[69] G. Campobello, A. Leonardi, and S. Palazzo, “Improving energy years of research on RNS for DSP: lessons learned and future per-
saving and reliability in wireless sensor networks using a simple CRT- spectives,” in Proc. 14th Int. Symp. Integrated Circuits (ISIC), Singapore,
based packet-forwarding solution,” IEEE/ACM Trans. Network., vol. 20, Dec. 2014, pp. 436–439.
no. 1, pp. 191–205, Feb. 2012. [93] V. P. Shmerko, Computer Arithmetics for Nanoelectronics. Boca Ra-
[70] J. Alves, Jr., L. F. L. Nascimento, and L. C. P. Albini, “Using the redun- ton, FL: CRC Press, 2009.
dant residue number system to increase routing dependability on mo- [94] C. Gamrat, “Challenges and perspectives of computer architecture
bile ad hoc networks,” Cyber J.: J. Sel. Areas Telecommun., vol. 1, no. 1, at the nano scale,” in Proc. IEEE Computer Society Annual Symp. VLSI
pp. 67–73, Jan. 2011. (ISVLSI), Lixouri, Kefalonia, July 2010, pp. 8–10.
[71] A. Celesti, M. Fazio, M. Villari, and A. Puliafito, “Adding long-term [95] I. Kouretas and V. Paliouras, “Variation-tolerant design using
availability, obfuscation, and encryption to multi-cloud storage sys- residue number system,” in Proc. 12th Euromicro Conf. Digital Sys-
tems,” Elsevier J. Netw. Comput. Appl. (to be published). tem Design, Architectures, Methods and Tools, Patras, Aug. 2009,
[72] V. P. Nelson, “Fault-tolerant computing: fundamental concepts,” pp. 157–163.
Computer, vol. 23, no. 7, pp. 19–25, July 1990. [96] I. Kouretas and V. Paliouras, “Delay-variation-tolerant FIR fil-
[73] N. Z. Haron and S. Hamdioui, “Redundant residue number system ter architectures based on the residue number system,” in Proc.
code for fault-tolerant hybrid memories,” ACM J. Emerg. Technol. Comput. IEEE Int. Symp. Circuits and Systems (ISCAS), Beijing, May 2013,
Syst., vol. 7, no. 1, Jan. 2011. pp. 2223–2226.
[74] T. Keller, T.-H. Liew, and L. Hanzo, “Adaptive redundant residue [97] D. H. K. Hoe, C. Martinez, and S. J. Vundavalli, “Design and char-
number system coded multicarrier modulation,” IEEE J. Sel. Areas Com- acterization of parallel prefix adders using FPGAs,” in Proc. IEEE
mun., vol. 18, no. 11, pp. 2292–2301, Nov. 2000. Southeastern Symposium on System Theory, Auburn, AL, Mar. 2011,
[75] S. Avik and N. Balasubramaniam, “Performance of systematic pp. 168–172.
RRNS based space-time block codes with probability-aware adap- [98] L. S. Didier and L. Jaulmes, “Fast modulo 2 n -1 and 2 n+1 adder using
tive demapping,” IEEE Trans. Wireless Commun., vol. 12, no. 5, carry-chain on FPGA,” in Proc. Asilomar Conf. Signals, Systems and Com-
pp. 2458–2469, May 2013. puters, Pacific Grove, CA, Nov. 2013, pp. 1155–1159.
[76] J. C. Bajard and L. Imbert, “A full RNS implementation of RSA,” IEEE [99] S. Pontarelli, G. C. Cardarilli, M. Re, and A. Salsano, “Optimized
Trans. on Comput., vol. 53, no. 6, pp. 769–774, June 2004. implementation of RNS FIR filters based on FPGAs,” J. Signal Process.
[77] D. M. Schinianakis, A. P. Fournaris, H. E. Michail, A. P. Kakarountas, and Syst., vol. 67, no. 3, pp. 201–212, June 2012.
T. Stouraitis, “An RNS implementation of an Fp elliptic curve point multi- [100] S. Antao, J. C. Bajard, and L. Sousa, “RNS based elliptic curve
plier,” IEEE Trans. Circuits Syst.-I, vol. 56, no. 6, pp. 1202–1213, June 2009. point multiplication for massive parallel architectures,” Comput. J., vol.
[78] M. Esmaeildoust, D. Schinianakis, H. Javashi, T. Stouraitis, and K. 55, no. 5, pp. 629–647, May 2012.
Navi, “Efficient RNS implementation of elliptic curve point multiplica- [101] S. Antao and L. Sousa, “The CRNS framework and its application to
tion over GF(p),” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, programmable and reconfigurable cryptography,” ACM Trans. Architect.
no. 8, pp. 1545–1549, Aug. 2013. Code Optim., vol. 9, no. 4, pp. 33:1–33:25, Jan. 2013.
[79] D. M. Schinianakis and T. Stouraitis, “Multifunction residue archi- [102] E. B. Olsen, “Residue number arithmetic logic unit,” U.S. patent
tectures for cryptography,” IEEE Trans. Circuits Syst.-I, vol. 61, no. 4, US20130311532A1, Nov. 2013.
pp. 1156–1169, Apr. 2014. [103] R. Chokshi, K. S. Berezowski, A. Shrivastava, and S. J. Piestrak, “Ex-
[80] K. C. Posch and R. Posch, “Modulo reduction in residue number sys- ploiting residue number system for power-efficient digital signal process-
tems,” IEEE Trans. Parallel Distrib. Syst., vol. 6, no. 5, pp. 449–454, May 1995. ing in embedded processors,” in Proc. Int. Conf. Compilers, Architecture,
[81] C. Gentry, “A fully homomorphic encryption scheme,” Ph.D. thesis, and Synthesis for Embedded Systems, Grenoble, Oct. 2009, pp. 19–28.
Stanford Univ., 2009. [104] T. F. Tay, C. H. Chang, and L. Sousa “Base transformation with injec-
[82] D. Archer and K. Rohloff, “Computing with data privacy: steps to- tive residue mapping for dynamic range reduction in RNS,” IEEE Trans.
ward realization,” IEEE Security Privacy, vol. 13, no. 1, pp. 22–29, Jan. 2015. Circuits Syst.-I, to be published.

44 IEEE circuits and systems magazine fourth QUARTER 2015

You might also like