Design Techniques For Fault Tolerance
Design Techniques For Fault Tolerance
Designing Fault-Tolerant
Techniques for SRAM-Based
FPGAs
Fernanda Gusmão de Lima Kastensmidt Gustavo Neuberger, Renato Fernandes Hentschke,
State University of Rio Grande do Sul Luigi Carro, and Ricardo Reis
Federal University of Rio Grande do Sul
552 0740-7475/04/$20.00 © 2004 IEEE Copublished by the IEEE CS and the IEEE CASS IEEE Design & Test of Computers
Block RAM
Lookup
F1 table
Flip-flop M M M M
F2 M M
F3 M
F4
M
M
Figure 1. Bits sensitive to single-event upsets (SEUs) in the configurable-logic-block tile schematic. Inputs F1
through F4 are the four 1-bit input signals of the lookup table. M is the configuration memory cell.
transistor’s gate. The effect can invert the stored value— lookup tables (LUTs), flip-flops, and CLB configuration
that is, produce a bit flip in the memory cell. This effect cells, and interconnections, as Figure 1 shows. All these
is called a single-event upset (SEU) or soft error, and it’s configuration bits are potentially sensitive to SEUs;
a major concern in digital circuits. When a charged par- hence, we targeted them in our investigation.
ticle hits the combinational logic block, it also gener- In an ASIC, the effect of a particle hitting either the
ates a transient current pulse. This phenomenon is combinational or the sequential logic is transient; the
called a single-event transient (SET). only variation is how long the fault lasts. A fault in the
In FPGAs, an upset has a peculiar effect when it hits the combinational logic is a transient logic pulse in a node
combinational and sequential logic mapped into the pro- that can disappear according to the logic delay and
grammable architecture. For example, consider SRAM- topology. In other words, a storage cell might or might
based FPGAs such as those from Xilinx’s Virtex series, one not latch a transient fault from the combinational logic.
of the most popular series of programmable devices on Faults in the sequential logic manifest themselves as bit
the market. Virtex devices include a flexible, regular archi- flips, which remain in the storage cell until the next
tecture comprising an array of configurable logic blocks load. In an SRAM-based FPGA, customizable memory
(CLBs) surrounded by programmable I/O blocks, all inter- cells—SRAM cells (see Figure 1)—implement both the
connected by a hierarchy of fast and versatile routing user’s combinational and sequential logic. When an
resources.2 The CLBs provide the functional elements for upset occurs in the combinational logic synthesized in
constructing logic; the I/O blocks provide the interface the FPGA, it corresponds to a bit flip in one of the LUT’s
between the package pins and the CLBs. A general rout- cells or in the cells that control the routing. An upset in
ing matrix interconnects the CLBs. This matrix includes an LUT memory cell modifies the implemented combi-
an array of routing switches located at the intersections of national logic, as Figure 2a shows. This upset has a per-
horizontal and vertical routing channels. Virtex devices manent effect, and is correctable only at the next load
also dedicate 4,096-bit memory blocks called block-select of the configuration bitstream. This effect is similar to a
RAMs, clock delay-locked loops (DLLs) for clock-distrib- stuck-at fault at 1 or 0 in the combinational logic defined
ution delay compensation and clock domain control, and by that LUT. Thus, a storage cell latches the upset from
two tristate buffers associated with each CLB. the FPGA’s combinational logic, unless the FPGA uses
Users can quickly program a Virtex device by load- some detection technique. An upset in the routing can
ing a configuration bitstream (a collection of configu- connect or disconnect a wire in the matrix, as Figure 2b
ration bits) into it. They can change device functionality shows. It also has a permanent effect, which can travel
at any time by loading in a new bitstream. The bitstream to an open or a short circuit in the combinational logic
contains all the information to configure the program- implemented by the FPGA. The configuration bit-
mable storage elements in the matrix located in the stream’s next load corrects this fault.
November–December 2004
553
Fault Tolerance
Figure 3. Triple modular redundancy for Xilinx FPGAs. The throughput logic is triplicated, represented by TMR
combinational modules tr0, tr1, and tr2. The registers are also triplicated and are voted on by majority voters; they
also have a mechanism to correct upsets in the multiplexers.
structure you need to mitigate. Logic falls into four dif- gy. Because full TMR generates every logic path in tripli-
ferent structure types: throughput, state machine, I/O, and cate, it’s necessary to bring these three logic paths back to
special features (select-RAM blocks, DLLs, and so on). a single path that doesn’t create a single point of failure.
Throughput logic is a logic module of any size or func- You can do this by placing TMR output voters inside the
tionality, synchronous or asynchronous, where all logic output logic block. Figure 3 illustrates the TMR technique.
paths flow from the module’s inputs to its outputs with- Scrubbing lets a system repair SEUs in the configu-
out forming a logic loop. In this case, all that’s necessary ration memory without disrupting operations. The
is to triplicate the logic, creating three redundant logic Virtex Select-MAP interface performs this scrubbing.
parts (0, 1, and 2). No voters are required, because the When an FPGA is in this mode, an external oscillator
FPGA output will be voted on later by default. State- generates a configuration clock that drives the pro-
machine logic is any structure where a registered output, grammable ROM (PROM) and the FPGA. At each clock
at any register stage in the module, feeds back into any cycle, new data is available on the PROM data pins. One
prior stage in the module, forming a registered logic loop. example is the Flash PROM XQR18V04, which provides
This structure is common in accumulators, counters, and a parallel frequency of up to 264 Mbps at 33 MHz. The
any custom state machine or state sequencer in which scrubbing cycle time depends on the configuration
each internal register’s state depends on its own previous clock frequency and the readback bitstream size.
state. In this case, it’s necessary to triplicate the logic and Previous results based on fault injection and radia-
to have majority voters in the outputs. To ensure that a tion ground testing show the Virtex TMR design tech-
register doesn’t lock on the wrong value, each redundant niques’ reliability.8,11 However, the TMR technique has
logic part in the feedback path has a voter so that the sys- limitations, such as high area overhead, three times
tem can recover itself. One LUT can easily implement the more input and output pins, and a significant increase
majority voter. For designs constrained by available logic in power dissipation. Many applications can accept
resources, you can implement the majority voter using these limitations, but some cannot.
Virtex tristate buffers rather than LUTs.
The primary purpose of using a TMR design method- Reducing TMR overheads by
ology is to remove all single points of failure from the combining hardware and time
design. Therefore, each redundant part that uses FPGA redundancy
inputs should have its own set of inputs. Thus, if an input To reduce the number of pins used by the TMR
suffers a failure, it affects only one of the redundant logic approach and to deal with permanent upset effects, we
parts. The outputs are the key to the overall TMR strate- present a new technique based on time and hardware
November–December 2004
555
Fault Tolerance
A B B A A B B A
ST 1 0 1 0 ST ST 0 1 0 1 ST ST 1 0 1 0 ST ST 0 1 0 1 ST
Decode dr0 Clock 1 dr1 Decode Decode dr0 Clock 1 dr1 Decode
Clock 0 Clock 0
= = = = = =
tc0 Hc tc1 tc0 Hc tc1
Voter Voter
Cycle I Cycle II
(normal (detection
operation) operation)
t t
(a) (b)
Figure 5. DWC combined with CED technique for SRAM-based FPGAs: normal (a) and fault detection (b) operation.
ST is a state signal from the voter block that puts the system in detection operation (ST = 1); dr0 and dr1 are the
combinational logic blocks; Tc0 and Tc1 are time comparisons; and Hc is the hardware comparison. The voter block
also generates a state error signal (ST_error) and signals to enable the fault-free block (enable_dr0 and enable_dr1).
Pad
en0 en1 en2 dr0
dr1
Majority Majority Majority Pad
voter voter voter
Figure 6. Example implementations when the combinational output is registered (a) or in the pad (b). Each
majority voter block receives the signal from the tr0, tr1, and tr2 registers. The enable_dr0 and enable_dr1 signals
decide which fault-free blocks should pass through the logic output to the registers or to the pads. To improve
reliability in routing, there are three enable signals (en1, en2, and en3), each a 1-bit signal with a logic value of 0
or 1. The outputs from the majority voter blocks are trv0, trv1, and trv2.
manent faults. These include bitwise inversion, recom- fault-free module, tr2 receives that module’s output, and
puting with shift operands (RESO), and recomputing continues receiving it until the next chip reconfigura-
with swapped operands (REWSO). We implement the tion (fault correction). By default, the circuit starts pass-
CED block using Patel and Fung’s RESO technique.12 ing the output of dr0 to tr2. For unregistered outputs, the
This RESO method includes encoding and decoding circuit can drive the signals directly to the next combi-
blocks and a register. national module or to the I/O pads, as Figure 6b shows.
During normal operation when time t0, dr0, and dr1 The important characteristic of our method is that it
are working simultaneously, the CED block stores the doesn’t incur a high performance penalty when the sys-
outputs in sample registers for further comparison, and tem has no faults or only a single fault. This method
the voter block continually compares the dr0 and dr1 out- needs only one clock cycle in a hold operation to detect
puts, as Figure 5a shows. If a mismatch occurs between the faulty module; then it operates normally again with-
these outputs, the output registers hold their original out performance penalties. The final clock period is the
value for an extra clock cycle, while the CED block’s original clock period plus the propagation delay of the
RESO detects the fault. During this second clock cycle, encoders, decoders, and output comparator.
the operands shift prior to use such that errors from per- The voter block contains comparators and a small
manent faults in the combinational logic are different in state machine to identify the operation’s fault-free state
the first calculation than in the second. Comparing the or to signal an error. Figure 7 shows this logic’s state dia-
results can identify these different errors, as Figure 5b gram. The state machine’s inputs are hardware com-
shows. The encoding blocks are simple multiplexers, parison Hc and time comparisons Tc0 and Tc1,
and the decoding blocks are simple connections. represented by the 2-bit signal, Tc. The state machine’s
For registered outputs, each output goes directly to outputs constitute a 4-bit vector (shown in Figure 7 after
the input of the user’s TMR register. Figure 6a shows the the slash) indicating the detection state (ST), the error
logic scheme. Block dr0 connects to TMR combina- state, enable_dr0, and enable_dr1. Signals enable_dr0
tional module tr0, and block dr1 connects to module tr1. and enable_dr1 are used for the unregistered outputs
While the circuit searches for faults, the user’s TMR reg- (Figure 6b); when the output is registered, only
ister holds its previous value. When the circuit finds the enable_dr0 is used (Figure 6a).
November–December 2004
557
Fault Tolerance
Table 2. Results for a 16-bit multiplier with a register in the output implemented in an XCV300-PQ240 FPGA.
Fault No. of
tolerance Maximum No. of four-input No. of Estimated power dissipation (mW)
technique delay (ns) I/O pads LUTs flip-flops Clock Nets Logic Inputs Outputs Total
None 54 67 495 32 7 88 186 2 29 312
TMR 56 201 1,709 96 22 305 718 7 88 1,140
DWC-CED 62 169 1,706 162 22 282 542 5 83 934
November–December 2004
559
Fault Tolerance
IN_tr0 R2_tr0 R3_tr0 R4_tr0 R5_tr0 R6_tr0 R7_tr0 R8_tr0 R9_tr0 R10_tr0 R11_tr0 Sequential
logic
Pads
C1_dr0 C2_dr0 C3_dr0 C4_dr0 C5_dr0 C6_dr0 C5_dr0 C4_dr0 C3_dr0 C2_dr0 C1_dr0
Combinational
X X X X X X X X X X X
logic
Pads
+ + + + + + + + + +
OUT_dr0
Figure 8. Digital, low-pass filter with 11 taps and 9 bits. The figure represents only one redundant block (dr0) out
of two for the combinational logic, and one redundant block (tr0) out of three for the sequential logic. IN_tr0 is the
input to TMR combinational module tr0; R2_tr0 through R11_tr0 are the registers of tr0; C1_dr0 through C6_dr0 are
constants of the filter. In these labels, dr0 indicates that DWC protects the combinational logic such that only dr0
and dr1 (not shown) are necessary, and OUT_dr0 is the output of dr0.
Table 3. Results for a digital, 11-tap, 9-bit FIR filter implemented in the XCV300-PQ240 FPGA.
Fault No. of
tolerance Maximum No. of four-input No. of Estimated power dissipation (mW)
technique* delay (ns) I/O pads LUTs flip-flops Clock Nets Logic Inputs Outputs Total
None 48 27 508 90 8 85 145 1 748 987
TMR 58 93 1,779 270 32 350 504 2 823 1,711
DWC-CED 63 75 1,738 308 25 324 530 2 19 900
* DWC-CED stores the output in registers, whereas the standard (no fault tolerance) technique and TMR do not.
coefficients calculated using Matlab (http://www.matlab. erance) approach because our technique uses fewer
com) by a constant of 512. The final multiplier coefficients input and output pins compared to TMR, uses less logic,
were 1, –1, –9, 6, 73, and 120. and stores the output in a register, whereas the standard
Table 3 compares the results in terms of area, per- approach has the combinational logic going directly to
formance, and power dissipation for this digital filter the output pads. The DWC-CED technique also saves
implemented with no fault tolerance, TMR, and our power because the output voter passes only one of the
DWC-CED technique. In this case, TMR also protected logic-registered outputs to the pads while the other one
the registers, whereas the DWC-CED using RESO pro- waits in the used one in case of a fault. TMR does not
tected the combinational logic (multipliers and register the outputs but rather votes on them in the out-
adders). The CED block resides at the outputs, where it put pads, consuming more power.
votes on the correct pad output from dr0 or dr1. Results
show that the FIR filter occupies a little bit less area in
the FPGA when DWC-CED rather than TMR protects it. WE’VE DISCUSSED only SEUs occurring in the SRAM pro-
The results also show that our method uses 19% fewer grammable cells that are permanent until the next recon-
pins than TMR. In terms of performance, TMR had a figuration. However, a circuit operating in outer space can
maximum delay of 58 ns for this test application, 20% suffer from a total ionization dose and other effects that
higher than the standard (no fault tolerance) approach. can provoke permanent physical damages in the circuit.
Our DWC-CED technique had a maximum delay of 63 We hope to explore these areas in the future. ■
ns (8% higher than TMR) for this application.
The DWC-CED technique’s power dissipation was References
considerably less than with TMR. But DWC-CED’s power 1. A.H. Johnston, “Scaling and Technology Issues for Soft
dissipation was also less than the standard (no fault tol- Error Rates,” Proc. 4th Ann. Research Conf. Reliability,
November–December 2004
561
Fault Tolerance
JOIN A
Ricardo Reis is a professor at the
Institute of Informatics of the Federal
University of Rio Grande do Sul, and
THINK
the Latin America liaison for IEEE
Design & Test. His research interests
include VLSI design, CAD, physical design, design
TANK
methodologies, and fault-tolerant techniques. Reis has
a BSc in electrical engineering from the Federal Uni-
versity of Rio Grande do Sul, and a PhD in computer
science and microelectronics from the Institut Nation-
L
ooking for a community targeted to your
area of expertise? IEEE Computer Society al Polytechnique de Grenoble, France. He is a vice
Technical Committees explore a variety president of the International Federation for Informa-
of computing niches and provide forums for tion Processing and a member of the IEEE.
dialogue among peers. These groups influence
our standards development and offer leading
Direct questions and comments about this article
conferences in their fields.
to Fernanda Gusmão de Lima Kastensmidt, PO Box
Join a community that targets your discipline. 15064, Porto Alegre – RS – Brasil, 91501-970;
[email protected].
In our Technical Committees, you’re in good company.
For more information on this or any other computing topic,
www.computer.org/TCsignup/ visit our Digital Library at http://www.computer.org/
publications/dlib.