A survey of Fault Tolerance in
FPGA
By C. Mohamed Yousuf
• Fault Tolerance
– Ability of system to continue error-free operation in
presence of unexpected fault
– to a design which is able to continue operation, possibly at
a reduced level, rather than failing completely, when some
part of the system fails
• Important in mission-critical & safety-critical
applications
– E.g., medical, aviation, banking, etc.
– Errors very costly
• A number of choices have to be examined to determine
which components should be fault-tolerant:
• How critical is the component? In a car, the radio is not
critical, so this component has less need for fault-tolerance.
• How likely is the component to fail? Some components,
like the drive shaft in a car, are not likely to fail, so no fault-
tolerance is needed.
• How expensive is it to make the component fault-
tolerant? Requiring a redundant car engine, for example,
would likely be too expensive both economically and in
terms of weight and space, to be considered.
Faults
• Permanent Faults(Hard faults)
– Due to manufacturing defects, early life failures, wearout
failures
– Wearout failures due to various mechanisms
• e.g., electromigration, hot carrier degradation, dielectric
breakdown, etc.
• Temporary Faults (Soft faults)
– Only present for short period of time
– Caused by external disturbance or marginal design
parameters
Fault Tolerance
• Redundancy
– Static Redundancy
– Dynamic Redundancy
– Hybrid redundancy
• Self Checking and Testing
• Reconfigurable Architecture
Redundancy
• Hardware redundancy
– Addition of replicated modules,
– and use of extra circuits for fault detection
• Information redundancy
– addition of extra information to data, to allow error
detection and correction, and self checking cricuits
– Self checking
• Only fault free circuit will produce a valid code word
m Functional k
Inputs Outputs
Logic
k
m Error
Check Bit Checker
Indication
Generator c
Write Data Word
Read Data Word
Generate Write
Data Word Check Bits
In Check
Bits Memory
Calculated
Check Bits
Read
Generate Check Bits
Data Word Correct Syndrome
Out
Data
• Software redundancy
– Programing
• Time redundancy
– Hardware- and information- redundancy requires
extra hardware. This could be avoided by doing
operations several times in the same module and
check the results, in stead of doing it in parallel on
several modules and compare the outputs. This
reduces the amount of hardware at the expense
of using additional time,
At circuit level- Red. used
• Duplication, where complementary logic
structures that produces the same responses
are compared
• Self checking
• Reconfigurable structures
– Location of faults within a replaceable units
– Reconfiguration of FPGA structure in accordance
with diagnostic data obtained
– 1. fully reconfigurable 2. partially reconfigurable
On the system level:
• Replication (TMR)
– Only detection, no recovery
– Majority 2 out of 3 voter is used to determine the correct
results
• Buit in Self Test (BIST)
On line BIST (during normal operation)
- Error correction and detection
Off line testing (function suspended)
- functional test
- based on the information of CUT, ensuring that,
the function of the logic behaves properly
- structural test
ensuring free from physical faults
Test Pattern Generation (TPG)
Circuitry Under Test
BIST
Control Unit
CUT
Test Response Analysis (TRA)
Fault Tolerant Merits Issues Applications
Technique
Redundancy Maximizes MTTF More area Satellites
Long Life applications More power Spacecraft
Easy to implement costly Implanted Biomedical
Self checking Circuits High speed Complex logic Reliable Real-Time
Reduced chip area Systems
Reduced power Satellites
Reduced cost Spacecraft
Problem if self checking circuit Implanted Biomedical
itself fails
Reconfiguration No extra hardware circuits Synchronization issues after Mainstream Low-Cost
Performance is improved reconfiguration Systems
Reduced area etc., If required, need further floor Consumer Electronics
planning Personal Computers
Conclusion
The parameter that needs to be consider for an
efficient Fault tolerant circuit are:
• Area overheads
• Cost effective
• Power consumption
• Speed
By taking considerations of advantages of
different methods and combining them, one can
design a fault tolerance circuits with the
parameters considered above.