0% found this document useful (0 votes)
28 views6 pages

2004 Long Survival of FPGAsystems Through Evolutionary Self Repair

This document proposes a technique called Jiggling to mitigate transient and permanent faults in FPGA systems through evolutionary self-repair. Jiggling extends the commonly used Triple Module Redundancy (TMR) and Scrubbing approaches by using two healthy FPGA modules to repair the third faulty module. A minimal (1+1) evolutionary algorithm is used during the repair process to allow blind variation and selection to find a configuration that repairs the fault. Analysis shows this Jiggling approach can provide high reliability, allowing a system to survive 17 times the mean time to local permanent fault arrival with 99% probability.

Uploaded by

Zx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

2004 Long Survival of FPGAsystems Through Evolutionary Self Repair

This document proposes a technique called Jiggling to mitigate transient and permanent faults in FPGA systems through evolutionary self-repair. Jiggling extends the commonly used Triple Module Redundancy (TMR) and Scrubbing approaches by using two healthy FPGA modules to repair the third faulty module. A minimal (1+1) evolutionary algorithm is used during the repair process to allow blind variation and selection to find a configuration that repairs the fault. Analysis shows this Jiggling approach can provide high reliability, allowing a system to survive 17 times the mean time to local permanent fault arrival with 99% probability.

Uploaded by

Zx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Scrubbing away transients and Jiggling around the permanent: Long survival of

FPGA systems through evolutionary self-repair


Miguel Garvie and Adrian Thompson
CCNR, Dept. of Informatics, University of Sussex, Brighton BN1 9QH, UK.
m.m.garvie, adrianth @sussex.ac.uk, +44 (0)1273 872945

Abstract used to mitigate faults and are considered to have “saved”


The Jiggling architecture extending TMR+Scrubbing is several space missions. A TMR system has three copies of
shown to mitigate FPGA transient and permanent faults us- a module and uses a voting system on their outputs so that
ing low overhead. Mission operation is never interrupted. the final output is an agreement between at least two mod-
The repair circuitry is sufficiently small that a pair could mu- ules. A TMR/Simplex defaults to a single module once one
tually repair each other. A minimal evolutionary algorithm is module fails, thereby increasing reliability. TMR+Scrubbing
used during permanent fault self-repair. Reliability analysis [3, 10] provides fault tolerance as above and wipes out SEUs
of the studied case shows the system has a 0.99 probability in FPGA configuration data by regular reprogramming. Con-
of surviving 17 times the mean time to local permanent fault figuration readback [3] is able to locate configuration errors
arrival. Such a system would be 0.99 probable to survive 100 and fix them by partial reconfiguration. These schemes are
years with one fault every 6 years. only as good as a TMR system in the presence of permanent
faults and rely on a golden (unbreakable) memory to store
1 Introduction configuration data. Latched user data in sequential designs
Reconfigurable hardware devices such as Field Pro- can be protected with state recovery schemes [15].
grammable Gate Arrays (FPGA) are being increasingly used Lach et al. [4] proposed a tile based approach for re-
in space applications because they allow cheap and fast pro- configuring an FPGA design to avoid LPD. This approach
duction of prototypes and final designs in low volume. Buggy tolerates limited faults per tile, requires a golden memory
designs can be fixed post-deployment, and the same hardware holding precompiled configurations and a golden fault loca-
used to perform various tasks – some possibly unforeseen – tion mechanism. The repair mechanism requires tiles to be
over the duration of a mission. The use of Commercial Off- off-line during reconfiguration, which may rule out repair of
The-Shelf (COTS) components such as FPGAs is becoming mission-critical modules. A similar approach [16] likewise
common-place in space applications and other industries. requires a set of golden pre-compiled configurations and a
SRAM based FPGAs deployed in space are susceptible to golden fault diagnosis system hosted on an extra FPGA. The
radiation hazards, most commonly [14, 3, 10] Single Event Roving STARS [1] approach proposes a self-testing column
Upsets (SEU) not causing permanent damage. Total dose ex- and row to shift itself across the FPGA. Fault detection la-
posure may cause catastrophic damage [10]. The above re- tency is of around 4000 cycles, and it requires constant re-
search does not focus on local permanent damage (LPD). configuration of the FPGA and is therefore a constant power
LPD has not been commonly observed in radiation tested drain. It relies on a golden micro-processor to perform timing
FPGAs, but there are several reasons why it should not be analysis, fault location, place and route, and more than 420K
ignored. Cases of Single Event Latch-up, which may cause of golden memory for storage of designs and faults. A final
LPD by inducing high current density, have been reported cost is that the system clock is stopped regularly.
[9]. Some SEUs cannot be mitigated without a full chip reset Embryonics [7] is a biologically inspired approach with
which may not be possible for a mission-critical module, thus an architecture requiring large amounts of overhead includ-
manifesting themselves as LPD. Radiation testing on earth ing golden copies of chip configuration. Zebulum et al. [17]
is not 100% faithful to space conditions and does not last have used a Genetic Algorithm (GA) to repair analog designs
as long as a mission. In fact, no FPGA has been tested for on a Field Programmable Transistor Array. They provide re-
more than 15 years, while NASA plans 100 year missions sults for a restricted fault set and assume a golden ‘SABLES’
for deep space exploration. Long device usage could lead to system composed of a DSP, memory for a full population,
LPD through electromigration or other aging effects. Other and a fitness evaluation mechanism in some cases requiring
dormant faults may only manifest themselves a considerable a healthy copy of the circuit being repaired. [11, 13, 6] apply
time after deployment. It would be unwise engineering to as- GAs to repair FPGA designs assuming a golden GA module
sume that LPD to FPGA cells would not occur during long with a full population and fitness evaluation mechanism.
space missions exposed to extreme environmental conditions Most FPGA LPD mitigation techniques mentioned so far
and radiation. Space missions are not the only deployments suffer from the Repairing the Repairer dilemma in which
that can benefit from strategies dealing with LPD, although a new single point of failure assumed unbreakable is intro-
they are particularly needy of autonomous onboard repair duced in the mitigation mechanism. This is especially awk-
since communication with earth is low bandwidth and high ward when the mitigation mechanism assumed unbreakable
latency. Radiation and aging effects are also encountered is larger than the breakable system itself. A repair mod-
at sea-level and may be a problem for inaccessible systems ule small and simple enough to be itself repaired, would of-
where component replacement is not feasable. fer an obvious advantage. This paper will describe such a
Triple Module Redundancy (TMR) is currently widely low overhead fault mitigation mechanism sufficiently small

Errata: The value of H used for repair simulations was 32, not 8. In equation 2, k=i-1.
that a pair could mutually repair each other, thus relying on allow blind variation and selection find the knowledge re-
no golden single point of failure. Section 2 will introduce quired to complete the repair.
the TMR+Lazy Scrubbing+Jiggling architecture for transient Section 2.3 will describe the Jiggling mechanism in
and permanent fault mitigation, section 3 will lay out a relia- greater detail and section 2.4 will cover its hardware imple-
bility analysis and section 4 will provide some conclusions. mentation. Section 2.5 will describe the simulator used to
2 TMR + Lazy Scrubbing + Jiggling collect repair time statistics while section 2.6 will present the
probability model used for availability analysis.
The proposed mechanism extends TMR + Scrubbing. 2.3 Specification
TMR provides fault tolerance keeping the system on-line at
all times and also votes out SEUs affecting user data. ‘Lazy 2.3.1 Evolutionary Algorithm used during Repair
Scrubbing’ mitigates SEUs to configuration data, and ‘Jig-
gling’ repairs LPD by using the two healthy modules to re- A (1+1) ES has one elite individual and one mutated version
pair the faulty one. Once a Jiggling repair is complete three of it. If the mutant is equal or superior it will replace the
healthy modules are again available (Fig.4). Permanent faults elite. Otherwise, another mutant is generated. This strategy
can be repaired until spare resources are exhausted. has been applied successfully to hardware evolution [12] and
After 2.1, the paper deals entirely with permanent fault has been considered [2] to be an effective strategy for explor-
mitigation of combinational circuits through Jiggling, se- ing fitness landscapes with high neutrality, such as those of
quential ones are left for future work. digital circuits. It was mainly chosen here for its simplicity
and allowance of a small hardware implementation.
2.1 Transient SEU Mitigation through Lazy Scrubbing
There may not be a single mutation restoring healthy be-
Traditional Scrubbing reconfigures a whole TMR system haviour to a damaged module, so an exhaustive search would
on an FPGA regularly from an external memory. To re- blindly iterate over all 243 configurations, where 5 is their
duce power consumption: a module will only be reconfig- length. This is not practical even for the small circuit tackled
ured when its output is different to the other two. Instead in this work where 5 =1058. A (1+1) ES is a hill-climbing
of requiring a golden memory holding module configuration, random walk within configuration space moving to a fitter
the configuration is read from all three modules and a major- (see 2.3.3) or equivalent configuration at each step. This is
ity vote is taken of this data (taking offsets into account) to not guaranteed to find a solution in time. But faced with a
reconfigure the faulty module. Lazy Scrubbing requires less stochastic fault source, no system is capable of guaranteeing
power, overhead and single point of failures than traditional survival. The probability of survival for a Jiggling system is
Scrubbing. Lazy Scrubbing cannot be used after the first Jig- studied in 3.
gling repair. Jiggling will still recover from SEUs in FPGA A (1+1) ES is brittle in the presence of noise because if
configuration, although with a higher latency. a bad mutant gets a lucky high fitness it could replace a su-
2.2 Jiggling Architecture Overview perior elite which would then be lost. Noise is present in
Fig. 1 shows the simplest
the evaluation of circuits on an FPGA when they are not
constrained to be combinational. Sequential circuits behave
setup of a Jiggling system
   differently under changes in input pattern ordering and gate
with
1 three copies of module
delays which may vary with environmental conditions. To
"!* + ,)$-'# . %'/)&)0 + ( %'&  and a repair module con-
   discourage evolution from discarding a good elite, a History
taining the voter and a mini-

   Window method is used: the last 6 accepted mutations are
mal implementation of a GA.
 TMR can be applied at many
stored so that if the current elite’s fitness is lower than the
Fig. 1: Jiggling: TMR + min- levels. Based
previous one’s, all 6 mutations are reverted. By rolling back
1 on experiments
imal GA for repair. after encountering individuals with noisy fitness, evolution is
to date, should be under a
discouraged from exploring areas of the fitness landscape en-
thousand gate equivalents to make the repair process feasi-
coding sequential circuits. If 7 is the probability of a lucky
ble. One GA circuit could service many TMR systems, all
high noisy evaluation, then the probability a sequential cir-
residing on a single FPGA.
cuit has been lucky 6 times varies as 798 . Thus, the larger 6
A module is considered to have LPD if it fails to repair
is, the higher the chance that the circuit reverted to is stable.
after Lazy Scrubbing. At this point, Jiggling is initiated. Jig-
gling uses the remaining functional modules as a template 2.3.2 Reconfiguration: Hardware Nature of Mutations
of desired behaviour to guide a (1+1) Evolutionary Strategy
(ES) [8] – the minimal expression of a GA – towards a func- Mutations are single1;: bit flips in the configuration stream of
tional configuration which avoids or exploits the faulty ele- the faulty module . Modules are always allocated 2=< ad-
ment. The system is kept on-line during this repair process by dresses of configurable logic blocks (CLB) so that the address
the two healthy modules driving the majority voter. Spare re- of a mutated block can be simply a randomly generated > bit
sources are allocated in each module. Mutations inserted by number. If a circuit only requires a fraction of these CLBs,
the GA are single bit flips in the FPGA configuration stream the rest are allocated as spare resources for repair. A muta-
of the faulty module. tion may affect any part of the CLB such as a Look-up Ta-
Given that permanent faults in FPGAs are not very fre- ble (LUT) or the routing. Ease of partial self-reconfiguration
quent (indeed completely ignored by Xilinx [5]) we can trade and robustness to random configuration bit flips make some
overhead and the knowledge contained therein for time and FPGA architectures more suitable than others.
[]\ UWV ^`_ba c 2.4.2 Structure
?@ FG
AA CBD HH JKLI MON Figure 3 shows a possible hardware implementation for the
XY Y Y Z E Jiggler. The voter provides system fault tolerant output Ú as
PRQ SOT
the majority of the module outputs Ú Ë=Û ÚÝÜ Û: ÚßÞ . It also pro-
Fig. 2: Jigglers containing voters and a minimal GA are able vides faulty module index à and output Ú to the minimal
to service multiple systems including other Jigglers.
lnm j"k onp µ GA. A shift register chain áÊâ Æ of size two will hold À Á and
snt ÀÄÃ . A Counter module will sum vector scores from Å into
r ÀÂÁ . A Comparator will check if ÀÒÁäãåÀÝà . A shift register
dfe'g h)i ® ¯ °'¯ ±'²'³'´
q £…¤)¥$¦ § ¨)© chain áæâ of size 6 will store the mutation history win-
« u"{}v$|nw x~ y v$z dow with 8 the addresses of the last 6 mutated configuration
ƒ…„)†R‡'ˆ
‰)Š ‰'‹ „ Š )'€‚ Œ…nŽ stream bits. A Random Number Generator (RNG) will gen-
erate the random mutation address ç . A reconfiguration unit
–n—$˜ ™)š › œ $ž ª will flip the configuration stream at a particular address and
Ÿ$  ¡'¢ ™   …n‘
’”“…• initiate the partial reconfiguration procedure. Its operation
¬'­ will depend on the reconfiguration mechanism of the FPGA
Fig. 3: Possible Jiggler GA implementation showing data flow architecture chosen.
directed by the Control FSM as in the algorithm in ¶ 2.4.1. 2.4.3 Overhead Analysis
2.3.3 Fitness Evaluation Given a circuit has è inputs and é outputs, we need 2ÈêæëíìO¾ßî
éðï bits storage for Å and 2Ìëñè bits for áÊâ Æ . If the the ad-
Since normal
1 : circuit operation is not interrupted during re- dress offset of a configuration bit within a module requires ò
pair of , circuit fitness evaluation is done on the fly using bits, we need 6óëíò bits storage for áæâ . The control mod-
ule can be implemented as an 8 state FSM 8 and might require
the inputs being applied during normal mission operation. A
score ·¸O¹ is collected for each output º under every input vec- 3 latches and 10 four-input LUTs. The ò bit RNG could be
1;: » until all vectors have been encountered. If output º of
tor implemented as a linear feedback shift register with roughly
was ever incorrect under the application of » , then ·¸R¹ = ¼ ; òÙôõ¾ö LUTs and ò latches. The reconfiguration module’s
otherwise ·½¸R¹ = ¾ . Output is considered incorrect if different size would depend on the FPGA architecture and could vary
to the corresponding output at the voter. Fitness score for the between 3 and 30 LUTs and some latches. For this analysis
it is assumed it requires 15 LUTs and 15 latches. The voter
current configuration is simply the sum of scores. This fitness could be implemented in roughly é LUTs. Given a CLB off-
score will guide evolution towards configurations whose be- set address within a module needs > bits we need é÷ëÌ> bits
haviour is increasingly similar to the healthy modules’. for the reconfigurable module output addresses.
If è =5, é =10, 6 =8, > =5, ò =11 the overhead is 352+10
2.3.4 Repairing the Repairer +88+50=500 storage bits plus roughly 10+1+15+10=36
Since the repair module is small, it would be feasible to be LUTs and 3+11+15=29 latches. 2.3.4 describes a scheme
for mitigating faults in this logic.
repaired with the same strategy, once this method has been
For larger circuits, the size of Å will dominate the sum
adapted to sequential circuits. Each repair module could be because it grows exponentially with the number of inputs. It
itself tripled as in Fig.2 and repaired by another repair mod- should be noted that Å will usually be smaller than a whole
ule. Both repair modules would then be in charge of repairing extra module and one Jiggler unit is capapble of servicing
multiple systems. multiple subsystems thus reducing overhead per subsystem.
2.4 Hardware Implementation 2.5 Simulated Jiggling
2.4.1 Jiggling Repair Cycle The Jiggling method was evaluated by collecting repair
time statistics from a simulated model.
The (1+1) ES with1¿History
: Window control loop for repair- 2.5.1 FPGA model
ing faulty module is described below in pseudocode. ÀÂÁ
and ÀÄÃ are registers holding the current and previous fitness Various FPGA architectures, some of which have been de-
ployed in space missions, can be simplified to a model where
values. Å is a bitwise storage holding ·¸O¹ scores. each CLB holds one LUT and one D-Latch. Routing between
1. Set register ÆÈÇÊÉÌË . such CLBs is limited yet can be assumed universal [6] for
2. Evaluate Elite: collect ÁÍ"Î scores into Ï until all input vectors have been
encountered. If all scores are 1 then stop repair process.
small circuits such as those dealt with in this work. This first
3. Count number of 1s in Ï and store it in the Æ4Ð register. study of the Jiggling approach tackles combinational circuits
4. If Æ ÐÒÑ Æ Ç revert all mutations in history shift register ÓÔÄÕ and go to step 1. only, so the FPGA model adopted uses four-input LUTs and
5. Shift the value of Æ4Ð up into Æ Ç . no latches. We assume it is not complex to turn off all latch
6. Insert new random mutation Ö in ×ÙØ .
7. Evaluate Mutant: collect scores as in step 2. If all scores are 1 then stop repair.
functionality for a given area of an FPGA.
8. Count number of 1s in Ï and store it in the Æ4Ð register.
9. If Æ ÐÒÑ Æ Ç revert mutation Ö . 2.5.2 Simulator Characteristics
10. Else shift the value of Æ4Ð up into Æ Ç and push Ö onto ÓÔ Õ . The simulator used is a simple version of an event driven dig-
11. Go to step 2.
ital logic simulator in which each logic unit is in charge of its
Control logic can be implemented as a Finite State Ma- own behaviour when given discrete time-slices and the state
chine (FSM) with eight states. If an incorrect configuration of its inputs. Routing is considered unlimited so any unit can
gets a lucky full score, repair will be resumed as soon as its be connected to any other allowing recurrent connections in-
behaviour is different from the voter. ducing sequential behaviour, so care must be taken to update
*,2 + -/. 0 1 ?A
all units ‘simultaneously’. This is achieved by sending the F G H @ B/C D E
time-slices to the logic units in two waves: the first to read    !#"$  %'&( )
their inputs and the second to update their outputs. During
each evaluation, circuit inputs were kept stable for 30 time-
slices and the outputs were read during the 5 last time slices. 3,> 4 57698;: 4 5=<
Gate delays are simulated in time-slice units and are ran-
domized at the start of each evaluation with a Gaussian dis- Fig. 4: Markov model of the Jiggling life cycle.
tribution with øúùû¼ýü}¾ and a þ varying between 3 and 6 thus module in order and the number of generations of evolu-
simulating a probe subjected to a changing environment. þ  tion required to arrive at a fully functional configuration is
increments, decrements or keeps its value within the ÿ Û recorded. 6 consecutive generations with a fully functional
range every  simulated generations where  is itself taken elite must elapse before the module is considered repaired
from a Gaussian distribution ì"þúù 2 ¼4¼=¼=¼ Û ø ù
4¼=¼=¼4¼4ï . For to avoid lucky solutions. The repair time of the I  fault in
the scenario studied in this paper these statistics would trans-
late to the mean of the frequently randomized gate delays the J  sequence will be referred to as K#LNM . If K#LNM exceeds
changing roughly every minute. 4.32 million (M) simulated generations the fault sequence is
The Stuck-At (SA) fault model was chosen as an indus- aborted and the repair time of all unrepaired faults in the se-
try standard providing a fairly accurate model of the effects quence is set to O , so PQSRTI`ü KVUM = O . This limits the amount
of radiation hazards at the gate level. SA faults can be in- of CPU time used for simulation and will significantly skew
troduced at any of the logic units of the simulator simply by the statistics pessimistically since long-term missions may
setting its output always to 0 or 1. have months to repair a permanent fault.
2.6 Probability Model
2.5.3 FPGA configuration stream encoding
Figure 4 shows a Markov model with the three states the
As mentioned earlier, mutations are performed on the sec-
tion of the FPGA configuration stream which encodes the
Jiggling system can be in. It begins its life fault-free 1 : in the
healthy state. When a fault arrives at module it moves1;to:
faulty module. During simulated evolutionary repair, the GA the repair state. If during repair another fault arrives in
deals with linear bit string genotypes which are equivalent to
the simulated FPGA configuration stream. As mentioned in it will stay in this state. If during repair a fault arrives in one
of the other modules it reaches the fail state. The system will 1 :
2.3.2, there are 2È< addresses available. The last è addresses
are assigned to circuit inputs while the remaining 2=< - è refer
only be considered to go back to the 1Xhealthy
WY : state when
to LUT units within the module. > bits are required per ad-
is repaired before a fault arrives at É .
The probability of moving from healthy to repair is dic-
dress. The first é ë > configuration bits – where é is the tated by the permanent fault rate which may be affected by
number of circuit outputs – encode the addresses from which such factors as usage, age and environmental stress. The
module outputs will be read by the voter. This simulates mu-
tations to the configuration memory controlling routing to the probability of moving from repair back to healthy will de-
pend on the permanent fault rate and on time to repair which
module outputs. The rest of the stream is divided into 2=< - è is likely to increase as faults accumulate and spares get used.
sections, one for each LUT. Each of these sections contains Faults affecting the voter and GA module could be dealt
16 bits encoding the LUT and ë¿> bits encoding where with as mentioned in 2.3.4.
the inputs of this LUT are routed from. LUT addresses are
assigned in the order they appear in the configuration stream. 2.6.1 Availability Analysis
2.5.4 Evaluation Procedure The1 Jiggling
: 1 WY :
system is considered to fail when, during repair
of a fault arrives at É . Consider Z L the probability
In order to mimic normal mission operation during circuit of the I ; repair within the whole system (after the I  fault
evaluation, each test vector applied for the 30 simulated time
step presentation is randomized. All analysis in this paper
at any module) succeeding. This is the probability that, 1 :given
I\[ ¾ system repairs were successful, 1] theWY latest
: fault at will
assumes the availability of a fresh input on every clock cycle.
Evaluation ends when every test vector has been encountered.
be
^ repaired before a fault arrives at É . Given a fault rate
For each input vector, the number of circuit output bits that per unit area and a total area ò for the three modules, the
were always correct during its application, is added to the Poisson distribution of permanent ^ local fault arrival during
total fitness score. a time period  has parameter ò_ . The probability that the
system has not failed before c time  , ie. its reliability is:
` ^ bfehgijk b
2.5.5 Collecting Repair Time Statistics a ì òdOï n
Repair time information is required to analyse the reliabil- âíìOï ù ml Z Lo (1)
ity of the Jiggling approach. Since all modules are equal re-
b ÉÝË L É Ü
The outer sum considers all possible fault counts during
pair statistics from a single module convey the information
required. The time taken to repair the first fault, when all
 . For each the probability of faults occurring during 
is multiplied by the probability of all system repairs suc-
spares are available, may be different to that of repairing the
 
when faults are present in the module and - ¾ suc-
cessful repairs have already taken place probably allocating ceeding. If 7 L is the probability of the I  repair on a single
r r 
at least - ¾ spares. module succeeding, the probability ZpL of the I system re-
To collect repair time information,  random fault se- pair succeeding cang be calculated:
rs W g r
L Ü
quences of length  – the initial number of spare LUTs – ar ¾ 2 W
are generated making sure no LUT is failed twice. For each ZqLù 7 Üut t > (2)
wv wv
of these  sequences, faults are inserted into the simulated ÉÝË
1

Table 1: Time xzy { in thousands of generations taken to repair 0.9

the |}~ fault for each of the € =  ... ‚ fault sequences. 0.8
M„ƒ„L 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … ` 15
1 0 0 0 0 0 0 229 197 942 1413 0 ` ` ` 0.7

2 0 0 0 11 43 60 2059 97 2 387 606 1463 4226 ` `


3 0 0 0 316 1922 409 320 319 62 223 2968 ` ` ` `
0 1616 1 37 79 0 1376 1160 657 3097 0 681 394 ` `
0.6

4 `

R(t)
5 0 0 5 2 25 3 0 79 147 1136 80 935 2908 1661 ` `
0.5

6 15 78 139 319 1246 131 151 53 1009 124 1986 1456 2751
7 0 0 324 30 328 473 41 1153 228 355 77 59 ` ` `
0 298 15 ` `
0.4

8 0 752 25 322 0 329 779 29 132 96 `


9 0 0 0 0 0 193 228 319 451 185 770 424 3689 4107
684 0 303 78 58 0 1311 417 199 193 2137 ` ` ` `
0.3

10
0.2

where Z Ë =1.1;: The sum considers all possible previous  fault


0.1

counts K at . For each K the probability of the ìK + ¾ï sin- 0


0 5 10 15 20 25 30 35 40
λAt
1 : is multiplied by the probability
gle module repair succeeding Fig. 5: Reliability vs. time for TMR/Simplex (dashed), Jiggling
of K faults occurring at out of I -1 system faults . (solid) and Jiggling with every repair successful (dotted).
2.6.2 Calculating 7 L Once K#LNM were collected (Table 1), 7¥LNM Û 7¥L and ZpL were cal-
Recall 7†L is the probability of the I ; repair on a single mod- culated, providing âÌìˆOï through the equations in 2.6. Repair
ule finishing before a fault arrives at one of the other two times were short requiring on average under two minutes of
modules. Given that fault arrival is modelled as a Poisson simulated time. 94 of the 104 evolutionary searches starting
process, fault interarrival time ‡ follows r an exponentialr dis- from a misfunctional configuration arrived at a healthy one
tribution, such that the probability of the next fault in any of within the simulated 11 minutes. In the 10 cases of unsuc-
two modules occurring j Ž later than time  is Z 숇‰RŠ ï ù
eg Þ jŒ‹dikˆ
cesful repair there was on average 14 faults present in the
where ò × = is module area. Given a set of  module under repair. Under the stochastic fault source and
tight time constraints, the repair mechanism exploited on av-
single module repair times for the I  repair K LNM , the proba- erage up to 78% of spare resources.
bility 7 LM of each of these repairs succeeding is Z 숇R‘K LNM ï . Figure 5 compares the resulting reliability of the Jiggling
Provided with a limited sample of  values of K LNM for each system against a TMR/Simplex approach whose modules are
I the best estimate of 7†L using the’ frequency interpretation of almost 3 times smaller. Since repair times are either neces-
Ü
probability will be: 7 L ù ’”“ M É Ü 7 LNM Given the equations sarily low or O due to the 4.32M timeout, the repair clock
speed does not influence reliability and the analysis only ^de-
above, the experimental data KLM collected as in 2.5.5 can be pends on the fault rate. The time scale is in terms of ò
used to calculate reliability âÌìˆOï . which is the local permanent fault arrival rate at any mod-
2.6.3 TMR/Simplex ule in the Jiggling system. The iŸj ¦ fault rate at any module in
The Jiggling system reliability under permanent faults will the TMR/Simplex system
^ g Ü
is Þ . The mean time between
be compared to TMR/Simplex. In this papers’ experi- fault arrivals  = ì ò ï is marked ^ together with the 0.75,
ments 180% spares were allocated over mission LUTs so 0.9 and 0.99 âÌìOï levels. With a ò of 1 fault a day the
the Jiggling module area ò × is 2.8 times as large as the TMR/Simplex system has a reliability lower than 0.75 after a
TMR/Simplex module area òd• × . Ž\The –„—p˜™ reliability
‹›š\œ – — of a week yet the Jiggling system still has 0.99 reliability after 17
˜ ™ ‹ š\œ
 days and is still above 0.75 up to a 31 day month. With a fault
TMR/Simplex system is: ⠕ ×íÔ ìOï ù Þ [ Þ rate of ¾ ô (mean time between arrivals  ) the Jiggling system
This ignores faults at the voter (a single point of failure), will be 0.99 reliable after ¾V , so a fault rate of 1 every 6 years
making the comparison conservative. would be tolerated for 100 years with 0.99 probability.
3 Results 3.1 Discussion
Repair time statistics were collected as in 2.5.5 for the The reliability of the Jiggling system studied in this paper
cm42a combinational benchmark from the MCNC ’91 test has been skewed negatively by the following factors. A fault
suite. This circuit has four inputs and ten outputs requiring in one of the modules not being repaired does not necessarily
10 four-input LUTs. The number of bits per address > was lead to system failure. Firstly, and specially relevant when
chosen to be 5 so 32 addresses were available, four of which spare count exceeds mission logic count, faults may arrive at
were allocated to circuit inputs ( 2.5.3). Thus 18 spare LUTs spares of these modules. Secondly, two modules each car-
were available in the module being repaired. History window rying a fault may produce erroneous outputs for disjoint sets
size 6 was set to 8. The number of generated fault sequences of inputs (ie. they never produce a false output at the same
 was chosen to be 10. An average 53.6 random input test time). In this case there is enough information at the voter
vectors must be applied during evaluation until all 16 possi- output to repair both modules, returning to the healthy state.
ble four-input vectors have been encountered. 20 clock cy- Finally the 4.32M generation limit equates to under 11 min-
cles would be sufficient between evaluations to count fitness, utes repair time at 1MHz and under 0.66 seconds at 1GHz of
swap a bit in the configuration stream and perform the re- repair time and thus ignores a huge amount of possibilities
quired control procedures. At a very modest clock speed of for repair. Since permanent damage is most likely to happen
ŽŸ  ¡
1MHz both module configurations (individuals) in a genera-
tion could be evaluated in ž Ü Ë\¢
Þ ù ¾=ü £=2 뿾¼ g†¤ seconds every 6 months at worst, the timeout heavily skewed statis-
tics negatively. With an extremely high permanent fault rate
and 6793 generations could be completed per second. With a of one per day there would be time to complete 587M gener-
timeout of 4.32M generations the simulator gave less than 11 ations at 1Mhz within the mean fault interarrival time. Infor-
minutes for circuits to repair from permanent damage. mal investigations suggest that with such a limit, repair would
succeed at nearly every opportunity and the system would re- quires 500 storage bits, 36 LUTs and 29 latches. This is small
pair up to the number of spares available making reliability enough to be itself repairable by another Jiggler thus remov-
more like the dotted line in fig. 5. Thus each spare added to ing all single points of failure, provided the architecture is
a module would lengthen its life by roughly the fault interar- extended to sequential circuits. This repair module would
rival time. Adding spares gives diminishing returns because also service multiple systems, amortising the overhead.
of the increased area, but it remains possible to compute the Reliability analysis for a small benchmark shows the
number of spares required to provide a desired reliability at a Jiggling system using 2.8 times the overhead per module
specific point in the mission. can survive with 0.99 probability: 17 times longer than a
This work assumes the circuit being repaired is being used TMR/Simplex approach. This analysis takes into account
for normal operation and its input is effectively randomised at both the stochastic natures of evolution and of the fault
every clock cycle. If the circuit were not in use then it could source. It is shown how the number of spares in such a sys-
be supplied inputs artificially, possibly from the RNG or the tem can be adapted to reach desired reliability guarantees for
counter. This is also necessary if the full set of input vectors specific times during a mission.
is not frequently encountered during normal operation. If the A more thorough evaluation of this architecture would in-
normal mission input pattern is not uniformly random and its volve larger benchmarks and more accurate statistics. The
characteristics are known, these could be used during simu- latter may be achieved by a less pessimistic probability model
lated repair to generate the appropriate reliability figures. and by allowing more time for simulated repairs. As the size
Once the first permanent local fault is repaired through of benchmarks renders simulation cost prohibitive, the Jig-
Jiggling, Lazy Scrubbing 2.1 must be disabled since one of gling system may be implemented onto a real FPGA allow-
the module configurations has changed. However transient ing a more accurate overhead analysis and its true evaluation
faults to configuration bits could still be repaired through Jig- in the presence of radiation. The reliability of Jiggling to
gling by a mutation hitting the configuration stream bit af- mitigate transient as well as permanent faults should also be
fected by the SEU. This could be accelerated with minimum studied with the possibility of doing away with Scrubbing.
extra hardware by using a counter to flip each bit in the con- Further developments to the architecture are required to
figuration stream in turn, until the SEU was undone. Further allow repair of sequential modules. The fitness evaluation
study is required to analyze the repair statistics under tran- procedure for sequential systems is necessarily more com-
sient faults combined with permanent faults potentially doing plex, and remains the key area for future work. For larger
away with the need for Scrubbing altogether. benchmarks, routing restrictions may be introduced as well
This preliminary case study evaluated the Jiggling sys- as a more comprehensive fault model.
tem repairing a design smaller than itself and increasing its References
reliability. The high probability of repair before spares are
[1] M. Abramovici, J. Emmert, and C. Stroud. Roving STARs: An integrated ap-
exhausted suggests the method could be applied to larger proach to on-line testing, diagnosis, and fault tolerance for FPGAs in adaptive
benchmarks. Unpublished related experiments suggest that computing systems. 3rd NASA/DoD W. on Evolvable Hardw., page 73, 2001.
evolutionary algorithms are capable of finding solutions in [2] L. Barnett. Netcrawling – optimal evolutionary search with neutral networks. In
Ü
spaces as large as 2 ËË ËË , although with a significantly longer
Proc. Cong. on Evol. Comp. (CEC), pages 30–37. IEEE, 2001.
[3] E. Fuller, M. Caffrey, A. Salazar, C. Carmichael, and J. Fabula. Radiation char-
search time. Given that the average repair time for the stud- acterization, and SEU mitigation, of the Virtex FPGA for space-based reconfig-
urable computing. Xilinx Application Note, 2000.
ied benchmark is under two minutes and that permanent local [4] J. Lach, W. Mangione-Smith, and M. Potkonjak. Low overhead fault-tolerant
fault interarrival time is likely to be over 6 months, a larger FPGA systems, 1998.
benchmark could have repair times several orders of magni- [5] A. Lesea and P. Alfke. A thousand years between single-event. Xilinx TechXclu-
sives, March 2003.
tude larger and still achieve similar reliability figures. Each of [6] J. Lohn, G. Larchev, and R. DeMara. A genetic representation for evolutionary
the  fault sequences tested for repairing the cm42a bench- fault recovery in Virtex FPGAs. Evolvable Systems: From Biology to Hardware,
mark took about 1 day on a 1.4GHz processor. The amount 5th Intl. Conf. (ICES 2003), LNCS2606:47–56, 2003.
[7] D. Mange, M. Goeke, D. Madon, A. Stauffer, G. Tempesti, and S. Durand.
of processing power required will vary exponentially with Embryonics: A new family of coarse-grained FPGA with self-repair and self-
benchmark size due to increased simulation time per config- reproducing properties. In E. Sanchez and M. Tomassini, editors, Towards Evolv.
Hardw.: The evol. eng. approach, volume 1062 of LNCS, pages 197–220. 1996.
uration and longer evolutionary search. [8] H.P. Schwefel and G. Rudolph. Contemporary evolution strategies. In F. Morán,
As Jiggling is evaluated for larger benchmarks and as the A. Moreno, J. J. Merelo, and P. Chacon, editors, Adv. in Artificial Life: Proc. 3rd
fault model used is made more realistic, the possibility of Eur. Conf. on Art. Life, volume 929 of LNAI, pages 893–907, 1995.
[9] S. Straulino. Results of a beam test at GSI on radiation damage for FPGAs Quick-
using real hardware is more attractive. The Jiggling system Logic QL12x16BL and Actel 54SX32.
could be implemented on a Xilinx FPGA subjected to radia- [10] F. Sturesson and S. Mattsson. Radiation pre-evaluation of Xilinx FPGA
tion. This would give us accurate overhead measures as well XQVR300. European Space Agency Contract Report, September 2001.
[11] A. Thompson. Evolving fault tolerant systems. In Proc. 1st IEE/IEEE Int. Conf.
as more accurate reliability figures since all cases mentioned on Genetic Algorithms in Eng. Systems: Innovations and Applications, pages
above would be taken into account in the real system and 524–529. IEE Conf. Publication No. 414, 1995.
[12] A. Thompson and P. Layzell. Evolution of robustness in an electronics design. In
generations could be truly performed at millions a second. J. Miller, A. Thompson, P. Thomson, and T. Fogarty, editors, Proc. 3rd Int. Conf.
on Evolvable Systems, volume 1801 of LNCS, pages 218–228, 2000.
4 Conclusion [13] S. Vigander. Evolutionary Fault Repair of Electronics in Space Applications. Dis-
sertation, Norwegian Univ. of Sci.&Tech., Trondheim, Norway, February 2001.
The TMR+Lazy Scrubbing+Jiggling approach to FPGA [14] J. J. Wang, R. B. Katz, J. S. Sun, B. E. Cronquist, J. L. McCollum, T. M. Speers,
transient and permanent fault mitigation has been introduced. and W. C. Plants. SRAM based re-programmable FPGA for space applications.
IEEE Transactions on Nuclear Science, 46(6):1728–1735, December 1999.
Lazy Scrubbing mitigates transients with less power and [15] S. Yu and E. McCluskey. On-line testing and recovery in TMR systems for real-
overhead than traditional Scrubbing and does not depend on a time applications. In Proc. IEEE Intl. Test Conference, pages 240–249, 2001.
golden memory with configuration data. Jiggling self-repairs [16] S. Yu and E. McCluskey. Permanent fault repair for FPGAs with limited redun-
dant area. IEEE Intl. Symp. on Defect & Fault Tol. in VLSI Sys., page 125, 2001.
a system subjected to permanent faults requiring no golden [17] R. S. Zebulum, D. Keymeulen, V. Duong, X. Guo, M. I. Ferguson, and A. Stoica.
memory for holding pre-compiled configurations nor an ex- Experimental results in evolutionary fault-recovery for field programmable ana-
ternal golden microprocessor. The Jiggling repair module is log arrays. 2003 NASA/DoD Conf. on Evolvable Hardw., pages 182–186, 2003.
implemented on the same FPGA and for the case studied re-

You might also like