2004 Long Survival of FPGAsystems Through Evolutionary Self Repair
2004 Long Survival of FPGAsystems Through Evolutionary Self Repair
Errata: The value of H used for repair simulations was 32, not 8. In equation 2, k=i-1.
that a pair could mutually repair each other, thus relying on allow blind variation and selection find the knowledge re-
no golden single point of failure. Section 2 will introduce quired to complete the repair.
the TMR+Lazy Scrubbing+Jiggling architecture for transient Section 2.3 will describe the Jiggling mechanism in
and permanent fault mitigation, section 3 will lay out a relia- greater detail and section 2.4 will cover its hardware imple-
bility analysis and section 4 will provide some conclusions. mentation. Section 2.5 will describe the simulator used to
2 TMR + Lazy Scrubbing + Jiggling collect repair time statistics while section 2.6 will present the
probability model used for availability analysis.
The proposed mechanism extends TMR + Scrubbing. 2.3 Specification
TMR provides fault tolerance keeping the system on-line at
all times and also votes out SEUs affecting user data. ‘Lazy 2.3.1 Evolutionary Algorithm used during Repair
Scrubbing’ mitigates SEUs to configuration data, and ‘Jig-
gling’ repairs LPD by using the two healthy modules to re- A (1+1) ES has one elite individual and one mutated version
pair the faulty one. Once a Jiggling repair is complete three of it. If the mutant is equal or superior it will replace the
healthy modules are again available (Fig.4). Permanent faults elite. Otherwise, another mutant is generated. This strategy
can be repaired until spare resources are exhausted. has been applied successfully to hardware evolution [12] and
After 2.1, the paper deals entirely with permanent fault has been considered [2] to be an effective strategy for explor-
mitigation of combinational circuits through Jiggling, se- ing fitness landscapes with high neutrality, such as those of
quential ones are left for future work. digital circuits. It was mainly chosen here for its simplicity
and allowance of a small hardware implementation.
2.1 Transient SEU Mitigation through Lazy Scrubbing
There may not be a single mutation restoring healthy be-
Traditional Scrubbing reconfigures a whole TMR system haviour to a damaged module, so an exhaustive search would
on an FPGA regularly from an external memory. To re- blindly iterate over all 243 configurations, where 5 is their
duce power consumption: a module will only be reconfig- length. This is not practical even for the small circuit tackled
ured when its output is different to the other two. Instead in this work where 5 =1058. A (1+1) ES is a hill-climbing
of requiring a golden memory holding module configuration, random walk within configuration space moving to a fitter
the configuration is read from all three modules and a major- (see 2.3.3) or equivalent configuration at each step. This is
ity vote is taken of this data (taking offsets into account) to not guaranteed to find a solution in time. But faced with a
reconfigure the faulty module. Lazy Scrubbing requires less stochastic fault source, no system is capable of guaranteeing
power, overhead and single point of failures than traditional survival. The probability of survival for a Jiggling system is
Scrubbing. Lazy Scrubbing cannot be used after the first Jig- studied in 3.
gling repair. Jiggling will still recover from SEUs in FPGA A (1+1) ES is brittle in the presence of noise because if
configuration, although with a higher latency. a bad mutant gets a lucky high fitness it could replace a su-
2.2 Jiggling Architecture Overview perior elite which would then be lost. Noise is present in
Fig. 1 shows the simplest
the evaluation of circuits on an FPGA when they are not
constrained to be combinational. Sequential circuits behave
setup of a Jiggling system
differently under changes in input pattern ordering and gate
with
1 three copies of module
delays which may vary with environmental conditions. To
"!* + ,)$-'# . %'/)&)0 + ( %'& and a repair module con-
discourage evolution from discarding a good elite, a History
taining the voter and a mini-
Window method is used: the last 6 accepted mutations are
mal implementation of a GA.
TMR can be applied at many
stored so that if the current elite’s fitness is lower than the
Fig. 1: Jiggling: TMR + min- levels. Based
previous one’s, all 6 mutations are reverted. By rolling back
1 on experiments
imal GA for repair. after encountering individuals with noisy fitness, evolution is
to date, should be under a
discouraged from exploring areas of the fitness landscape en-
thousand gate equivalents to make the repair process feasi-
coding sequential circuits. If 7 is the probability of a lucky
ble. One GA circuit could service many TMR systems, all
high noisy evaluation, then the probability a sequential cir-
residing on a single FPGA.
cuit has been lucky 6 times varies as 798 . Thus, the larger 6
A module is considered to have LPD if it fails to repair
is, the higher the chance that the circuit reverted to is stable.
after Lazy Scrubbing. At this point, Jiggling is initiated. Jig-
gling uses the remaining functional modules as a template 2.3.2 Reconfiguration: Hardware Nature of Mutations
of desired behaviour to guide a (1+1) Evolutionary Strategy
(ES) [8] – the minimal expression of a GA – towards a func- Mutations are single1;: bit flips in the configuration stream of
tional configuration which avoids or exploits the faulty ele- the faulty module . Modules are always allocated 2=< ad-
ment. The system is kept on-line during this repair process by dresses of configurable logic blocks (CLB) so that the address
the two healthy modules driving the majority voter. Spare re- of a mutated block can be simply a randomly generated > bit
sources are allocated in each module. Mutations inserted by number. If a circuit only requires a fraction of these CLBs,
the GA are single bit flips in the FPGA configuration stream the rest are allocated as spare resources for repair. A muta-
of the faulty module. tion may affect any part of the CLB such as a Look-up Ta-
Given that permanent faults in FPGAs are not very fre- ble (LUT) or the routing. Ease of partial self-reconfiguration
quent (indeed completely ignored by Xilinx [5]) we can trade and robustness to random configuration bit flips make some
overhead and the knowledge contained therein for time and FPGA architectures more suitable than others.
[]\ UWV ^`_bac 2.4.2 Structure
?@ FG
AA CBD HH JKLI MON Figure 3 shows a possible hardware implementation for the
XY Y Y Z E Jiggler. The voter provides system fault tolerant output Ú as
PRQ SOT
the majority of the module outputs Ú Ë=Û ÚÝÜ Û: ÚßÞ . It also pro-
Fig. 2: Jigglers containing voters and a minimal GA are able vides faulty module index à and output Ú to the minimal
to service multiple systems including other Jigglers.
lnm j"k onp µ GA. A shift register chain áÊâ Æ of size two will hold À Á and
snt ÀÄÃ . A Counter module will sum vector scores from Å into
r ÀÂÁ . A Comparator will check if ÀÒÁäãåÀÝà . A shift register
dfe'g h)i ® ¯ °'¯ ±'²'³'´
q £
¤)¥$¦ § ¨)© chain áæâ of size 6 will store the mutation history win-
« u"{}v$|nw x~ y v$z dow with 8 the addresses of the last 6 mutated configuration
)R'
) ' )'
n stream bits. A Random Number Generator (RNG) will gen-
erate the random mutation address ç . A reconfiguration unit
n$ ) $ ª will flip the configuration stream at a particular address and
$ ¡'¢
n
initiate the partial reconfiguration procedure. Its operation
¬' will depend on the reconfiguration mechanism of the FPGA
Fig. 3: Possible Jiggler GA implementation showing data flow architecture chosen.
directed by the Control FSM as in the algorithm in ¶ 2.4.1. 2.4.3 Overhead Analysis
2.3.3 Fitness Evaluation Given a circuit has è inputs and é outputs, we need 2ÈêæëíìO¾ßî
éðï bits storage for Å and 2Ìëñè bits for áÊâ Æ . If the the ad-
Since normal
1 : circuit operation is not interrupted during re- dress offset of a configuration bit within a module requires ò
pair of , circuit fitness evaluation is done on the fly using bits, we need 6óëíò bits storage for áæâ . The control mod-
ule can be implemented as an 8 state FSM 8 and might require
the inputs being applied during normal mission operation. A
score ·¸O¹ is collected for each output º under every input vec- 3 latches and 10 four-input LUTs. The ò bit RNG could be
1;: » until all vectors have been encountered. If output º of
tor implemented as a linear feedback shift register with roughly
was ever incorrect under the application of » , then ·¸R¹ = ¼ ; òÙôõ¾ö LUTs and ò latches. The reconfiguration module’s
otherwise ·½¸R¹ = ¾ . Output is considered incorrect if different size would depend on the FPGA architecture and could vary
to the corresponding output at the voter. Fitness score for the between 3 and 30 LUTs and some latches. For this analysis
it is assumed it requires 15 LUTs and 15 latches. The voter
current configuration is simply the sum of scores. This fitness could be implemented in roughly é LUTs. Given a CLB off-
score will guide evolution towards configurations whose be- set address within a module needs > bits we need é÷ëÌ> bits
haviour is increasingly similar to the healthy modules’. for the reconfigurable module output addresses.
If è =5, é =10, 6 =8, > =5, ò =11 the overhead is 352+10
2.3.4 Repairing the Repairer +88+50=500 storage bits plus roughly 10+1+15+10=36
Since the repair module is small, it would be feasible to be LUTs and 3+11+15=29 latches. 2.3.4 describes a scheme
for mitigating faults in this logic.
repaired with the same strategy, once this method has been
For larger circuits, the size of Å will dominate the sum
adapted to sequential circuits. Each repair module could be because it grows exponentially with the number of inputs. It
itself tripled as in Fig.2 and repaired by another repair mod- should be noted that Å will usually be smaller than a whole
ule. Both repair modules would then be in charge of repairing extra module and one Jiggler unit is capapble of servicing
multiple systems. multiple subsystems thus reducing overhead per subsystem.
2.4 Hardware Implementation 2.5 Simulated Jiggling
2.4.1 Jiggling Repair Cycle The Jiggling method was evaluated by collecting repair
time statistics from a simulated model.
The (1+1) ES with1¿History
: Window control loop for repair- 2.5.1 FPGA model
ing faulty module is described below in pseudocode. ÀÂÁ
and ÀÄÃ are registers holding the current and previous fitness Various FPGA architectures, some of which have been de-
ployed in space missions, can be simplified to a model where
values. Å is a bitwise storage holding ·¸O¹ scores. each CLB holds one LUT and one D-Latch. Routing between
1. Set register ÆÈÇÊÉÌË . such CLBs is limited yet can be assumed universal [6] for
2. Evaluate Elite: collect ÁÍ"Î scores into Ï until all input vectors have been
encountered. If all scores are 1 then stop repair process.
small circuits such as those dealt with in this work. This first
3. Count number of 1s in Ï and store it in the Æ4Ð register. study of the Jiggling approach tackles combinational circuits
4. If Æ ÐÒÑ ÆÇ revert all mutations in history shift register ÓÔÄÕ and go to step 1. only, so the FPGA model adopted uses four-input LUTs and
5. Shift the value of Æ4Ð up into Æ Ç . no latches. We assume it is not complex to turn off all latch
6. Insert new random mutation Ö in ×ÙØ .
7. Evaluate Mutant: collect scores as in step 2. If all scores are 1 then stop repair.
functionality for a given area of an FPGA.
8. Count number of 1s in Ï and store it in the Æ4Ð register.
9. If Æ ÐÒÑ ÆÇ revert mutation Ö . 2.5.2 Simulator Characteristics
10. Else shift the value of Æ4Ð up into Æ Ç and push Ö onto ÓÔ Õ . The simulator used is a simple version of an event driven dig-
11. Go to step 2.
ital logic simulator in which each logic unit is in charge of its
Control logic can be implemented as a Finite State Ma- own behaviour when given discrete time-slices and the state
chine (FSM) with eight states. If an incorrect configuration of its inputs. Routing is considered unlimited so any unit can
gets a lucky full score, repair will be resumed as soon as its be connected to any other allowing recurrent connections in-
behaviour is different from the voter. ducing sequential behaviour, so care must be taken to update
*,2 + -/. 0 1 ?A
all units ‘simultaneously’. This is achieved by sending the F G H @ B/C D E
time-slices to the logic units in two waves: the first to read !#"$ %'&( )
their inputs and the second to update their outputs. During
each evaluation, circuit inputs were kept stable for 30 time-
slices and the outputs were read during the 5 last time slices. 3,> 4 57698;: 4 5=<
Gate delays are simulated in time-slice units and are ran-
domized at the start of each evaluation with a Gaussian dis- Fig. 4: Markov model of the Jiggling life cycle.
tribution with øúùû¼ýü}¾ and a þ varying between 3 and 6 thus module in order and the number of generations of evolu-
simulating a probe subjected to a changing environment. þ tion required to arrive at a fully functional configuration is
increments, decrements or keeps its value within the ÿ Û recorded. 6 consecutive generations with a fully functional
range every simulated generations where is itself taken elite must elapse before the module is considered repaired
from a Gaussian distribution ì"þúù 2¼4¼=¼=¼ Û ø ù
4¼=¼=¼4¼4ï . For to avoid lucky solutions. The repair time of the I
fault in
the scenario studied in this paper these statistics would trans-
late to the mean of the frequently randomized gate delays the J
sequence will be referred to as K#LNM . If K#LNM exceeds
changing roughly every minute. 4.32 million (M) simulated generations the fault sequence is
The Stuck-At (SA) fault model was chosen as an indus- aborted and the repair time of all unrepaired faults in the se-
try standard providing a fairly accurate model of the effects quence is set to O , so PQSRTI`ü KVUM = O . This limits the amount
of radiation hazards at the gate level. SA faults can be in- of CPU time used for simulation and will significantly skew
troduced at any of the logic units of the simulator simply by the statistics pessimistically since long-term missions may
setting its output always to 0 or 1. have months to repair a permanent fault.
2.6 Probability Model
2.5.3 FPGA configuration stream encoding
Figure 4 shows a Markov model with the three states the
As mentioned earlier, mutations are performed on the sec-
tion of the FPGA configuration stream which encodes the
Jiggling system can be in. It begins its life fault-free 1 : in the
healthy state. When a fault arrives at module it moves1;to:
faulty module. During simulated evolutionary repair, the GA the repair state. If during repair another fault arrives in
deals with linear bit string genotypes which are equivalent to
the simulated FPGA configuration stream. As mentioned in it will stay in this state. If during repair a fault arrives in one
of the other modules it reaches the fail state. The system will 1 :
2.3.2, there are 2È< addresses available. The last è addresses
are assigned to circuit inputs while the remaining 2=< - è refer
only be considered to go back to the 1Xhealthy
WY : state when
to LUT units within the module. > bits are required per ad-
is repaired before a fault arrives at É .
The probability of moving from healthy to repair is dic-
dress. The first é ë > configuration bits – where é is the tated by the permanent fault rate which may be affected by
number of circuit outputs – encode the addresses from which such factors as usage, age and environmental stress. The
module outputs will be read by the voter. This simulates mu-
tations to the configuration memory controlling routing to the probability of moving from repair back to healthy will de-
pend on the permanent fault rate and on time to repair which
module outputs. The rest of the stream is divided into 2=< - è is likely to increase as faults accumulate and spares get used.
sections, one for each LUT. Each of these sections contains Faults affecting the voter and GA module could be dealt
16 bits encoding the LUT and ë¿> bits encoding where with as mentioned in 2.3.4.
the inputs of this LUT are routed from. LUT addresses are
assigned in the order they appear in the configuration stream. 2.6.1 Availability Analysis
2.5.4 Evaluation Procedure The1 Jiggling
: 1 WY :
system is considered to fail when, during repair
of a fault arrives at É . Consider Z L the probability
In order to mimic normal mission operation during circuit of the I
; repair within the whole system (after the I
fault
evaluation, each test vector applied for the 30 simulated time
step presentation is randomized. All analysis in this paper
at any module) succeeding. This is the probability that, 1 :given
I\[ ¾ system repairs were successful, 1] theWY latest
: fault at will
assumes the availability of a fresh input on every clock cycle.
Evaluation ends when every test vector has been encountered.
be
^ repaired before a fault arrives at É . Given a fault rate
For each input vector, the number of circuit output bits that per unit area and a total area ò for the three modules, the
were always correct during its application, is added to the Poisson distribution of permanent ^ local fault arrival during
total fitness score. a time period has parameter ò_ . The probability that the
system has not failed before c time , ie. its reliability is:
` ^ bfehgijk b
2.5.5 Collecting Repair Time Statistics a ì òdOï n
Repair time information is required to analyse the reliabil- âíìOï ù ml Z Lo (1)
ity of the Jiggling approach. Since all modules are equal re-
b ÉÝË L É Ü
The outer sum considers all possible fault counts during
pair statistics from a single module convey the information
required. The time taken to repair the first fault, when all
. For each the probability of faults occurring during
is multiplied by the probability of all system repairs suc-
spares are available, may be different to that of repairing the
when faults are present in the module and - ¾ suc-
cessful repairs have already taken place probably allocating ceeding. If 7 L is the probability of the I
repair on a single
r r
at least - ¾ spares. module succeeding, the probability ZpL of the I system re-
To collect repair time information, random fault se- pair succeeding cang be calculated:
rs W g r
L Ü
quences of length – the initial number of spare LUTs – ar ¾ 2 W
are generated making sure no LUT is failed twice. For each ZqLù 7 Üut t > (2)
wv wv
of these sequences, faults are inserted into the simulated ÉÝË
1
the |}~ fault for each of the = ... fault sequences. 0.8
ML 1 2 3 4 5 6 7 8 9 10 11 12 13 14
` 15
1 0 0 0 0 0 0 229 197 942 1413 0 ` ` ` 0.7
4 `
R(t)
5 0 0 5 2 25 3 0 79 147 1136 80 935 2908 1661 ` `
0.5
6 15 78 139 319 1246 131 151 53 1009 124 1986 1456 2751
7 0 0 324 30 328 473 41 1153 228 355 77 59 ` ` `
0 298 15 ` `
0.4
10
0.2