Bulletproof: A Defect-Tolerant CMP Switch Architecture
Bulletproof: A Defect-Tolerant CMP Switch Architecture
Abstract
that transistor reliability will begin to wane in the nanome-
As silicon technologies move into the nanometer regime, tran-
ter regime. As devices become subject to extreme process
sistor reliability is expected to wane as devices become subject
variation, particle-induced transient errors, and transistor
to extreme process variation, particle-induced transient errors,
wearout, it will likely no longer be possible to avoid these
and transistor wear-out. Unless these challenges are addressed,
faults. Instead, computer designers will have to begin to
computer vendors can expect low yields and short mean-times-
directly address system reliability through fault-tolerant de-
to-failure. In this paper, we examine the challenges of designing
sign techniques.
complex computing systems in the presence of transient and per-
Figure 1 illustrates the fault-tolerant design space we fo-
manent faults. We select one small aspect of a typical chip multi-
cus on in this paper. The horizontal axis lists the type
processor (CMP) system to study in detail, a single CMP router
of device-level faults that systems might experience. The
switch. To start, we develop a unified model of faults, based on
source of failures are widespread, ranging from transient
the time-tested bathtub curve. Using this convenient abstraction,
faults due to energetic particle strikes [32] and electrical
we analyze the reliability versus area tradeoff across a wide spec-
noise [28], to permanent wearout faults caused by electro-
trum of CMP switch designs, ranging from unprotected designs
migration [13], stress-migration [8], and dielectric break-
to fully protected designs with online repair and recovery capabil-
down [10]. The vertical axis of Figure 1 lists design solutions
ities. Protection is considered at multiple levels from the entire
to deal with faults. Design solutions range from ignoring
system down through arbitrary partitions of the design. To bet-
any possible faults (as is done in many systems today), to
ter understand the impact of these faults, we evaluate our CMP
detecting and reporting faults, to detecting and correcting
switch designs using circuit-level timing on detailed physical lay-
faults, and finally fault correction with repair capabilities.
outs. Our experimental results are quite illuminating. We find
The final two design solutions are the only solutions that
that designs are attainable that can tolerate a larger number of
can address permanent faults, with the final solution being
defects with less overhead than naı̈ve triple-modular redundancy,
the only approach that maintains efficient operation after
using domain-specific techniques such as end-to-end error detec-
encountering a silicon defect.
tion, resource sparing, automatic circuit decomposition, and it-
In recent years, industry designers and academics have
erative diagnosis and reconfiguration.
paid significant attention to building resistance to transient
faults into their designs. A number of recent publications
have suggested that transient faults, due to energetic par-
1. Introduction ticles in particular, will grow in future technologies [5, 16].
A critical aspect of any computer design is its reliabil- A variety of techniques have emerged to provide a capa-
ity. Users expect a system to operate without failure when bility to detect and correct these type of faults in storage,
asked to perform a task. In reality, it is impossible to build including parity or error correction codes (ECC) [23], and
a completely reliable system, consequently, vendors target logic, including dual or triple-modular spatial redundancy
design failure rates that are imperceptibly small [23]. More- [23] or time-redundant computation [24] or checkers [30].
over, the failure rate of a population of parts in the field Additional work has focused on the extent to which circuit
must exhibit a failure rate that does not prove too costly timing, logic, architecture, and software are able to mask
to service. The reliability of a system can be expressed as out the effects of transient faults, a process referred to as
the mean-time-to-failure (MTTF). Computing system reli- “derating” a design [7, 17, 29].
ability targets are typically expressed as failures-in-time, or In contrast, little attention has been paid to incorporat-
FIT rates, where one FIT represents one failure in a billion ing design tolerance for permanent faults, such as silicon de-
hours of operation. fects and transistor wearout. The typical approach used to-
In many systems today, reliability targets are achieved by day is to reduce the likelihood of encountering silicon faults
employing a fault-avoidance design strategy. The sources through post-manufacturing burn-in, a process that acceler-
of possible computing failures are assessed, and the neces- ates the aging process as devices are subjected to elevated
sary margins and guards are placed into the design to en- temperature and voltage [10]. The burn-in process accel-
sure it will meet the intended level of reliability. For exam- erates the failure of weak transistors, ensuring that, after
ple, most transistor failures (e.g., gate-oxide breakdown) can burn-in, devices still working are composed of robust tran-
be reduced by limiting voltage, temperature and frequency sistors. Additionally, many computer vendors provide the
[8]. While these approaches have served manufacturers well ability to repair faulty memory and cache cells, via the in-
for many technology generations, many device experts agree
TYP
DES E OF MANUFACTURING plus a comparison to traditional fault tolerant techniques,
IGN DEFE
FEA C DEFECT WEAR-OUT DEFECT TRANSIENT ERROR
T UR T such as ECC and triple-modular redundancy (TMR). Fi-
E
MonteCarlo simulation
loop – 1000x
3.2 Reliability of the Baseline CMP Switch Design
The first experiment is an evaluation of the reliability of
Figure 3: Simulation infrastructure for permanent
the baseline CMP switch design. In Figure 4, we used the
faults. The defect infrastructure uses two models of
bathtub curve fitted for the post-65nm technology node as
the system, simulated in parallel. Defects are uni-
derived in Section 2. The FIT rate of this curve is 55000
formly distributed in time and space and the input
during the grace period, which corresponds to a mean time
stimuli is a full coverage test that activates each in-
to failure (MTTF) of 2 years. We used this failure rate in our
ternal circuit node of the system. A fault analyzer
simulation framework for permanent failures and we plotted
classifies defects based on the system response.
the results.
Figure 3 shows our simulation framework for evaluating The baseline CMP design does not deploy any protec-
the impact of silicon defects on a digital design. The frame- tion technique against defects, and one defect is sufficient
work consists of an event-driven simulator that simulates to bring down the system. Consequently, the graph of Fig-
two copies of the structural, gate-level description of the de- ure 4 shows that in a large parts population, 50% of the
sign in parallel. Of these two designs, one copy is kept intact parts will be defective by the end of the second year after
(golden model), while the other is subject to fault injection shipment, and by the fourth year almost all parts will have
(defect-exposed model). The structural specification of our failed. In this experiment, we have also analyzed a design
design was synthesized from a Verilog description using the variant which deploys triple-module-redundancy (TMR) at
Synopsys Design Compiler. the full-system level (i.e. three CMP switches with voting
Our silicon defect model distributes defects in the design gates at their outputs. Designs with TMR applied at dif-
uniformly in time of occurrence and spatial location. Once ferent granularities are evaluated at Section 5) to present
a permanent failure occurs, the design may or may not con- better defect tolerance.
tinue to function depending on the circuit’s internal struc- The TMR model used in this analysis is the classical TMR
ture and the system architecture. The defect analyzer clas- model which assumes that when a module fails it starts pro-
sifies each defect as exposed, protected or unprotected but ducing incorrect outputs, and if two or more modules fail,
masked. In the context of defect evaluation, faults accumu- the output of the TMR voter will be incorrect. This model
late over time until the design fails to operate correctly. The is conservative in its reliability analysis because it does not
defect that brings the system to failure is the last injected take into account compensating faults. For example, if two
defect in each experiment and it is classified as exposed. A faults affect two independent output bits, then the voter cir-
defect may be protected if, for instance, it is the first one to cuit should be able to correctly mask both faults. However,
occur in a triple-module-redundant design. An unprotected the benefit gained from accounting for compensating faults
but masked defect is a defect that it is masked because it rapidly diminishes with a moderate number of defects be-
occurs in a portion of the design that has already failed, for cause the probabilities of fault independence are multiplied.
VC State
Further, though the switch itself demonstrated a moder- Input
Routing Logic
Buffers
Cross-Bar
ate number of independent fault sites, submodules within
VC State
the design tended to exhibit very little independence. Also, Input
Routing Logic
Buffers
in [21], it is demonstrated that even when TMR is applied on
VC State
diversified designs (i.e. three modules with the same func- Input
Buffers
Routing Logic
From Figure 4, the simulation-based analysis finds that Figure 5: Baseline CMP switch design. A high
TMR provides very little reliability improvements over the level block diagram for a wormhole interconnection
baseline designs, due to the few number of defects that can switch is presented. It consists of 5 input controllers,
be tolerated by system-level TMR. Furthermore, the area a cross-bar, a switch arbiter and a cross-bar con-
of the TMR protected design is more than three times the troller.
area of the baseline design. The increase in area raises the level netlist, which consists of approximately 10k gates. This
probability of a defect being manifested in the design, which
router design consists of five input controllers which dom-
significantly affects the design’s reliability. In the rest of the
inate the design’s area (86%). Also, the design is heavily
paper, we propose and evaluate defect-tolerant techniques
dominated by combinational logic, which represents 84% of
that are significantly more robust and less costly than tra-
the total area, making it critical to choose protection tech-
ditional defect-tolerant techniques.
niques that can tolerate errors in logic effectively.
4. Self-repairing CMP Switch Design 4.2 Protection Mechanisms
The goal of the Bulletproof project is to design a defect- A design that is tolerant to permanent defects must pro-
tolerant chip-multiprocessor capable of tolerating significant vide mechanisms that perform four central actions related
levels of various types of defects. In this work, we address to faults: detection, diagnosis, repair, and recovery. Fault
the design of one aspect of the system, a defect tolerant detection identifies that a defect has manifested as an error
CMP switch. The CMP switch is much less complex than a in some signal. Normal operation cannot continue after fault
modern microprocessor, enabling us to understand the en- detection as the hardware is not operating properly. Often
tire design and explore a large solution space. Further, this fault detection occurs at a macro-level, thus it is followed
switch design contains many representative components of by a diagnosis process to identify the specific location of the
larger designs including finite state machines, buffers, con- defect. Following diagnosis, the faulty portion of the de-
trol logic, and buses. sign must be repaired to enable proper system functionality.
Repair can be handled in many ways, including disabling,
4.1 Baseline Design ignoring, or replacing the faulty component. Finally, the
The baseline design, consists of a CMP switch similar system must recover from the fault, purging any incorrect
to the one described in [19]. This CMP switch provides data and recomputing corrupted values. Recovery essen-
wormhole routing pipelined at the flit level and implements tially makes the defect’s manifestation transparent to the
credit-based flow control functionality for a two-dimensional application’s execution. In this section, we discuss a range
torus network. In the switch pipeline, head flits will pro- of techniques that can be applied to the baseline switch to
ceed through routing and virtual channel allocation stages, make it tolerant of permanent defects. The techniques differ
while all flits proceed through switch allocation and switch in their approach and the level at which they are applied to
traversal stages. A high-level block diagram of the router the design.
architecture is depicted in Figure 5. In [9] the authors present the Reliable Router (RR), a
The implemented router is composed of four functional switching element design for improved performance and re-
modules: the input controller, the switch arbiter, the cross- liability within a mesh interconnect. The design relies on
bar controller, and the crossbar. The input controller is re- an adaptive routing algorithm coupled with a link level re-
sponsible for selecting the appropriate output virtual chan- transmission protocol in order to maintain service in the
nel for each packet, maintaining virtual channel state infor- presence of a single node or link failure within the network.
mation, and buffering flits as they arrive and await virtual Our design differs from the RR in that our target domain
channel allocation. Each input controller is enhanced with involves a much higher fault rate and focuses on maintaining
an 8-entry 32-bit buffer. The switch arbiter allocates virtual switch service in the face of faults rather than simply routing
channels to the input controllers, using a priority matrix to around faulty nodes or links. However, the two techniques
ensure that starvation does not occur. The switch arbiter can be combined and provide a higher reliability multipro-
also implements flow control by tallying credit information cessor interconnection network.
used to determine the amount of available buffer space at
downstream nodes. The crossbar controller is responsible 4.2.1 General Techniques
for determining and setting the appropriate control signals The most commonly used protection mechanisms are dual
so that allocated flits can pass from the input controllers to and triple modular redundancy, or DMR and TMR [23].
the appropriate output virtual channels through the inter- These techniques employ spatial redundancy combined with
connect provided by the crossbar. a majority voter. With permanent faults, DMR provides
The router design is specified in Verilog and was synthe- only fault detection. Hence, a single fault in either of the
sized using the Synopsys Design Compiler to create a gate- redundant components will bring the system down. TMR
a: Correctly routed flit
is more effective as it provides solutions to detection, re- b, c: In the switch pipeline
d: Next flit to be routed
covery, and repair. In TMR, the majority voter identifies a e: Last flit buffered
Input
malfunctioning hardware component and masks its affects Buffers
e d
e d c b a
on the primary outputs. Hence, repair is trivial since the Error Detection Signal CRC
Checker
defective component is always just simply outvoted when it
Routed
computes an incorrect value. Due to this restriction, TMR Flit Tail Head Recovery
Head
is inherently limited to tolerating a single permanent fault.
Routed Routed
Faults that manifest in either of the other two copies can- Flit Flit
CRC Interconnect CRC
not be handled. DMR/TMR are applicable to both state Checker Switch Checker
and logic elements and thus are broadly applicable to our Recovery
Logic Routed
baseline switch design. Routed
Flit
*
figuration procedures. By assuring that a correct copy of
the flit is allocated into the input buffers and that the input X
buffer’s head/tail pointers are maintained correctly, we guar-
Partition D Partition C Partition B Partition A
antee that each flit entering the switch will correctly reach
Trial = (Defects + Reconfiguration) Configuration Defects
the head of the queue and be routed through the switch’s
Attempt# Configuration Reconfiguration Defects Trial Comment
pipeline. 1 0010 0001 0010 0011 Error Detected
To guarantee that a flit will get routed to the correct out- 2 0010 0010 0010 0010 Iteration Skipped
put channel, the flit is split into two parts, as shown in 3 0010 0100 0010 0110 Error Detected
4 0010 1000 0010 1010 Correct Execution
Figure 6(b). Each part will get its output channel requests
from a different routing logic block, and access the requested New Configuration = Trial Configuration = 1010
New Defects = Defects + (Configuration Trial) = 1010
output channel through a different switch arbiter. Finally,
each part is routed through the cross-bar independently. To Figure 7: Example system diagnosis and reconfigu-
accomplish this, we add an extra routing logic unit and an ration. This example shows the system with four
extra switch arbiter. The status bits in the input controllers partitions and one spare for each partition. The
that store the output channel reserved by the head flit are first spare of partition B contains a previously de-
duplicated as well. Since the cross-bar routes the flits at tected and corrected defect, thus the latest error in
the bit-level, the only difference is that the responses to the execution is caused by the defect in the first spare
cross-bar controller from the switch arbiter will not be the of partition D.
same for all the flit bits, but the responses for the first and recovery mechanism performs a system replay. Eventually,
the second parts of the flit are fitted from the first and sec- the partition that happens to possess the current error will
ond switch arbiters, respectively. If a defect causes a flit to be disabled and its corresponding spare enabled. When this
be misrouted, it follows that a single defect can impact only occurs, the system diagnosis mechanism will detect correct
one of the two parts of the flit, and the error will be caught system behavior and terminate the replay mode. Using this
later at the CRC check. approach, the faulty piece of logic is identified and correctly
The area overhead of the proposed error detection and disabled.
recovery mechanism is limited to only 10% of the switch’s In order for the system diagnosis to operate, it maintains
area. The area overhead of the CRC checkers, the Recov- a set of bit vectors as follows:
ery Logic and the Buffer Checker units is almost negligible. • Configuration Vector: It indicates which spare parti-
More specifically, the area of a single CRC checker is 0.1% tions are enabled.
of the switch’s area and the area for the Buffer Checker • Reconfiguration Vector: It keeps track of which config-
and the Recovery Logic is much less significant. The area urations have been tried and indicates the next config-
overhead of the proposed mechanism is dominated by the uration to be tried. It gets updated at each iteration
extra Switch Arbiter (5.7%), the extra Routing Logic units of system diagnosis.
(5x0.5% = 2.5%), and the additional CRC bits (1.5%). As • Defects Vector: It keeps track of which spare partitions
we can see, the proposed error detection and recovery mech- are defected.
anism has a 10X times less area overhead than a naı̈ve DMR • Trial Vector: Indicates which spare partitions are en-
implementation. abled for a specific system diagnosis iteration.
Figure 7, demonstrates an example where system diagno-
Resource Sparing. For providing defect tolerance to the
sis is applied on a system with four partitions and two copies
switch design we use resource sparing for selected partitions
(one spare) for each partition. The first copy of partition B
of the switch. During the switch operation only one spare
has a detected defect (mapped at the Defects Vector). The
is active for each distinct partition of the switch. For each
defect in the first copy of partition D is a recently manifested
spare added in the design, there is an additional overhead
defect and is the one that caused erroneous execution. Once
for the interconnection and the required logic for enabling
the error is detected, the system recovers to the last correct
and disabling the spare. For resource sparing, we study two
state using the mechanism described in the previous section
different techniques, dedicated sparing and shared sparing.
(see error detection and recovery mechanism), and it ini-
In the dedicated sparing technique, each spare is owned by a
tializes the Reconfiguration Vector. Next, the Trial Vector
single partition and can be used only when the specific par-
is computed using the Configuration, Reconfiguration, and
tition fails. When shared sparing is applied, one spare can
Defects vectors. In case the Trial Vector is the same with the
be used to replace a set of partitions. In order for the shared
Configuration Vector (attempt 2), due to a defected spare,
sparing technique to be applied, it requires multiple identi-
the iteration is skipped. Otherwise, the Trial Vector is used
cal partitions, such as the input controllers for the switch
as the current Configuration vector, indicating which spare
design. Furthermore, each shared spare requires additional
partitions will be enabled for the current trial. The execu-
interconnect and logic overhead because of its need of hav-
tion is then replayed from the recovery point until the error
ing the ability to replace more than one possible defective
detection point. In case the error is detected, a new trial
partitions.
is initiated by updating the Reconfiguration Vector and re-
System Diagnosis and Reconfiguration. As a system
computing the Trial Vector. In case no error is detected,
diagnosis mechanism, we propose an iterative trial-and-error
meaning that the trial configuration is a working configura-
method which recovers to the last correct state of the switch,
tion, the Trial Vector is copied to the Configuration Vector
reconfigures the system, and replays the execution until no
and the Defects Vector is updated with the located defected
error is detected. The general concept is to iterate through
copy. If all the trial configurations are exhausted, which are
each spared partition of the switch and swap in the spare
equal to the number of partitions, and no working configu-
for the current copy. For each swap, the error detection and
ration was found, then the defect was a fatal defect and the
#partitions: 1 #partitions: 2 #partitions: 3
system won’t be able to recover. The example implemen- #part.outputs: 2
#hyper edges: 8
#part.outputs: 3
#hyper edges: 7
#part.outputs: 5
#hyper edges: 5
tation of the system diagnosis mechanism demonstrated in Generate
#cut edges: 1 #cut edges: 3
Hypergraph Bisection Algorithm Bisection Algorithm
Figure 7, can be adapted accordingly for designs with more
partitions and more spares. A
A A A
We also consider the Built-In-Self-Test(BIST) technique F F F F
as an alternative for providing system diagnosis. For each I I I
I B B
B B
distinct partition in the design we store in ROM automati- G
G G G
C C C
cally generated test vectors. During system diagnosis with C J J J
J
BIST, these test vectors are applied to each partition of the D
D D D
H H H
system through scan chains to check its functionality cor- H
E E E
rectness and locate the defected partition. E
Both the iterative replay and BIST techniques can be im- (a) (b) (c) (d)
plemented as a separate module from the switch and the Figure 8: The process of automatic cluster decompo-
area overhead for their implementation can be shared by a sition. In part (a) a sample netlist is shown with
wide number of switches in a possible chip multiprocessor 2 primary outputs, along with its corresponding hy-
design. pergraph in part (b). Part (c) shows the hypergraph
after a min-cut bisection creating two unbalanced
partitions. Part (d) shows the final 3-way partition
4.2.3 Level of Protection resulting from a bisection of the largest partition.
The error resiliency achieved by implementing one of the
protection techniques (e.g., TMR or sparing) is highly de- 4.2.4 Automatic Cluster Decomposition
pendent on the granularity of the partitions. In general, the Automatic Cluster Decomposition takes a netlist and cre-
larger the granularity of the partitions, the less robust the ates partitions with the end goal that each partition is ap-
design. However, as the granularity of the partition becomes proximately the same size and that there is a minimal amount
smaller, more logic is required. For TMR, each output for a of outputs required for each partition generated. Generat-
given partition requires a MAJORITY gate. Since each added ing these partitions requires that the netlist be converted
MAJORITY gate is unshielded from permanent defects, poorly into a graph that can then be partitioned using a balanced-
constructed small partitions can make a design less error recursive min-cut algorithm [15, 12] that has found use in
resilient than designs with larger partitions. fields like VLSI [2].
To illustrate these trade-offs, consider the baseline switch Figure 8 shows how these partitions are generated from
again in Figure 5. Sparing and TMR can be done on the the netlist of a design. First, the netlist pictured in part (a)
system-level where the whole switch is replicated and each is used to generate a hypergraph shown in part (b). A hy-
output requires some extra logic like a MUX or MAJORITY pergraph is an extension of a normal graph where one edge
gate. A single permanent error makes one copy of the switch can connect multiple vertices. In the figure, each vertex rep-
completely broken. However, the area overhead beyond the resents a separate net in the design. A hyperedge is drawn
spares is limited to only a gate for each primary output. A around each net and its corresponding fanout. If that net
slightly more resilient design, considers partitioning based is placed in a different partition than one of its fanout, that
on the components that make up the switch. For instance, net becomes an output for its partition thus increasing the
each of the five input controllers can have a spare along with overhead of the partition. Thus the goal of the partitioning
the arbiter, cross-bar, and the cross-bar controllers. This algorithm is to minimize the number of hyperedges that are
partitioning approach leaves the design more protected as a cut. For this example, we show a 3-way partitioning of the
permanent defect in the input controller would make only circuit. The algorithm in [12] performs a recursive min-cut
that small partition broken and not the other four input operation where the original circuit is bisected and then one
controllers. There is a small area penalty for this approach of these partitions is bisected again. In Figure 8(c), the hy-
as the sum of the outputs for each partition is greater than pergraph is bisected and the number of hyperedges cut is
the switch as a whole. However, the added unprotected reported. Notice that one of those pieces is twice the size
logic is still insufficient to worsen the error resiliency of the of the other one. Because 3-way partitioning is desired, one
design. Finally, consider partitioning at the gate-level. In piece is slightly larger so that the final partitions are fairly
this approach, each gate is in its own partition. In this balanced where each partition has about the same number
scheme, the error resiliency for each partition is extremely of vertices/nets. Figure 8(d) shows the final partitioning as-
high because the target is very small. However, the over- signment of the hypergraph along with the number of hyper-
head of this approach requires an extra gate for each gate edges cut which corresponds to the number of total outputs
in the switch design. Thus for TMR, the area would be four for all the partitions not including the original outputs of the
times the original design. In addition, because each added system. In practice, several iterations of [12] are run as the
gate is unprotected, the susceptibility of this design to er- algorithm is heuristically-based and requires many runs for
rors is actually greater than the larger partitions used in the optimal partitions. Also, some imbalance in partition sizes
component-based partition. is tolerated if the number of hyperedges cut is significantly
The previous analysis shows that the level of partition- smaller as a result.
ing effects the error resiliency and the area overheads of the Once the hypergraph is partitioned, several different repli-
design. In this paper, we introduce a technique called, Au- cation strategies can be used such as sparing and TMR. In
tomatic Cluster Decompositions, that generates partitions Figure 9, an example of performing 3-way partitioning over
that minimizes area overhead while maximizing error re- an arbitrary piece of logic using one spare per partition is
siliency. shown. In part (a), sparing is performed at the system-level.
Part.2
Part.1
Part. 1 Select
Select Table 1: Mnemonic table for design configurations.
X
X
MU
MU MU MU
X
X
Full system MU For each portion of the naming convention, we show
MU MU MU
X
X
MU
X
Part. 2 the possible mnemonics with the related description.
X
MUX
The last portion provides some example design con-
X
MU
MUX
MUX
Part. Part.3
figurations.
X
Part. 3
MU
Select Select Mnemonic Group Mnemonic Description
S System level
(a) (b) Level of applying C Component level
defect tolerance G Gate level
Figure 9: Examples of one spare systems. In part (a) technique S+CL
C+CL
System level clusters
Component level clusters
sparing without any cluster decomposition is shown. Defect tolerance TMR Triple Modular Redundancy
techniques (can be #SP # dedicated spares for each partition
In part (b) sparing is applied to a three-way par- applied in #SH(X) # shared spares for partition of type X
combinations) ECC Error Correction Codes applied at state
tition. Cluster decomposition increases MUX in- System diagnosis IR Iterative replay
technique BIST Built-In-Self-Test
terconnect overhead, but provides higher protection S+CL_1SP_IR System level clusters with 1 spare for each partition and iterative replay.
Example C_2SH(IC)+1SP_BIST Component level with 2 shared input controllers and one dedicate spare
due to the smaller granularity of the sparing. configurations for the rest of the components. BIST for system diagnosis.
C+CL_TMR+ECC Component level clusters TMR with ECC protected state.
The outputs of both identical units are fed into a MUX where
a register value determines which unit is active and which design configuration, we provide the area overhead needed
one is inactive. In part (b), the circuit is partitioned into for implementing the specific design. This area overhead
three pieces. Notice, that the outputs for each boundary includes the extra area needed for the spare units, the ma-
must now have a MUX associated with it. Also, each parti- jority gates, the logic for enabling and disabling spare units,
tion requires a register to determine which copy is active. the logic for the end-to-end error detection, recovery and
Thus 3-way partitioning requires a total of three registers system diagnosis (different configurations have different re-
and the number of MUXes corresponding to the number of quirements for the extra logic added). We notice that the
outputs generated by the partitioning. design configurations with the higher area overheads are the
ones applying BIST for system diagnosis. This is due to the
4.3 Switch Designs
extra area needed for storing the test vectors necessary for
Each configuration providing a defect tolerant switch de- self-testing each distinct partition in the design, along with
sign is characterized by three parameters: level of protec- the additional interconnection and logic needed for the scan
tion, techniques applied, and system diagnosis method. For chains. Even though the area overhead for the test vectors
each configuration, we give a name convention as follows: can be shared over the total number of switches per chip,
level technique diagnosis. The configurations using TMR the area overhead of the BIST technique is still rather large.
as the defect tolerance technique do not use the end-to-end Another design configuration with high area overhead is the
error detection, recovery and system diagnosis techniques, one where TMR is applied at the gate level due to the extra
since TMR inherently provides error detection, diagnosis voting gate needed for each gate in the baseline switch de-
and recovery. All other configurations use the end-to-end er- sign. On the other hand, designs with shared spares achieve
ror detection and recovery technique, along with either iter- low area overhead (under two) since not every part of the
ative replay or BIST for system diagnosis. Table 1, describes switch is duplicated. The area overhead for the rest of the
the choices that we considered in our simulated configura- design configurations is dependent on the amount of spares
tions for the three parameters, and it gives some example per partition.
configurations along with their name conventions. In the fourth column of the table, we provide the mean
number of defects to failure for each design configuration.
5. Experimental Results The design configurations providing high mean number of
To evaluate the effectiveness of our domain specific defect defects to failure are the ones employing the ACD (Auto-
tolerance techniques in protecting the switch design, we sim- matic Cluster Decomposition) technique. Another point of
ulated various design configurations with both traditional interest is that techniques employing ECC even when cou-
and domain specific techniques. To assess the effectiveness pled with automatic cluster decomposition perform poorly.
of the various design configurations in protecting the switch Although state is traditionally protected by ECC, when a
design, we take into account the area and time overheads design is primarily combinational logic, like our switch, the
of the design along with the mean number of defects that cost of separating the state from the logic exceeds the pro-
the design can tolerate. We also introduce a new metric, tection given to the state elements. In other words, if the
the Silicon Protection Factor (SPF), which gives us a more state is not considered in the ACD analysis and is therefore
representative notion about the amount of protection that is not part of any of the spared partitions, the boundary be-
offered to the system by a given defect tolerance technique. tween the state and the spared partitions must have some
Specifically, the SPF is computed by dividing the mean num- unprotected interconnection logic. This added logic coupled
ber of defects needed to cause a switch failure with the area with the unprotected logic required by ECC makes ECC in
overhead of the protection techniques. In other words, the a logic dominated design undesirable.
higher the SPF factor, the more resilient each transistor is The SPF values for each design are presented in the fifth
to defects. Since the number of defects in a design is pro- column of Table 2. The highest SPFs are given by the de-
portional to the area of the design, the use of this metric sign configurations that employ automatic cluster decom-
for assessing the effectiveness of the silicon protection tech- position, with the highest being design S+CL 2SP IR at
niques is more appropriate. 11.11. Even though design S+CL 2SP BIST uses the same
In Table 2, we list the design configurations that we simu- sparing strategies, the area overhead added from BIST de-
lated. The naming convention followed for representing each creases the design’s SPF significantly. It’s interesting that
configuration is described in Table 1. For each simulated two design configurations have SPFs lower than 1. The first
Table 2: Results of the evaluated designs. For each design configuration we report the mnemonic, the area
factor over the baseline design, the number of defects that can be tolerated, the SPF, the number of partitions
and an estimate of the impact on the system delay.
Design Area Design Area
Key Defects SPF #Part. %Dly Key Defects SPF #Part. %Dly
Configuration O.head Configuration O.head
1 S_TMR 3.02 2.49 0.82 1 0.00 20 S+CL_2SP+ECC_IR 3.39 8.64 2.55 118 22.22
2 S+CL_TMR 3.08 16.78 5.45 241 22.22 21 C_2SP_IR 3.36 13.07 3.90 12 0.00
3 S+CL_TMR+ECC 3.07 6.92 2.25 185 27.78 22 C_2SP_BIST 3.90 13.07 3.35 12 0.00
4 C_TMR 3.04 4.68 1.54 12 0.00 23 C+CL_2SP_IR 3.44 32.33 9.39 208 18.75
5 C+CL_TMR 3.09 15.86 5.13 223 18.75 24 C_CL_2SP_BIST 4.31 32.33 7.50 208 18.75
6 C+CL_TMR+ECC 3.11 6.25 2.01 298 25.00 25 C+CL_2SP+ECC_R 3.41 7.49 2.20 103 25.00
7 G_TMR 4.00 4.00 1.00 10540 100.00 26 C_2SH(IC)_IR 1.52 3.15 2.07 12 0.00
8 S_1SP_IR 2.22 3.27 1.47 1 0.00 27 C_3SH(IC)_IR 1.71 4.14 2.43 12 0.00
9 S+CL_1SP_IR 2.30 17.53 7.63 206 22.22 28 C_4SH(IC)_IR 1.89 5.02 2.65 12 0.00
10 S+CL_1SP_BIST 3.16 17.53 5.54 206 22.22 29 C_5SH(IC)_IR 2.08 5.90 2.84 12 0.00
11 S+CL_1SP+ECC_IR 2.48 5.96 2.41 183 27.78 30 C_2SH(IC)+1SP_IR 1.74 4.40 2.53 12 0.00
12 S+CL_1SP+ECC_BIST 3.34 5.96 1.78 183 27.78 31 C_3SH(IC)+1SP_IR 1.93 5.79 3.01 12 0.00
13 C_1SP_IR 2.24 5.87 2.62 12 0.00 32 C_4SH(IC)+1SP_IR 2.12 7.10 3.34 12 0.00
14 C_1SP_BIST 2.79 5.87 2.62 12 0.00 33 C_5SH(IC)+1SP_IR 2.41 8.39 3.48 12 0.00
15 C+CL_1SP_IR 2.33 16.04 6.88 223 18.75 34 C_2SH(IC)+2SP_IR 1.93 5.01 2.60 12 0.00
16 C+CL_1SP_ECC_IR 2.51 5.34 2.13 138 25.00 35 C_3SH(IC)+2SP_IR 2.12 6.57 3.09 12 0.00
17 S_2SP_IR 3.32 5.95 1.79 1 0.00 36 C_4SH(IC)+2SP_IR 2.30 8.10 3.52 12 0.00
18 S+CL_2SP_IR 3.42 37.99 11.11 206 22.22 37 C_5SH(IC)+2SP_IR 2.50 9.58 3.84 12 0.00
19 S+CL_2SP_BIST 4.29 37.99 8.86 206 22.22 38 S_ECC 1.18 1.16 0.98 12 0.00
18
one is TMR applied at the system level, which can tolerate
16
2.5 defects but the area overhead is more than triple, thus Mean Defects to Failure
making the new design less defect tolerant than the baseline 14 SPF-Defect Tolerance
k
switch design by 18%. The second one is where the state is 12
19-S+CL_2SP_BIST
7-G_TMR 22-C_2SP_BIST 24-C+CL_2SP_BIST
4
Area Overhead h 23-C+CL_2SP_IR
17 12 25 20 21-C_2SP_IR
10-S+CL_1SP_BIST 18-S+CL_2SP_IR
1-S_TMR 6 3-S+CL_TMR+ECC 5
3 2-S+CL_TMR
4-C_TMR
14-C_1SP_BIST
16 11 33 37-C_5SH(IC)+2SP_IR 15-C+CL_1SP_IR More robust designs
8-S_1SP_IR 13 35 32 36 9-S+CL_1SP_IR
28 342931-C_3SH(IC)+1SP_IR
Cheaper designs
2
c h o re
30-C_2SH(IC)+1SP_IR
ea ro
m
26 27-C_3SH(IC)_IR
pe bu
r st
38-S_ECC
1
de
si
gns
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Normalized Defect Resiliency - Silicon Protection Factor (SPF)
Figure 11: Pareto chart of the explored solutions. The design evaluated are plotted on an area vs. SPF chart.
The line across the chart connects the set of optimal solutions. See Table 1 for explanations of design points.
100
120000
protection. Design configurations with moderate SPFs but
90 108000
with much less cost in area overhead are: C 3SH(IC) IR,
g
80 96000
C 2SH(IC)+1SP IR,C 3SH(IC)+1SP IR, andC 2SH(IC)+
70
2SP IR. These design configurations use shared spares of 84000
less than 2X, but offering SPFs of 2.5-3 at the same time. 40 48000
Such designs are interesting, since they keep the implemen- 30 Failure Rate (FIT) 36000
9. S+CL_1SP_IR (SPF=7.63)
tation cost at low levels and provide an attractive solution 20 4. C_TMR (SPF=1.54) 24000
18. S+CL_2SP_IR (SPF=11.11)
for defect tolerance. 10 31. C_3SH(IC)+1SP_IR (SPF=3.01) 12000
13. C_1SP_IR (SPF=2.62)
Other two interesting design configurations are C+CL 1SP 0
0 5 10 15 20 25 30
IR and S+CL 1SP IR. These two designs use the same tech- Time (Years)
applies the ACD technique on the system level, and design 90 108000
C+CL 1SP IR at the component level. The area cost of the 80 96000
g
two designs is almost the same but S+CL 1SP IR provides 70 84000
Defected Parts (%)
S+CL 2SP IR and C+CL 2SP IR. This suggests that ap- 50 60000
Failure Rate (FIT)
plying the ACD technique at the system level can offer more 40 9. S+CL_1SP_IR (SPF=7.63) 48000
4. C_TMR (SPF=1.54)
effective defect tolerance at the same cost in area. 30
18. S+CL_2SP_IR (SPF=11.11)
36000
31. C_3SH(IC)+1SP_IR (SPF=3.01)
In Figure 12, we present how some of the design configu- 20
13. C_1SP_IR (SPF=2.62)
24000
rations affect the lifetime of the switch design for a future 10 12000
post 65nm technology where the mean time between fail-
0
ures to be manifested on a switch is 2 years (a failure rate 0 1 2 3 4 5 6 7 8 9 10 11 12
Time (Years)
of 55000 FITs). The graph’s horizontal axis represents the
(b)
years that the switch design is operating. The vertical axis
represent the percentage of defected parts over a popula- Figure 12: Fault tolerance of some interesting design
tion of switches (left axis) and the baseline switch’s failure configurations. Part (a) superimposes the FIT rate
rate (right axis). The baseline switch’s lifetime failure rate of the bathtub model with the percentage of defec-
for the given technology is presented by the darker thick tive parts over time. In addition part (b) takes into
line, forming the bathtub curve. In Figure 12(a), it only account the breakdown period.
forms a part of the bathtub curve since for this graph we three years. On the other hand, when in a design configu-
assume that the design’s breakdown occurs after 30 years. ration where automated cluster decomposition was applied
For each design configured presented in the graph, there is a at the system level with 2 spares for each partition, the 25%
line showing the failing rate of switch parts over time. This of the shipped parts will be defected after 16 years and the
line starts from year 1, since we assume that the first year of 75% after 29 years. If we define the lifetime of a manufac-
a parts lifetime is consumed during the accelerated testing tured product as the period of time where 10% of the man-
(burn-in) procedure, and that shipped parts are already at ufactured parts become defective, then the clustering design
their first year of lifetime with a constant failure rate. configuration S+CL 2SP IR increases the switch’s lifetime
From the graph in Figure 12(a), we can observe that when by 26X over the TMR design configuration C TMR.
applying TMR at the component level, 25% of the shipped System designers, can choose a defect tolerance technique
parts will be defected by the first year and 75% after the first that best matches with their design’s specifications. For
example the design configuration S+CL 1SP IR, where au- [6] F. A. Bower, et al. Tolerating hard faults in microprocessor
tomatic clustering decomposition is applied at system level array structures. In Proc. of International Conference on
Dependable Systems and Networks (DSN), 2004.
with one dedicated spare for each partition, where 10% of [7] K. Constantinides, et al. Assessing SEU Vulnerability via
the parts will get defected after 7 years but with 48% less Circuit-Level Timing Analysis In Proc. of 1st Workshop on
cost in area than design configuration S+CL 2SP IR might Architectural Reliability (WAR), 2005.
be a more attractive solution. [8] J. E. D. E. Council. Failure mechanisms and models for
semiconductor devices. JEDEC Publication JEP122-A, 2002.
The same data as in Figure 12(a), is presented in Fig- [9] W. J. Dally, et al. The reliable router: A reliable and
ure 12(b), with the difference that here we assume that the high-performance communication substrate for parallel
breakdown for the switch design starts after 10 years of be- computers. In Proc. International Workshop on Parallel
ing shipped. For the first three design configurations, there Computer Routing and Communication (PCRCW), 1994.
[10] E. Wu, et al. Interplay of voltage and temperature acceleration
is no difference since by that time all of the parts become of oxide breakdown for ultra-thin gate dioxides. Solid-state
defective. For the other two design configurations, what Electronics Journal, 2002.
is interesting to observe is that even after the breakdown [11] P. Gupta and A. B. Kahng. Manufacturing-aware physical
point where the failure rates increase with an exponential design. In Proc. of International Conference on
Computer-Aided Design (ICCAD), 2003.
rate, most of the parts will be able to provide the user a [12] hMETIS. http://www.cs.umn.edu/ e karypis.
warning time window of a month before failure. This is a [13] C. K. Hu, et al. Scaling effect on electromigration in on-chip
very important feature for a design configuration, especially Cu wiring. International Electron Devices Meeting, 1999.
for very critical high dependable applications. [14] A. M. Ionescu, M. J. Declercq, S. Mahapatra, K. Banerjee, and
J. Gautier. Few electron devices: towards hybrid CMOS-SET
integrated circuits. In Proc. of the Design Automation
6. Conclusions and Future Directions Conference, pages 88–93, 2002.
As silicon technologies continue to scale, transistor relia- [15] G. Karypis, et al. Multilevel hypergraph partitioning:
Applications in VLSI domain. In Proc. of the Design
bility is becoming an increasingly important issue. Devices Automation Conference, pages 526–529, 1997.
are becoming subject to extreme process variation, tran- [16] S. Mukherjee, et al. The soft error problem: An architectural
sistor wearout, and manufacturing defects. As a result, it perspective. In Proc. of the International Symposium on
High-Performance Computer Architecture, 2005.
will likely be no longer possible to create fault-free designs.
[17] S. Mukherjee, et al. A systematic methodology to compute the
In this paper, we investigate the design of a defect-tolerant architectural vulnerability factors for a high-performance
CMP network switch. To accomplish this design, we first microprocessor. In Proc. International Symposium on
develop a high-level, architect-friendly model of silicon fail- Microarchitecture (MICRO), pages 29–42, 2003.
[18] B. T. Murray and J. P. Hayes. Testing ICs: Getting to the core
ures based on the time-tested bathtub curve. Based on this of the problem. IEEE Computer, 29(11):32–38, 1996.
model, we explore the design space of defect-tolerant CMP [19] L.-S. Peh. Flow Control and Micro-Architectural Mechanisms
switch designs and the resulting tradeoff between defect tol- for Extending the Performance of Interconnection Networks.
erance and area overhead. We find that traditional mecha- PhD thesis, Stanford University, 2001.
[20] R. Rao, et al. Statistical estimation of leakage current
nisms, such as triple modular redundancy and error correc- considering inter- and intra-die process variation. In Proc. of
tion codes, are insufficient for tolerating moderate numbers the International Symposium on Low Power Electronics and
of defects. Rather, domain-specific techniques that include Design (ISLPED), pages 84–89, 2003.
end-to-end error detection, resource sparing, and iterative [21] N.R. Saxena, and E.J. McCluskey Dependable Adaptive
Computing Systems. IEEE Systems, Man, and Cybernetics
diagnosis/reconfiguration are more effective. Further, de- Conf., 1998.
composing the netlist of the switch into modest-sized clus- [22] P. Shivakumar, et al. Exploiting microarchitectural redundancy
ters is the most effective granularity to apply the protection for defect tolerance. In Proc. of International Conference on
techniques. Computer Design (ICCD), 2003.
[23] D. P. Siewiorek, et al. Reliable computer systems: Design and
This work provides a solid foundation for future explo- evaluation, 3rd edition. AK Peters, Ltd Publisher, 1998.
ration in the area of defect-tolerant design. We plan to [24] J. Smolens, et al. Fingerprinting: Bounding the soft-error
investigate the use of spare components based on wearout detection latency and bandwidth. In Proc. of the Symposium
profiles to provide more sparing for the most vulnerable com- on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 2004.
ponents. Further, a CMP switch is only a first step towards [25] L. Spainhower and T. A. Gregg. G4: A fault-tolerant CMOS
the over-reaching goal of designing a defect-tolerant CMP mainframe. In Proc. of International Symposium on
system. Fault-Tolerant Computing (FTCS), 1998.
Acknowledgments: This work is supported by grants [26] J. Srinivasan, et al. The impact of technology scaling on
lifetime reliability. In Proc. of International Conference on
from NSF and Gigascale Systems Research Center. We Dependable Systems and Networks (DSN), 2004.
would also like to acknowledge Li-Shiuan Peh for provid- [27] J. H. Stathis. Reliability limits for the gate insulator in CMOS
ing us access to CMP Switch models, and the anonymous technology. IBM Journal of Research and Development, 2002.
reviewers for providing useful comments on this paper. [28] S. B. K. Vrudhula, D. Blaauw, and S. Sirichotiyakul.
Estimation of the likelihood of capacitive coupling noise. In
7. References Proc. of the Design Automation Conference, 2002.
[29] N. J. Wang, et al. Characterizing the effects of transient faults
[1] H. Al-Asaad and J. P. Hayes. Logic design validation via on a high-performance processor pipeline. In Proc. of
simulation and automatic test pattern generation. J. Electron. International Conference on Dependable Systems and
Test., 16(6):575–589, 2000. Networks (DSN), pages 61–70, 2004.
[2] C. J. Alpert and A. B. Kahng. Recent directions in netlist [30] C. Weaver and T. Austin. A fault tolerant approach to
partitioning: a survey. Integr. VLSI J., 19(1-2):1–81, 1995. microprocessor design. In Proc. of International Conference
[3] T. S. Barnett and A. D. Singh. Relating yield models to on Dependable Systems and Networks (DSN), 2001.
burn-in fall-out in time. In Proc. of International Test [31] C. Weaver, et al. Techniques to reduce the soft error rate of a
Conference (ITC), pages 77–84, 2003. high-performance microprocessor. In Annual International
[4] E. Bohl, et al. The fail-stop controller AE11. In Proc. of Symposium on Computer Architecture, 2004.
International Test Conference (ITC), pages 567–577, 1997. [32] J. F. Ziegler. Terrestrial cosmic rays. IBM Journal of Research
[5] S. Borkar, et al. Design and reliability challenges in nanometer and Development, 40(1), 1996.
technologies. In Proc. of the Design Automation Conf., 2004.