0% found this document useful (0 votes)
72 views12 pages

Bulletproof: A Defect-Tolerant CMP Switch Architecture

This document proposes a defect-tolerant switch architecture for chip multiprocessors (CMPs) to address reliability challenges in nanoscale technologies. It presents a design space with solutions ranging from no fault detection to full fault correction with repair. It evaluates reliability versus area tradeoffs across CMP switch designs with various protection levels. Experimental results show designs can tolerate more defects with less overhead than naive triple modular redundancy using domain-specific techniques like end-to-end error detection and automatic circuit decomposition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views12 pages

Bulletproof: A Defect-Tolerant CMP Switch Architecture

This document proposes a defect-tolerant switch architecture for chip multiprocessors (CMPs) to address reliability challenges in nanoscale technologies. It presents a design space with solutions ranging from no fault detection to full fault correction with repair. It evaluates reliability versus area tradeoffs across CMP switch designs with various protection levels. Experimental results show designs can tolerate more defects with less overhead than naive triple modular redundancy using domain-specific techniques like end-to-end error detection and automatic circuit decomposition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

BulletProof: A Defect-Tolerant CMP Switch Architecture

Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang†


Valeria Bertacco ‡
Scott Mahlke ‡
Todd Austin ‡
Michael Orshansky†

Advanced Computer Architecture Lab †
Department of Electrical and Computer Engineering
University of Michigan University of Texas at Austin
Ann Arbor, MI 48109 Austin, TX, 78712
{kypros, splaza, jblome, valeria, [email protected]
mahlke, austin}@umich.edu [email protected]

Abstract
that transistor reliability will begin to wane in the nanome-
As silicon technologies move into the nanometer regime, tran-
ter regime. As devices become subject to extreme process
sistor reliability is expected to wane as devices become subject
variation, particle-induced transient errors, and transistor
to extreme process variation, particle-induced transient errors,
wearout, it will likely no longer be possible to avoid these
and transistor wear-out. Unless these challenges are addressed,
faults. Instead, computer designers will have to begin to
computer vendors can expect low yields and short mean-times-
directly address system reliability through fault-tolerant de-
to-failure. In this paper, we examine the challenges of designing
sign techniques.
complex computing systems in the presence of transient and per-
Figure 1 illustrates the fault-tolerant design space we fo-
manent faults. We select one small aspect of a typical chip multi-
cus on in this paper. The horizontal axis lists the type
processor (CMP) system to study in detail, a single CMP router
of device-level faults that systems might experience. The
switch. To start, we develop a unified model of faults, based on
source of failures are widespread, ranging from transient
the time-tested bathtub curve. Using this convenient abstraction,
faults due to energetic particle strikes [32] and electrical
we analyze the reliability versus area tradeoff across a wide spec-
noise [28], to permanent wearout faults caused by electro-
trum of CMP switch designs, ranging from unprotected designs
migration [13], stress-migration [8], and dielectric break-
to fully protected designs with online repair and recovery capabil-
down [10]. The vertical axis of Figure 1 lists design solutions
ities. Protection is considered at multiple levels from the entire
to deal with faults. Design solutions range from ignoring
system down through arbitrary partitions of the design. To bet-
any possible faults (as is done in many systems today), to
ter understand the impact of these faults, we evaluate our CMP
detecting and reporting faults, to detecting and correcting
switch designs using circuit-level timing on detailed physical lay-
faults, and finally fault correction with repair capabilities.
outs. Our experimental results are quite illuminating. We find
The final two design solutions are the only solutions that
that designs are attainable that can tolerate a larger number of
can address permanent faults, with the final solution being
defects with less overhead than naı̈ve triple-modular redundancy,
the only approach that maintains efficient operation after
using domain-specific techniques such as end-to-end error detec-
encountering a silicon defect.
tion, resource sparing, automatic circuit decomposition, and it-
In recent years, industry designers and academics have
erative diagnosis and reconfiguration.
paid significant attention to building resistance to transient
faults into their designs. A number of recent publications
have suggested that transient faults, due to energetic par-
1. Introduction ticles in particular, will grow in future technologies [5, 16].
A critical aspect of any computer design is its reliabil- A variety of techniques have emerged to provide a capa-
ity. Users expect a system to operate without failure when bility to detect and correct these type of faults in storage,
asked to perform a task. In reality, it is impossible to build including parity or error correction codes (ECC) [23], and
a completely reliable system, consequently, vendors target logic, including dual or triple-modular spatial redundancy
design failure rates that are imperceptibly small [23]. More- [23] or time-redundant computation [24] or checkers [30].
over, the failure rate of a population of parts in the field Additional work has focused on the extent to which circuit
must exhibit a failure rate that does not prove too costly timing, logic, architecture, and software are able to mask
to service. The reliability of a system can be expressed as out the effects of transient faults, a process referred to as
the mean-time-to-failure (MTTF). Computing system reli- “derating” a design [7, 17, 29].
ability targets are typically expressed as failures-in-time, or In contrast, little attention has been paid to incorporat-
FIT rates, where one FIT represents one failure in a billion ing design tolerance for permanent faults, such as silicon de-
hours of operation. fects and transistor wearout. The typical approach used to-
In many systems today, reliability targets are achieved by day is to reduce the likelihood of encountering silicon faults
employing a fault-avoidance design strategy. The sources through post-manufacturing burn-in, a process that acceler-
of possible computing failures are assessed, and the neces- ates the aging process as devices are subjected to elevated
sary margins and guards are placed into the design to en- temperature and voltage [10]. The burn-in process accel-
sure it will meet the intended level of reliability. For exam- erates the failure of weak transistors, ensuring that, after
ple, most transistor failures (e.g., gate-oxide breakdown) can burn-in, devices still working are composed of robust tran-
be reduced by limiting voltage, temperature and frequency sistors. Additionally, many computer vendors provide the
[8]. While these approaches have served manufacturers well ability to repair faulty memory and cache cells, via the in-
for many technology generations, many device experts agree
TYP
DES E OF MANUFACTURING plus a comparison to traditional fault tolerant techniques,
IGN DEFE
FEA C DEFECT WEAR-OUT DEFECT TRANSIENT ERROR
T UR T such as ECC and triple-modular redundancy (TMR). Fi-
E

NO-DETECCTION Untestable Defects


System fails in
System glitch
manifests in
nally, Section 6 gives conclusions and suggestions for future
unpredictable way
unpredictable way research directions.
Component Component

2. An Analysis of the Fault Landscape


DETECTION Testing terminates terminates.
at first error DMR Hard-reset restore
DMR

DETECTION Post-manufacturing Online defect


Diva
Transient fault Razor
As silicon technologies progress into the 65nm regime and
+CORRECTION recovery
ECC - memory
recovery
TMR
recovery ECC
TMR
below, a number of failure factors rise in importance. In this
Post-manufacturing section, we highlight these failure mechanisms, and discuss
DETECTION
+CORRECTION reconfiguration Online repair the relevant trends for future process technologies.
cache-line swap-out
+REPAIR memory-array spares Bulletproof
Single-Event Upset (SEU). There is growing concern
Mainstream Solutions High-end Solutions Specialized Solutions Research-stage Solutions about providing protection from soft errors caused by charged
Figure 1: Reliable System Design Space. The dia- particles (such as neutrons and alpha particles) that strike
gram shows a map of type of device-level faults in a the bulk silicon portion of a die [32]. The effect of SEU is a
digital system (horizontal axis) vs. protection tech- logic glitch that can potentially corrupt combinational logic
niques against these faults (vertical axis). This work computation or state bits. While a variety of studies have
addresses the problems/solutions in the dark shaded been performed that demonstrate the unlikeliness of such
area of the map. events [31, 29], concerns remain in the architecture and cir-
cuit communities. This concern is fueled by the trends of re-
clusion of spare storage cells [25]. Recently, academics have duced supply voltage and increased transistor budgets, both
begun to extend these techniques to support sparing for ad- of which exacerbate a design’s vulnerability to SEU.
ditional on-chip memory resources such as branch predictors Process Variation. Another reliability challenge design-
[6] and registers [22]. ers face is the design uncertainty that is created by increas-
ing process variations. Process variations result from device
1.1 Contributions of This Paper dimension and doping concentration variation that occur
In this paper, we push forward the understanding in re- during silicon fabrication. These variations are of particu-
liable microarchitecture design by performing a comprehen- lar concern because their effects on devices are amplified as
sive design study of the effects of permanent faults on a device dimensions shrink [20], resulting in structurally weak
chip-multiprocessor switch design.The goal is to better un- and poor performing devices. Designers are forced to deal
derstand the nature of faults, and to build into our designs with these variations by assuming worst-case device char-
a cost-effective means to tolerate these faults. Specifically, acteristics (usually, a 3-sigma variation from typical condi-
we make the following contributions: tions), which leads to overly conservative designs.
• We develop a high-level architect-friendly model of sil- Manufacturing Defects. Deep sub-micron technolo-
icon failures, based on the time-tested bathtub curve. gies are increasingly vulnerable to several fabrication-related
The bathtub curve models the early-life failures of de- failure mechanisms. For example, step coverage problems
vices during burn-in, the infrequent failure of devices that occur during the metalization process may cause open
during the part’s lifetime, and the breakdown of de- circuits. Post-manufacturing test [18] and built-in self-test
vices at the end of their normal operating lifetime. (BIST) [1] are two techniques to impress test vectors onto
From this bathtub-curve model, we define the design circuits in order to identify manufacturing defects. A more
space of interest, and we fit previously published device- global approach to testing for defects is taken by IDDQ test-
level reliability data to the model. ing, which uses on-board current monitoring to detect short-
• We introduce a low-cost chip-multiprocessor (CMP) circuits in the manufactured part. During IDDQ testing,
switch router architecture that incorporates system-level any abnormally high current spikes found during functional
checking and recovery, component-level fault diagnosis, testing are indicative of short-circuit defects [4].
and spare-part reconfiguration. Our design, called Bul- Gate Oxide Wearout. Technology scaling has adverse
letProof, is capable of tolerating silicon defects, tran- effects on the lifetime of transistor devices, due to time-
sient faults, and transistor wearout. We evaluate a va- dependent wearout. There are three major failure modes
riety of Bulletproof switch designs, and compare them for time-dependent wearout: electromigration, hot carrier
to designs that utilize traditional fault tolerance tech- degradation (HCD), and time-dependent oxide breakdown.
niques, such as ECC and triple-modular redundancy. Electro-migration results from the mass transport of metal
We find that our domain-specific fault-tolerance tech- atoms in chip interconnects. The trends of higher current
niques are significantly more robust and less costly density in future technologies increases the severity of elec-
than traditional generic fault tolerance techniques. tromigration, leading to a higher probability of observing
The remainder of this paper is organized as follows. Sec- open and short-circuit nodes over time [11]. HCD is the
tion 2 gives additional background on the faults of interest in result of carriers being heated by strong electrical fields
this study and introduces our architect-friendly fault model and subsequently being injected into the gate oxide. The
based on the bathtub curve. Section 3 presents our fault trapped carriers cause the threshold voltage to shift, even-
simulation infrastructure, and examines the exposure of the tually leading to device failure. HCD is predicted to worsen
baseline design to permanent faults. Section 4 introduces for thinner oxide and shorter channel lengths [14]. Time-
the techniques we have employed in our CMP switch de- dependent oxide breakdown is due to the extensive use of
signs to provide cost-effective tolerance of transient and per- ultra-thin oxide for high performance. The rate of defect
manent faults. In Section 5, we present a detailed trade-off generation in the oxide is proportional to the current den-
analysis of the resilience and cost of our CMP switch designs, sity flowing through it, and therefore is increasing drastically
as a result of relentless down-scaling [27].

Failure Rate (FIT)


Model Parameters:
Transistor Infant Mortality. Scaling has had adverse Y FG: grace period wear-out rate
effects on the early failures of transistor devices. Tradition- L : avg latent manufacturing defects
FG+109 L/t (1 - (t+1)-m) m : maturing rate FG + (t - tB)b
ally, early transistor failures have been reduced through the b : breakdown rate
use of burn-in. The burn-in process utilizes high voltage tB : breakdown start point

and temperature to accelerate the failure of weak devices,


thereby ensuring that parts that survive burn-in only pos- FG Bu
rn-
i n
sess robust transistors. Unfortunately, burn-in is becoming
less effective in the nanometer regime, as deep sub-micron tA tB Time
devices are subject to thermal run-away effects, where in-
Infant Period Grace Period Breakdown Period
creased temperature leads to increased leakage current and
increased leakage current leads to yet higher temperatures. Infant Period Graceful
The end results is that aggressive burn-in will destroy even with burn-in degradation
robust transistors. Consequently, vendors may soon have to Figure 2: Simple bathtub curve model of device defect
relax the burn-in process which will ultimately lead to more exposure. The curve indicates the qualitative trend
early-failures for transistors in the field. of failure rates for a silicon part over time. The ini-
tial operational phase and the “aged-silicon” phase
2.1 The Bathtub: A Generic Model for Semicon- are characterized by much higher failure rates.
ductor Hard Failures
In an effort to base our experiments off of published em-
To derive a simple architect-friendly model of failures, we pirical fault data, we developed a baseline bathtub model
step back and return to the basics. In the semiconductor based on published literature. Unfortunately, we were un-
industry, it is widely accepted that the failure rate for many able to locate a single technology failure model that fully
systems follows what is known as the bathtub curve, as il- captured the lifetime of a silicon device, so for each period
lustrated in Figure 2. We will adopt this time-tested fail- of the bathtub curve we will use reference values from dif-
ure model for our research. Our goal with the bathtub- ferent sources.
curve model is not to predict its exact shape and magnitude Latent Manufacturing Defects per Chip (λL ): Pre-
for the future (although we will fit published data to it to vious work [3], showed that the rate of latent manufacturing
create “design scenarios”), but rather to utilize the bath- defects is determined by the formula λL = γλK , where λK
tub curve to illuminate the potential design space for future is the average number of “killer” defects per chip, and γ is
fault-tolerant designs. The bathtub curve represents device an empirically estimated parameter with typical values be-
failure rates over the entire lifetime of transistors, and it is tween 0.01 and 0.02. The same work, provides formulas for
characterized by three distinct regions. deriving the maximum number of latent manufacturing de-
• Infant Period: In this phase, failures occur very soon fects that may surface during burn-in test. Based on these
and thus the failure rate declines rapidly over time. models, the average number of latent manufacturing defects
These infant mortality failures are caused by latent per chip (140mm2 ) for current technologies (λL ) is approx-
manufacturing defects that surface quickly if a tem- imately 0.005. In the literature, there are no clear trends
perature or voltage stress is applied. how this value changes with technology scaling, thus we use
• Grace Period: When early failures are eliminated, the same rate for projections of future technologies.
the failure rate falls to a small constant value where Grace Period Failure Rate (FG ): For the grace pe-
failures occur sporadically due to the occasional break- riod failure rate, we use reference data by [26]. In [26], a
down of weak transistors or interconnect. microarchitecture-level model was used to estimate workload-
dependent processor hard failure rates at different technolo-
• Breakdown Period: During this period, failures oc-
gies. The model used supports four main intrinsic failure
cur with increasing frequency over time due to age-
mechanisms experienced by processors: elegtromigration,
related wearout. Many devices will enter this period
stress migration, time-dependence dielectric breakdown, and
at roughly the same time, creating an avalanche effect
thermal cycling. For a predicted post-65nm fabrication tech-
and a quick rise in device failure rates.
nology, we adopt their worst-case failure rate (FG ) of 55,000
With the respect to Figure 2, the model is represented FITs.
with the following equations: Breakdown Period Start Point (tB ): Previous work
9 [27], estimates the time to dielectric breakdown using ex-
FG + λL 10t (1 − 1
), if 0 ≤ t < tA
(t+1)m trapolation from the measurement conditions (under stress)
F (t) = FG , if tA ≤ t < t B to normal operation conditions. We estimate the break-
FG + (t − tB )b , if tB ≤ t down period start point (tB ) to be approximately 12 years
for 65nm CMOS at 1.0V supply voltage. We were unable to
(t is measured in hours) find any predictions as to how this value will trend for fab-
Where the parameters of the model are as follows: rication technologies beyond 65nm, but we conservatively
• λL : average number of latent manufacturing defects assume that the breakdown period will be held to periods
per chip beyond the expected lifetime of the product. Thus, we need
not address reliable operation in this period, other than to
• m: infant period maturing factor
provide a limited amount of resilience to breakdown for the
• FG : grace period failure rate purpose of allowing the part to remain in operation until it
• tB : breakdown period start point can be replaced.
• b: breakdown factor, The maturing factor during the infant mortality period
100 120000
and the breakdown factor during the breakdown period used, 90 108000
are m = 0.02 and b = 2.5, respectively.
80 96000

3. A Fault Impact Evaluation Infrastructure

Defected Parts (%)


70 84000

Failure Rate (FIT)


60 72000
In [7], we introduced a high-fidelity simulation infrastruc- 50 60000
ture for quantifying various derating effects on a design’s 40 48000
overall soft-error rate. This infrastructure takes into account
30 36000
Failure Rate (FIT)
circuit level phenomena, such as time-related, logic-related Baseline Unprotected
20 24000
System Level TMR
and microarchitectural-related fault masking. Since many
10 12000
tolerance techniques for permanent errors can be adapted
0
to also provide soft error tolerance, the remainder of this 0 1 2 3 4 5 6
Time (Years)
work concentrates on the exploration of various defect tol-
erance techniques. Figure 4: Baseline design reliability. The graph su-
perimposes the FIT rates of the bathtub model with
3.1 Simulation Methodology for Permanent Faults the fault tolerance of two variants of the CMP switch
In this section, we present our simulation infrastructure design: a baseline unprotected version and a variant
for evaluating the impact of silicon defects. We create a with a traditional TMR technique.
model for permanent fault parameters, and develop the in- example a defect hitting an already failed module of a TMR
frastructure necessary to evaluate the effects on a given de- design.
sign. In the context of defects, we are concerned with studying
the potential of a defect to impact the design outputs in
any possible future execution. Thus, the input stimuli is a
Defect model full coverage test, crafted to excite all internal nodes of the
• exposed design while observing the outputs. If any of the stimuli
Time, location • protected
• unprotected but masked
impact the output correctness, the implication is that there
Function test Defect-exposed
is at least one execution that can expose the defect, and thus
(full-cover. test) model Defect is such defect is considered exposed.
Defect Finally, to gain statistical confidence of the results, we
Structural analyzer run the simulations described many times in a Monte Carlo
design Golden model modeling framework.
(no defect injected)

MonteCarlo simulation
loop – 1000x
3.2 Reliability of the Baseline CMP Switch Design
The first experiment is an evaluation of the reliability of
Figure 3: Simulation infrastructure for permanent
the baseline CMP switch design. In Figure 4, we used the
faults. The defect infrastructure uses two models of
bathtub curve fitted for the post-65nm technology node as
the system, simulated in parallel. Defects are uni-
derived in Section 2. The FIT rate of this curve is 55000
formly distributed in time and space and the input
during the grace period, which corresponds to a mean time
stimuli is a full coverage test that activates each in-
to failure (MTTF) of 2 years. We used this failure rate in our
ternal circuit node of the system. A fault analyzer
simulation framework for permanent failures and we plotted
classifies defects based on the system response.
the results.
Figure 3 shows our simulation framework for evaluating The baseline CMP design does not deploy any protec-
the impact of silicon defects on a digital design. The frame- tion technique against defects, and one defect is sufficient
work consists of an event-driven simulator that simulates to bring down the system. Consequently, the graph of Fig-
two copies of the structural, gate-level description of the de- ure 4 shows that in a large parts population, 50% of the
sign in parallel. Of these two designs, one copy is kept intact parts will be defective by the end of the second year after
(golden model), while the other is subject to fault injection shipment, and by the fourth year almost all parts will have
(defect-exposed model). The structural specification of our failed. In this experiment, we have also analyzed a design
design was synthesized from a Verilog description using the variant which deploys triple-module-redundancy (TMR) at
Synopsys Design Compiler. the full-system level (i.e. three CMP switches with voting
Our silicon defect model distributes defects in the design gates at their outputs. Designs with TMR applied at dif-
uniformly in time of occurrence and spatial location. Once ferent granularities are evaluated at Section 5) to present
a permanent failure occurs, the design may or may not con- better defect tolerance.
tinue to function depending on the circuit’s internal struc- The TMR model used in this analysis is the classical TMR
ture and the system architecture. The defect analyzer clas- model which assumes that when a module fails it starts pro-
sifies each defect as exposed, protected or unprotected but ducing incorrect outputs, and if two or more modules fail,
masked. In the context of defect evaluation, faults accumu- the output of the TMR voter will be incorrect. This model
late over time until the design fails to operate correctly. The is conservative in its reliability analysis because it does not
defect that brings the system to failure is the last injected take into account compensating faults. For example, if two
defect in each experiment and it is classified as exposed. A faults affect two independent output bits, then the voter cir-
defect may be protected if, for instance, it is the first one to cuit should be able to correctly mask both faults. However,
occur in a triple-module-redundant design. An unprotected the benefit gained from accounting for compensating faults
but masked defect is a defect that it is masked because it rapidly diminishes with a moderate number of defects be-
occurs in a portion of the design that has already failed, for cause the probabilities of fault independence are multiplied.
VC State
Further, though the switch itself demonstrated a moder- Input
Routing Logic
Buffers
Cross-Bar
ate number of independent fault sites, submodules within
VC State
the design tended to exhibit very little independence. Also, Input
Routing Logic
Buffers
in [21], it is demonstrated that even when TMR is applied on
VC State
diversified designs (i.e. three modules with the same func- Input
Buffers
Routing Logic

tionality but different implementation), the probability of VC State


Routing Logic
independence is small. Therefore, in our reliability analysis, Input
Buffers
we choose to implement the classical TMR model and for VC State Cross-Bar Controller
Routing Logic
the rest of the paper whenever TMR is applied, the classical Input
Buffers Switch Arbiter
TMR model is assumed. Input Controllers

From Figure 4, the simulation-based analysis finds that Figure 5: Baseline CMP switch design. A high
TMR provides very little reliability improvements over the level block diagram for a wormhole interconnection
baseline designs, due to the few number of defects that can switch is presented. It consists of 5 input controllers,
be tolerated by system-level TMR. Furthermore, the area a cross-bar, a switch arbiter and a cross-bar con-
of the TMR protected design is more than three times the troller.
area of the baseline design. The increase in area raises the level netlist, which consists of approximately 10k gates. This
probability of a defect being manifested in the design, which
router design consists of five input controllers which dom-
significantly affects the design’s reliability. In the rest of the
inate the design’s area (86%). Also, the design is heavily
paper, we propose and evaluate defect-tolerant techniques
dominated by combinational logic, which represents 84% of
that are significantly more robust and less costly than tra-
the total area, making it critical to choose protection tech-
ditional defect-tolerant techniques.
niques that can tolerate errors in logic effectively.
4. Self-repairing CMP Switch Design 4.2 Protection Mechanisms
The goal of the Bulletproof project is to design a defect- A design that is tolerant to permanent defects must pro-
tolerant chip-multiprocessor capable of tolerating significant vide mechanisms that perform four central actions related
levels of various types of defects. In this work, we address to faults: detection, diagnosis, repair, and recovery. Fault
the design of one aspect of the system, a defect tolerant detection identifies that a defect has manifested as an error
CMP switch. The CMP switch is much less complex than a in some signal. Normal operation cannot continue after fault
modern microprocessor, enabling us to understand the en- detection as the hardware is not operating properly. Often
tire design and explore a large solution space. Further, this fault detection occurs at a macro-level, thus it is followed
switch design contains many representative components of by a diagnosis process to identify the specific location of the
larger designs including finite state machines, buffers, con- defect. Following diagnosis, the faulty portion of the de-
trol logic, and buses. sign must be repaired to enable proper system functionality.
Repair can be handled in many ways, including disabling,
4.1 Baseline Design ignoring, or replacing the faulty component. Finally, the
The baseline design, consists of a CMP switch similar system must recover from the fault, purging any incorrect
to the one described in [19]. This CMP switch provides data and recomputing corrupted values. Recovery essen-
wormhole routing pipelined at the flit level and implements tially makes the defect’s manifestation transparent to the
credit-based flow control functionality for a two-dimensional application’s execution. In this section, we discuss a range
torus network. In the switch pipeline, head flits will pro- of techniques that can be applied to the baseline switch to
ceed through routing and virtual channel allocation stages, make it tolerant of permanent defects. The techniques differ
while all flits proceed through switch allocation and switch in their approach and the level at which they are applied to
traversal stages. A high-level block diagram of the router the design.
architecture is depicted in Figure 5. In [9] the authors present the Reliable Router (RR), a
The implemented router is composed of four functional switching element design for improved performance and re-
modules: the input controller, the switch arbiter, the cross- liability within a mesh interconnect. The design relies on
bar controller, and the crossbar. The input controller is re- an adaptive routing algorithm coupled with a link level re-
sponsible for selecting the appropriate output virtual chan- transmission protocol in order to maintain service in the
nel for each packet, maintaining virtual channel state infor- presence of a single node or link failure within the network.
mation, and buffering flits as they arrive and await virtual Our design differs from the RR in that our target domain
channel allocation. Each input controller is enhanced with involves a much higher fault rate and focuses on maintaining
an 8-entry 32-bit buffer. The switch arbiter allocates virtual switch service in the face of faults rather than simply routing
channels to the input controllers, using a priority matrix to around faulty nodes or links. However, the two techniques
ensure that starvation does not occur. The switch arbiter can be combined and provide a higher reliability multipro-
also implements flow control by tallying credit information cessor interconnection network.
used to determine the amount of available buffer space at
downstream nodes. The crossbar controller is responsible 4.2.1 General Techniques
for determining and setting the appropriate control signals The most commonly used protection mechanisms are dual
so that allocated flits can pass from the input controllers to and triple modular redundancy, or DMR and TMR [23].
the appropriate output virtual channels through the inter- These techniques employ spatial redundancy combined with
connect provided by the crossbar. a majority voter. With permanent faults, DMR provides
The router design is specified in Verilog and was synthe- only fault detection. Hence, a single fault in either of the
sized using the Synopsys Design Compiler to create a gate- redundant components will bring the system down. TMR
a: Correctly routed flit
is more effective as it provides solutions to detection, re- b, c: In the switch pipeline
d: Next flit to be routed
covery, and repair. In TMR, the majority voter identifies a e: Last flit buffered
Input
malfunctioning hardware component and masks its affects Buffers
e d
e d c b a
on the primary outputs. Hence, repair is trivial since the Error Detection Signal CRC
Checker
defective component is always just simply outvoted when it
Routed
computes an incorrect value. Due to this restriction, TMR Flit Tail Head Recovery
Head
is inherently limited to tolerating a single permanent fault.
Routed Routed
Faults that manifest in either of the other two copies can- Flit Flit
CRC Interconnect CRC
not be handled. DMR/TMR are applicable to both state Checker Switch Checker
and logic elements and thus are broadly applicable to our Recovery
Logic Routed
baseline switch design. Routed
Flit

Storage or state elements are often protected by parity or Flit CRC


Checker
error correction codes (ECC) [23]. ECC provides a lower CRC
Checker
overhead solution for state elements than TMR. Like TMR,
ECC provides a unified solution to detection and recovery. (a)
System
Repair is again trivial as the parity computation masks the Diagnosis
Switch Recovery
effects of permanent faults. In addition to the storage over- Recovery
Logic Error
head of the actual parity bits, the computation of parity or Routing Logic
ECC bits generally requires a tree of exclusive-ORs. This Header

hardware has moderate overhead, but more importantly, it CRC Input


Cross-bar
CRC
Checker Buffers Checker
can often be done in parallel, thus not affecting latency. For CRC
Buffer
our defect-tolerant switch, the application of ECC is limited Checker Tail Flit Cross-bar Controller
due to the small fraction of area that holds state. Head/Tail VC State Routing Logic
System Switch Arbiter
Diagnosis
Switch Arbiter
4.2.2 Domain-specific Techniques
The properties of the wormhole router can be exploited (b)
to create domain-specific protection mechanisms. Here, we Figure 6: End-to-End error detection and recovery
focus on one efficient design that employs end-to-end error mechanism. In part (a) the interconnection switch
detection, resource sparing, system diagnosis, and reconfig- is enhanced by Cyclic Redundancy Checkers (CRC)
uration. and recovery logic for providing data corrupting er-
End-to-End Error Detection and Recovery Mech- ror detection. The input buffers are enhanced with
anism. Within our router design, errors can be separated an extra recovery head pointer to mark the last
into two major classes. The first class is comprised of data correctly checked flit. In part (b) a more detailed
corrupting errors, for example a defect that alters the data view of the switch with End-to-End error detection
of a routed flit, so that the routed flit is permanently cor- is shown. Flits are split into two parts, which are
rupted. The second class is comprised of errors that cause independently routed through the switch pipeline.
functional incorrectness, for example a defect that causes a
flit to be misrouted to a wrong output channel or to get lost switch recovery, the recovery head pointer is assigned to the
and never reach any of the switch’s output channels. head pointer for all five input buffers, and the switch recov-
The first class of errors, the data corrupting errors, can be ers operations by starting rerouting the flits pointed by the
addressed by adding Cyclic Redundancy Checkers (CRC) at head pointers. Further, the switch’s credit backflow mecha-
each one of the switch’s five output channels, as shown in nism needs to be adjusted accordingly since an input buffer
Figure 6(a). When an error is detected by a CRC checker, is now considered full when the tail pointer reaches the re-
all CRC checkers are notified about the error detection and covery head pointer. In order for the switch’s recovery logic
block any further flit routing. The same error detection sig- to be able to distinguish soft from hard errors, the error
nal used to notify the CRC checkers also notifies the switch’s counter is reset to zero at regular intervals.
recovery logic. The switch’s recovery logic logs the error oc- The detection of errors causing functional incorrectness
currence by incrementing an error counter. In case the error is considerably more complicated because of the need to be
counter surpasses a predefined threshold, the recovery logic able to detect misrouted and lost flits. A critical issue for
signals the need for system diagnosis and reconfiguration. the recovery of the system is to assure that there is at least
In case the error counter is still below the predefined one uncorrupted copy for each flit in flight in the switch’s
threshold, the switch recovers its operation from the last pipeline. This uncorrupted flit can then be used during re-
“checkpointed” state, by squashing all inflight flits and rerout- covery. To accomplish this, we add a Buffer Checker unit
ing the corrupted flit and all following flits. This is ac- to each input buffer. As shown in Figure 6(b), the Buffer
complished by maintaining an extra recovery head pointer Checker unit compares the CRC checked incoming flit with
at the input buffers. As shown in Figure 6(a), each input the last flit allocated into the input buffers (tail flit). Fur-
buffer maintains an extra head pointer which indicates the ther, to guarantee the input buffer’s correct functionality,
last flit stored in the buffer which is not yet checked by a the Buffer Checker also maintains a copy of the head and
CRC checker. The recovery head pointer is automatically the tail pointers which are compared with the input buffer’s
incremented four cycles after the associated input controller pointers whenever a new flit is allocated. In the case that
grants access to the requested output channel, which is the the comparison fails, the Buffer Checker signals an alloca-
latency needed to route the flit through the switch, once tion retry, to cover the case of a soft error. If the error per-
access to the destination channel is granted. In case of a sists, this means that there is a potential permanent error
1 0 1 0 1 0 1 0
in the design, and it signals the system diagnosis and recon-

*
figuration procedures. By assuring that a correct copy of
the flit is allocated into the input buffers and that the input X
buffer’s head/tail pointers are maintained correctly, we guar-
Partition D Partition C Partition B Partition A
antee that each flit entering the switch will correctly reach
Trial = (Defects + Reconfiguration) Configuration Defects
the head of the queue and be routed through the switch’s
Attempt# Configuration Reconfiguration Defects Trial Comment
pipeline. 1 0010 0001 0010 0011 Error Detected
To guarantee that a flit will get routed to the correct out- 2 0010 0010 0010 0010 Iteration Skipped
put channel, the flit is split into two parts, as shown in 3 0010 0100 0010 0110 Error Detected
4 0010 1000 0010 1010 Correct Execution
Figure 6(b). Each part will get its output channel requests
from a different routing logic block, and access the requested New Configuration = Trial Configuration = 1010
New Defects = Defects + (Configuration Trial) = 1010
output channel through a different switch arbiter. Finally,
each part is routed through the cross-bar independently. To Figure 7: Example system diagnosis and reconfigu-
accomplish this, we add an extra routing logic unit and an ration. This example shows the system with four
extra switch arbiter. The status bits in the input controllers partitions and one spare for each partition. The
that store the output channel reserved by the head flit are first spare of partition B contains a previously de-
duplicated as well. Since the cross-bar routes the flits at tected and corrected defect, thus the latest error in
the bit-level, the only difference is that the responses to the execution is caused by the defect in the first spare
cross-bar controller from the switch arbiter will not be the of partition D.
same for all the flit bits, but the responses for the first and recovery mechanism performs a system replay. Eventually,
the second parts of the flit are fitted from the first and sec- the partition that happens to possess the current error will
ond switch arbiters, respectively. If a defect causes a flit to be disabled and its corresponding spare enabled. When this
be misrouted, it follows that a single defect can impact only occurs, the system diagnosis mechanism will detect correct
one of the two parts of the flit, and the error will be caught system behavior and terminate the replay mode. Using this
later at the CRC check. approach, the faulty piece of logic is identified and correctly
The area overhead of the proposed error detection and disabled.
recovery mechanism is limited to only 10% of the switch’s In order for the system diagnosis to operate, it maintains
area. The area overhead of the CRC checkers, the Recov- a set of bit vectors as follows:
ery Logic and the Buffer Checker units is almost negligible. • Configuration Vector: It indicates which spare parti-
More specifically, the area of a single CRC checker is 0.1% tions are enabled.
of the switch’s area and the area for the Buffer Checker • Reconfiguration Vector: It keeps track of which config-
and the Recovery Logic is much less significant. The area urations have been tried and indicates the next config-
overhead of the proposed mechanism is dominated by the uration to be tried. It gets updated at each iteration
extra Switch Arbiter (5.7%), the extra Routing Logic units of system diagnosis.
(5x0.5% = 2.5%), and the additional CRC bits (1.5%). As • Defects Vector: It keeps track of which spare partitions
we can see, the proposed error detection and recovery mech- are defected.
anism has a 10X times less area overhead than a naı̈ve DMR • Trial Vector: Indicates which spare partitions are en-
implementation. abled for a specific system diagnosis iteration.
Figure 7, demonstrates an example where system diagno-
Resource Sparing. For providing defect tolerance to the
sis is applied on a system with four partitions and two copies
switch design we use resource sparing for selected partitions
(one spare) for each partition. The first copy of partition B
of the switch. During the switch operation only one spare
has a detected defect (mapped at the Defects Vector). The
is active for each distinct partition of the switch. For each
defect in the first copy of partition D is a recently manifested
spare added in the design, there is an additional overhead
defect and is the one that caused erroneous execution. Once
for the interconnection and the required logic for enabling
the error is detected, the system recovers to the last correct
and disabling the spare. For resource sparing, we study two
state using the mechanism described in the previous section
different techniques, dedicated sparing and shared sparing.
(see error detection and recovery mechanism), and it ini-
In the dedicated sparing technique, each spare is owned by a
tializes the Reconfiguration Vector. Next, the Trial Vector
single partition and can be used only when the specific par-
is computed using the Configuration, Reconfiguration, and
tition fails. When shared sparing is applied, one spare can
Defects vectors. In case the Trial Vector is the same with the
be used to replace a set of partitions. In order for the shared
Configuration Vector (attempt 2), due to a defected spare,
sparing technique to be applied, it requires multiple identi-
the iteration is skipped. Otherwise, the Trial Vector is used
cal partitions, such as the input controllers for the switch
as the current Configuration vector, indicating which spare
design. Furthermore, each shared spare requires additional
partitions will be enabled for the current trial. The execu-
interconnect and logic overhead because of its need of hav-
tion is then replayed from the recovery point until the error
ing the ability to replace more than one possible defective
detection point. In case the error is detected, a new trial
partitions.
is initiated by updating the Reconfiguration Vector and re-
System Diagnosis and Reconfiguration. As a system
computing the Trial Vector. In case no error is detected,
diagnosis mechanism, we propose an iterative trial-and-error
meaning that the trial configuration is a working configura-
method which recovers to the last correct state of the switch,
tion, the Trial Vector is copied to the Configuration Vector
reconfigures the system, and replays the execution until no
and the Defects Vector is updated with the located defected
error is detected. The general concept is to iterate through
copy. If all the trial configurations are exhausted, which are
each spared partition of the switch and swap in the spare
equal to the number of partitions, and no working configu-
for the current copy. For each swap, the error detection and
ration was found, then the defect was a fatal defect and the
#partitions: 1 #partitions: 2 #partitions: 3
system won’t be able to recover. The example implemen- #part.outputs: 2
#hyper edges: 8
#part.outputs: 3
#hyper edges: 7
#part.outputs: 5
#hyper edges: 5
tation of the system diagnosis mechanism demonstrated in Generate
#cut edges: 1 #cut edges: 3
Hypergraph Bisection Algorithm Bisection Algorithm
Figure 7, can be adapted accordingly for designs with more
partitions and more spares. A
A A A
We also consider the Built-In-Self-Test(BIST) technique F F F F
as an alternative for providing system diagnosis. For each I I I
I B B
B B
distinct partition in the design we store in ROM automati- G
G G G
C C C
cally generated test vectors. During system diagnosis with C J J J
J
BIST, these test vectors are applied to each partition of the D
D D D
H H H
system through scan chains to check its functionality cor- H
E E E
rectness and locate the defected partition. E
Both the iterative replay and BIST techniques can be im- (a) (b) (c) (d)
plemented as a separate module from the switch and the Figure 8: The process of automatic cluster decompo-
area overhead for their implementation can be shared by a sition. In part (a) a sample netlist is shown with
wide number of switches in a possible chip multiprocessor 2 primary outputs, along with its corresponding hy-
design. pergraph in part (b). Part (c) shows the hypergraph
after a min-cut bisection creating two unbalanced
partitions. Part (d) shows the final 3-way partition
4.2.3 Level of Protection resulting from a bisection of the largest partition.
The error resiliency achieved by implementing one of the
protection techniques (e.g., TMR or sparing) is highly de- 4.2.4 Automatic Cluster Decomposition
pendent on the granularity of the partitions. In general, the Automatic Cluster Decomposition takes a netlist and cre-
larger the granularity of the partitions, the less robust the ates partitions with the end goal that each partition is ap-
design. However, as the granularity of the partition becomes proximately the same size and that there is a minimal amount
smaller, more logic is required. For TMR, each output for a of outputs required for each partition generated. Generat-
given partition requires a MAJORITY gate. Since each added ing these partitions requires that the netlist be converted
MAJORITY gate is unshielded from permanent defects, poorly into a graph that can then be partitioned using a balanced-
constructed small partitions can make a design less error recursive min-cut algorithm [15, 12] that has found use in
resilient than designs with larger partitions. fields like VLSI [2].
To illustrate these trade-offs, consider the baseline switch Figure 8 shows how these partitions are generated from
again in Figure 5. Sparing and TMR can be done on the the netlist of a design. First, the netlist pictured in part (a)
system-level where the whole switch is replicated and each is used to generate a hypergraph shown in part (b). A hy-
output requires some extra logic like a MUX or MAJORITY pergraph is an extension of a normal graph where one edge
gate. A single permanent error makes one copy of the switch can connect multiple vertices. In the figure, each vertex rep-
completely broken. However, the area overhead beyond the resents a separate net in the design. A hyperedge is drawn
spares is limited to only a gate for each primary output. A around each net and its corresponding fanout. If that net
slightly more resilient design, considers partitioning based is placed in a different partition than one of its fanout, that
on the components that make up the switch. For instance, net becomes an output for its partition thus increasing the
each of the five input controllers can have a spare along with overhead of the partition. Thus the goal of the partitioning
the arbiter, cross-bar, and the cross-bar controllers. This algorithm is to minimize the number of hyperedges that are
partitioning approach leaves the design more protected as a cut. For this example, we show a 3-way partitioning of the
permanent defect in the input controller would make only circuit. The algorithm in [12] performs a recursive min-cut
that small partition broken and not the other four input operation where the original circuit is bisected and then one
controllers. There is a small area penalty for this approach of these partitions is bisected again. In Figure 8(c), the hy-
as the sum of the outputs for each partition is greater than pergraph is bisected and the number of hyperedges cut is
the switch as a whole. However, the added unprotected reported. Notice that one of those pieces is twice the size
logic is still insufficient to worsen the error resiliency of the of the other one. Because 3-way partitioning is desired, one
design. Finally, consider partitioning at the gate-level. In piece is slightly larger so that the final partitions are fairly
this approach, each gate is in its own partition. In this balanced where each partition has about the same number
scheme, the error resiliency for each partition is extremely of vertices/nets. Figure 8(d) shows the final partitioning as-
high because the target is very small. However, the over- signment of the hypergraph along with the number of hyper-
head of this approach requires an extra gate for each gate edges cut which corresponds to the number of total outputs
in the switch design. Thus for TMR, the area would be four for all the partitions not including the original outputs of the
times the original design. In addition, because each added system. In practice, several iterations of [12] are run as the
gate is unprotected, the susceptibility of this design to er- algorithm is heuristically-based and requires many runs for
rors is actually greater than the larger partitions used in the optimal partitions. Also, some imbalance in partition sizes
component-based partition. is tolerated if the number of hyperedges cut is significantly
The previous analysis shows that the level of partition- smaller as a result.
ing effects the error resiliency and the area overheads of the Once the hypergraph is partitioned, several different repli-
design. In this paper, we introduce a technique called, Au- cation strategies can be used such as sparing and TMR. In
tomatic Cluster Decompositions, that generates partitions Figure 9, an example of performing 3-way partitioning over
that minimizes area overhead while maximizing error re- an arbitrary piece of logic using one spare per partition is
siliency. shown. In part (a), sparing is performed at the system-level.
Part.2
Part.1
Part. 1 Select
Select Table 1: Mnemonic table for design configurations.

X
X
MU

MU MU MU

X
X
Full system MU For each portion of the naming convention, we show

MU MU MU
X
X
MU

X
Part. 2 the possible mnemonics with the related description.

X
MUX
The last portion provides some example design con-

X
MU

MUX
MUX

Part. Part.3
figurations.

X
Part. 3

MU
Select Select Mnemonic Group Mnemonic Description
S System level
(a) (b) Level of applying C Component level
defect tolerance G Gate level
Figure 9: Examples of one spare systems. In part (a) technique S+CL
C+CL
System level clusters
Component level clusters
sparing without any cluster decomposition is shown. Defect tolerance TMR Triple Modular Redundancy
techniques (can be #SP # dedicated spares for each partition
In part (b) sparing is applied to a three-way par- applied in #SH(X) # shared spares for partition of type X
combinations) ECC Error Correction Codes applied at state
tition. Cluster decomposition increases MUX in- System diagnosis IR Iterative replay
technique BIST Built-In-Self-Test
terconnect overhead, but provides higher protection S+CL_1SP_IR System level clusters with 1 spare for each partition and iterative replay.
Example C_2SH(IC)+1SP_BIST Component level with 2 shared input controllers and one dedicate spare
due to the smaller granularity of the sparing. configurations for the rest of the components. BIST for system diagnosis.
C+CL_TMR+ECC Component level clusters TMR with ECC protected state.
The outputs of both identical units are fed into a MUX where
a register value determines which unit is active and which design configuration, we provide the area overhead needed
one is inactive. In part (b), the circuit is partitioned into for implementing the specific design. This area overhead
three pieces. Notice, that the outputs for each boundary includes the extra area needed for the spare units, the ma-
must now have a MUX associated with it. Also, each parti- jority gates, the logic for enabling and disabling spare units,
tion requires a register to determine which copy is active. the logic for the end-to-end error detection, recovery and
Thus 3-way partitioning requires a total of three registers system diagnosis (different configurations have different re-
and the number of MUXes corresponding to the number of quirements for the extra logic added). We notice that the
outputs generated by the partitioning. design configurations with the higher area overheads are the
ones applying BIST for system diagnosis. This is due to the
4.3 Switch Designs
extra area needed for storing the test vectors necessary for
Each configuration providing a defect tolerant switch de- self-testing each distinct partition in the design, along with
sign is characterized by three parameters: level of protec- the additional interconnection and logic needed for the scan
tion, techniques applied, and system diagnosis method. For chains. Even though the area overhead for the test vectors
each configuration, we give a name convention as follows: can be shared over the total number of switches per chip,
level technique diagnosis. The configurations using TMR the area overhead of the BIST technique is still rather large.
as the defect tolerance technique do not use the end-to-end Another design configuration with high area overhead is the
error detection, recovery and system diagnosis techniques, one where TMR is applied at the gate level due to the extra
since TMR inherently provides error detection, diagnosis voting gate needed for each gate in the baseline switch de-
and recovery. All other configurations use the end-to-end er- sign. On the other hand, designs with shared spares achieve
ror detection and recovery technique, along with either iter- low area overhead (under two) since not every part of the
ative replay or BIST for system diagnosis. Table 1, describes switch is duplicated. The area overhead for the rest of the
the choices that we considered in our simulated configura- design configurations is dependent on the amount of spares
tions for the three parameters, and it gives some example per partition.
configurations along with their name conventions. In the fourth column of the table, we provide the mean
number of defects to failure for each design configuration.
5. Experimental Results The design configurations providing high mean number of
To evaluate the effectiveness of our domain specific defect defects to failure are the ones employing the ACD (Auto-
tolerance techniques in protecting the switch design, we sim- matic Cluster Decomposition) technique. Another point of
ulated various design configurations with both traditional interest is that techniques employing ECC even when cou-
and domain specific techniques. To assess the effectiveness pled with automatic cluster decomposition perform poorly.
of the various design configurations in protecting the switch Although state is traditionally protected by ECC, when a
design, we take into account the area and time overheads design is primarily combinational logic, like our switch, the
of the design along with the mean number of defects that cost of separating the state from the logic exceeds the pro-
the design can tolerate. We also introduce a new metric, tection given to the state elements. In other words, if the
the Silicon Protection Factor (SPF), which gives us a more state is not considered in the ACD analysis and is therefore
representative notion about the amount of protection that is not part of any of the spared partitions, the boundary be-
offered to the system by a given defect tolerance technique. tween the state and the spared partitions must have some
Specifically, the SPF is computed by dividing the mean num- unprotected interconnection logic. This added logic coupled
ber of defects needed to cause a switch failure with the area with the unprotected logic required by ECC makes ECC in
overhead of the protection techniques. In other words, the a logic dominated design undesirable.
higher the SPF factor, the more resilient each transistor is The SPF values for each design are presented in the fifth
to defects. Since the number of defects in a design is pro- column of Table 2. The highest SPFs are given by the de-
portional to the area of the design, the use of this metric sign configurations that employ automatic cluster decom-
for assessing the effectiveness of the silicon protection tech- position, with the highest being design S+CL 2SP IR at
niques is more appropriate. 11.11. Even though design S+CL 2SP BIST uses the same
In Table 2, we list the design configurations that we simu- sparing strategies, the area overhead added from BIST de-
lated. The naming convention followed for representing each creases the design’s SPF significantly. It’s interesting that
configuration is described in Table 1. For each simulated two design configurations have SPFs lower than 1. The first
Table 2: Results of the evaluated designs. For each design configuration we report the mnemonic, the area
factor over the baseline design, the number of defects that can be tolerated, the SPF, the number of partitions
and an estimate of the impact on the system delay.
Design Area Design Area
Key Defects SPF #Part. %Dly Key Defects SPF #Part. %Dly
Configuration O.head Configuration O.head
1 S_TMR 3.02 2.49 0.82 1 0.00 20 S+CL_2SP+ECC_IR 3.39 8.64 2.55 118 22.22
2 S+CL_TMR 3.08 16.78 5.45 241 22.22 21 C_2SP_IR 3.36 13.07 3.90 12 0.00
3 S+CL_TMR+ECC 3.07 6.92 2.25 185 27.78 22 C_2SP_BIST 3.90 13.07 3.35 12 0.00
4 C_TMR 3.04 4.68 1.54 12 0.00 23 C+CL_2SP_IR 3.44 32.33 9.39 208 18.75
5 C+CL_TMR 3.09 15.86 5.13 223 18.75 24 C_CL_2SP_BIST 4.31 32.33 7.50 208 18.75
6 C+CL_TMR+ECC 3.11 6.25 2.01 298 25.00 25 C+CL_2SP+ECC_R 3.41 7.49 2.20 103 25.00
7 G_TMR 4.00 4.00 1.00 10540 100.00 26 C_2SH(IC)_IR 1.52 3.15 2.07 12 0.00
8 S_1SP_IR 2.22 3.27 1.47 1 0.00 27 C_3SH(IC)_IR 1.71 4.14 2.43 12 0.00
9 S+CL_1SP_IR 2.30 17.53 7.63 206 22.22 28 C_4SH(IC)_IR 1.89 5.02 2.65 12 0.00
10 S+CL_1SP_BIST 3.16 17.53 5.54 206 22.22 29 C_5SH(IC)_IR 2.08 5.90 2.84 12 0.00
11 S+CL_1SP+ECC_IR 2.48 5.96 2.41 183 27.78 30 C_2SH(IC)+1SP_IR 1.74 4.40 2.53 12 0.00
12 S+CL_1SP+ECC_BIST 3.34 5.96 1.78 183 27.78 31 C_3SH(IC)+1SP_IR 1.93 5.79 3.01 12 0.00
13 C_1SP_IR 2.24 5.87 2.62 12 0.00 32 C_4SH(IC)+1SP_IR 2.12 7.10 3.34 12 0.00
14 C_1SP_BIST 2.79 5.87 2.62 12 0.00 33 C_5SH(IC)+1SP_IR 2.41 8.39 3.48 12 0.00
15 C+CL_1SP_IR 2.33 16.04 6.88 223 18.75 34 C_2SH(IC)+2SP_IR 1.93 5.01 2.60 12 0.00
16 C+CL_1SP_ECC_IR 2.51 5.34 2.13 138 25.00 35 C_3SH(IC)+2SP_IR 2.12 6.57 3.09 12 0.00
17 S_2SP_IR 3.32 5.95 1.79 1 0.00 36 C_4SH(IC)+2SP_IR 2.30 8.10 3.52 12 0.00
18 S+CL_2SP_IR 3.42 37.99 11.11 206 22.22 37 C_5SH(IC)+2SP_IR 2.50 9.58 3.84 12 0.00
19 S+CL_2SP_BIST 4.29 37.99 8.86 206 22.22 38 S_ECC 1.18 1.16 0.98 12 0.00

18
one is TMR applied at the system level, which can tolerate
16
2.5 defects but the area overhead is more than triple, thus Mean Defects to Failure
making the new design less defect tolerant than the baseline 14 SPF-Defect Tolerance

k
switch design by 18%. The second one is where the state is 12

D efect R esilie ncy


protected by ECC. Since our design is logic dominated and 10
the protected fraction of the design is very small, the ex-
8
tra logic required for applying ECC (which is unprotected),
6
is larger than the actual protected area. Thus, this tech-
nique makes the specific design less defect tolerant than the 4

baseline unprotected design by 2%. 2


The sixth column in the table shows the number of distinct 0
partitions for each design configuration. This parameter is 0 100 200 300 400 500 600 700 800 900
#Partitions
very important for the configurations employing ACD. The
SPF of a given design configurations, is greatly dependent Figure 10: Defect resiliency as a function of the num-
on the number of partitions in the decomposed design. ber of partitions. As an example we plot the SPF
Figure 10 shows the dependency of the SPF over the num- defect tolerance of configuration S+CL 1SP IR for a
ber of decomposed partitions for the design configuration varying number of partitions generated by the ACD
S+CL 1SP IR. We can see that for the given design config- algorithm.
uration the peak SPF occurs around 200 partitions. As the the C 2SH(IC) IR, achieve no overhead as no interconnec-
per partition size decreases, the SPF value increases, and tion logic is added to any of the critical paths. In general,
as the number of cut edges per partition increases, the SPF our results indicate that achieving high SPF require slight
value decreases. Therefore, the initial rise of the SPF oc- delay penalty; however, in principle the ACD strategy could
curs because the area per partition was decreasing as the be used to try to minimize the number of critical paths that
number of decomposed partitions was getting larger. Af- are partitioned.
ter the optimal point of 200 partitions, the overhead of the The graph in Figure 11 shows the trade off between de-
extra unprotected logic required for each cutting edge be- fect tolerance and area overhead. The horizontal axis of the
tween partitions causes the SPF to start declining. For each graph represents the defect tolerance provided from a design
design configuration employing automatic cluster decompo- configuration in SPFs, and the vertical axis the area over-
sition, we ran several simulations for varying numbers of head of the design configuration. The further to the right a
partitions to achieve an optimal SPF. design configuration lies, the higher the defect tolerance it
The final column in Table 2, %Delay, gives the percent- provides, while the lower it is, the lower the implementation
age increase of the critical path delay in the switch. We cost.
produce a coarse approximation of the delay overheads that At the lower left corner, is the design configuration S ECC
is technology independent by making the delay increase pro- providing ECC protection to the state. This is the cheap-
portional to the number of interconnection gates added to est design configuration, but it does not provide any con-
the critical path. Thus, TMR-based designs will achieve siderable defect tolerance to the switch design. The right-
the same delay increase as spare-based designs as multiplex- most design configuration, S+CL 2SP IR, provides a de-
ers and majority gates are treated the same in the anal- fect tolerance of 11.11 SPF, by employing automatic clus-
ysis. Our results show that for the best designs, we al- ter decomposition at the system level with 200 partitions
ways achieve a delay increase of less than 25%. The designs and two extra spares for each partition, along with itera-
that involve ACD involve the greatest increase in delay be- tive replay for system diagnosis. The area overhead for im-
cause the partitions generated frequently split up the critical plementing this design configuration is 3.42X, and provides
paths. Designs with minimal amount of clustering, such as the better trade-off between area required and offered defect
5

19-S+CL_2SP_BIST
7-G_TMR 22-C_2SP_BIST 24-C+CL_2SP_BIST
4
Area Overhead h 23-C+CL_2SP_IR
17 12 25 20 21-C_2SP_IR
10-S+CL_1SP_BIST 18-S+CL_2SP_IR
1-S_TMR 6 3-S+CL_TMR+ECC 5
3 2-S+CL_TMR
4-C_TMR
14-C_1SP_BIST
16 11 33 37-C_5SH(IC)+2SP_IR 15-C+CL_1SP_IR More robust designs
8-S_1SP_IR 13 35 32 36 9-S+CL_1SP_IR
28 342931-C_3SH(IC)+1SP_IR

Cheaper designs
2

c h o re
30-C_2SH(IC)+1SP_IR

ea ro
m
26 27-C_3SH(IC)_IR

pe bu
r st
38-S_ECC
1

de
si
gns
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Normalized Defect Resiliency - Silicon Protection Factor (SPF)
Figure 11: Pareto chart of the explored solutions. The design evaluated are plotted on an area vs. SPF chart.
The line across the chart connects the set of optimal solutions. See Table 1 for explanations of design points.
100
120000
protection. Design configurations with moderate SPFs but
90 108000
with much less cost in area overhead are: C 3SH(IC) IR,

g
80 96000
C 2SH(IC)+1SP IR,C 3SH(IC)+1SP IR, andC 2SH(IC)+
70
2SP IR. These design configurations use shared spares of 84000

Defected Parts (%)

Failure Rate (FIT)


60
input controllers along with dedicated spares for the other 72000

components in the switch design, keeping the area overhead 50 60000

less than 2X, but offering SPFs of 2.5-3 at the same time. 40 48000

Such designs are interesting, since they keep the implemen- 30 Failure Rate (FIT) 36000
9. S+CL_1SP_IR (SPF=7.63)
tation cost at low levels and provide an attractive solution 20 4. C_TMR (SPF=1.54) 24000
18. S+CL_2SP_IR (SPF=11.11)
for defect tolerance. 10 31. C_3SH(IC)+1SP_IR (SPF=3.01) 12000
13. C_1SP_IR (SPF=2.62)
Other two interesting design configurations are C+CL 1SP 0
0 5 10 15 20 25 30
IR and S+CL 1SP IR. These two designs use the same tech- Time (Years)

nique, automatic cluster decomposition with one spare for (a)


each partition, with the difference that design S+CL 1SP IR 100 120000

applies the ACD technique on the system level, and design 90 108000

C+CL 1SP IR at the component level. The area cost of the 80 96000
g

two designs is almost the same but S+CL 1SP IR provides 70 84000
Defected Parts (%)

Failure Rate (FIT)


11% more SPF. The same argument also holds for designs 60 72000

S+CL 2SP IR and C+CL 2SP IR. This suggests that ap- 50 60000
Failure Rate (FIT)
plying the ACD technique at the system level can offer more 40 9. S+CL_1SP_IR (SPF=7.63) 48000
4. C_TMR (SPF=1.54)
effective defect tolerance at the same cost in area. 30
18. S+CL_2SP_IR (SPF=11.11)
36000
31. C_3SH(IC)+1SP_IR (SPF=3.01)
In Figure 12, we present how some of the design configu- 20
13. C_1SP_IR (SPF=2.62)
24000

rations affect the lifetime of the switch design for a future 10 12000
post 65nm technology where the mean time between fail-
0
ures to be manifested on a switch is 2 years (a failure rate 0 1 2 3 4 5 6 7 8 9 10 11 12
Time (Years)
of 55000 FITs). The graph’s horizontal axis represents the
(b)
years that the switch design is operating. The vertical axis
represent the percentage of defected parts over a popula- Figure 12: Fault tolerance of some interesting design
tion of switches (left axis) and the baseline switch’s failure configurations. Part (a) superimposes the FIT rate
rate (right axis). The baseline switch’s lifetime failure rate of the bathtub model with the percentage of defec-
for the given technology is presented by the darker thick tive parts over time. In addition part (b) takes into
line, forming the bathtub curve. In Figure 12(a), it only account the breakdown period.
forms a part of the bathtub curve since for this graph we three years. On the other hand, when in a design configu-
assume that the design’s breakdown occurs after 30 years. ration where automated cluster decomposition was applied
For each design configured presented in the graph, there is a at the system level with 2 spares for each partition, the 25%
line showing the failing rate of switch parts over time. This of the shipped parts will be defected after 16 years and the
line starts from year 1, since we assume that the first year of 75% after 29 years. If we define the lifetime of a manufac-
a parts lifetime is consumed during the accelerated testing tured product as the period of time where 10% of the man-
(burn-in) procedure, and that shipped parts are already at ufactured parts become defective, then the clustering design
their first year of lifetime with a constant failure rate. configuration S+CL 2SP IR increases the switch’s lifetime
From the graph in Figure 12(a), we can observe that when by 26X over the TMR design configuration C TMR.
applying TMR at the component level, 25% of the shipped System designers, can choose a defect tolerance technique
parts will be defected by the first year and 75% after the first that best matches with their design’s specifications. For
example the design configuration S+CL 1SP IR, where au- [6] F. A. Bower, et al. Tolerating hard faults in microprocessor
tomatic clustering decomposition is applied at system level array structures. In Proc. of International Conference on
Dependable Systems and Networks (DSN), 2004.
with one dedicated spare for each partition, where 10% of [7] K. Constantinides, et al. Assessing SEU Vulnerability via
the parts will get defected after 7 years but with 48% less Circuit-Level Timing Analysis In Proc. of 1st Workshop on
cost in area than design configuration S+CL 2SP IR might Architectural Reliability (WAR), 2005.
be a more attractive solution. [8] J. E. D. E. Council. Failure mechanisms and models for
semiconductor devices. JEDEC Publication JEP122-A, 2002.
The same data as in Figure 12(a), is presented in Fig- [9] W. J. Dally, et al. The reliable router: A reliable and
ure 12(b), with the difference that here we assume that the high-performance communication substrate for parallel
breakdown for the switch design starts after 10 years of be- computers. In Proc. International Workshop on Parallel
ing shipped. For the first three design configurations, there Computer Routing and Communication (PCRCW), 1994.
[10] E. Wu, et al. Interplay of voltage and temperature acceleration
is no difference since by that time all of the parts become of oxide breakdown for ultra-thin gate dioxides. Solid-state
defective. For the other two design configurations, what Electronics Journal, 2002.
is interesting to observe is that even after the breakdown [11] P. Gupta and A. B. Kahng. Manufacturing-aware physical
point where the failure rates increase with an exponential design. In Proc. of International Conference on
Computer-Aided Design (ICCAD), 2003.
rate, most of the parts will be able to provide the user a [12] hMETIS. http://www.cs.umn.edu/ e karypis.
warning time window of a month before failure. This is a [13] C. K. Hu, et al. Scaling effect on electromigration in on-chip
very important feature for a design configuration, especially Cu wiring. International Electron Devices Meeting, 1999.
for very critical high dependable applications. [14] A. M. Ionescu, M. J. Declercq, S. Mahapatra, K. Banerjee, and
J. Gautier. Few electron devices: towards hybrid CMOS-SET
integrated circuits. In Proc. of the Design Automation
6. Conclusions and Future Directions Conference, pages 88–93, 2002.
As silicon technologies continue to scale, transistor relia- [15] G. Karypis, et al. Multilevel hypergraph partitioning:
Applications in VLSI domain. In Proc. of the Design
bility is becoming an increasingly important issue. Devices Automation Conference, pages 526–529, 1997.
are becoming subject to extreme process variation, tran- [16] S. Mukherjee, et al. The soft error problem: An architectural
sistor wearout, and manufacturing defects. As a result, it perspective. In Proc. of the International Symposium on
High-Performance Computer Architecture, 2005.
will likely be no longer possible to create fault-free designs.
[17] S. Mukherjee, et al. A systematic methodology to compute the
In this paper, we investigate the design of a defect-tolerant architectural vulnerability factors for a high-performance
CMP network switch. To accomplish this design, we first microprocessor. In Proc. International Symposium on
develop a high-level, architect-friendly model of silicon fail- Microarchitecture (MICRO), pages 29–42, 2003.
[18] B. T. Murray and J. P. Hayes. Testing ICs: Getting to the core
ures based on the time-tested bathtub curve. Based on this of the problem. IEEE Computer, 29(11):32–38, 1996.
model, we explore the design space of defect-tolerant CMP [19] L.-S. Peh. Flow Control and Micro-Architectural Mechanisms
switch designs and the resulting tradeoff between defect tol- for Extending the Performance of Interconnection Networks.
erance and area overhead. We find that traditional mecha- PhD thesis, Stanford University, 2001.
[20] R. Rao, et al. Statistical estimation of leakage current
nisms, such as triple modular redundancy and error correc- considering inter- and intra-die process variation. In Proc. of
tion codes, are insufficient for tolerating moderate numbers the International Symposium on Low Power Electronics and
of defects. Rather, domain-specific techniques that include Design (ISLPED), pages 84–89, 2003.
end-to-end error detection, resource sparing, and iterative [21] N.R. Saxena, and E.J. McCluskey Dependable Adaptive
Computing Systems. IEEE Systems, Man, and Cybernetics
diagnosis/reconfiguration are more effective. Further, de- Conf., 1998.
composing the netlist of the switch into modest-sized clus- [22] P. Shivakumar, et al. Exploiting microarchitectural redundancy
ters is the most effective granularity to apply the protection for defect tolerance. In Proc. of International Conference on
techniques. Computer Design (ICCD), 2003.
[23] D. P. Siewiorek, et al. Reliable computer systems: Design and
This work provides a solid foundation for future explo- evaluation, 3rd edition. AK Peters, Ltd Publisher, 1998.
ration in the area of defect-tolerant design. We plan to [24] J. Smolens, et al. Fingerprinting: Bounding the soft-error
investigate the use of spare components based on wearout detection latency and bandwidth. In Proc. of the Symposium
profiles to provide more sparing for the most vulnerable com- on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 2004.
ponents. Further, a CMP switch is only a first step towards [25] L. Spainhower and T. A. Gregg. G4: A fault-tolerant CMOS
the over-reaching goal of designing a defect-tolerant CMP mainframe. In Proc. of International Symposium on
system. Fault-Tolerant Computing (FTCS), 1998.
Acknowledgments: This work is supported by grants [26] J. Srinivasan, et al. The impact of technology scaling on
lifetime reliability. In Proc. of International Conference on
from NSF and Gigascale Systems Research Center. We Dependable Systems and Networks (DSN), 2004.
would also like to acknowledge Li-Shiuan Peh for provid- [27] J. H. Stathis. Reliability limits for the gate insulator in CMOS
ing us access to CMP Switch models, and the anonymous technology. IBM Journal of Research and Development, 2002.
reviewers for providing useful comments on this paper. [28] S. B. K. Vrudhula, D. Blaauw, and S. Sirichotiyakul.
Estimation of the likelihood of capacitive coupling noise. In
7. References Proc. of the Design Automation Conference, 2002.
[29] N. J. Wang, et al. Characterizing the effects of transient faults
[1] H. Al-Asaad and J. P. Hayes. Logic design validation via on a high-performance processor pipeline. In Proc. of
simulation and automatic test pattern generation. J. Electron. International Conference on Dependable Systems and
Test., 16(6):575–589, 2000. Networks (DSN), pages 61–70, 2004.
[2] C. J. Alpert and A. B. Kahng. Recent directions in netlist [30] C. Weaver and T. Austin. A fault tolerant approach to
partitioning: a survey. Integr. VLSI J., 19(1-2):1–81, 1995. microprocessor design. In Proc. of International Conference
[3] T. S. Barnett and A. D. Singh. Relating yield models to on Dependable Systems and Networks (DSN), 2001.
burn-in fall-out in time. In Proc. of International Test [31] C. Weaver, et al. Techniques to reduce the soft error rate of a
Conference (ITC), pages 77–84, 2003. high-performance microprocessor. In Annual International
[4] E. Bohl, et al. The fail-stop controller AE11. In Proc. of Symposium on Computer Architecture, 2004.
International Test Conference (ITC), pages 567–577, 1997. [32] J. F. Ziegler. Terrestrial cosmic rays. IBM Journal of Research
[5] S. Borkar, et al. Design and reliability challenges in nanometer and Development, 40(1), 1996.
technologies. In Proc. of the Design Automation Conf., 2004.

You might also like