0% found this document useful (0 votes)
78 views17 pages

A Flexible Software-Based Framework For Online Detection of Hardware Defects

This document proposes a new software-based technique called the Access-Control Extensions (ACE) framework for online detection of hardware defects in processors. The ACE framework uses special firmware and instructions to periodically test the processor hardware without requiring redundant execution. When a defect is detected, the tests can diagnose and locate the issue and activate repairs by reconfiguring resources. Evaluation on a commercial chip showed the ACE framework detected 99.22% of defects with only 5.5% average performance overhead.

Uploaded by

Ritesh Sejkar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views17 pages

A Flexible Software-Based Framework For Online Detection of Hardware Defects

This document proposes a new software-based technique called the Access-Control Extensions (ACE) framework for online detection of hardware defects in processors. The ACE framework uses special firmware and instructions to periodically test the processor hardware without requiring redundant execution. When a defect is detected, the tests can diagnose and locate the issue and activate repairs by reconfiguring resources. Evaluation on a commercial chip showed the ACE framework detected 99.22% of defects with only 5.5% average performance overhead.

Uploaded by

Ritesh Sejkar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO.

8, AUGUST 2009 1063

A Flexible Software-Based Framework for


Online Detection of Hardware Defects
Kypros Constantinides, Student Member, IEEE, Onur Mutlu, Member, IEEE,
Todd Austin, Member, IEEE, and Valeria Bertacco, Member, IEEE

Abstract—This work proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of
instructions, called Access-Control Extensions (ACE), that can access and control the microprocessor’s internal state. Special
firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a
hardware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfiguration.
The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade-off
performance with reliability without requiring any change to the hardware. We describe and evaluate different execution models for
using the ACE framework. We also describe how the proposed ACE framework can be extended and utilized to improve the quality of
post-silicon debugging and manufacturing testing of modern processors. We evaluated our technique on a commercial chip-
multiprocessor based on Sun’s Niagara and found that it can provide very high coverage, with 99.22 percent of all silicon defects
detected. Moreover, our results show that the average performance overhead of software-based testing is only 5.5 percent. Based on
a detailed register transfer level (RTL) implementation of our technique, we find its area and power consumption overheads to be
modest, with a 5.8 percent increase in total chip area and a 4 percent increase in the chip’s overall power consumption.

Index Terms—Reliability, hardware defects, online defect detection, testing, online self-test, post-silicon debugging, manufacturing
test.

1 INTRODUCTION
requirements. Researchers have pursued the development
T HE impressive growth of the semiconductor industry
over the last few decades is fueled by continuous silicon
scaling, which offers smaller, faster, and cheaper transistors
of global checkpoint and recovery mechanisms; examples of
these include SafetyNet [52] and ReVive [42], [39]. These
with each new technology generation. However, challenges low-cost checkpointing mechanisms provide the capabilities
in producing reliable components in these extremely dense necessary to implement system recovery. Additionally, the
technologies are growing, with many device experts highly redundant nature of future CMPs will allow low-cost
warning that continued scaling will inevitably lead to repair through the disabling of defective processing ele-
future generations of silicon technology being much less ments [48]. With a sufficient number of processing re-
reliable than present ones [4], [53]. Processors manufac- sources, the performance of a future parallel system will
tured in future technologies will likely experience failures gracefully degrade as manifested defects increase.
in the field due to silicon defects occurring during system Given the existence of low-cost mechanisms for system
operation. In the absence of any viable alternative technol- recovery and repair, the remaining major challenge in the
ogy, the success of the semiconductor industry in the future design of a defect-tolerant CMP is the development of low-
will depend on the creation of cost-effective mechanisms to cost defect detection techniques. Existing online hardware-
tolerate silicon defects in the field (i.e., during operation). based defect detection and diagnosis techniques can be
The challenge—tolerating hardware defects. To tolerate classified into two broad categories: 1) continuous: those that
permanent hardware faults (i.e., silicon defects) encountered continuously check for execution errors and 2) periodic:
during operation, a reliable system requires the inclusion of those that periodically check the processor’s logic.
three critical capabilities: 1) mechanisms for detection and Existing defect tolerance techniques and their short-
diagnosis of defects, 2) recovery techniques to restore correct comings. Examples of continuous techniques are Dual
system state after a fault is detected, and 3) repair mechan- Modular Redundancy (DMR) [51], lockstep systems [27],
isms to restore correct system functionality for future and DIVA [2]. These techniques detect silicon defects by
computation. Fortunately, research in chip-multiprocessor
validating the execution through independent redundant
(CMP) architectures already provides for the latter two
computation. However, independent redundant computa-
tion requires significant hardware cost in terms of silicon
. K. Constantinides, T. Austin, and V. Bertacco, are with the University of area (100 percent extra hardware in the case of DMR and
Michigan, Ann Arbor, 2260 Hayward, 2773 CSE, MI 48109. lockstep systems). Furthermore, continuous checking con-
E-mail: {kypros, austin, valeria}@umich.edu. sumes significant energy and requires part of the power
. O. Mutlu is with the Carnegie Mellon University, 5000 Forbes Avenue, envelope to be dedicated to it. In contrast, periodic
ECE-HH-A305, Pittsburgh, PA 15213. E-mail: [email protected]. techniques check periodically the integrity of the hardware
Manuscript received 18 Feb 2008; revised 30 Aug. 2008; accepted 20 Nov. without requiring redundant execution [50]. These techni-
2008; published online 20 Mar. 2009. ques rely on checkpointing and recovery mechanisms that
Recommended for acceptance by C. Bolchini.
For information on obtaining reprints of this article, please send e-mail to: provide computational epochs and a substrate for spec-
[email protected], and reference IEEECS Log Number TC-2008-02-0078. ulative unchecked execution. At the end of each computa-
Digital Object Identifier no. 10.1109/TC.2009.52. tional epoch, the hardware is checked by on-chip testers. If
0018-9340/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1064 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

Our approach provides particularly wide coverage, as it not


only tests the internal processor control and instruction
sequencing mechanisms through software functional testing,
but it can also check all datapaths, routers, interconnect, and
microarchitectural components by issuing ACE instruction
test sequences.

2 WHY DOES SILICON FAIL? A BRIEF OVERVIEW


OF SILICON FAILURE MECHANISMS
We first provide a brief overview of the silicon failure
Fig. 1. The ACE framework fits in the hardware/software stack below the mechanisms that motivate the solution we propose in this
operating system.
work. The interested reader can refer to [14], [44], [49], [54],
[23] for a detailed treatment of these mechanisms.
Time-dependent wearout:
the hardware tests succeed, the results produced during the
epoch are committed and execution proceeds to the next . Electromigration: Due to the momentum transfer
computational epoch. Otherwise, the system is deemed between the current-carrying electrons and the host
defective and system repair and recovery are required. metal lattice, ions in a conductor can move in the
The on-chip testers employed by periodic defect tolerance direction of the electron current. This ion movement
techniques rely on the same Built-In-Self-Test (BIST) techni- is called electromigration [14]. Gradually, this ion
ques that are used predominantly during manufacturing test movement can cause clustered vacancies that can
[7]. BIST techniques use specialized circuitry to generate test grow into voids. These voids can eventually grow
patterns and validate the responses generated by the until they block the current flow in the conductor.
hardware. There are two main ways to generate test patterns This leads to increased resistance and propagation
on chip: 1) by using pseudorandom test pattern generators delay, which, in turn, leads to possible device failure.
and 2) by storing on-chip previously generated test vectors Other effects of electromigration are fractures and
that are based on a specific fault model. Unfortunately, both shorts in the interconnect. The trend of increasing
of these approaches have significant drawbacks. The first current densities in future technologies increases the
approach does not follow any specific testing strategy severity of electromigration, leading to a higher
(targeted fault model), and therefore, requires extended probability of observing open and short-circuit
testing times to achieve good fault coverage [7]. The second nodes over time [18].
approach not only requires significant hardware overhead . Gate Oxide Wearout: Thin gate oxides lead to
[10] to store the test patterns on chip but also binds a specific additional failure modes as devices become subject
testing approach (i.e., fault model) into silicon. On the other to gate oxide wearout (or Time-Dependent Di-
hand, as the nature of wearout-related silicon defects and the electric Breakdown, TDDB) [14]. Over time, gate
techniques to detect them are under continuous exploration oxides can break down and become conductive. If
[17], binding specific testing approaches into silicon might be enough material in the gate breaks down, a
premature, and therefore, undesirable. conduction path can form from the transistor gate
As of today, hardware-based defect tolerance techniques to the substrate, essentially shorting the transistor
have one or both of the following two major disadvantages: and rendering it useless [18], [23]. Fast clocks, high
temperatures, and voltage scaling limitations are
1. Cost: They require significant additional hardware to
well-established architectural trends that aggravate
implement a specific testing strategy.
this failure mode [54].
2. Inflexibility: They bind specific test patterns and a
. Hot Carrier Degradation (HCD): As carriers move
specific testing approach (e.g., based on a specific
along the channel of an MOSFET and experience
fault model) into silicon. Thus, it is impossible to
change the testing strategy and test patterns after the impact ionization near the drain end of the device, it
processor is deployed in the field. Flexible defect is possible that they gain sufficient kinetic energy to
tolerance solutions that can be upgraded in the field be injected into the gate oxide [14]. This phenomen-
are very desirable. on is called Hot Carrier Injection. Hot carriers can
degrade the gate dielectric, causing shifts in thresh-
High-level overview of our approach. Our goal in this old voltage and eventually device failure. HCD is
work is to develop a low-cost, flexible defect tolerance predicted to worsen for future thinner oxide and
technique that can be modified and upgraded in the field. To shorter channel lengths [23].
this end, we propose to implement hardware defect
detection and diagnosis in software. In our approach, the Transistor infant mortality. Extreme device scaling also
hardware provides the necessary substrate to facilitate exacerbates early transistor failures. Early transistor failures
testing and the software makes use of this substrate to are caused by weak transistors that escape postmanufactur-
perform the testing. We introduce specialized Access- ing validation tests. These weak transistors work initially,
Control Extension (ACE) instructions that are capable of but they have dimensional and doping deficiencies that
accessing and controlling virtually any portion of the subject them to much higher stress than robust transistors.
microprocessor’s internal state. Special firmware periodi- Quickly (within days to months), they will break down
cally suspends microprocessor execution and uses the ACE from stress and render the device unusable. Traditionally,
instructions to run directed tests on the hardware and detect early transistor failures have been reduced through
if any component has become defective. aggressive burn-in testing, where, before being placed in
Fig. 1 shows how the ACE framework fits in the the field, devices are subjected to high voltage and
hardware/software stack below the operating system layer. temperature testing to accelerate the failure of weak

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1065

transistors [7]. Those that survive the burn-in testing are


likely to be robust devices, thereby ensuring a long product
lifetime. However, in the deep-submicron regime, burn-in
becomes less effective as devices are subject to thermal
runaway effects, where increased temperature leads to
increased leakage current, which, in turn, leads to even
higher temperatures [37]. The end result is that aggressive
burn-in of deep-submicron silicon can destroy even robust
devices. Manufacturers are forced to either sacrifice yield
by deploying aggressive burn-in testing or experience more
frequent early failures in the field by using less aggressive
burn-in testing.
Manufacturing defects that escape testing. Optical
proximity effects, airborne impurities, and processing
material defects can all lead to the manufacturing of faulty Fig. 2. A typical scan flip-flop (adapted from [38]).
transistors and interconnect [44]. Moreover, deep-submicron
gate oxides have become so thin that manufacturing
variation can lead to currents penetrating the gate, rendering
it unusable [49]. Even small amounts of manufacturing
variation in the gate oxide could render the device unusable.
The problem of manufacturing defects is compounded by
the immense complexity of current designs. Design com-
plexity makes it more difficult to test for defects during
manufacturing. Vendors are forced to either spend more
time with parts on the tester, which reduces profits by Fig. 3. The ACE Architecture: (a) the chip is logically partitioned into
increasing time-to-market, or risk the possibility of untested multiple ACE domains. Each ACE domain includes several ACE
defects escaping to the field. Moreover, in highly complex segments. The union of all ACE segments comprises the full chip’s
state (excluding SRAM structures). (b) Data are transferred from/to the
designs, many defects are not testable without additional
register file to/from an ACE segment through the bidirectional ACE tree.
hardware support. As a result, even in today’s manufactur-
ing environment, untestable defects can escape testing and
manifest themselves later on in the field. 3.1 An ACE-Enhanced Architecture
Our goal. To overcome the possible errors caused by the A microprocessor’s state can be partitioned into two parts:
aforementioned silicon failure mechanisms, our goal in this accessible from the software layer (e.g., register file, PC,
work is to develop a flexible, low-cost silicon defect detection etc.) or not accessible (e.g., reorder buffer, load/store
and diagnosis technique. We next describe our technique in queues, etc.). An ACE-enhanced microarchitecture allows
detail. the software layer to access and control (almost) all of the
microprocessor’s state. This is done by using ACE instruc-
tions that copy a value from an architectural register to any
3 SOFTWARE-BASED DEFECT DETECTION other part of the microprocessor’s state and vice versa.
AND DIAGNOSIS This approach inherently requires the architecture to
access the underlying microarchitectural state. To provide
A key challenge in implementing a software-based defect this accessibility without a large hardware overhead, we
detection and diagnosis technique is the development of leverage the existing scan chain infrastructure. Most modern
effective software routines to check the underlying hard- processor designs employ full hold-scan techniques to aid
ware. Commonly, software routines for this task suffer and automate the manufacturing testing process [30], [62].
from the inherent inability of the software layer to observe Fig. 2 shows a typical scan flip-flop design [38], [30]. The
and control the underlying hardware, resulting in either system flip-flop is used during the normal operating mode,
excessively long test sequences or poor defect coverage. while the scan portion is used during testing to load the
Current microprocessor designs allow only minimal access system with test patterns and to read out the test responses.
to their internal state by the software layer; often all that Our approach extends the existing scan chain using a
software can access consists of the register file and a few hierarchical, tree-structured organization to provide fast
control registers (such as the program counter (PC), status software access to different microarchitectural components.
registers, etc.). Although this separation provides protec- ACE domains and segments. In our ACE extension
tion from malicious software, it also largely limits the implementation, the microprocessor design is logically
degree to which stock hardware can utilize software to test partitioned into several ACE domains. An ACE domain
for silicon defects. consists of the state elements and combinational logic
To overcome this limited accessibility, we propose associated with a specific part of the microprocessor. Each
architectural support through an extension to the proces- ACE domain is further subdivided into ACE segments as
sor’s ISA. Our extension adds a set of special instructions shown in Fig. 3a. Each ACE segment includes only a fixed
enabling full observability and control of the hardware’s number of storage bits, which is the same as the width of an
internal state. These ACE instructions are capable of read- architectural register (64 bits in our design).
ing/writing from/to any part of the microprocessor’s ACE instructions. Using this hierarchical structure, ACE
internal state. ACE instructions make it possible to probe instructions can read or write any part of the micropro-
underlying hardware and systematically and efficiently cessor’s state. Table 1 shows a description of the ACE
assess if any hardware component is defective. instruction set extensions.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1066 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

TABLE 1
The ACE Instruction Set Extensions

Fig. 4. ACE firmware: Pseudocode for 1) loading a test pattern,


2) testing, and 3) validating the test response.

3.2 ACE-Based Online Testing


ACE instruction set extensions make it possible to craft
programs that can efficiently and accurately detect the
underlying hardware defects. The approach taken in
building test programs, however, must have high coverage,
even in the presence of defects that might affect the
ACE_set copies a value from an architectural register to correctness of ACE instruction execution and test programs.
the scan state (scan portion in Fig. 2) of the specified ACE This section describes how test programs are designed.
segment at-speed (i.e., at the processor’s clock frequency). ACE testing and diagnosis. Special firmware periodi-
Similarly, ACE_get loads a value from the scan state of the cally suspends normal processor execution and uses the
specified ACE segment to an architectural register at-speed. ACE infrastructure to perform high-quality testing of the
These two instructions can be used for manipulating the underlying hardware. A test program exercises the under-
lying hardware with previously generated test patterns and
scan state through software-accessible architectural state.
validates the test responses. Both the test patterns and the
The ACE_swap instruction is used for swapping the scan associated test responses are stored in physical memory.
state with the processor state (system flip-flops) of the ACE The pseudocode of a firmware code segment that applies a
segment by asserting both the UPDATE and the CAPTURE test pattern and validates the test response is shown in
signals (see Fig. 2). Fig. 4. First, the test program stops normal execution and
Finally, ACE_test is a test-specific instruction that uses the ACE_set instruction to load the scan state with a
performs a three-cycle atomic operation for orchestrating test pattern (Step 1). Once the test pattern is loaded into the
the actual testing of the underlying hardware (see Section 3.2 scan state, a three-cycle atomic ACE_test instruction is
for example). executed (Step 2). In the first cycle, the processor state is
In order to avoid any malicious use of the ACE loaded with the test pattern by swapping the processor state
infrastructure, ACE instructions are privileged instructions with the scan state. The next cycle is the actual test cycle,
that can be used only by ACE firmware. ACE firmware where the combinational logic generates the test response.
routines are special applications running between the In the third cycle, by swapping again the processor state
with the scan state, the processor state is restored while the
operating system layer and the hardware in a trusted
test response is copied to the scan state for further
mode, similarly to other firmware, such as device drivers. validation. The final phase (Step 3) of the test routine uses
ACE tree. During the execution of an ACE instruction, the ACE_get instruction to read and validate the test
data need to be transferred from the register file to any part response from the scan state. If a test pattern fails to
of the chip that contains microarchitectural state. In order to produce the correct response at the end of Step 3, the test
avoid long interconnect, which would require extra program indicates which part of the hardware is defective1
repeaters and buffering circuitry, the data transfer between and disables it through system reconfiguration [48], [13].
the register file and the ACE segments is pipelined through Given this software-based testing approach, the firm-
the ACE tree as shown in Fig. 3b. At the root of the ACE tree ware designer can easily change the level of defect coverage
is the register file while the ACE segments are its leaves. At by varying the number of test patterns. As a test program
each intermediate tree level, there is an ACE node that is executes more patterns, coverage increases. We use auto-
responsible for buffering and routing the data based on the matic test pattern generation (ATPG) tools [7] to generate
compact test pattern sets adhering to specific fault models.
executed operation. The ACE tree is a bidirectional tree
Basic core functional testing. When performing ACE
allowing data transfers from the register file to the ACE testing, there is one initial challenge to overcome: ACE
segments and back. testing firmware relies on the correctness of a set of basic core
Design complexity. We believe that since the ACE Tree is functionalities that loads test patterns, executes ACE
a regular structure that routes data from the register file to instructions, and validates the test response. If the core has
the scan chains and vice versa, its implementation and a defect that prevents the correct execution of the ACE
insertion into the microprocessor implementation can be firmware, then ACE testing cannot be performed reliably. To
automated by CAD tools, similar to the way that scan chains bypass this problem, we craft specific programs to test the
are automatically implemented and inserted in current basic functionalities of a core before running any ACE testing
microprocessors today. The main intrusive portion of the firmware. If these programs do not report success in a timely
ACE Tree that needs interaction with existing processor manner to an independent auditor (e.g., the operating
components are the additional read/write ports needed to system running on the other cores), then we assume that
connect the root of the ACE Tree to the processor register file. an irrecoverable defect has occurred on the core and we
Similarly, the ACE instruction set extensions are likely not permanently disable it. If the basic core functionalities are
found to be intact, finer grained ACE testing can begin.
intrusive to the microarchitecture since their operations are
relatively simple and their implementation does not affect 1. By interpreting the correspondence between erroneous response bits
the implementation of other instructions in the ISA. and ACE domains.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1067

TABLE 2
Algorithmic Flow of ACE-Based Testing
in a Checkpoint/Recovery Environment

Fig. 5. Different execution models of ACE testing: (a) Illustrates ACE


testing in a single-threaded sequential execution model, where the ACE
testing thread is run exclusively after application execution. (b) The ACE
testing thread runs simultaneously with the application in a 2-way SMT
execution environment. (c) ACE testing is interleaved with application
execution and run in the shadow of L2 cache misses.

be disabled. Since our focus is on flexible defect detection,


we leave fault analysis and recovery to future work.
3.5 ACE Testing Execution Models
Single-threaded sequential ACE testing. The simplest
execution model for ACE testing is to implement the ACE
testing process at the end of each checkpoint interval. In this
execution model, the application runs normally on the
3.3 ACE Testing in a Checkpointing processor until the buffering resources dedicated to the
and Recovery Environment checkpoint are full and a new checkpoint needs to be taken.
At this point, a context switch between the application
We incorporate the ACE testing framework within a process and the ACE testing process happens. If the ACE
multiprocessor checkpointing and recovery mechanism testing routine deems the underlying hardware defect free,
(e.g., SafetyNet [52] or ReVive [42]) to provide support for a new checkpoint of the processor state is taken and the
system-level recovery. When a defect is detected, the system execution of the application process is resumed. Otherwise,
state is recovered to the last checkpoint (i.e., correct state) system repair and recovery are triggered. Fig. 5a illustrates
after the system is repaired. this single-threaded sequential execution model.
In a checkpoint/recovery system, the release of a SMT-based ACE testing. In processors that support
checkpoint is an irreversible action. Therefore, the system simultaneous multithreading (SMT) execution [47], [21],
must execute the ACE testing firmware at the end of each
[59], it is possible for the ACE firmware to run simulta-
checkpoint interval to test the integrity of the whole chip. A
neously with the application threads running on separate
checkpoint is released only if ACE testing finds no defects.
execution contexts. This execution model is illustrated in
With this policy, the performance overhead induced by
Fig. 5b and could be higher performance since it overlaps the
running the ACE testing firmware depends directly on the
latency of ACE testing with actual application execution.
length of the checkpoint interval, that is, longer intervals
Fortunately, the majority of the instructions used by the
lead to lower performance overhead. We explore the trade-
ACE testing firmware do not entail any synchronization
off between checkpoint interval size and ACE testing
requirements between the ACE testing thread and the other
performance overhead in Section 5.4.
threads running on the processor. For example, the ACE
3.4 Algorithmic Flow of ACE-Based Online Testing instructions used to load a test pattern into the scan state
Table 2 shows the flow of ACE-Based Online testing in a (ACE_set) or read and validate a test response (ACE_get)
checkpointing and recovery environment with single- do not affect the execution of other threads running on the
threaded execution. Other execution models are examined processor. The work performed by these instructions can be
in the next section. Two points are worth noting in the fully overlapped with application execution.
algorithm. First, a lightweight context switch is performed However, the ACE_test instruction momentarily
from the application thread to the ACE testing thread at the changes the microarchitectural state of the entire processor,
beginning of the test and vice versa at the end of the test. and thus, affects the normal execution of all running threads.
Lightweight context switching [1], [28] in a single cycle is To avoid the incorrect execution of other running threads
supported by many simultaneously multithreaded proces- when an ACE_test instruction is executed by the ACE
sors today, including Sun’s UltraSPARC T1. If lightweight testing thread, all other threads need to pause execution.
context switch support is not available, then a pipeline This is implemented by using simple synchronization
flush is required. Our results show that context switch hardware that pauses execution of all other threads (i.e.,
penalty, even if it is hundreds of cycles, only negligibly stalls their pipelines) when an ACE_test instruction starts
increases the overhead of ACE testing. Second, if the basic execution and resumes their execution once the test instruc-
core functional test fails, the core is disabled and execution tion is completed. Note that during testing, the processor’s
traps to the system software. If the ACE firmware test fails, microarchitectural state is stored in the scan state. The
the system software performs defect diagnosis to localize microarchitectural state gets restored right after the test cycle
the defect. To do so, the system software maps the ACE (see Section 3.1) enabling the seamless resumption of normal
segments that fail to match the expected test response to processor execution.
specific hardware components (i.e., the combinational logic The advantage of the SMT-based ACE testing model is
driving the flip-flops of the ACE segments). If reconfigur- its lower performance overhead compared to single-
ability support is provided within those hardware compo- threaded sequential ACE testing. The disadvantage is that
nents, the ACE firmware can pinpoint these components to this model requires a separate SMT context to be present in

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1068 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

the underlying processor. Note that to guarantee correct


recovery, with this execution model, the recovery mechan-
ism needs to buffer the last two checkpoints.
Interleaved ACE testing in the shadow of L2 misses.
When the ACE testing thread is sharing the processor
resources with other critical applications, it is important to
avoid penalizing the performance of these critical applica-
tions due to hardware testing. Performance penalties can be
reduced by allowing the ACE testing thread to execute only
when the processor resources are unutilized by the perfor-
mance critical threads. An example scenario is to execute the
ACE testing thread when the processor is stalled waiting for
an L2 cache miss to complete, i.e., in the shadow of L2 cache
misses. This execution scenario is illustrated in Fig. 5c.
In the execution model, the processor suspends the
execution of the application and context switches into the
ACE testing thread when the application incurs an
Fig. 6. ACE coverage of the OpenSPARC T1 processor: Modules that
L2 cache miss due to its oldest instruction. The context
are dominated by SRAM structures, such as on-chip caches, are not
switch is similar to the lightweight context switches used in
covered by ACE testing since they are already protected by ECC.
switch-on-event multithreading [1], [28]. When the L2 miss
is fully serviced, the processor context switches back to the
application and suspends the execution of the ACE thread. circuit defects [36], [17]. In the N-detect test pattern sets,
Under this execution policy, the ACE testing thread utilizes each single stuck-at fault is detected by at least N different
test patterns. In addition to the stuck-at and N-detect fault
resources that would otherwise be unutilized. However, it
models, we also generate test pattern sets using the path-
is possible that the full ACE testing might not be completed
delay fault model [7]. The path-delay fault model we use is
in the shadow of L2 misses because the application might
the built-in path-delay fault model in the Synopsys
not incur enough L2 cache misses. If that is the case, the
TetraMAX commercial ATPG tool [56].
remaining portion of the ACE testing thread is executed at
Benchmarks. We used a set of benchmarks from the
the end of the checkpoint interval.
SPEC CPU2000 suite to evaluate the performance overhead
The advantage of the ACE testing model is that it does not
and memory logging requirements of ACE testing. All
require a separate SMT context and can possibly provide
benchmarks were run with the reference input set.
lower performance overhead than sequential ACE testing.
Microarchitectural simulation. To evaluate the perfor-
On the other hand, if L2 misses are not common in an
mance overhead of ACE testing, we modified the SESC
application, the model can degenerate into single-threaded simulator [45] to simulate a SPARC core enhanced with the
sequential ACE testing. As with the SMT-based model, to ACE framework. The simulated SPARC core is a six-stage
guarantee correct recovery with this execution model, the in-order core (with 16 KB IL1 and 8 KB DL1 caches) running
recovery mechanism needs to buffer the last two checkpoints. at 1 GHz [55]. For each simulation run, we skipped the first
billion instructions and then performed cycle-accurate
4 EXPERIMENTAL METHODOLOGY simulation for different checkpoint interval lengths (10 M,
100 M, and 1 B dynamic instructions). To obtain the number
To evaluate our software-based defect detection technique, of clock cycles needed for the ACE testing, we simulated a
we used the OpenSPARC T1 architecture, the open source process that was emulating the ACE testing functionality.
version of the commercial UltraSPARC T1 (Niagara) For the SMT experiments, we use a separate thread that
processor from Sun [55], as our experimental testbed. runs the ACE testing software and we use a round-robin
First, using the processor’s RTL code, we divided the thread fetch policy. For these experiments, the simulation
processor into ACE domains. We made the partition based terminates when the ACE thread finishes testing and at least
on functionality, where each domain comprises a basic one of the other threads executes 100 M instructions. The
functionality module in the RTL code. When dividing the thread combinations simulated for these experiments were
processor into ACE domains, we excluded modules that are determined randomly. Unless otherwise stated, we evaluate
dominated by SRAM structures (such as caches) because the single-threaded sequential execution model for ACE
such modules are already protected with error-coding testing in our experiments.
techniques such as ECC. Fig. 6 shows the processor modules Experiments to determine memory logging require-
covered by the ACE framework (note that the L1 caches ments. To evaluate the memory logging storage require-
within each core are also excluded). ments of coarse-grained checkpointing, we used the Pin x86
Next, we used the Synopsys Design Compiler to synthe- binary instrumentation tool [35]. We wrote a Pin tool that
size each ACE domain using the Artisan IBM 0.13 m measures the amount of storage needed to buffer the cache
standard cell library. We used the Synopsys TetraMAX lines written back from L2 cache to main memory during a
ATPG tool to generate the test patterns. checkpoint interval, based on the ReVive checkpointing
Fault models. In our studies, we explored several single- scheme [42]. Benchmarks were run to completion for these
fault models: stuck-at, N-detect, and path-delay. The stuck- experiments. Section 5.4 presents the memory logging
at fault model is the industry standard model for test overhead of our technique.
pattern generation. It assumes that a circuit defect behaves Performance overhead of I/O-intensive applications.
as a node stuck at 0 or 1. However, previous research has An irreversible I/O operation (e.g., sending a packet to a
shown that the test pattern sets generated using the network interface) requires the termination of a checkpoint
N-detect fault model are more effective for both timing before it is executed. If such operations occur frequently,
and hard failures, and present higher correlation to actual they can lead to consistently short checkpoint intervals, and

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1069

Fig. 7. Fault coverage of basic core functional testing: The pie chart on the right shows the distribution of the outcomes of a fault injection campaign
on a five-stage in-order core running the purely software-based preliminary functional tests.

therefore, high performance overhead for our proposal. To outcomes of the fault injection campaign. Overall, the basic
investigate the performance overhead due to such frequent core test successfully detected 62.14 percent of the injected
I/O operations, we simulated some I/O-intensive filesys- faults. The remaining 37.86 percent of the injected faults lie
tem and network processing benchmarks. We evaluated in parts of the core’s logic that do not affect the core’s
microbenchmarks Bonnie and IOzone to exercise the capability of executing simple programs such as the basic
filesystem by performing frequent disk read/write opera- core test and the ACE testing firmware. ACE testing
tions. We also used NetPerf benchmarks [20] to exercise the firmware will subsequently test these untested areas of
network interface by performing very frequent packet the design to provide full core coverage.
send/receive operations. In addition to the Netperf suite,
we evaluated three other benchmarks, NetIO, NetPIPE, and 5.2 ACE Testing Latency, Coverage,
ttcp, which are commonly used to measure the network and Storage Requirements
performance. In these experiments, the execution of an An important metric for measuring the efficiency of our
irrecoverable I/O operation is preceded by a checkpoint technique is how long it takes to fully check the underlying
termination and the new checkpoint interval begins right hardware for defects. The latency of testing an ACE domain
after the execution of the I/O operation. Section 5.5 presents depends on 1) the number of ACE segments it consists of
our results. and 2) the number of test patterns that need to be applied.
RTL implementation. We implemented the ACE tree In this experiment, we generate test patterns for each
structure in RTL using Verilog in order to obtain a detailed individual ACE domain in the design using three different
and accurate estimate of the area and power consumption fault models (stuck-at, path-delay, and N-detect) and the
overheads of the ACE framework. We synthesized our methodology described in Section 4. Table 3 lists the
design of the ACE tree using the same tools, cell library, number of test instructions needed to test each of the major
and methodology that we used for synthesizing the Open- modules in the design (based on the ACE firmware code
SPARC T1 modules, as described earlier in this section. shown in Fig. 4).
Section 5.6 evaluates and quantifies the area overhead of the For the stuck-at fault model, the most demanding
ACE framework while Section 5.7 evaluates its power
module is the SPARC core, requiring about 150 K dynamic
consumption.
test instructions to complete the test. Modules dominated
by combinational logic, such as the SPARC core, the DRAM
5 EXPERIMENTAL EVALUATION controller, the FPU, and the I/O bridge, are more demand-
ing in terms of test instructions. On the other hand, the
5.1 Basic Core Functional Testing CPU-cache crossbar, which consists mainly of buffer queues
Before running the ACE testing firmware, we first run a and interconnect, requires much fewer instructions to
software functional test to check the core for defects that complete the tests.
would prevent the correct execution of the testing firmware. For the path-delay fault model, we generate test pattern
If this test does not report success in a timely manner to an sets for the critical paths that are within 5 percent of the
independent auditor (i.e., the OS running on other cores), clock period. The required number of test instructions to
the test is repeated to verify that the failing cause was not
transient. If the test fails again, then an irrecoverable core
defect is assumed, the core is disabled, and the targeted TABLE 3
tests are canceled. Number of Test Instructions Needed to Test Each
The software functional test we used to check the core of the Major Modules in the Design
consists of three self-validating phases. The total size of the
software functional test is approximately 700 dynamic
instructions. To evaluate the effectiveness of the basic core
test, we performed a stuck-at fault injection campaign on
the gate-level netlist of a synthesized five-stage in-order
core (similar to the SPARC core with the exception of
multithreading support). Fig. 7 shows the distribution of the

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1070 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

TABLE 4 TABLE 5
Test Pattern/Response Storage Requirements Number of Test Instructions Needed by Each Core Pair
per Fault Model and Design Module in Full-Chip Distributed Testing: The Testing Process Is
Distributed over the Chip’s Eight SPARC Cores

Each core is assigned to test its resources and some parts of the
surrounding noncore modules as shown in this table.

complete the path-delay tests is usually less than or similar


to that required by the stuck-at model.
For the N-detect fault model, the number of test
instructions is significantly more than that needed for the
stuck-at model. This is because many more test patterns are
needed to satisfy the N-detect requirement. For values of
N higher than four, we observed that the number of test
patterns generated increases almost linearly with N, an
Fig. 8. Performance overhead of ACE testing for a 100 M instruction
observation that is aligned with previous studies [36], [17].
Full test coverage. The overall chip test coverage for the checkpoint interval.
stuck-at fault model is 99.22 percent (shown in Table 3). The
test coverage for the two considered N-detect fault models controller, and one-half of the I/O bridge, for a total of
is slightly less than that of the stuck-at model, at 468 K dynamic test instructions (for both stuck-at and path-
98.88 percent and 98.65 percent, respectively, (not shown delay testing). The overall latency required to complete the
in Table 3 for simplicity). testing of the entire chip is driven by these 468 K dynamic
Storage requirements for ATPG test patterns/responses. test instructions, since all the other cores have shorter test
Table 4 shows the storage requirements for the ATPG test sequences, and will therefore, complete their tests sooner.
patterns and the associated test responses. The storage
requirements are shown separately for each major module 5.4 Performance Overhead of ACE Testing
in the OpenSPARC T1 chip and for each fault model In this section, we evaluate the performance overhead of
considered in this work. Note that since there is resource ACE testing for the execution models described in
replication in the OpenSPARC T1 chip (e.g., there are eight Section 3.5. For all experiments, we set the checkpoint
SPARC cores and four DRAM controllers on the chip), only interval to 100 M instructions.
one set of the test patterns/responses is required to be stored Single-threaded sequential ACE testing. With this
per resource. The least amount of test pattern storage is execution model, at the end of each checkpoint interval,
required by the path-delay fault model (1.34 MB) while the normal execution is suspended and ACE testing is
most demanding fault model is N-detect, where N ¼ 4, performed. In these experiments, the ACE testing firmware
which requires about 5 MB. The overall test pattern/ executes until it reaches the maximum test coverage. The
response storage requirement for all modules and all fault four bars in the graph of Fig. 8 show the performance
models is 11.11 MB, which is similar to what is reported in overhead when the fault model used in ACE testing is
previous work [34]. In our scheme, the test patterns and
responses are stored in physical memory and loaded into the 1. stuck-at,
register file during the testing phase. Therefore, for physical 2. stuck-at and path-delay,
memories of several gigabytes in modern processors, the 3. N-detect (N ¼ 2) and path-delay, and
storage requirements of 11 MB is considered negligible. 4. N-detect (N ¼ 4) and path-delay.
The minimum average performance overhead of ACE
5.3 Full-Chip Distributed Testing testing is 5.5 percent and is observed when only the
In the OpenSPARC T1 architecture, the hardware testing industry-standard stuck-at fault model is used. When the
process can be distributed over the chip’s eight SPARC cores. stuck-at fault model is combined with the path-delay fault
Each core has an ACE tree that spans over the core’s model to achieve higher testing quality, the average
resources and over parts of the surrounding noncore performance overhead increases to 9.8 percent. As ex-
modules (e.g., the CPU-cache crossbar, the DRAM control- pected, when test pattern sets are generated using the
lers etc.). Therefore, each core is assigned to test its resources higher quality N-detect fault model, the average perfor-
and some parts of the surrounding noncore modules. mance overhead increases to 15.2 and 25.4 percent for N ¼ 2
We distributed the testing responsibilities of the noncore and N ¼ 4, respectively.
modules to the eight SPARC cores based on the physical Table 6 shows the trade-off between memory logging
location of the modules on the chip (shown in Fig. 6). storage requirements and performance overhead for check-
Table 5 shows the resulting distribution. The most heavily point intervals of 10 M, 100 M, and 1 B dynamic instructions.
loaded pair of cores are cores two and four. Each of these Both log size and performance overhead are averaged over
two cores is responsible for testing its own resources, one- all evaluated benchmarks. As the checkpoint interval size
eighth of the CPU-cache crossbar, one-half of the DRAM increases, the required log size increases, but the performance

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1071

execution model, when ACE testing checks for stuck-at


TABLE 6
Memory Log Size and ACE Testing Performance Overhead for failures, the average performance overhead is 2.6 percent,
Different Checkpoint Intervals which is 53 percent lower than the 5.5 percent overhead
observed when testing is performed in a single-threaded
sequential execution environment. For other fault models,
the observed results follow a similar trend: the performance
overhead of SMT-based ACE testing is lower than the
performance overhead of single-threaded sequential ACE
testing. The performance overhead reduction observed
under the SMT-based execution model stems from better
processor resource utilization between the ACE testing
thread and the running application. This is a consequence
of the ACE testing thread simultaneously sharing the
processor resources instead of sequentially executing
exclusively on the processor. The latency of major portions
of ACE testing (loading and checking of test patterns) is
hidden by application execution.
In SMT-based ACE testing, the testing thread occupies
an SMT context. Although performing ACE-based testing
in an SMT environment can reduce the potential perfor-
mance overhead of testing, it is important also to evaluate
Fig. 9. Performance overhead of SMT-based ACE testing. the system throughput loss due to the testing thread since
the extra SMT context utilized by the testing thread could
otherwise be utilized by another application thread. Fig. 10
shows the reduction in system throughput when the
testing thread competes for processor resources with other
threads in a 2-way and a 4-way SMT configuration. We
define system throughput as the number of instructions
per cycle executed by application threads (excluding the
testing thread).
We observe that for stuck-at testing, the system
throughput reduction in a 2-way SMT configuration is
limited to 3 percent. The highest throughput reduction,
Fig. 10. System throughput reduction due to SMT-based ACE testing. 24 percent, is observed in a 2-way SMT configuration when
high-quality testing is performed (N-Detect, N ¼ 4, in
overhead of ACE testing decreases. From this experiment, we combination with the path-delay fault model). We also
conclude that checkpoint intervals in the order of hundreds of observe that when the number of SMT contexts increases to
millions of instructions are sustainable with reasonable 4, the throughput reduction due to software-based testing
significantly reduces. This is because ACE testing occupies
storage overhead, while providing an efficient substrate to
only a single thread context in the SMT processor and other
perform ACE testing with low performance overhead. thread contexts can still contribute to system throughput by
SMT-based ACE testing. Fig. 9 shows the performance executing application threads.
overhead when ACE testing is used in a 2-way SMT Interleaved ACE testing in the shadow of L2 misses.
processor with several SPEC CPU2000 benchmarks. The Fig. 11 shows the performance overhead when ACE testing
ACE testing thread runs concurrently, on a separate SMT is run in the shadow of L2 cache misses. With this execution
context, with the benchmark that is evaluated. In this model, whenever there is an L2 cache miss on the

Fig. 11. Performance overhead of interleaved ACE testing in the shadow of L2 cache misses.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1072 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

application thread, there is a lightweight context switch


with the ACE testing thread. The application thread
resumes execution after the L2 cache miss is served. In
the case that the checkpoint buffering resources are full
(signaling the end of the checkpoint interval) and the ACE
testing is not completed, the ACE testing thread starts
running exclusively on the processor resources and
executes the remaining of the ACE testing routine to
completion. The dark part of each bar in Fig. 11 shows the
fraction of ACE testing overhead that is due to testing
performed in the shadow of L2 cache misses, while the gray
part shows the fraction of ACE testing overhead that is due
to testing performed at the end of the checkpoint interval.
The overhead of testing that is performed in the shadow of
L2 cache misses is caused by the additional time taken to
switch between the application thread and the ACE testing Fig. 12. Execution time overhead of ACE testing on I/O-intensive
thread, and vice versa. filesystem and networking applications.
We observe that for some memory intensive benchmarks
that exhibit a high L2 cache miss rate, such as ammp and in some Netperf benchmarks because these benchmarks are
mcf, the ACE testing routine was able to run in its entirety intentionally designed to stress-test the network interface
in the shadow of L2 cache misses. For these benchmarks, we by executing a very tight loop that continuously sends and
observe an average performance overhead reduction of 57 receives packets to/from the network interface. Even with
and 43 percent, respectively, compared to single-threaded these adversarial benchmarks, the performance overhead
sequential ACE testing. However, for the rest of the of ACE testing is at most 27 percent with the stuck-at fault
benchmarks, we noticed that due to the low L2 cache miss model and 48 percent with the combined stuck-at and
rate, there were very few opportunities to execute the ACE path-delay fault models.
testing thread in the shadow of L2 cache misses. These In this experiment, a checkpoint terminates whenever
benchmarks, depending on the amount of ACE testing there is a write operation to the filesystem or a send/
performed in the shadow of L2 cache misses, exhibit the receive operation to the network interface (i.e., an irrecov-
same or slightly less performance overhead when com- erable I/O operation). This assumption is pessimistic. The
pared to single-threaded sequential ACE testing. execution time overhead observed in this experiment can
Based on these experimental results, we conclude that significantly be reduced with more aggressive and intelli-
the interleaved ACE testing execution model benefits only gent I/O handling techniques like I/O buffering [39] or I/O
benchmarks that exhibit a high enough L2 cache miss rate speculation [40], which we do not consider in this work.
and provide enough opportunities for interleaved ACE Furthermore, we note that heavily I/O-intensive applica-
testing to utilize the processor resources more efficiently. tions, such as the Netperf benchmarks, constitute an
Different thread interleaving criteria other than L2 cache unfavorable running environment for the ACE testing
misses could lead to higher benefits and affect more technique due to two reasons. First, if high performance
uniformly all benchmarks. However, the overhead of is desired when running such I/O intensive applications,
switching between the application thread and the ACE the system can alternatively reduce the test quality
testing thread should be kept low. We leave the design and requirements of ACE testing (or even completely switch it
investigation of such criteria and low overhead context off) and trade-off testing quality with performance. Second,
switching to future work. we note that such I/O intensive applications have very low
CPU utilization; therefore, there might be little need for
5.5 Overhead of ACE Testing high-quality, high-coverage ACE testing of the CPU during
in I/O-Intensive Applications their execution.
In I/O-intensive applications, frequent I/O operations
significantly affect the performance overhead of check- 5.6 ACE Tree Implementation and Area Overhead
point-based system rollback and recovery. Several system The area overhead of the ACE framework is dominated by
I/O operations are not reversible (e.g., sending a packet to a the ACE tree. In order to evaluate this overhead, we
network interface, writing to the display, or writing to the implemented the ACE tree for the OpenSPARC T1 archi-
disk), and thus, cause early checkpoint termination. Conse- tecture in Verilog and synthesized it with the Synopsys
quently, frequent I/O operations lead to shorter checkpoint Design Compiler. Our ACE tree implementation consists of
intervals and more frequent hardware testing that can have data movement nodes that transfer data from the tree root
a negative impact on system performance. This section (the register file) to the tree leaves (ACE segments) and vice
evaluates the performance overhead of ACE testing under a versa. In our implementation, each node has four children,
heavy I/O usage environment using I/O-intensive filesys- and therefore, in an ACE tree that accesses 32 kilobits (about
tem and network processing benchmarks. 1/8 of the OpenSPARC T1 architecture), there are 42 internal
Fig. 12 shows the execution time overhead of ACE tree nodes and 128 leaf nodes, where each leaf node has four
testing for the stuck-at fault model and the stuck-at 64-bit ACE segments as children. Fig. 13a shows the
combined with the path-delay fault model. Except for topology of the ACE tree configuration, which has the
three of the Netperf benchmarks, all benchmarks exhibit an ability to directly access any of the 32 kilobits. To cover the
execution time overhead that ranges from 4 to 10 percent whole OpenSPARC T1 chip with the ACE framework, we
for the stuck-at fault model and from 6 to 17 percent used eight such ACE trees, one for each SPARC core. The
when combined with the path-delay fault model. Note overall area overhead of the ACE framework configuration
that the overheads are very high (greater than 25 percent) (for all eight trees) is 18.7 percent of the chip area.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1073

resulting gate-level netlist is subsequently analyzed by the


Power Compiler to estimate the module’s power consump-
tion. To perform the synthesis and power consumption
analysis, we used the Artisan IBM 130 nm standard cell
library, characterized at typical conditions of 1.2 V (Vdd)
and 25 C average temperature. The average transistor
switching activity factor was set to 0.5.
For modules dominated by SRAM structures, such as the
on-chip caches, where logic synthesis and power analysis
Fig. 13. ACE tree implementation: (a) Topology of a direct access ACE using the RTL code is inefficient,3 we used existing tools
tree. (b) Topology of a hybrid (partial direct access, partial scan chain) designed specifically to characterize SRAM modules. To
ACE tree. estimate the power consumption of the L1 and L2 caches,
we used the CACTI 4.2 tool [57], a tool with integrated
cache performance, area, and power models.
This methodology is sufficient enough to estimate the
power consumption of most of the chip’s logic modules.
However, there are parts of the design whose power
consumption cannot be accurately estimated with these
tools. These include 1) numerous buses, wires, and repeaters
distributed all over the design, which are very hard to model
accurately using the Design and Power Compilers, unless
the design is fully placed and routed, 2) I/O pads of the chip.
In order to estimate the power consumption of these two
Fig. 14. Power consumption overhead of the ACE framework: (a) All parts, we used values from the reported power envelope of
major design components and the methodology/tools used to estimate the commercial Sun UltraSPARC T1 design [32].
the associated power consumption. (b) The power envelope of the Results. The estimated power envelope for the whole
OpenSPARC T1 design enhanced with the ACE framework.
OpenSPARC T1 chip without the addition of the ACE
framework is 56.3 W.4 Fig. 14b shows the power consump-
In order to contain the area overhead of the ACE tion for our enhanced OpenSPARC T1 design including the
framework, we propose a hybrid ACE tree implementation ACE framework. The power envelope of the ACE-en-
that combines the direct processor state accessibility of the hanced design is 58.5 W, where the power consumption of
previous implementation with the existing scan-chain the ACE framework is estimated to be 2.2 W. Thus, the
structure. In this hybrid approach, we divide the 32 K ACE framework consumes 4 percent of the design’s total
ACE-accessible bits into sixty-four 512-bit scan chains. Each power. Our estimation assumes that the ACE framework is
scan chain has 64 bits that can be directly accessed through enabled all the time while the chip is in operation.
the ACE tree. The reading/writing to the rest of the bits in However, as illustrated in the previous sections, the ACE
the scan chain is done by shifting the bits to/from the framework is actually used during very short testing
64 directly accessible bits. Fig. 13b shows the topology of the periods at the end of each checkpoint interval. Therefore,
hybrid ACE tree configuration. The overall area overhead of we expect the actual power consumption and power
the ACE framework when using the hybrid ACE tree envelope overhead of the ACE framework to be signifi-
configuration is 5.8 percent of the chip area.2 cantly lower than 4 percent, depending on the frequency
5.7 Power Consumption Overhead and length of testing (i.e., checkpoint interval size and time
spent in testing).
of the ACE Framework
An important consideration in evaluating the ACE frame-
work is the degree to which the extra hardware increases 6 OTHER APPLICATIONS OF THE ACE FRAMEWORK
the baseline design’s power consumption envelope. To
We believe that the ACE framework is a general framework
evaluate this power consumption overhead for our design
that can be used in several other applications to amortize its
on Sun’s OpenSPARC T1 chip multiprocessor, we first
hardware cost. We have recently shown that the ACE
estimated the power consumption of the baseline design
framework can be utilized for the flexible detection of
that lacks the ACE framework capabilities. We calibrated
hardware design bugs during online operation [11]. In this
the estimated power consumption with actual power
section, we describe how the ACE framework can be used
consumption numbers provided by Sun for each module
of the chip [32]. After we validated our power estimates for in two other possible applications: post-silicon debugging
the baseline OpenSPARC T1 design, we estimated the and manufacturing testing.
additional power required by the ACE framework. 6.1 ACE Framework for Post-silicon Debugging
Power estimation methodology. Fig. 14a shows the major
design components of the OpenSPARC T1 and the metho- Post-silicon debugging is an essential and highly resource-
dology/tools we used to estimate their power consumption. demanding phase that is on the critical path of the
We estimated the power consumption of the majority of microprocessor development cycle. Following product
OpenSPARC T1 modules using the Synopsys Power tape-out (i.e., fabrication of the microprocessor design into
Compiler (part of the Synopsys Design Compiler package) a silicon die), the post-silicon debugging phase checks if the
and the available RTL code for the design. Each module’s
3. In logic synthesis, memory elements are synthesized into either latches
RTL code is synthesized using the Design Compiler. The or flip-flops. Therefore, SRAM macrocells are implemented using memory
compilers instead of using the conventional logic synthesis flow.
2. We found that the ACE tree’s impact on the processor’s clock cycle 4. Our estimate of the OpenSPARC T1 power is within 12 percent of the
time is negligible in both direct access and hybrid implementations. reported power consumption of the commercial Sun Niagara design [32].

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1074 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

actual physical design of the product meets all the temporarily alters the design by physically changing the
performance and functionality specifications as they were metal layers of the chip. Unfortunately, FIB is limited in
defined in the design phase. The goal of post-silicon two ways. First, FIB typically can only change metal layers
debugging is to find all design errors, also known as of the chip and cannot create any new transistors. There-
design bugs, and to eliminate them through design changes fore, some potential design fixes are not possible to make
or other means before selling the product to the customer or evaluate using this technology. Second, FIB’s effective-
[24], [22], [25]. ness is projected to diminish with further technology
The first phase of post-silicon debugging is to run scaling as the access to lower metal layers is becoming
extended tests to validate the functional and electrical increasingly difficult due to the introduction of more metal
operation of the design. The validation content commonly layers in modern designs [8], [24].
consists of focused software test programs written to Recently proposed mechanisms try to address the
exercise specific functionalities of the design or randomly limitations of these traditional techniques. Specifically,
generated tests that exercise different parts of the design. recently proposed solutions suggest the use of reconfigur-
We refer to these test programs as the validation test suite. able programmable logic cores and flexible on-chip net-
These tests are applied under different operating conditions works that will improve both signal observability and the
(i.e., voltage, clock frequency, and temperature) in order to ability to temporally alter the design [43]. However, these
electrically characterize the product. When the observed solutions have considerable area overheads [43] and still do
behavior diverges from the expected prespecified correct not provide complete accessibility to all of the processor’s
behavior (i.e., when a failure is found), further investigation internal state [43].
is required by the post-silicon debugging team. During a Solution—ACE framework for post-silicon debugging.
failure investigation, the post-silicon debug engineer tries to The ACE framework can be an effective low-overhead
1) isolate the failure, 2) find the root cause of the failure, framework that provides the post-silicon debug engineers
and 3) fix the failure, using features hardwired into the with full accessibility and controllability of the processor’s
design to support debugging as well as tools external to the internal microarchitectural state at runtime. This capability
design [24]. can be helpful to post-silicon debug engineers in isolating
Motivation. The trends of higher device integration into design bugs and finding their root causes. Furthermore, once
a single chip and the high complexity of modern processor a design bug is isolated and its causes have been identified,
designs make the post-silicon debugging phase a signifi- the ACE framework can be used to dynamically overwrite
cantly costly process, both in terms of resources and time. the microarchitectural state, and thus, emulate a potential
For modern processors, the post-silicon debugging phase hardware fix. This allows the debug engineer to quickly
can easily cost $15-20 million and take six months to observe the effects of a potential design fix and verify its
complete [16]. The post-silicon debugging phase is esti- correctness without any physical hardware modification.
mated to take up to 35 percent of the chip design cycle [8], Specifically, the event that triggers a failure investigation
resulting in a lengthy time-to-market. As the level of device by a post-silicon debug engineer is an incorrect design
integration continues to rise and the complexity of modern output during the execution of the validation test suite.
processor designs increases [15], this problem will be However, by just observing the incorrect output, it is very
exacerbated leading to either 1) very expensive and long hard to pinpoint the root cause of the failure. Therefore,
post-silicon debugging phases, which would adversely further debugging of the failure is required. The first step in
affect the processor cost and/or time-to-market or 2) more this process is the reproduction of the conditions under
buggy designs being released to the customers due to poor which the failure occurred. Once the failure is reproduced,
post-silicon debugging [61], [46], which would likely debugging tools can be used to analyze the design’s internal
increase the fraction of chips that fail in the field. state and pinpoint the design bug. This is where the ACE
There are two major challenges in the post-silicon firmware could be very useful to a post-silicon debug
debugging of modern highly integrated processors. First, engineer. The debug engineer can run the ACE firmware as
because the internal signals of the microarchitecture have an independent thread (called the ACE debugging thread)
limited observability to the testing software, it is difficult to that runs in conjunction with the validation test thread to
isolate a failure and find its root cause. Second, because the identify the root cause of the failure and evaluate a potential
hardware design is not easily or flexibly alterable by the design fix. We first describe the required extensions to the
post-silicon debug engineer, it is difficult to evaluate ACE framework to support post-silicon debugging using
whether or not a potential fix to the design eliminates the the ACE firmware, then provide a detailed example of how
cause of the failure [25]. Existing techniques that are used to the debug engineer uses the ACE framework.
address these two challenges are not adequate, as briefly ACE instructions for post-silicon debugging. Table 7
explained below. shows the ACE instruction set extensions that enable the
Traditional techniques used to address the limited signal synchronization between the validation test thread and the
observability problem are built-in scan chains [62], [25] and ACE debugging thread.
optical probing tools [63]. Unfortunately, both have sig- The ACE_pause instruction pauses the execution of the
nificant shortcomings. The use of built-in scan chains to running validation test thread after it is executed for a given
monitor internal signals is very slow due to the serial nature number of clock cycles and switches execution to the ACE
of external scan testing [19]. The effectiveness of optical debugging thread. The execution switch between the
probing tools reduces with each technology generation as validation test thread and the ACE debugging thread is
direct probing becomes very difficult, if not impossible, scheduled by setting an interrupt counter to the parameter
with more metal layers and smaller devices [60]. Further- value of the ACE_pause instruction. This interrupt counter
more, it is very hard to integrate these two techniques into decrements every clock cycle during the execution of the
an automated post-silicon debugging environment [60]. validation test thread. Once the counter becomes zero, the
The traditional technique used to evaluate design fixes processor state and scan state get swapped, thus, taking a
is the Focused Ion Beam (FIB) [24] technique, which snapshot of the running microarchitectural state of the

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1075

After 10,000 cycles into the execution of the validation


TABLE 7
test thread, the validation test thread is interrupted. At this
Additional ACE Instruction Set Extensions
point, 1) processor state is swapped with the scan state and
for Post-silicon Debugging
2) execution is switched from the validation test thread to
the ACE debugging thread. Once execution is transferred
to the ACE debugging thread, the post-silicon engineer
uses the ACE framework to investigate the microarchitec-
tural state of the validation test thread during clock cycle
10,000 (which is stored in the scan state). The example
scenario in Fig. 15 assumes that the suspected bug is in the
third ACE domain of the core. ACE_get instruction reads
the third ACE domain’s microarchitectural state and prints
it to the debugging console. We assume that the domain’s
microarchitectural state is checked by the debug engineer
and is found to be error free. Therefore, the debug
engineer decides to check the domain’s state in the next
clock cycle. In order to step the execution of the validation
test thread for one clock cycle, the interrupt counter is set
to one using the ACE_pause instruction, and the valida-
tion test thread’s execution is resumed with the execution
of the ACE_return instruction (Fig. 15-middle).
After one clock cycle of validation test execution,
control is transferred again to the ACE debugging thread
and the domain’s new microarchitectural state is checked
by the debug engineer. After inspecting the domain’s
microarchitectural state, the debug engineer finds that the
Fig. 15. Example of ACE firmware pseudocode used for post-silicon
third bit of the domain’s sixth segment is a control signal
debugging.
that should be a zero, but instead, it has the value of one.
Thus, the engineer pinpoints the root cause of the failure.
validation testing thread into the scan state. In the same clock In order to verify that this is the only design bug that
cycle, execution is switched to the ACE debugging thread. affects the execution of the validation test thread, and that
The ACE_return instruction returns execution from fixing the specific control signal does not cause any other
the ACE debugging thread to the validation testing erroneous side effects, the debug engineer modifies the
thread and swaps the scan state with the processor state domain’s microarchitectural state and sets the control
in order to restore the microarchitectural state of the signal to its correct value using the ACE_set instruction
validation test thread. (Fig. 15-right). Assuming that the whole validation test
Post-silicon debugging example using the ACE frame- takes 100,000 clock cycles to execute, the debug engineer
work. Fig. 15 shows example of a possible ACE firmware sets the next debugging interrupt to occur after 90,000 clock
written to perform post-silicon debugging. Suppose that cycles, which is right after the completion of the validation
the debug engineer runs a validation test program that test. At this point, the execution is transferred to the
fails after 10,000 cycles of execution and the validation validation test thread, which runs uninterrupted to
engineer suspects that the bug is in the third ACE domain completion. After completion, the debug engineer checks
of the core. Fig. 15 shows the pseudocode of the ACE the final output to verify that the potential design bug fix
firmware written to analyze such a failure. The first led to the correct output and there were not any erroneous
portion of the code (Fig. 15-left) pauses the execution of side effects due to the introduction of the bug fix. In the
the validation test program at the desired clock cycle; the case that the final output is incorrect, a new failure
second portion (Fig. 15-middle) allows the debug engineer investigation starts from the beginning and the debug
to single-step the execution by one cycle to observe state engineer writes another piece of firmware to investigate
changes. Based on the information obtained by running the failure.
these portions of the code, the engineer devises a possible We would like to note the analogy between ACE
fix. The third portion of the code (Fig. 15-right) is used by framework-based post-silicon debugging and conventional
the engineer to evaluate whether or not the design fix software debugging. ACE_pause instruction is analogous to
setting a breakpoint in software debugging. ACE_return is
would result in correct execution. We describe each code
analogous to the low-level mechanism that allows switching
portion of the ACE firmware in detail below.
from the debugger to the main program code. Examining the
The debugging process starts with the execution of the state of the processor and stepping hardware execution for
ACE debugging firmware thread (Fig. 15-left). In this one cycle are analogous to examining the state of program
thread, the first instruction is an ACE_pause instruction variables and single stepping in software debugging.
that sets the interrupt counter to the clock cycle in which Finally, ACE framework’s ability to modify the state of the
detailed debugging is desired by the post-silicon debug processor while the test program is running is analogous to a
engineer. In the example shown in Fig. 15, the validation test software debugger’s ability to modify memory state during
is set to be interrupted at clock cycle 10,000 (assuming that the execution of a software program that is debugged. We
this is the phase of the validation test, where the post-silicon note that, similar to a software debugging program, a
debug engineer suspects that the first error occurs). The graphical interface can be designed to encapsulate the post-
ACE_pause instruction is followed by an ACE_return silicon debugging commands to ease the use of ACE
instruction. ACE_return switches execution from the ACE firmware for post-silicon debugging.
debugging thread to the validation test thread, and thus, the Advantages. The results of the detailed debugging
validation test program’s execution begins. process, demonstrated by the above example, are sometimes

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1076 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

achievable using traditional post-silicon debugging techni- testing, logic BIST techniques use the scan infrastructure
ques that were described previously. However, the use of to apply the on-chip pseudorandomly generated test
the ACE framework provides a promising post-silicon patterns and employ specialized hardware to compact the
debugging tool that can ease, shorten, and reduce the cost test responses [7]. Furthermore, the control signals used for
of the post-silicon design process. The main advantages of testing are driven by an on-chip test controller. Therefore, a
ACE framework-based post-silicon debugging are the clear advantage of logic BIST over the traditional manu-
following: facturing testing methodology is that it significantly reduces
the amount of data that is communicated between the tester
1. It eases the debugging process: ACE framework- and the chip. This leads to shorter testing times and, as a
based debugging is closer to software, very similar result, lower testing cost. Logic BIST also allows the
to the software debugging process, and therefore, is manufacturing test to be performed at-speed (i.e., at the
trivial to understand and use by the debug engineer. chip’s normal operating frequency rather than the fre-
This ease in debugging is achieved by providing quency of the automatic test equipment), which improves
complete accessibility and controllability of the both the speed and quality of testing.
hardware state to the debug engineer. Although logic BIST addresses major challenges of the
2. It can test potential design bug fixes without traditional manufacturing testing methodology, it also
physically and permanently modifying the under- imposes some new challenges. First, logic BIST requires the
lying hardware. This reduces both the cost and on-chip storage of a very large amount of pseudorandomly
difficulty of post-silicon debugging by reducing the generated test patterns. Second, because logic BIST uses
manual labor involved in fixing the design bugs. pseudorandomly generated test patterns, it often provides
3. It can accelerate the post-silicon debugging process significantly lower fault coverage than that provided by a
because it does not require very slow procedures much smaller number of high-quality, ATPG-pregenerated
such as scan-out of the whole microarchitectural test patterns [7]. Third, the use of the logic BIST methodology
state or manual modification of the underlying requires significantly more stringent design rules than
hardware using the aforementioned FIB technique conventional manufacturing testing [19]. For example, bus
to evaluate potential design fixes. conflicts must be eliminated and the circuit must be made
random-pattern testable [19]. Therefore, logic BIST techni-
6.2 ACE Framework for Manufacturing Testing ques significantly increase both the hardware cost and the
Manufacturing testing is the phase that follows chip design complexity, while resulting in lower test coverage.
fabrication and screens out parts with defective or weak Proposed solution—use of the ACE framework for
devices. Today, most complex microprocessor designs use manufacturing testing. The ACE infrastructure incorpo-
scan chains as the fundamental design for test (DFT) rates the advantages of both the scan-based and logic BIST
methodology. During the manufacturing testing phase, testing methodologies, while it also can effectively address
the design’s scan chains are driven by external automatic their limitations. Specifically, the ACE infrastructure
test equipment (ATE) that applies pregenerated test provides two capabilities that are not together present in
patterns to check the chip under test [7]. During the previous manufacturing testing techniques. First, the ACE
manufacturing testing phase, every single chip has to go framework is a built-in solution for fast loading of high-
through this testing process multiple times at different quality pregenerated ATPG test patterns into the scan-
voltage, temperature, and frequency levels. Therefore, the chain structures through software. This capability can
manufacturing testing cost for each chip can be as high as eliminate the need for expensive and slow external
25-30 percent of the total manufacturing cost [19]. equipment, currently needed for test pattern loading.
Motivation. Although this testing methodology served Second, the ACE framework allows the test patterns to be
the semiconductor industry well for the last few decades, it loaded and applied at-speed at the chip’s normal operating
has started to face an increasing number of challenges due frequency rather than the much slower operating fre-
to the exponential increase in the complexity of modern quency of the automatic test equipment, which results in
microprocessors [15], a product of the continuous silicon higher quality testing.
process technology scaling. With these two capabilities, the ACE framework pro-
Specifically, the external ATE testers have a limited vides the best of both existing manufacturing testing
number of channels to drive the design’s scan chains due techniques: 1) fast loading of test patterns to reduce testing
to package pin limitations [19]. Furthermore, the speed of time, 2) at-speed testing of the chip to improve testing
test pattern loading is limited by the maximum scan quality as well as to reduce testing time, and 3) testing with
frequency that is usually much lower than the chip’s ATPG-pregenerated test patterns rather than the use of
operating frequency [19], [7]. The limited throughput of the pseudorandomly generated test patterns, to improve test-
scan interface between the external tester and the design ing quality. Thus, if employed by the future integrated
under test constitutes the main bottleneck. These limita- circuit manufacturing testing methodologies, it can greatly
tions in combination with the larger set of test patterns improve the speed, cost, and test coverage of the costly
required for testing modern multimillion gate designs lead manufacturing testing phase of the microprocessor devel-
to longer time spent on the tester per chip. Even today, the opment cycle.
amount of time a chip spends on a tester can be several
seconds [19]. Considering that the amortized testing cost of
high-end test equipment is estimated to be at thousands of 7 RELATED WORK
dollars per hour [5], [19], the conventional manufacturing Hardware-based reliability techniques. The previous
testing process can be very cost-ineffective for micropro- work most closely related to this work is [50]. In [50],
cessor vendors. we proposed a hardware-based technique that utilizes
Alternative solutions. Logic BIST is a testing methodol- microarchitectural checkpointing to create epochs of
ogy based on pseudorandom test pattern generation and execution during which on-chip distributed BIST-like
test response compaction. To speed up manufacturing testers validate the integrity of the underlying hardware.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1077

To lower silicon cost, the testers were customized to the testing of the processor. The proposed scheme requires very
tested modules. However, this leads to increased design little additional hardware cost. It requires an LFSR for
complexity because a specialized tester needs to be generating randomized operands for test instructions and an
designed for each module. MISR for generating the result signature. Also, a minor
A traditional defect detection technique that is predomi- modification of the ISA is required for the test instructions to
nantly used for manufacturing testing is logic BIST [7]. read/write from the LFSR/MISR. Similarly, Kranitis et al.
Logic BIST incorporates pseudorandom pattern generation [29] use the knowledge of the ISA and the RTL-level model of
and response validation circuitry on the chip. Although on- a processor to select high fault coverage instructions and
chip pseudorandom pattern generation removes any need their operands to include in self-test software routines.
for pattern storage, such designs require a large number of Batcher and Papachristiou [3] employ instruction randomi-
random patterns and often provide lower fault coverage zation hardware to generate randomized instructions to be
than ATPG patterns [7]. used in self-test software routines for functional testing.
This work improves on these previous works due to the Brahme and Abraham [6] describe how to generate rando-
following major reasons: mized instruction sequences to be used in self-test software
routines. Building upon these works, Chen and Dey [9]
1. It effectively removes the need for on-chip test propose a mechanism that generates instruction sequences to
pattern generation and validation circuitry and exercise structural test patterns designed to test processor
moves this functionality to software; components and applies such instruction sequences in the
2. It is not hardwired in the design, and therefore, has software-based self-test routines to achieve higher coverage
ample flexibility to be modified/upgraded in the than other approaches that randomly generate instruction
field; sequences.
3. It has higher test coverage and shorter testing time Our technique is fundamentally different from these
because it uses ATPG instead of pseudorandomly instruction-based functional testing techniques in that it is a
generated patterns; structural testing approach that uses software routines to
4. In contrast to [50], it can uniformly be applied to any apply test patterns. We introduce new instructions that are
microprocessor module with low design complexity capable of applying high-quality ATPG-generated structural
because it does not require module-specific custo- test patterns to every processor segment by exposing the scan
mizations; and chain to the instruction set architecture. Software self-test
5. It provides wider coverage across the whole chip, routines that use these instructions can therefore directly
including noncore modules. apply test patterns to processor structures and read test
Software-based reliability techniques. A very recent responses, which results in the fast and high-coverage
approach proposes the detection of silicon defects by structural testing of each processor component. In contrast,
employing low overhead detection strategies that monitor none of the previously proposed instruction-based func-
for simple software symptoms at the operating system level tional testing techniques are capable of directly applying test
[33]. These software-based detection techniques rely on the patterns to processor components. Instead, they execute
premise that silicon defects manifested in some microarch- existing ISA instruction sequences to indirectly (functionally)
itectural structures have a high probability (95 percent) to test the hardware for faults. As such, previous instruction-
propagate detectable symptoms through the software stack based functional test approaches, in general, lead to higher
to the operating system [33]. testing times or lower fault coverage since they rely on
The main differences between [33] and our work are: (randomized) functional testing.
1) unlike the probabilistic software symptom-based defect One recent previous work [41] employed purely soft-
detection, our technique checks the underlying hardware in ware-based functional testing techniques during the man-
a deterministic process through a structured high-quality ufacturing testing of the Intel Pentium 4 processor. In our
test methodology with very high fault coverage (99 percent) approach, we use a similar functional testing technique (our
and can be executed on demand, 2) software symptom- “basic core functional test” program) to check the basic core
based defect detection techniques can flag the possible functionality before running the ACE firmware to perform
existence of a hardware failure, but they do not have the directed, high-quality testing. In fact, any of the previously
capability to diagnose which part of the underlying hard- proposed instruction-based functional testing approaches
ware is defective. In our technique, by employing ATPG test can be used as the basic core functional test within the ACE
patterns, it is trivial to diagnose the defective device at a framework.
very fine granularity.
Instruction-based functional testing. A large amount of
work has been performed in functional testing [6], [26], [31] 8 SUMMARY AND CONCLUSIONS
of microprocessors. The most relevant of these to our We introduced a novel, flexible software-based technique,
approach are the instruction-based functional self-test ISA extensions, and microarchitecture support to detect and
techniques. In general, these techniques apply randomly diagnose hardware defects during online operation of a
generated or automatically selected instruction sequences chip-multiprocessor. Our technique uses the Access-Control
and/or combinations of instruction sequences and ran- Extension (ACE) framework that allows special ISA
domly or automatically generated operands to test for instructions to access and control virtually any part of the
hardware defects. If the result of the test sequence does not processor’s internal state. Based on this framework, we
match the expected output of the instruction sequence, then proposed the use of special firmware that periodically
a hardware fault is declared. suspends the processor’s execution and performs high-
We briefly describe the state-of-the-art approaches that quality testing of the underlying hardware to detect defects.
work in this manner: In [58], a self-test program written in We described several execution models for the interaction
processor assembly language and the expected results of the of the special testing firmware with the applications
program are stored in on-chip ROM memory. When running on the processor for both single-threaded and
invoked, the self-test program performs at-speed functional multithreaded processing cores.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
1078 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 8, AUGUST 2009

Using a commercial ATPG tool and three different fault [9] L. Chen and S. Dey, “Software-Based Self-Testing Methodology
models, we experimentally evaluated our ACE testing for Processor Cores,” IEEE Trans. Computer-Aided Design of
Integrated Circuits and Systems, vol. 20, no. 3, pp. 369-380, Mar. 2001.
technique on a commercial chip multiprocessor design based
[10] K. Constantinides, J. Blome, S. Plaza, B. Zhang, V. Bertacco, S.
on Sun’s Niagara. Our experimental results showed that Mahlke, T. Austin, and M. Orshansky, “BulletProof: A Defect-
ACE testing is capable of performing high-quality hardware Tolerant CMP Switch Architecture,” Proc. 12th Int’l Symp. High
testing for 99.22 percent of the chip area. Based on our Performance Computer Architecture (HPCA-12), 2006.
detailed RTL implementation, implementing the ACE frame- [11] K. Constantinides, O. Mutlu, and T. Austin, “Online Design Bug
Detection: RTL Analysis, Flexible Mechanisms, and Evaluation,”
work requires a 5.8 percent increase in Sun Niagara’s chip
Proc. 41st Ann. Int’l Symp. Microarchitecture (MICRO-41), 2008.
area and a 4 percent increase in its power consumption [12] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, “Soft-
envelope. ware-Based Online Detection of Hardware Defects: Mechanisms,
We demonstrated how ACE testing can seamlessly be Architectural Support, and Evaluation,” Proc. 40th Ann. Int’l Symp.
coupled with a coarse-grained checkpointing and recovery Microarchitecture (MICRO-40), 2007.
mechanism to provide a complete defect tolerance solution. [13] W.J. Dally, L.R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos,
“The Reliable Router: A Reliable and High-Performance Commu-
Our evaluation shows that, with coarse-grained checkpoint nication Substrate for Parallel Computers,” Proc. Parallel Computer
intervals, the average performance overhead of ACE testing Routing and Comm. Workshop (PCRCW), 1994.
is only 5.5 percent. Our results also show that the software- [14] N. Durrant and R. Blish, “Semiconductor Device Reliability
based nature of ACE testing provides ample flexibility to Failure Models,” http://www.sematech.org/, 2000.
dynamically tune the performance-reliability trade-off at [15] M.J. Flynn and P. Hung, “Microprocessor Design Issues: Thoughts
on the Road Ahead,” IEEE Micro, vol. 25, no. 3, pp. 16-31, May/
runtime based on system requirements. June 2005.
We also described how the ACE framework can be used [16] R. Goering, “Post-Silicon Debugging Worth a Second Look,”
to improve the quality and reduce the cost of two critical Electronic Eng. Times, Feb. 2007.
phases of microprocessor development: post-silicon debug- [17] R. Guo, S. Mitra, E. Amyeen, J. Lee, S. Sivaraj, and S. Venkatara-
ging and manufacturing testing. Our descriptions showed man, “Evaluation of Test Metrics: Stuck-At, Bridge Coverage
that the flexibility provided by the ACE framework can Estimate and Gate Exhaustive,” Proc. Very Large Scale Integration
(VLSI) Test Symp. (VTS), 2006.
significantly ease and accelerate the post-silicon debugging
[18] P. Guptan and A.B. Kahng, “Manufacturing-Aware Physical
process by making the microarchitecture state easily Design,” Proc. Int’l Conf. Computer-Aided Design (ICCAD), 2003.
accessible and controllable by the post-silicon debug [19] G. Hetherington, T. Fryars, N. Tamarapalli, M. Kassab, A. Hassan,
engineers. Similarly, the flexibility of the ACE framework and J. Rajski, “Logic BIST for Large Industrial Designs: Real Issues
can eliminate the need for expensive automatic test and Case Studies,” Proc. Int’l Test Conf. (ITC), Sept. 1999.
equipment or costly yet lower coverage hardware changes [20] NetPerf: A Network Performance Benchmark. Hewlett-Packard
Company, 1995.
(e.g., logic BIST) needed for manufacturing testing. We
[21] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y.
conclude that the ACE framework is a general framework Nakase, and T. Nishizawa, “An Elementary Processor Architecture
that can be used for multiple purposes to enhance the with Simultaneous Instruction Issuing from Multiple Threads,”
reliability and to reduce the design/testing cost of modern Proc. 19th Int’l Symp. Computer Architecture (ISCA-19), 1992.
microprocessors. [22] H. Holzapfel and P. Levin, “Advanced Post-Silicon Verification
and Debug,” EDA Tech Forum, vol. 3, no. 3, Sept. 2006.
[23] A.M. Ionescu, M.J. Declercq, S. Mahapatra, K. Banerjee, and J.
ACKNOWLEDGMENTS Gautier, “Few Electron Devices: Towards Hybrid CMOS-SET
Integrated Circuits,” Proc. Design Automation Conf. (DAC), 2002.
The authors thank the anonymous reviewers for their [24] D. Josephson, “The Good, the Bad, and the Ugly of Silicon Debug,”
feedback. This work was supported by grants from the Proc. 43rd Design Automation Conf. (DAC-43), pp. 3-6, 2006.
National Science Foundation (NSF), SRC, and GSRC, and is [25] D. Josephson and B. Gottlieb, “The Crazy Mixed up World of
an extended and revised version of [12]. Silicon Debug,” Proc. IEEE Custom Integrated Circuits Conf.
(IEEE-CICC), 2004.
[26] H. Klug, “Microprocessor Testing by Instruction Sequences
REFERENCES Derived from Random Patterns,” Proc. Int’l Test Conf. (ITC), 1988.
[27] C. Kong, “A Hardware Overview of the NonStop Himalaya
[1] A. Agarwal, B.-H. Lim, D.A. Kranz, and J. Kubiatowicz, “April: A (K10000),” Tandem Systems Overview, vol. 10, no. 1, pp. 4-11, 1994.
Processor Architecture for Multiprocessing,” Proc. 17th Ann. Int’l [28] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-Way
Symp. Computer Architecture (ISCA-17), 1990. Multithreaded SPARC Processor,” IEEE Micro, vol. 25, no. 2,
[2] T.M. Austin, “DIVA: A Reliable Substrate for Deep Submicron pp. 21-29, Mar./Apr. 2005.
Microarchitecture Design,” Proc. 32nd Ann. Int’l Symp. Micro-
[29] N. Kranitis, A. Paschalis, D. Gizopoulos, and Y. Zorian, “Instruc-
architecture (MICRO-32), 1999.
tion-Based Self-Test of Processor Cores,” Proc. Very Large Scale
[3] K. Batcher and C. Papachristiou, “Instruction Randomization Self
Integration (VLSI) Test Symp. (VTS), 2002.
Test for Processor Cores,” Proc. Very Large Scale Integration (VLSI)
Test Symp. (VTS), 1999. [30] R. Kuppuswamy, P. DesRosier, D. Feltham, R. Sheikh, and P.
[4] S. Borkar, T. Karnik, and V. De, “Design and Reliability Thadikaran, “Full Hold-Scan Systems in Microprocessors: Cost/
Challenges in Nanometer Technologies,” Proc. 41st Ann. Conf. Benefit Analysis,” Intel Technology J., vol. 8, no. 1, pp. 63-72, Feb.
Design Automation (DAC-41), 2004. 2004.
[5] B. Bottoms, “The Third Millennium’s Test Dilemma,” IEEE Design [31] J. Lee and J.H. Patel, “An Instruction Sequence Assembling
and Test of Computers, vol. 15, no. 4, pp. 7-11, Oct.-Dec. 1998. Methodology for Testing Microprocessors,” Proc. Int’l (r) Test Conf.
[6] D. Brahme and J.A. Abraham, “Functional Testing of Micro- (ITC), Sept. 1992.
processors,” IEEE Trans. Computers, vol. 33, no. 6, pp. 475-485, [32] A.S. Leon, K.W. Tam, J.L. Shin, D. Weisner, and F. Schumacher,
June 1984. “A Power-Efficient High-Throughput 32-Thread SPARC Proces-
[7] M.L. Bushnell and V.D. Agrawal, Essentials of Electronic Testing for sor,” IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 7-16, Jan. 2007.
Digital, Memory and Mixed-Signal VLSI Circuits. Kluwer Academic [33] M.-L. Li, P. Ramachandran, S.K. Sahoo, S.V. Adve, V.S. Adve, and
Publishers, 2000. Y. Zhou, “Understanding the Propagation of Hard Errors to
[8] K.-H. Chang, I.L. Markov, and V. Bertacco, “Automating Post- Software and Implications for Resilient System Design,” Proc. 13th
Silicon Debugging and Repair,” Proc. Int’l Conf. Computer-Aided Int’l Conf. Architectural Support for Programming Languages and
Design (ICCAD), Nov. 2007. Operating Systems (ASPLOS-XIII), 2008.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.
CONSTANTINIDES ET AL.: A FLEXIBLE SOFTWARE-BASED FRAMEWORK FOR ONLINE DETECTION OF HARDWARE DEFECTS 1079

[34] Y. Li, S. Makar, and S. Mitra, “CASP: Concurrent Autonomous [59] D. Tullsen, S. Eggers, and H. Levy, “Simultaneous Multithreading:
Chip Self-Test Using Stored Test Patterns,” Proc. Conf. Design, Maximizing On-Chip Parallelism,” Proc. 22nd Int’l Symp. Computer
Automation and Test in Europe (DATE), 2008. Architecture (ISCA-22), June 1995.
[35] C.-K. Luk, R.S. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. [60] D.P. Vallett, “Future Challenges in IC Testing and Fault Isolation,”
Wallace, V.J. Reddi, and K. Hazelwood, “Pin: Building Custo- Proc. IEEE Ann. Meeting of Lasers and Electro-Optics Society (LEOS),
mized Program Analysis Tools with Dynamic Instrumentation,” vol. 2, pp. 539-540, Oct. 2003.
Proc. Conf. Programming Language Design and Implementation [61] I. Wagner, V. Bertacco, and T. Austin, “Shielding against Design
(PLDI), 2005. Flaws with Field Repairable Control Logic,” Proc. 43rd Design
[36] E.J. McCluskey and C.-W. Tseng, “Stuck-Fault Tests vs. Actual Automation Conf. (DAC-43), 2006.
Defects,” Proc. Int’l Test Conf. (ITC), pp. 336-343, Oct. 2000. [62] T.J. Wood, “The Test and Debug Features of the AMD-K7
[37] M. Meterelliyoz, H. Mahmoodi, and K. Roy, “A Leakage Control Microprocessor,” Proc. Int’l Test Conf. (ITC), pp. 130-136, 1999.
System for Thermal Stability during Burn-In Test,” Proc. Int’l Test [63] W.M. Yee, M. Paniccia, T. Eiles, and V. Rao, “Laser Voltage Probe
Conf. (ITC), 2005. (LVP): A Novel Optical Probing Technology for Flip-Chip
[38] S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K.S. Kim, “Robust Packaged Microprocessors,” Proc. Int’l Symp. Physical and Failure
System Design with Built-In Soft-Error Resilience,” Computer, Analysis of Integrated Circuits (IPFA-7), 1999.
vol. 38, no. 2, pp. 43-52, Feb. 2005.
[39] J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, Kypros Constantinides received the BS
“ReViveI/O: Efficient Handling of I/O in Highly-Available degree in computer science from the University
Rollback-Recovery Servers,” Proc. Int’l Symp. High-Performance of Cyprus, in 2004, and the MS degree in
Computer Architecture (HPCA), 2006. electrical engineering and computer science
[40] E.B. Nightingale, P.M. Chen, and J. Flinn, “Speculative Execution from the University of Michigan, Ann Arbor, in
in a Distributed File System,” ACM Trans. Computer Systems, 2006. He is currently working toward the PhD
vol. 24, no. 4, pp. 361-392, Nov. 2006. degree in electrical engineering and computer
[41] P. Parvathala, K. Maneparambil, and W. Lindsay, “FRITS—A science at the University of Michigan, Ann Arbor.
Microprocessor Functional BIST Method,” Proc. Int’l Test Conf. He is interested in computer architecture re-
(ITC), 2002. search with a focus in reliable system design. He
[42] M. Prvulovic, Z. Zhang, and J. Torrellas, “ReVive: Cost-Effective previously worked at Microsoft Research and Intel Corporation. He
Architectural Support for Rollback Recovery in Shared-Memory received the Intel Foundation PhD Fellowship in 2008. He is a student
Multiprocessors,” Proc. 29th Int’l Symp. Computer Architecture member of the IEEE.
(ISCA-29), 2002.
[43] B.R. Quinton and S.J.E. Wilton, “Post-Silicon Debug Using Onur Mutlu received the BS degree in computer
Programmable Logic Cores,” Proc. Conf. Field-Programmable engineering and psychology from the University
Technology (FPT), pp. 241-248, 2005. of Michigan, Ann Arbor, and the MS and PhD
[44] J.M. Rabaey, Digital Integrated Circuits: A Design Perspective. degrees in ECE from the University of Texas at
Prentice-Hall, Inc., 1996. Austin. He is currently an assistant professor of
[45] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Privulovic, L. Ceze, S. ECE at Carnegie Mellon University. Prior to
Sarangi, P. Sack, K. Stauss, and P. Montesinos, “SESC Simulator,” Carnegie Mellon, he worked at Microsoft Re-
http://sesc.sourceforge.net, 2002. search, Intel Corporation, and Advanced Micro
[46] S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and Devices. He is interested in computer architec-
J. Torrellas, “Patching Processor Design Errors with Programmable ture and systems research, especially in the
Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12-25, Jan./Feb. 2007. interactions between languages, operating systems, compilers, and
[47] M.J. Serrano, W. Yamamoto, R.C. Wood, and M. Nemirovsky, “A microarchitecture. He was a recipient of the Intel PhD Fellowship in
Model for Performance Estimation in a Multistreamed, Super- 2004, the University of Texas George H. Mitchell Award for Excellence
scalar Processor,” Proc. Seventh Int’l Conf. Modeling Techniques and in Graduate Research in 2005, the Microsoft Gold Star Award in 2008,
Tools for Computer Performance Evaluation, 1994. and five “Computer Architecture Top Pick” Paper Awards by the IEEE
[48] P. Shivakumar, S.W. Keckler, C.R. Moore, and D. Burger, Micro Magazine. He is a member of the IEEE.
“Exploiting Microarchitectural Redundancy for Defect Tolerance,”
Proc. Int’l Conf. Computer Design (ICCD), 2003. Todd Austin received the PhD degree in
[49] M. Shulz, “The End of the Road for Silicon,” Nature Magazine, June computer science from the University of Wis-
1999. consin, Madison, in 1996. He is an associate
[50] S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. professor of electrical engineering and computer
Austin, “Ultra Low-Cost Defect Protection for Microprocessor science at the University of Michigan, Ann Arbor.
Pipelines,” Proc. 12th Int’l Conf. Architectural Support for Program- Prior to joining academia, he was a senior
ming Languages and Operating Systems (ASPLOS-12), pp. 73-82, computer architect at Intel’s Microprocessor
2006. Research Labs, a product-oriented research
[51] D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems: Design laboratory in Hillsboro, Oregon. His research
and Evaluation, third ed. AK Peters, Ltd., 1998. interests include computer architecture, compi-
[52] D.J. Sorin, M.M.K. Martin, M.D. Hill, and D.A. Wood, “SafetyNet: lers, computer system verification, and performance analysis tools and
Improving the Availability of Shared Memory Multiprocessors techniques. He is a member of the IEEE.
with Global Checkpoint/Recovery,” Proc. 29th Int’l Symp. Compu-
ter Architecture (ISCA-29), 2002. Valeria Bertacco received the Laurea degree
[53] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, “The Impact of in computer engineering from the University of
Technology Scaling on Lifetime Reliability,” Proc. Int’l Conf. Padova, Italy, and the MS and PhD degrees in
Dependable Systems and Networks (DSN-34), 2004. electrical engineering from Stanford University
[54] J.H. Stathis, “Reliability Limits for the Gate Insulator in CMOS in 2003. She is an assistant professor of
Technology,” IBM J. Research and Development, vol. 46, nos. 2/3, electrical engineering and computer science at
pp. 265-286, 2002. the University of Michigan. She joined the
[55] OpenSPARC T1 Microarchitecture Specification. Sun Microsystems, faculty at Michigan after being at Synopsys
Inc., Aug. 2006. for four years. Her research interests are in the
[56] TetraMAX ATPG User Guide, version 2002.05. Synopsys, http:// areas of formal and semiformal design verifica-
www.synopsys.com, 2002. tion with emphasis on full design validation and digital system
[57] D. Tarjan, S. Thoziyoor, and N.P. Jouppi, “Cacti 4.0.,” Technical reliability. She is an associate editor of the IEEE Transactions on
Report hpl-2006-86, Hewlett-Packard, 2006. Computer-Aided Design of Integrated Circuits and Systems and has
served on the program committees for DAC and ICCAD. She is a
[58] M.H. Tehranipour, S. Fakhraie, Z. Navabi, and M. Movahedin, “A
recipient of the US National Science Foundation (NSF) CAREER
Low-Cost At-Speed Bist Architecture for Embedded Processor
Award and the University of Michigan’s Outstanding Achievement
and Sram Cores,” J. Electronic Testing: Theory and Applications,
Award. She is a member of the IEEE.
vol. 20, no. 2, pp. 155-168, 2004.

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 11, 2009 at 22:02 from IEEE Xplore. Restrictions apply.

You might also like