0% found this document useful (0 votes)

18 views9 pages

A Survey of Fault Tolerance Approaches On Different Architecture Levels

The document discusses various fault tolerance approaches for multicore computing platforms, focusing on redundancy at different architecture levels to counteract the increasing frequency of errors in memory and CPUs. It categorizes fault tolerance techniques into spatial, temporal, information, and functional redundancy, and highlights the significance of instruction-level redundancy (ILR) and thread-level redundancy (TLR) in error detection and recovery. The authors also examine software-based methods such as Software Implemented Fault Tolerance (SWIFT) and its enhancements for efficient error detection and recovery.

Uploaded by

Braincain007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views9 pages

A Survey of Fault Tolerance Approaches On Different Architecture Levels

Uploaded by

Braincain007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

ARCS 2017, April, 3 – 6, 2017, Wien, Austria

A Survey of Fault Tolerance Approaches at

Different Architecture Levels
Lukas Osinski, Tobias Langer and Jürgen Mottok
Laboratory for Safe and Secure Systems - LaS3
University of Applied Sciences Regensburg, Germany,
{lukas.osinski, tobias.langer, juergen.mottok}@oth-regensburg.de

Abstract—In the recent years the development trends for are implemented through instruction-level (ILR) and thread-
computing platforms moved to multicore systems. Associated level redundancy (TLR). ILR is represented by software-only
with this trend, feature sizes decreased with each new hard- techniques which operate by duplicating program instruction
ware generation and consequently led to a rise of transient
and permanent error frequency in memory and CPUs. In this and interleaving them with the original program code in such
context, researchers presented several approaches which exploit a way, that they can be scheduled along with the original
the inherent redundancy of multicore platforms to provide fault ones by utilizing instruction level parallelism (ILR) [7]. The
tolerance. We present a discussion of fault tolerance approaches basic idea of TLR approaches is, that copies of the same
based on redundancy at different levels of architecture regarding thread are executed independently either on the same processor
their sphere of replication, performance as well as error detection
and recovery capability. by exploiting hardware features called Simultaneous Multi-
Threading (SMT) [8] or on separate processors. In both cases
I. I NTRODUCTION the redundant execution is used to detect error by comparing
Development trends for computing platforms moved from the results of the execution.
increasing the frequency of a single core to increasing the II. BASIC C ONCEPTS OF D EPENDABILITY
parallelism with multiple cores on the same die [1]. Although,
chip-multiprocessors (CMP) present new development chal- A. Fault-Error-Failure
lenges, they have strong potential to support cost-efficient fault A system failure is defined as the deviation of the systems
tolerance due to their inherent spatial redundancy in order external state from the correct (specified) state. The cause for
to counteract the rising frequency of transient and permanent a system failure is an internal fault (e.g. random wire break,
errors in memories [2] and CPUs [3]. hardware erratum) or external fault (e.g. cosmic radiation).
Fault tolerance requires at least error detection and recovery Faults can either be permanent or temporary [9]. Permanent
[4]. The detection of random hardware faults can be real- faults are continuous in time and remain in the system until
ized by involving combinations of information redundancy an explicit repair action takes places which removes the fault.
(e.g. ECC), temporal redundancy (e.g. rollback) and spatial Temporary faults are faults which presence is bounded in time
redundancy (e.g. dual/triple modular redundancy) [5]. Several and which disappear after a given time interval without an
fault-tolerance techniques use fully replicated hardware com- explicit repair action has taken place. Temporary faults can
ponents which are cycle-by-cycle synchronized in order to be either transient or intermittent [9]. Transient faults are
detect random hardware faults [6]. During fault-free operation often classified as temporary external faults which originate
each component performs the same operation on the same from the physical environment whereas intermittent faults are
inputs, producing the same outputs (lockstepping). However, temporary internal faults which originate from the inside of
hardware-based approaches introduce higher hardware costs the system and produce errors only under certain operating
and cannot be used on off-the-shelf processors. Furthermore, conditions (e.g. component wear-out, component overload)
these approaches do not allow a flexible program execution [4]. With respect to the effect of a fault, a fault can be active
environment where legacy binary code and the redundant code or dormant. When a dormant fault becomes active, it deviates
can co-exist depending on the required level of reliability. the total state of one or more components of the system.
Therefore, research on software techniques at different archi- This is known as an error. When an error affects the external
tecture levels such as instruction-, thread-, process- and virtual state of the system and the external state deviates from the
machine level became more and more attractive. We focus correct (specified) service it is called a failure [4]. Failures
on the discussion of fault detection mechanisms for random can be classified into two different classes based on their
hardware faults and more specifically mechanisms which domain and consistency. With respect to domain, a failure can
be timing related or content related. Timing failures imply
The authors gratefully acknowledge the financial funding from the Bay- that the system either responds too early or too late, but
erische Forschungsstiftung (BayFor), research initiative FORMUS3 IC ”Multi-
Core safe and software-intensive Systems Improvement Community” under the content is correct; Content failures imply that the content
funding code AZ-1165-15. delivered by the system is corrupted, but the timing is correct.

ISBN 978-3-8007-4395-7 117 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

Furthermore, failures can be content and timing related at more powerful approach because it not only detects that a
the same time. These types of failures can be categorized as fault occurred but it is also capable to select the correct results
halt and erratic failures. In terms of consistency, there can be and to identify the one faulty component by majority voting
byzantine and consistent failures. When a byzantine failure [5]. Furthermore, it contributes to a higher system availability,
[10] occurs, some or all users of the system will perceive a since the system can continue the execution by masking the
different service. When a consistent failure occurs, all users faulty element. The weak spot of DMR is the majority voter
will perceive identical service. which depicts a single point of failure and therefore has to be
highly reliable. In general, the number of replicas in the NMR
B. Fault Tolerance approach is not limited. In order to achieve fault tolerance with
Fault-tolerance techniques are used to tolerate errors oc- higher availability the number of replicas can be increased as
curring during system operation and include masking, error long as the necessary resources are available [5].
detection and recovery [4]. A system is referred to as fault Usually, redundancy can be achieved by four main strategies
tolerant, if faults do not affect the external state of the [5]: spatial, temporal, information and functional redundancy.
system. However, it can allow its components to fail as long Spatial redundancy means the expansion of a system with
as the external state is not corrupted. Temporary errors are additional components which are dispensable for the func-
generally detected by concurrent error detection techniques tionality of the system [5]. Referencing the described TMR
and recovered by schemes like re-execution, rollback recovery, approach, the three identical components work in parallel per-
rollforward recovery and checkpointing [4]. forming the same operation on distinct hardware components.
While spatial redundancy performs the same operation on
C. Fault Tolerance by Redundancy distinct hardware components, temporal redundancy indicates
A key mechanism to achieve fault tolerance i.e. error that the same operation in a NMR approach is independently
detection and recovery of a system is redundancy respectively performed N times sequentially on the same hardware in
the replication of components in e.g. hardware: processors, different periods of time [5]. Comparison (DMR) or voting
memory; or software: entire programs or parts of it [5]. A (TMR) is performed at the end of the sequential execution
component is considered redundant, if a system can work fully of the replicas. A widely used temporal redundancy recov-
functional without the additional components i.e. redundancy ery technique is implemented through checkpointing and the
includes all resources which are not necessary for the func- process of rollback recovery; checkpoints are created during
tionality of a system [5]. These additional components are program execution at defined points in time to store the current
specifically used in a coordinated way to detect errors, mask system state. In case of a detected error, rollback recovery
fault or to recover the system. takes place in order to restore the system state to the last
A widely used paradigm for error detection and/or recovery correct state i.e. to the checkpoint [5]. In comparison to spatial
is represented by the N-modular redundancy (NMR) pattern, redundancy temporal redundancy involves additional execution
where N characterizes the number of replicated identical time to a function or algorithm to detect and overcome errors
processing components which process the same data [5]. A respectively faults.
popular type of NMR is dual modular redundancy (DMR). Information redundancy describes another way of redun-
DMR uses two identical elements, the original and replicated dancy in order to achieve error detection (and recovery).
component, connected with a comparator component to detect Information redundancy includes all additional data used in a
errors. Errors are detected by the comparator, if the results program. The simplest way is e.g. to provide different memory
of the two elements are dissimilar. However, it is impossible spaces in order to store the replicated data redundantly. A
for the comparator to decide which result is the correct more enhanced technique is to add extra data to the original
one and which one is the result of a faulty component [5]. one instead of replicating them by using error detection codes
Therefore, DMR is only suitable for error detection e.g in a (EDC) or error-correction codes (ECC). Error detection codes
fail-silent system design. An example for DMR in hardware such as parity bits allow checks whether the final data has the
is represented by the lockstep configuration of two processing same amount of bits with a defined value (1 or 0) or not ([11]).
cores on a multicore system [6]. Extending the DMR approach Error correction codes such as hamming code ([11]), where a
by another replica leads to triple modular redundancy (TMR). code word is partitioned in groups and each group has its own
TMR uses three identical elements which perform the same parity bit, allow the correction of a defined amount of error in
operation. In comparison to the DMR approach the comparator the code word. A further approach for error detection codes
is replaced by a majority voter [5]. The voting element com- are AN-BD codes which are used by several error detection
pares the three results and selects the correct result by majority techniques([12], [13], [14], [15])
of the three results i.e. if at least two results are equal, then Spatial, temporal and information redundancy are able to de-
the voter considers it as correct. In case the three results are tect random hardware fault. In order to detect systematic faults,
different, another strategy must be applied such as re-executing functional redundancy can be used. Function redundancy is
the complete operation. However, two components very rarely the extension of a system by additional functions which
fail at the same time [5]. Compared to DMR, TMR is besides are only used for fault tolerance operation [5]. Functional
the higher costs in terms of resources (e.g performance) a redundancy can either be achieved by additional function

ISBN 978-3-8007-4395-7 118 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

which are specified different to already implemented function Software Implemented Fault Tolerance (SWIFT) [18]
or by diverse function which implement the same specified is a transient compiler-based error detection approach based
function in a diverse way [5]. on EDDI [7] with several refinements. Whereas EDDI in-
A key concept regarding redundant execution is called cludes the memory in the sphere of replication, SWIFT
sphere of replication (SoR) [16]. Basically the SoR describes assumes that the memory subsystem is protected by ECC.
which resources are replicated for fault tolerance and therefore The transformation with EDDI incurs a significant memory
enjoy fault coverage. Components outside the sphere are not overhead, because each location in memory needs to have a
covered by fault tolerance and therefore must be protected corresponding shadow location in memory for the redundant
via other means. Values entering the SoR inputs must be duplicates. This memory duplication incurs a significant hard-
replicated; values leaving the SoR are outputs that must be ware cost and significant performance costs since cache sizes
compared. are effectively halved and additional memory traffic is created.
Most fault tolerance mechanisms are using a combination SWIFT proposes to eliminate the use of two distinct memory
of different redundancy strategies for error detection and locations for all memory values and consequently eliminate the
recovery. However, the basic scheme of replication is always duplicated store instruction. It is stated, that this modification
similar. First, the input for the replicas are replicated, second will not reduce the fault detection coverage due to the ECC
all replicas execute the input and finally the output of all protected memory but will make the protected code execute
replicas are compared in order to detect the possible error. more efficiently and require less memory [18]. EDDI suffers
In order to perform the output comparison of the replicas from incomplete protection for control flow faults, because
correctly, redundancy requires replica determinism [17] i.e. faulty branch instructions could lead to a misdirected control
all replicas must produce the same output by a given input. flow without detection. SWIFT proposes to eliminate this
Based on the different redundancy strategies presented in vulnerability by the use of control-flow checks with software
this section, the following sections compare a selection of error signature [21] and run-time adjusting signatures [18].
detection and recovery approaches at different architecture Software Implemented Fault Tolerance with Recovery
levels. (SWIFT-R) [19] is a transient error detection approach which
extends SWIFT with the ability to recover from detected er-
III. R EDUNDANCY- BASED FAULT TOLERANCE
rors. SWIFT-R achieves error detection with recovery by using
APPROACHES
triple-modular redundancy. Therefore, instead of duplicating
A. Instruction-level redundancy (ILR) instructions the transformation triplicates the instructions. In
This section examines different software-only fault toler- case of a fault which corrupts any one version of the compu-
ance approaches based on instruction-level redundancy. tation, two other versions will still hold the correct values. A
Error Detection by Duplicated Instruction (EDDI) [7] simple majority voting scheme, can identify the correct value
is a software-only approach which operates by duplicating and mask a single-bit fault.
program instructions and using redundant execution to detect Triple Redundancy Using Multiplication Protection
transient errors. EDDI does not assume any fault-free opera- (TRUMP) [19] is similar to SWIFT except the duplicated copy
tions and targets inter-block and intra-block control flow er- is AN-encoded [15]. In particular, the AN-encoded code word
rors, data or code change in memory as well as transient errors is built by multiplying the original value times the constant fac-
in functional units. The program instructions are duplicated tor A. The recovery subroutine is called whenever the original
and interleaved with the original program instructions. EDDI values times factor A does not match the AN-encoded copy.
interleaves the duplicated instructions in such a way, that The recovery itself is a more complex majority voting scheme
most of the control flow errors are detected. The duplicated as introduced in SWIFT-R. Instead of voting, comparison is
instructions are scheduled among the original ones in the performed. If the AN-encoded copy is divisible by A, it can
same execution thread by utilizing instruction level paral- be surmised that the fault struck the original copy. If it is not
lelism (ILP), thus minimizing the transformations performance divisible by A, the AN-encoded copy was struck. Encoding the
penalty. Furthermore, each copy of the program uses different duplicated code word allows a more compact representation of
registers and different memory locations (original and shadow redundancy, because TRUMP contains SWIFT-Rs redundant
locations) to prevent interference among each another and to data in two instead of three registers. Although, TRUMP is
detect memory operation errors. At certain synchronization unable to protect certain parts of programs because of the
points in the combined program flow, validation instructions encoding (e.g. bit-shift) and might be costly because of the
are inserted by the compiler to ensure that the computed values division and modulo operations.
of the original instructions and their redundant copies are ∆ − encoding [12] is a software-only approach which
equal. In case of an inequality an error is detected and an error combines AN code [15] and duplicated instructions [7] to
detection subroutine is called. Error Detection by Diverse Data harden programs against transient and permanent hardware
and Duplicated Instructions (ED4 I) [20] is extending the EDDI faults. The original program data flow is duplicated and AN-
concept by adding data diversity in order to enable permanent encoded at compile-time. The transformations are performed
error detection. ED4 I was a theoretical attempt and was not by a source-to-source C transformer [22] by encoding the
evaluated by simulation. original program at the level of an Abstract Syntax Tree.

ISBN 978-3-8007-4395-7 119 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

TABLE I
OVERVIEW OF INSTRUCTION - LEVEL REDUNDANCY APPROACHES

EDDI [7] SWIFT [18] SWIFT-R [19] TRUMP [19] ∆ − encoding[12]

Replication Assembly Assembly Assembly Assembly High level
Sphere of Replication CPU & Memory CPU 2 CPU 2 CPU CPU & Memory
Error recovery No No Yes Partially Partially
Control flow error Partially Yes Partially Partially No
Fault types Transient Trans. & Perm. Transient Transient Trans. & Perm.
Fault model Single Single Single Single Multiple
Fault coverage (AVG) 98.5% 3 100.0% 3 99.2% 3 95.1% 3 99.997% 3
Performance 1 61% 41% 99% 37% 408%
1 Average overhead compared to single execution
2 Requires ECC protected memory
3 Determined by fault injection experiments

In the first transformation all data is AN-encoded and all execution distance between the two threads can not be more
original operations are substituted by AN-encoded operations. than the delay buffer length. In case of the R-thread detects a
The second transformations duplicate all encoded data and deviation between the results, both threads perform a rollback
operations and inserts checks at synchronization points. At to the last saved state (checkpoint), which is the last committed
run-time, the program effectively works on two copies of state of the R-thread. Instructions are only committed, if the
data, encoded in two different ways (completely encoded data results are equal. Both threads independently read and write
flow). Data diversity is achieved by using different encoding from memory and have separate address spaces. Consequently,
constants (A) for the two copies of data. In case of delta- there is no explicit sharing of values.
encoding, if a hard CPU fault triggers some specific input, Simultaneous and Redundantly Threaded processor
it will corrupt only one copy of the data, but not the other. (SRT) [16] extends the AR-SMT idea of leading and trailing
Due to the instance that periodic checks lead to a tremendous threads by introducing modifications to reduce the perfor-
slowdown, since each operation would then be accompanied mance overhead. Furthermore, they aim to mitigate the design
by heavy-weight checks, the authors introduce accumulators. challenges of cycle-by-cycle output comparison and input
Accumulators substitute the heavy-weight periodic checks replication (determinism). In order to avoid memory latencies
with a simple addition of the intermediate results. Heavy- and miss-predictions during computations, SRT passes branch
weight checks are only performed right before output of the outcomes between the threads to speed up the trailing checker
computational result. thread. Additional the leading thread effectively pre-fetches
memory values for the trailing thread, which reduces latency.
B. Thread-level redundancy (TLR) To challenge the deterministic input replication the authors
Research on thread-level redundancy (TLR) became popular introduce two design alternatives: Active Load Address Buffer
with the introduction of simultaneous multithreading (SMT) (ALAB) and a Load Value Queue (LVQ). ALAB allows
[8]. SMT is a technique, that allows fine-grained resource corresponding cached loads from both replicated threads to
sharing among multiple independent threads in a dynamically receive the same value in the presence of out-of-order and
scheduled super-scalar processor [16]. This section provides an speculative execution (thread, cache replacements and cache
overview of several redundancy approaches based on simul- invalidation). Therefore, it stores information about all active
taneous multithreading (SMT) in software and with hardware load lines that have been executed by the leading thread but
support. not by the trailing thread. Any updates or invalidation to an
Active-stream/Redundant-stream Simultaneous Multi- active line are suppressed until the line becomes active. The
threading (AR-SMT) [23] is a temporal redundancy fault- LVQ uses a single cache access strategy to satisfy both threads
tolerant approach, which combines the full processor coverage by forwarding the (pre-designated) leading threads committed
of program-level redundancy with the performance advantages load addresses and values to the trailing thread. The trailing
of instruction re-execution. In AR-SMT, two copies of the thread derives all its load values from the ECC-protected LVQ
same program are executed in two threads called A(ctive)- instead of the data cache.
thread and R(edundant)-thread. However, there is a delay of Simultaneously and Redundantly Threaded processor
tens of cycles between them. This slack was introduced in with Recovery (SRTR) [24] extends the SRT [16] approach
order to localize the effect of intermittent faults to one thread. to enable transient-fault recovery. The authors recognized
The A-thread (leading thread) performs its computation and that in the SRT approach, the leading thread is allowed to
commits the results to the program state and a delay buffer. commit a non-store instruction before verification. This alters
After the R-thread (trailing thread) finished its computation, the state of the system regardless of whether the instruction
it compares the results to the values in the delay buffer. The executed incorrectly or not. As a solution, they proposed

ISBN 978-3-8007-4395-7 120 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

TABLE II
OVERVIEW OF THREAD - LEVEL REDUNDANCY APPROACHES - S INGLECORE

AR-SMT [23] SRT [16] SRTR [24] SRMT 6[25]

Category HW HW HW SW
HW-Overhead Delay buffer Add. buffers/queue 1 Add. queue 2 -
Sphere of Replication CPU 4 CPU 3 CPU 3 CPU (Instructions) 5

Error recovery Yes No Yes No

Fault types Transient Transient Transient Transient
Fault model Single Single Single Single
Fault coverage Full 8 Full 8 Full 8 99.98%
Performance 27% 7 21% 7 30% 7 19% 7
1 Check Store Buffer (CSB), Load Value Queue (LVQ), Branch Outcome Queue (BOQ)
2 Register Value Queue (RVQ)
3 Included: Pipeline, (Registers); Excluded: instruction and data cache, or register file
4 Excluded: Register file
5 Excluded: System call for I/O operation and shared memory access
6 Also applicable to CMPs
7 Average overhead compared to single execution
8 Theoretical assumption by authors of approach

the checking of the instruction of the leading thread before before checking and the trailing thread after checking, so that
commit with the trailing thread. Therefore, SRTR does not the trailing thread state may be used for recovery.
allow the commit before checking occurs, since the faulty Reunion [28] proposes a CRT-based architecture that re-
instruction cannot be undone once the instruction commits. laxes input replication while preserving the existing memory
However, the verification of the outputs involves comparing system, including the coherence protocol and consistency
the values of registers. This increases pressure on the register model and reduces comparison bandwidth by compressing
file, which may degrade performance. As a solution to this, the results. The redundant execution of the same thread is
the authors propose maintaining all unverified results of the performed on a logical pair of cores. Each logical processor
leading thread in a Register Value Queue (RVQ). The trailing pair consists of one vocal and one mute core. The stores
thread compares its results with the values stored in the RVQ. of the vocal core are allowed to propagate to the rest of
To reduce the bandwidth pressure on the RVQ itself, SRTR the memory system, while those of the mute core are not.
employs dependence-based check elision (DBCE). Recovery is Furthermore, the mute core does not participate in coherence
achieved by utilization of rollback ability of pipelines i.e. after protocol actions. In order to detect faults, a fingerprint (hash
an error is detected, a rollback is performed to the offending of instruction results) is created after a core executes a pre-
instruction and re-execution is performed. determined number of instructions. The cores exchange their
With the emergence of multicore technology, the application fingerprints and compare it with each other. Upon detection of
of RMT to CMP was research topic of several publications. differences between vocal and mute core, the processor pair
Different approaches found, that performing RMT on CMP in- starts the re-execution protocol.
troduces fewer overheads than RMT on single and lockstepped Dynamic Core Coupling (DCC) [29] proposes a dynamic
cores. Also, due to the inherent redundancy of CMPs, most coupling approach for CMPs - similar to Reunion - which
RMTs provide both hard and soft fault detection. allows arbitrary processor cores to verify each other without
Chip-level Redundant Threading (CRT) [26] extends the requiring dedicated communication hardware. Each thread has
SRT technique for single SMT processor to CMP architectures. a redundant copy running on another core. Unlike to Reunion,
The basic idea of CRT is to generate logically redundant DCC introduces a slack between two cores. With increasing
threads (as in SRT) but to run leading and trailing threads slack, the probability of different forms of input incoherence
on separate processor cores of the CMP. Similar to SRT, CRT event increase. DCC proposes to sole this increase on a per-
uses loosely synchronized redundant threads in order to reduce address basis by introduction of write windows. When the
the checker overhead and eliminating cache miss penalties on leading thread executes a load, it opens a read window for
the trailing thread. The forwarding of inputs to the load value the address and when it executes a store, it opens a write
queue, branch prediction queue and store comparator require window. When both leading and trailing threads commit the
a dedicated bus between the cores. load/store, the read/write window is closed. Two read windows
Chip-level Redundant Threading processor with Re- on the same address may overlap, but a read and a write
covery (CRTR) [27] extends the CRT approach with fault window, or two write windows may not overlap. Enforcing
recovery for CMPs similar to SRTR [24]. CRTR uses a long this constraint ensures that shared memory operations in the
slack enabled by asymmetric commit to hide inter-processor leading and trailing thread behave in the same way.
latency required on CMP. CRTR commits the leading thread In contrast to SRT the Software-based Redundant Multi-

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

TABLE III
OVERVIEW OF THREAD - LEVEL REDUNDANCY APPROACHES - M ULTICORE

CRT [26] CRTR [27] Reunion [28] DCC [29]

Category HW HW HW HW
HW-Overhead Add. core + buffers 1 Add. core + buffers 1 Add. core + Sign generator Add. core
Sphere of Replication CPU 2 CPU 2 CPU 3 CPU 3
Error recovery No Yes Yes Yes
Fault types Trans. (& Perm.) Trans. (& Perm.) Trans. (& Perm.) Trans. (& Perm.)
Fault model Single Single Single Single
Fault coverage Full 6 Full 6 Full 6 Full6
Performance 13% 4 30% 4 5-6% 3-20% 5
1 Check Store Buffer (CSB), Load Value Queue (LVQ), Branch Outcome Queue (BOQ), Dedicated communication channels
2 Included: Pipeline, Registers; Excluding: Caches, Data path between processors
3 Included: Pipeline, Registers, Caches
4 Average improvement compared to lockstepping
5 Average overhead compared to single execution
6 Theoretical assumption by authors of approach

Threading (SRMT) [25] does not require hardware support, average (eight benchmark programs) performance overhead to
however performance overheads can be reduced by minimal approximately 61.5% [7].
hardware support. SRMT uses the compiler to automati- SWIFT provides error detection by temporal redundancy. In
cally create redundant threads. Like other HRMT approaches, order to determine the error coverage of SWIFT, it was applied
SRMT performs computations in two threads, a leading thread to 29 benchmark programs (300 iterations) in which a fault in-
backed up by a trailing thread for error detection. The leading jection forced a single bit-flip in the general-purpose registers,
threads perform all operations in the original program with floating-point registers or predicate registers. The simulation
additional operations to communicate with the trailing thread. results show that 100% fault coverage is achieved by SWIFT.
The trailing thread transparently replicates computations of The average (29 benchmark programs) performance overhead
the leading thread and compares its results with those from compared to native execution is stated with 41% [18].
the leading thread to detect transient faults. For correctness, SWIFT-R provides error detection by temporal redundancy.
the compiler treats the leading threads as the original thread in In order to determine the error coverage of SWIFT-R, it
the program and the trailing threads as a helper thread which was applied to 27 benchmark programs (250 iterations) in
only helps to detect transient faults. which a fault injection forced a single bit flip in the register-
file. The simulation results show that on average 99.2% fault
IV. D ISCUSSION coverage is achieved by SWIFT-R. The average performance
This section discusses the approaches of the authors and overhead compared to native execution is stated with 99%. The
compares them by the numbers stated in their publications. Swift-R technique is more expensive than Trump in terms of
The definite comparison of the numbers is not applicable redundancy because it requires two additional versions of the
to all approaches due to minor deviations in the conducted computation instead of one [19].
experiments. However, the greater number of experiments In order to determine the error coverage of TRUMP, it
were performed with similar benchmarks and prerequisites for was applied to 27 benchmark programs (250 iterations) in
performance or fault coverage evaluation. The following dis- which a fault injection forced a single bit flip in the register-
cussion with regard to instruction-level redundancy approaches file. The simulation results show that on average 95.1% fault
is summarized in Table 1. coverage is achieved by TRUMP. The average performance
EDDI provides error detection by temporal redundancy. The overhead compared to native execution is stated with 37%. For
SoR is the CPU and the memory directly used by the dupli- benchmarks that are dominated by arithmetic instruction that
cated program. In order to determine EDDIs error coverage, it TRUMP can protect it performs on par with SWIFT-R [19].
was applied to eight benchmark programs (500 iterations) in For benchmarks that are dominated by instructions TRUMP
which a fault injection forced a single bit-flip in the code seg- can not protect, such as logical operations, TRUMPs reliability
ment of executable machine code. The simulation results show is significantly lower than SWIFT-Rs and more expensive in
that approximately 98.5% (average) fault coverage is achieved terms of verification because it must convert the AN-encoded
by EDDI. Although more than 100% performance overhead is and original data to the same form for comparison.
expected due to instruction duplication, in most cases it is less ∆-Encoding is a software-only approach to detect 99,997%
than 100%. The reduced performance overhead is achieved by of hardware faults with performance slowdown of 408%
scheduling instructions that are added for detecting the errors (average) compared to native execution. ∆-encoding makes
such that Instruction Level Parallelism (ILP) within a single no assumptions on the rate (single-bit or multiple bit) and
(4-way) super-scalar processor is maximized. This reduces the type of fault (transient, intermittent or permanent fault). The

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

SoR assumed in the approach is the CPU and the memory The following discussion with regard to multicore thread-
directly used by the encoded program. ∆-encoding does not level redundancy approaches is summarized in Table 3. In
cover control flow error [12]. Compared to other approaches order to exploit the inherent spatial redundancy of CMPs,
∆-encoding provides a greater SoR and moreover an extended CRT introduces SMT to CMPs [26]. The advantage compare
fault coverage including permanent faults. However, the aver- to previous approaches is better permanent fault coverage as
age performance overhead is increased extensively. no resources are shared between a leading and trailing thread.
The following discussion with regard to singlecore thread- CRT introduces compared to SRT further HW-Overheads such
level redundancy approaches is summarized in Table 2. AR- as the obvious additional core, extra queues (LVQ,BOQ) and
SMT provides error detection and recovery of single transient dedicated communication channels between the cores. The
faults by hardware redundant multithreading (HRMT) [23]. SoR of CRT includes the CPU (incl. the register file of
The SoR assumed for the approach is the CPU without the each processor) but excludes the memory. However, CRTs
register file. AR-SMT requires a SMT based machine with leading thread only commit stores after checking, so that
a delay buffer. Detailed simulations of five benchmarks in memory is guaranteed to be correct. In addition, the data-paths
[30] showed, that the execution of two redundant programs between the two cores and the cache hierarchy is excluded
with AR-SMT introduced an average performance overhead from the SoR. This parts must be protected with some form
of 27% compared to native execution (only a single version of information redundancy e.g ECC. Experiments showed
of program). On basis of AR-SMT, the SRT approach [16] that CRT achieved better results than simple lockstepping the
was able to reduced the performance overhead to 21%. The two cores, because in lockstepping both copies of a compu-
researchers of SRT state, that the performance could even be tation are forced to waste resources on misspeculation and
further improved by utilizing hardware features like slack fetch cache misses [26]. The evaluation showed that CRT processor
and branch outcome queue. However, compared to AR-SMT, performs similarly to lockstepping for single-program run,
SRT requires additional hardware resources such as buffers but outperforms lockstepping by 13% on average (with a
and queues. Furthermore, SRT does not provide recovery maximum improvement of 22%) for multithreaded programs.
after an error was detected. Since both the streams in AR- As a drawback, CRT only provides fault detection.
SMT execute the same program, the active stream can act In [27] CRTR was introduced in order to accomplish
as a very effective pre-fetcher and branch predictor for the recovery. To hide inter-process latency, CRTR uses a long
redundant stream during error-free operation. This design slack enabled by asymmetric commit. As in CRT, CRTR
does not address the issue of managing non-determinism in commits memory updates only after checking, so that memory
parallel applications which needs to be handled properly to is guaranteed to be correct. Because stores are less frequent
ensure forward progress of an TLR system [16]. SRT provides than register updates, CRTR can increase the slack without
this improvement of guaranteed input replication, because the stalling leading thread commits. Furthermore, CRTR incurs
trailing thread receives the same values for the load, as used by negligible performance degradation compared to CRT. After
the leading thread. A drawback of guaranteed input replication the detection of a fault, CRTR uses the trailing thread state
is introduced by the ALAB and LVQ structures which add for recovery by copying the trailing state to the leading thread.
considerable complexity to the core logic. Furthermore, with CRTR is guaranteed to provide recovery from single transient
the LVQ based load mechanism, the memory controller logic faults, excepts for those cases that affect the register file (not
remains unprotected by redundancy and any error in the con- protected by ECC) and the memory controller logic. For error
troller logic goes undetected as it is no independently verified. in the register files CRTR guarantees the error detection.
In addition, the memory storage has to be ECC protected. The A major challenge in providing redundant execution support
fault model and fault coverage of both approaches is similar. for parallel applications is maintaining identical instruction
SRT only addresses fault detection without discussing error streams. The redundant cores operate independently but still
recovery as it only compares the store values (if register files need to receive the same shared-memory value to execute
are in the SoR). the same stream of instructions. Previous design like CRT
SRTR enhances the SRT approach by recovery thereby handle this is issue through input replication by using a ALAB
introducing a higher performance overhead of 30% compared or LVQ [27]. However, structures like ALAB and LVQ add
to SRT [24]. The SoR assumed by SRTR is the CPU without a considerable amount of complexity to the processor core
the register file. SRTR and AR-SMT perform recovery in design. Furthermore, they fail to protect the controller logic
fundamentally different ways, with different costs: SRTR of memory subsystems through redundancy.
disallows the leading thread from committing until the trailing The authors of Reunion [28] observed that even without any
thread completes and is checked, and uses instruction squash special hardware the redundant threads would execute identical
to rollback to a committed state before the faults. AR-SMT instruction streams most of the time. On this basis they relaxed
allows the leading thread to commit potentially faulty states, the input replication and experienced performance overhead of
and lets the trailing thread be checked upon completion of only 5-6% for commercial and scientific workloads [28]. Re-
each instruction. Upon detection of a fault, AR-SMT uses the union provides detection and recovery from transient error and
trailing threads committed state to restore the leading threads input incoherence using a combination of light-weight error
state. detection [31] and existing exception rollback mechanisms.

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

Dynamic Core Coupling [29] allows the detection and Additionally, these approaches do not provide a dual-use
recovery from both hard and soft errors. Furthermore, it can capability by supporting both redundant and non-redundant
provide on-demand triple modular redundancy at no additional execution depending on the required reliability. In this context,
cost by using hot spares. Performance evaluation of DCC compiler-based SRMT [25] states a promising strategy with
shows, compared to single-core execution with no fault tol- outstanding fault coverage. SRMT is a software-only approach
erance, overheads ranging from 3% to 20% depending on the which requires no additional hardware features and provides a
checkpoint interval. flexible program execution environment where legacy binary
Software-based redundant multithreading (SRMT) [25] code and the redundant code can co-exist depending on
compared to previous techniques is a software-only approach the required level of reliability. Moreover, compiler analysis
for transient fault detection. The SoR assumes all instructions and optimization techniques can reduce data communication
except system calls for I/O operations and shared memory requirements of HRMT by up to 88%. Although hardware
access operations. In order to determine SRMTs error cover- features could be used to reduce performance overheads,
age, it was applied to several benchmark programs in which SRMT can also be applied to COTS hardware. Other types
a fault injection forced a random single bit-flip in one of of software-based fault tolerance techniques such as SWIFT
the application registers. The evaluation showed that 99,98% or ∆-encoding provide fault tolerance by duplicated program
of single it transient faults were successfully detected. The execution at instruction respectively source level. Instruction
authors argue that 100% fault coverage can not be reached level techniques are limited to single processors in order to
because of remaining vulnerabilities e.g a value may be exploit ILP for performance overhead reduction. Moreover,
corrupted after it is sent to the trailing thread for checking. the greater number of techniques only provide transient fault
SRMT can be extended to perform both error detection and detection, with exception to ∆-encoding which realizes per-
recovery by using two trailing threads and a majority voter to manent fault detection by applying the AN-Coding technique
recover from a single error. Performance evaluation show an [15]. However, ∆-encoding experiences a massive increase
overhead of 19% compared to non redundant execution of the of performance overhead compared to single execution and
program and can be further reduced with further support in other techniques. In order to reduce performance overhead,
the instruction set architecture (ISA). Furthermore, compared optimization techniques such as SDCTune [32] could be a
to HRMT approaches, SRMT provides a flexible program promising approach. As an alternative to hardware-only and
execution environment where legacy binary code and the software-only fault detection techniques which represent sharp
redundant code can co-exist depending on the desired level trade-offs between hardware cost, reliability and performance,
of reliability. Additionally, compiler analysis and optimization hybrid system such as CRAFT [33] (combination of SWIFT
techniques can reduce data communication requirements by and RMT) could enhance characteristics such as reliability,
up to 88% compared to HRMT. performance and system design.
Summarized, each specific technique at different architec-
V. C ONCLUSION
ture levels provide its benefits and drawbacks. Therefore, the
SRT [16] and SRTR [24] are proposals for transient fault application to a system must be decided on the particular
detection and recovery based on single SMT processors which set of given design constrains. Furthermore, fault detection
rely on special hardware extensions. Fault tolerance on CMPs and recovery can be realized by more generic approach such
usually is provided by tightly lockstepping two executions as process-level redundancy (PLR) [34], [35] or redundant
on redundant cores. Lockstepping is a purely hardware-based virtual machines [36] which show different characteristics
solution where both, input duplication and output comparison regarding determinism, sphere of replication, reliability and
are implemented in hardware. Lockstepping ensures that both performance and therefore constitute promising strategies for
processors observe identical load values, cache invalidations further investigations.
and external interrupts. This requirement also must be ful-
filled by TLR approaches. TLR based approaches tackle
R EFERENCES
this problems by introducing additional hardware overhead
in terms of additional buffers, queues and communication [1] G. Macher, A. Höller, E. Armengaud, and C. Kreiner, “Automotive
channels. Due to the fact that device scaling continuous, embedded software: Migration challenges to multi-core computing
researchers proposed several alternatives to lockstepping. CRT platforms,” in 2015 IEEE 13th International Conference on Industrial
Informatics (INDIN), Jul. 2015, pp. 1386–1393.
[26], CRTR [27], Reunion [28] and DCC [29] represent
[2] E. B. Nightingale, J. R. Douceur, and V. Orgovan, “Cycles, Cells
TLR approaches for CMPs which show similar or improved and Platters: An Empirical Analysisof Hardware Failures on a Million
performance compared to lockstepping and maintain similar Consumer PCs,” in Proceedings of the Sixth Conference on Computer
fault coverage while introducing minor hardware overheads Systems, ser. EuroSys ’11. New York, NY, USA: ACM, 2011, pp.
343–356.
for input replication and results comparison. HRMT tech- [3] M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and
niques, however, still suffer from several drawbacks. Due to Y. Zhou, “Understanding the Propagation of Hard Errors to Software and
the special hardware requirements, this approach cannot be Implications for Resilient System Design,” in Proceedings of the 13th
International Conference on Architectural Support for Programming
used on off-the-shelf processors. Furthermore, the redundant Languages and Operating Systems, ser. ASPLOS XIII. New York,
execution introduces significant complexity to system design. NY, USA: ACM, 2008, pp. 265–276.

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.
ARCS 2017, April, 3 – 6, 2017, Wien, Austria

[4] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, “Basic concepts [26] S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, “Detailed design and
and taxonomy of dependable and secure computing,” IEEE Transactions evaluation of redundant multi-threading alternatives,” in Proceedings
on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, Jan. 29th Annual International Symposium on Computer Architecture, 2002,
2004. pp. 99–110.
[5] K. Echtle, Klaus Echtle, 1990. [27] M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz,
[6] N. Aggarwal, P. Ranganathan, N. P. Jouppi, and J. E. Smith, “Config- “Transient-fault Recovery for Chip Multiprocessors,” in Proceedings of
urable Isolation: Building High Availability Systems with Commodity the 30th Annual International Symposium on Computer Architecture,
Multi-core Processors,” in Proceedings of the 34th Annual International ser. ISCA ’03. New York, NY, USA: ACM, 2003, pp. 98–109.
Symposium on Computer Architecture, ser. ISCA ’07. New York, NY, [28] J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, “Reunion:
USA: ACM, 2007, pp. 470–481. Complexity-Effective Multicore Redundancy,” in Proceedings of the
[7] N. Oh, P. P. Shirvani, and E. J. McCluskey, “Error detection by 39th Annual IEEE/ACM International Symposium on Microarchitecture,
duplicated instructions in super-scalar processors,” IEEE Transactions ser. MICRO 39. Washington, DC, USA: IEEE Computer Society, 2006,
on Reliability, vol. 51, no. 1, pp. 63–75, Mar. 2002. pp. 223–234.
[8] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithread- [29] C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar, “Utilizing
ing: Maximizing On-chip Parallelism,” in 25 Years of the International Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor,”
Symposia on Computer Architecture (Selected Papers), ser. ISCA ’98. in 37th Annual IEEE/IFIP International Conference on Dependable
New York, NY, USA: ACM, 1998, pp. 533–544. Systems and Networks (DSN’07), Jun. 2007, pp. 317–326.
[9] J.-C. Laprie, Dependability: Basic Concepts and Terminology, 1992. [30] A. Prodromou, A. Panteli, C. Nicopoulos, and Y. Sazeides, “NoCAlert:
[10] L. Lamport, R. Shostak, and M. Pease, “The Byzantine Generals An On-Line and Real-Time Fault Detection Mechanism for Network-
Problem,” ACM Trans. Program. Lang. Syst., vol. 4, no. 3, pp. 382– on-Chip Architectures,” in 2012 45th Annual IEEE/ACM International
401, Jul. 1982. Symposium on Microarchitecture, Dec. 2012, pp. 60–71.
[11] B. Friedrichs, Kanalcodierung: Grundlagen Und Anwendungen in Mod- [31] J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G.
Nowatryk, “Fingerprinting: Bounding soft-error-detection latency and
ernen Kommunikationssystemen, 1995.
bandwidth,” IEEE Micro, vol. 24, no. 6, pp. 22–29, Nov. 2004.
[12] D. Kuvaiskii and C. Fetzer, “Delta-Encoding: Practical Encoded Pro-
[32] Q. Lu, K. Pattabiraman, M. S. Gupta, and J. A. Rivers, “SDCTune:
cessing,” in 2015 45th Annual IEEE/IFIP International Conference on
A model for predicting the SDC proneness of an application for con-
Dependable Systems and Networks, Jun. 2015, pp. 13–24.
figurable protection,” in 2014 International Conference on Compilers,
[13] J. Braun and J. Mottok, “The Myths of Coded Processing,” in 2015 Architecture and Synthesis for Embedded Systems (CASES), Oct. 2014,
IEEE 17th International Conference on High Performance Computing pp. 1–10.
and Communications, 2015 IEEE 7th International Symposium on [33] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and
Cyberspace Safety and Security, and 2015 IEEE 12th International S. S. Mukherjee, “Design and Evaluation of Hybrid Fault-Detection
Conference on Embedded Software and Systems, Aug. 2015, pp. 1637– Systems.pdf,” 2005.
1644. [34] A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors, “Us-
[14] U. Wappler and M. Muller, “Software Protection Mechanisms for ing Process-Level Redundancy to Exploit Multiple Cores for Transient
Dependable Systems,” in 2008 Design, Automation and Test in Europe, Fault Tolerance,” in 37th Annual IEEE/IFIP International Conference on
Mar. 2008, pp. 947–952. Dependable Systems and Networks (DSN’07), Jun. 2007, pp. 297–306.
[15] U. Wappler and C. Fetzer, “Hardware Failure Virtualization Via Software [35] P. Ulbrich, “Ganzheitliche Fehlertoleranz in eingebetteten Softwaresys-
Encoded Processing,” in 2007 5th IEEE International Conference on temen,” Ph.D. dissertation, 2014.
Industrial Informatics, vol. 2, Jun. 2007, pp. 977–982. [36] T. C. Bressoud and F. B. Schneider, “Hypervisor-based Fault Tolerance,”
[16] S. K. Reinhardt and S. S. Mukherjee, “Transient Fault Detection via in Proceedings of the Fifteenth ACM Symposium on Operating Systems
Simultaneous Multithreading,” in Proceedings of the 27th Annual Inter- Principles, ser. SOSP ’95. New York, NY, USA: ACM, 1995, pp. 1–11.
national Symposium on Computer Architecture, ser. ISCA ’00. New
York, NY, USA: ACM, 2000, pp. 25–36.
[17] H. Kopetz, Real Time Systems - Design Principles for Distributed
Embedded Applications, 2011.
[18] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August,
“SWIFT: Software Implemented Fault Tolerance,” in Proceedings of the
International Symposium on Code Generation and Optimization, ser.
CGO ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp.
243–254.
[19] G. A. Reis, J. Chang, and D. I. August, “Automatic Instruction-Level
Software-Only Recovery,” 2007, pp. 36–47.
[20] N. Oh, S. Mitra, and E. J. McCluskey, “ED4I: Error detection by diverse
data and duplicated instructions,” IEEE Transactions on Computers,
vol. 51, no. 2, pp. 180–199, Feb. 2002.
[21] N. Oh, P. P. Shirvani, and E. J. McCluskey, “Control-flow checking by
software signatures,” IEEE Transactions on Reliability, vol. 51, no. 1,
pp. 111–122, Mar. 2002.
[22] M. Rebaudengo, M. S. Reorda, M. Violante, and M. Torchiano, “A
source-to-source compiler for generating dependable software,” in Pro-
ceedings First IEEE International Workshop on Source Code Analysis
and Manipulation, 2001, pp. 33–42.
[23] E. Rotenberg, “AR-SMT: A microarchitectural approach to fault tol-
erance in microprocessors,” in Digest of Papers. Twenty-Ninth An-
nual International Symposium on Fault-Tolerant Computing (Cat.
No.99CB36352), Jun. 1999, pp. 84–91.
[24] T. N. Vijaykumar, I. Pomeranz, and K. Cheng, “Transient-fault recovery
using simultaneous multithreading,” in Proceedings 29th Annual Inter-
national Symposium on Computer Architecture, 2002, pp. 87–98.
[25] C. Wang, H.-s. Kim, Y. Wu, and V. Ying, “Compiler-Managed Software-
based Redundant Multi-Threading for Transient Fault Detection,” in
Proceedings of the International Symposium on Code Generation and
Optimization, ser. CGO ’07. Washington, DC, USA: IEEE Computer
Society, 2007, pp. 244–258.

Authorized licensed use limited to: Louisiana Tech University. Downloaded on April 22,2025 at 18:06:26 UTC from IEEE Xplore. Restrictions apply.

Ch3 Rotor System Operation PDF
No ratings yet
Ch3 Rotor System Operation PDF
13 pages
Steel Grades For GB Standard - JIS Standard - ASTM Standard - DIN Standard
70% (10)
Steel Grades For GB Standard - JIS Standard - ASTM Standard - DIN Standard
8 pages
TN4611 PDF
No ratings yet
TN4611 PDF
11 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
Fault Tolerant Computing
No ratings yet
Fault Tolerant Computing
4 pages
Iph750 Hydraulic Piling Hammer and Rig: Impact-Power Hydraulics Sdn. BHD
100% (1)
Iph750 Hydraulic Piling Hammer and Rig: Impact-Power Hydraulics Sdn. BHD
4 pages
STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
RTFT15 Unit 2
No ratings yet
RTFT15 Unit 2
53 pages
Jucs 24 12 1776 1799 Kokila
No ratings yet
Jucs 24 12 1776 1799 Kokila
24 pages
II Fault Tolerant Techniques
No ratings yet
II Fault Tolerant Techniques
101 pages
Fault Tolerance
No ratings yet
Fault Tolerance
27 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
Distrsyslectureset7 Win20
No ratings yet
Distrsyslectureset7 Win20
114 pages
OS Presentattion
No ratings yet
OS Presentattion
15 pages
Zhang 2014
No ratings yet
Zhang 2014
13 pages
Toward Monitoring Fault-Tolerant Embedded Systems (Extended Abstract)
No ratings yet
Toward Monitoring Fault-Tolerant Embedded Systems (Extended Abstract)
3 pages
Lecture 4
No ratings yet
Lecture 4
25 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
51 pages
Bulletproof: A Defect-Tolerant CMP Switch Architecture
No ratings yet
Bulletproof: A Defect-Tolerant CMP Switch Architecture
12 pages
DFTS BE 4 II Sem Unit 2
No ratings yet
DFTS BE 4 II Sem Unit 2
112 pages
Functional Testing in RTS
No ratings yet
Functional Testing in RTS
47 pages
II - Fault-Tolerant-techniques
No ratings yet
II - Fault-Tolerant-techniques
104 pages
Digital Circuit Testing and Testability-8-59
No ratings yet
Digital Circuit Testing and Testability-8-59
52 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
6 pages
Lecture 7 - FAULT-TOLERANT COMPUTING
No ratings yet
Lecture 7 - FAULT-TOLERANT COMPUTING
13 pages
Unit-5 Faults in RTOS
No ratings yet
Unit-5 Faults in RTOS
5 pages
Distributed System - Failures
No ratings yet
Distributed System - Failures
12 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
Fault Tolerance Computing Lecture Note
No ratings yet
Fault Tolerance Computing Lecture Note
61 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Final
No ratings yet
Final
42 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
Survey Paper 2
No ratings yet
Survey Paper 2
25 pages
Lesson 2 - Fault and Error Modelling
No ratings yet
Lesson 2 - Fault and Error Modelling
7 pages
Fault Tolerant Design: An Introduction: Elena Dubrova
No ratings yet
Fault Tolerant Design: An Introduction: Elena Dubrova
162 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Hybrid Lockstep Technique For Soft Error Mitigation
No ratings yet
Hybrid Lockstep Technique For Soft Error Mitigation
8 pages
Lect4 - Fault Modelling
No ratings yet
Lect4 - Fault Modelling
24 pages
Computer and Spftware Reliability
No ratings yet
Computer and Spftware Reliability
4 pages
ES 06 Fault-Tolerance
No ratings yet
ES 06 Fault-Tolerance
65 pages
CH-6 Assignment - Models Modified
No ratings yet
CH-6 Assignment - Models Modified
48 pages
Challenging Malicious Inputs With Fault Tolerance Techniques
No ratings yet
Challenging Malicious Inputs With Fault Tolerance Techniques
8 pages
Cloud
No ratings yet
Cloud
18 pages
Rtes Reliability and Fault Torelance
No ratings yet
Rtes Reliability and Fault Torelance
40 pages
Rts
No ratings yet
Rts
44 pages
7.fault Tolerance
No ratings yet
7.fault Tolerance
35 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
RTS UNiT 4
No ratings yet
RTS UNiT 4
19 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
21 pages
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
No ratings yet
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
7 pages
Parallaft Runtime-Based CPU Fault Tolerance Via
No ratings yet
Parallaft Runtime-Based CPU Fault Tolerance Via
16 pages
21EC63 Module 4A
No ratings yet
21EC63 Module 4A
39 pages
2-Fault Modeling in Chip Design
No ratings yet
2-Fault Modeling in Chip Design
12 pages
Fault Tolerance Techniques
No ratings yet
Fault Tolerance Techniques
4 pages
Unit5 1
No ratings yet
Unit5 1
23 pages
Satish DATE07
No ratings yet
Satish DATE07
6 pages
SSD: An Affordable Fault Tolerant Architecture For Superscalar Processors
No ratings yet
SSD: An Affordable Fault Tolerant Architecture For Superscalar Processors
8 pages
Handling Software Faults With Redundancy
No ratings yet
Handling Software Faults With Redundancy
24 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Alloy 600
No ratings yet
Alloy 600
2,106 pages
Best Practice Catalog: Machine Condition Monitoring
No ratings yet
Best Practice Catalog: Machine Condition Monitoring
18 pages
Introduction
No ratings yet
Introduction
8 pages
Relativistic Electrodynamics PDF
No ratings yet
Relativistic Electrodynamics PDF
10 pages
Auditing in Oracle 10g Release 2
No ratings yet
Auditing in Oracle 10g Release 2
9 pages
Unit 5: International Financial Management 5.1
No ratings yet
Unit 5: International Financial Management 5.1
4 pages
Zebex Z-6XXX Programming Guide
No ratings yet
Zebex Z-6XXX Programming Guide
94 pages
Design and Development of A Petrol-Powered Hammer Mill For Rural Nigerian Farmers
No ratings yet
Design and Development of A Petrol-Powered Hammer Mill For Rural Nigerian Farmers
11 pages
Clay Shale
No ratings yet
Clay Shale
22 pages
Chapter 2 Fault Modeling: The Testing Problem
No ratings yet
Chapter 2 Fault Modeling: The Testing Problem
10 pages
Discrete-Time Simulation With Simulink: ECE4560: Digital Control Laboratory
No ratings yet
Discrete-Time Simulation With Simulink: ECE4560: Digital Control Laboratory
5 pages
Rotational Motion - Torque and Center of Gravity
No ratings yet
Rotational Motion - Torque and Center of Gravity
39 pages
Caps Maths English GR R FS
No ratings yet
Caps Maths English GR R FS
286 pages
Third Order Intercepts
No ratings yet
Third Order Intercepts
6 pages
Introduction
No ratings yet
Introduction
24 pages
Mathlogicp1 PDF
No ratings yet
Mathlogicp1 PDF
122 pages
1974 Lambda Catalog and Application Handbook
No ratings yet
1974 Lambda Catalog and Application Handbook
191 pages
Amptec 601ES - Explosive Safety Digital Multimeter (DMM)
No ratings yet
Amptec 601ES - Explosive Safety Digital Multimeter (DMM)
2 pages
Wk3 - Lecture 3-27-25 Practical Firewalls - WB
No ratings yet
Wk3 - Lecture 3-27-25 Practical Firewalls - WB
41 pages
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
No ratings yet
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
20 pages
JTT v6.21 en
No ratings yet
JTT v6.21 en
32 pages
Grundfosliterature 5769232
No ratings yet
Grundfosliterature 5769232
14 pages
Victaulic Grooved IPS-CS Installation
No ratings yet
Victaulic Grooved IPS-CS Installation
3 pages
A Survey of Fault Tolerance Mechanisms Adn Checkpoint Restart Implementations For High Performance Computing Systems
No ratings yet
A Survey of Fault Tolerance Mechanisms Adn Checkpoint Restart Implementations For High Performance Computing Systems
25 pages
Chapter3 Electrochemistyry
No ratings yet
Chapter3 Electrochemistyry
2 pages
Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors
No ratings yet
Instruction Level Parallelism Through Microtrheading - A Scalable Approach To Chip Multiprocessors
23 pages
Instruction Scheduling For Instruction Level Parallel Processors
No ratings yet
Instruction Scheduling For Instruction Level Parallel Processors
22 pages
The Graphical Interpretation of The Function Properties: Increasing, Decreasing, and Constant Functions
No ratings yet
The Graphical Interpretation of The Function Properties: Increasing, Decreasing, and Constant Functions
3 pages
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
No ratings yet
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
13 pages
Lesson 1 - Overview & Key Concepts
No ratings yet
Lesson 1 - Overview & Key Concepts
12 pages
Template REVIEW JURNAL AJMH
No ratings yet
Template REVIEW JURNAL AJMH
2 pages
Themes - Flutter
No ratings yet
Themes - Flutter
5 pages
Assembly #2
No ratings yet
Assembly #2
5 pages
Midterm - Study Guide
No ratings yet
Midterm - Study Guide
4 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
Final - Study Guide
No ratings yet
Final - Study Guide
3 pages
Abstrak Jibran
No ratings yet
Abstrak Jibran
2 pages
Choi Lecture CH19
No ratings yet
Choi Lecture CH19
2 pages
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
From Everand
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
William Smith
No ratings yet

A Survey of Fault Tolerance Approaches On Different Architecture Levels

Uploaded by

A Survey of Fault Tolerance Approaches On Different Architecture Levels

Uploaded by

ARCS 2017, April, 3 – 6, 2017, Wien, Austria

A Survey of Fault Tolerance Approaches at

ISBN 978-3-8007-4395-7 117 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

ISBN 978-3-8007-4395-7 118 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

ISBN 978-3-8007-4395-7 119 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

EDDI [7] SWIFT [18] SWIFT-R [19] TRUMP [19] ∆ − encoding[12]

ISBN 978-3-8007-4395-7 120 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

AR-SMT [23] SRT [16] SRTR [24] SRMT 6[25]

Error recovery Yes No Yes No

ISBN 978-3-8007-4395-7 121 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

CRT [26] CRTR [27] Reunion [28] DCC [29]

ISBN 978-3-8007-4395-7 122 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

ISBN 978-3-8007-4395-7 123 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

ISBN 978-3-8007-4395-7 124 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

ISBN 978-3-8007-4395-7 125 © 2017 VDE VERLAG GMBH ∙ Berlin ∙ Offenbach

You might also like