0% found this document useful (0 votes)
46 views8 pages

Hybrid Lockstep Technique For Soft Error Mitigation

Uploaded by

konglele316
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views8 pages

Hybrid Lockstep Technique For Soft Error Mitigation

Uploaded by

konglele316
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1574 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 69, NO.

7, JULY 2022

Hybrid Lockstep Technique for Soft


Error Mitigation
M. Peña-Fernández , A. Serrano-Cases , A. Lindoso , Member, IEEE, S. Cuenca-Asensi ,
L. Entrena , Member, IEEE, Y. Morilla , Pedro Martín-Holgado , Student Member, IEEE,
and A. Martínez-Álvarez

Abstract— This work presents the evaluation of a new continue the operation even in the presence of faults, is a key
dual-core lockstep hybrid approach aimed to improve the fault to enable these devices for safety-critical applications.
tolerance in microprocessors. Our approach takes advantage of This work is focused on the enhancement of proces-
modern multicore processor resources to combine software-based
lockstep with a custom hardware observer. The first is used to sors fault tolerance to soft errors, i.e., transient faults on
duplicate data and instruction flows; meanwhile, the second is memory cells that eventually can lead to system failures.
in charge of the control-flow monitoring. The proposal has been Traditionally, the protection of microprocessors is addressed
implemented in a dual-core ARM microprocessor and validated by means of spatial and/or temporal redundancy to detect
with low-energy proton irradiation and emulated fault injection and/or correct radiation-induced faults. Depending on how
campaigns. The results show an improvement of one order of
magnitude in the cross section of the benchmarks tested, even they are implemented, the techniques are usually categorized
considering the worst case scenario. as hardware- or software-based. Hardware techniques are
those based on the replication of processing by means of
Index Terms— Dual cores, fault tolerance, lockstep, multi-
threading, proton irradiation, soft errors. redundant hardware blocks: registers, memories, or even entire
processing units. Dual- and triple-redundant core locksteps
I. I NTRODUCTION (DCLS/TCLS) [1], [2] replicate the whole processor and
compare the system output every clock cycle to detect any
C OMMERCIAL processors are becoming a commodity
for the implementation of critical electronic systems in
a multitude of industrial domains: from traditional aerospace
mismatch during the execution of the code. TCLS, in addition,
offers the ability to recover the system using the third core
and military sectors to emerging markets, such as high- state.
performance computing, autonomous vehicles, or medical Generally, only output data are checked for errors [2], [3].
appliances. The superior flexibility and performance offered However, control-flow errors may cause one of the proces-
by their advanced multicore architectures make those devices sors to lose synchronization and eventually hang or get lost.
a promising alternative to other specifically designed circuits. Control-flow errors are not easy to detect as they may not have
Unfortunately, the progressive miniaturization of the electronic an immediate observable effect in the computed data. More-
components jointly with the high clock frequencies demanded over, it is common in dual cores that one of the processors
by new applications is making the cores more vulnerable acts as primary and the other as secondary. In such a case, the
to radiation-induced faults. Therefore, providing some kind hang of the primary can lead to the crash of the entire system.
of fault tolerance to the microprocessors, i.e., the ability to Software techniques are aimed to protect the code execution on
unreliable hardware, mostly commercial off-the-shelf (COTS)
Manuscript received 20 December 2021; revised 30 January 2022; accepted devices. Similar to the hardware techniques, they introduce
31 January 2022. Date of publication 7 February 2022; date of current version replication at different software levels: programs, functions,
18 July 2022. This work was supported in part by the Spanish Ministry of
Science and Innovation under Project PID2019-106455GB-C22 and Project loops, instructions, and so on. Although their implementations
PID2019-106455GB-C21 and in part by the Community of Madrid under have a lower impact on development costs compared with
Grant IND2017/TIC-7776. hardware techniques, their application presents relevant over-
M. Peña-Fernández is with Arquimea ADS, 28918 Leganés, Spain (e-mail:
[email protected]). heads, in terms of performance and memory footprint, which
A. Serrano-Cases, S. Cuenca-Asensi, and A. Martínez-Álvarez are with the should be taken into consideration.
Computer Technology Department, University of Alicante, 03690 Alicante, Unlike related works, this proposal tries to reduce the
Spain (e-mail: [email protected]; [email protected]; [email protected]).
A. Lindoso and L. Entrena are with the Department of Electronic Tech- impact of the unavoidable unreliable software to achieve
nology, Universidad Carlos III de Madrid, 28911 Leganés, Spain (e-mail: a reliable and efficient computation. In fact, no operating
[email protected]; [email protected]). system (OS) nor external threading libraries are used at all.
Y. Morilla and Pedro Martín-Holgado are with the Centro Nacional
de Aceleradores (CNA), Centro Nacional de Aceleradores, CSIC, JA, Our approach exploits the multithreading capability of modern
Universidad de Seville, 41092 Seville, Spain (e-mail: [email protected]; microprocessors by means of multiple instances of the same
[email protected]). program running in parallel on separate cores and without any
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNS.2022.3149867. communication between them, excluding a little piece of code
Digital Object Identifier 10.1109/TNS.2022.3149867 for stall and synchronization purposes.

0018-9499 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1575

Usually, mitigation techniques are conceived, assuming that work, the OS service transparently replicates the execution of
memory chips are protected with some kind of error detection applications at the binary level by creating redundant threads in
and correction (EDAC) mechanism. In our case, no assump- separate address spaces. Some studies propose the use of stan-
tions are made and the proposal includes procedures to miti- dard libraries, such as OpenMP and Pthreads [11], to generate
gate soft errors even for nonprotected memories. In addition, redundancy programmatically. Furthermore, authors in [12]
errors that affect the control flow in any core can be detected and [13] propose the use of custom application programming
by a custom hardware observer (IP observer) connected to the interfaces (API) and language directives to allow the program-
trace subsystem. This IP core decodes on-the-fly the program mer to define redundant threads and selectively decide the code
traces and checks the obtained information. regions and variables to be protected. Also, software tools,
Our hybrid and multithreaded DCLS lockstep-based such as Trikaya [14], have been proposed to automatically
approach have shown improvements in error mitigation, even transform the source and produce a DMR version with rollback
for applications with high resilience to radiation. The tech- reexecution. All those high-level approaches suffer from two
nique has been validated with low-energy proton irradiation main problems. The first is the performance overhead pro-
and tested by means of emulated injection campaigns. duced by the additional software layers. The second is the
The paper is organized as follows. Section II introduces the increment in the susceptibility to errors due to the complexity
software and hardware mechanism combined in our proposal. introduced by the software stack and the OS itself. That is
Section II makes a review of works related to multithreading expressed by a higher number of both silent data corruption
and lockstep mitigation techniques. Section III describes the events and OS crashes, as pointed in [15].
fault injection campaigns performed and the previous results’ There are a lower number of approaches based on hardware
analysis to estimate the contribution of the technique to the redundancy as they usually apply architectural modifications
system reliability. Section IV reports the radiation experiments to real devices or include custom modules in programmable
and their results. Finally, Section V summarizes the conclu- SoCs. In [16], authors add a hardware module to a standard
sions of this work. implementation of a lockstep mechanism over two PowerPC
processors hardwired on an FPGA. The module reduces
II. R ELATED W ORKS the checkpoint overhead by comparing only the modified
The emergence of multicore processors has enabled the addresses and values. A different approach is presented in [17]
execution of multiple copies of the same instruction flow on where two FPGA-based boards are used to implement a
separated execution units. The technique known as redundant rollback recovery method to protect periodic tasks running on
multithreading (RMT) along with the concept of sphere of two LEON processors. Softcore processors are also employed
replication was proposed in [4] for detection and recov- in [18]. In this case, Microblaze processors are configured in
ery of soft errors. Basically, the sphere-of-replication (SoR) DMR to detect errors in the application outputs; meanwhile,
defines the set of resources, hardware or software, which are a TMR Picoblaze continuously reads the configuration mem-
replicated. This way, the values entering the SoR must be ory looking for errors. Finally, the work [2] implements a roll-
replicated, and the values leaving the SoR must be checked back/recovery mechanism using the programmable resources
to assure their integrity. Initial RMT approaches included the of an FPGA. The application is manually divided into several
processor pipeline and the register file in the SoR boundaries, blocks delimited by verification points, and it is executed
relaying the correctness of the execution on helper structures, simultaneously in both cores of an ARM cortex-A9 processor.
such as store buffers, load, and branch queues, not present on Every time a verification point is reached, the context and
real processors [5], [6]. Those proposals were tested on simula- data are saved in a dual-port private memory and compared
tors with promising results but never evaluated on real devices. to detect some mismatch by custom hardware.
Other approaches deal with soft errors considering their In our approach, a single program multiple data (SPMD)
effect on the software running on the system. They address scheme has been adopted for bare metal applications (without
the problem from either the compiler, OS, or application OS), which renders a reduced number of race conditions and
level. Most of them make the assumption that memories are lower control overhead compared to traditional solutions. This
protected by an error detection mechanism. In [7] and [8], technique is combined with a custom IP that leverages the
a custom compiler transforms the application code into two information provided by the on-chip debugging facilities to
communicating threads: the leading thread performs all the detect on-the-fly any control flow error. It results in an efficient
load/store operations and sends the data to the trailing thread implementation with a very low area usage.
that replicates the ALU operations and compare the results. Among all the reviewed works, only a few were tested in
The SoR only includes computations; therefore, they suffer accelerated radiation experiments [2], [11], [14]. More com-
from the vulnerable input replication and output compari- prehensive surveys about multithreading and lockstep-based
son processes. Another approach [9] suggests duplicating the mitigation techniques can be found in the respective sur-
memory read/write operation values to solve this problem; veys [19], [20].
however, it increases the synchronization and performance
overheads up to 5× given the low granularity of the memory III. H YBRID L OCKSTEP A PPROACH
operations. Our approach combines hardware and software techniques
Other authors [10] propose specific OS services to support that exploit common resources available in modern micro-
RMT execution providing error detection and recovery. In that processors. It is intended to be directly applied to them with

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
1576 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 69, NO. 7, JULY 2022

A. Software Implemented Data Protection

Taking advantage of the parallel computing that multiple


cores can offer to improve data protection, the multithreaded
mitigation technique called duplication with comparison and
reexecution (DWC-R), first presented in [21], is also proposed
and demonstrated under radiation here. This technique is
based on the concept of SoR [22], in which boundaries
define the region of the code whose computation must be
under protection. Within the SoR, the instruction flow and
the involved variables are replicated using lightweight threads.
The verification of the correctness of data is only needed when
an instruction sends the data outside the SoR. In any other
case, the replicated threads progress in parallel working on
their own data replica.
The implementation of the aforementioned technique
requires some instrumentation of the source code (C or C++).
This is undertaken by means of custom C++ macros and
preprocessing directives, which are used to annotate what data
variables belong to the SoR and, thus, must be under pro-
Fig. 1. Architecture overview. tection, and the source code region (containing all read/write
accesses to the SoR) where the mitigation will take place.
The user just has to manually include the primitives in the
original code, and the compiler automatically produces the
minimum additions. Fig. 1 shows the architecture resources multithread version of the executable. The annotation of each
that support the proposal. Two redundant threads run simulta- data resource distinguishes the functional context of each
neously in two separate cores of a multicore processor sharing variable, which means that those variables belonging to the
the on-chip memory to store a single copy of the code and .rodata (read-only data), .bss (uninitialized variables), and .data
private copies of the data. A software infrastructure was devel- (initialized variables) data sections are indicated and processed
oped to endow the dual-core system with the ability of running conveniently.
redundant threads on bare metal. It is composed of three Fig. 2 shows an example of code instrumented with our
elements: first, a modified board support package (BSP) able to technique. SoR boundaries are defined by the SYNC macro.
boot up the processor in the SPMD mode; second, a memory It is also used to declare the context restoration and validation
map and the associated linker scripts to build separate memory points by means of the variables that cross the SoR limits.
sections for each core; finally, a thread support library that All variables within the SoR (e.g., fooVar) are duplicated in
includes the synchronization and communication mechanisms, its own core address space. Additional variables involved in
and the implementation of different macros and preprocessing the critical computation region (e.g., global variables) need
directives to define the region of the code to be protected and to be explicitly replicated using the XHARD macro, which
the context that has to be restored in the case of error. is overloaded depending on the memory section where the
The synchronization mechanism follows the spinlock variables will be allocated. Each thread accesses its copy
method by means of exclusive load/store on shared variables using a pointer created and initialized with the PTR and
(locks). Also, shared variables jointly with interrupts are THREAD_REPLICA_VAR primitives, respectively.
used to implement minimal communication to notify events Threads operation and synchronization can be seen in Fig. 3.
between threads. In order to achieve detection and recovery capability, the
The on-chip trace modules, called program trace macro- SoR is triplicated by means of the replication of each data
cells (PTMs), are used to extract execution information, one section. Two copies of each data section (.bss0, .bss1, .data0,
for each core. The trace information, containing program .data1, .rodata0, and .rodata1) are automatically generated
counter (PC) values related to the executed application of each when spawning the two threads (primary and shadow). The
core, is sent to a custom IP implemented in the programmable corresponding addresses for each data section are automati-
logic of the SoC, which is in charge of the decodification and cally managed by a custom linker script. In addition, DWC-R
monitorization of the cores activity. allocates a third copy of data sections (.bss2, .data2, and
On the one hand, the software part of the technique is able .rodata2) to allow the restoration of the SoR variables in the
to detect soft errors affecting the data during the computa- case of error.
tion. The detection of an error triggers a rollback recovery When the program starts, two bare-metal threads (primary
mechanism and can additionally be notified externally for and shadow) are spawned in parallel using both indepen-
further actions. The hardware IP, on the other hand, reports dent shared memory processing units (Core0 and Core1) to
any anomalous behavior of the instruction flows. Using that replicate the full program computation. In case the program
information, specific recovery actions may be implemented. enters a protected region, the threads are synchronized using a

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1577

Fig. 3. Duplication with comparison and reexecution (DWC-R) with two


threads. Code program (.text memory section) is shared among both threads,
while .data, .bss, and .rodata memory sections are distributed.

interface is a very common resource in modern microproces-


sors, enabling debug and profiling tasks during application
development. However, the trace interface is usually left
unused once the application is released, so it can be reused
for other purposes. The trace interface is, by design, capable
of providing relevant information about processor execution
without disturbing it in a nonintrusive manner and with low
latency. In a multicore system, the trace interface is typically
shared between all cores, and it can be effectively used to
gather execution information of all of them.
A custom observer IP core has been developed based on the
trace interface specification and implemented in VHDL [23].
Fig. 2. Snippet of C/C++ code instrumentation.
The IP can receive raw data packets from the trace interface,
spinlock barrier, and then the context is automatically updated decode them, and use that information to check the correctness
by the primary thread (SoR-input checkpoint), which notifies of the execution flow of one or more processors in the system.
the shadow to start the computation of the critical region. The decoding and checking processes are performed online
Next, a new synchronization is needed to guarantee that both along with processor execution, and the latency of the IP
threads have finished the computation before the SoR output is less than 30 clock cycles to determine whether an error
checkpoint is reached. At this point, SoR computed variables has occurred. The time that the trace takes to output the
are compared by the primary thread. In the case of a mismatch, information about execution is also known to be low [24],
both threads go back to the first checkpoint and perform a so the overall detection latency can be as low as 500 ns in a
context restoration (using the third copy) to reexecute the pro- typical implementation.
tected section. Conversely, the program continues the normal The program trace macrocells (PTM0 and PTM1) available
execution flow. It is worth mentioning that the minimal needed at the trace subsystem of the SoC processing system have been
code instrumentation (green boxes) does not interfere with the configured to export trace information related to the PC value
original program flow. of each core during execution. The amount of information
exported by the PTM modules is not exhaustive since it is
compressed to optimize the bandwidth of the trace interface.
B. Hardware Implemented Control-Flow Protection However, it is enough to infer the execution flow of each core,
To provide control-flow protection, we leverage the informa- as it includes information about all the branches taken by each
tion available at the trace interface of the processor. The trace processor. In the absence of branch information in the trace,

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
1578 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 69, NO. 7, JULY 2022

sequential execution is assumed. The observer IP leverages the unprotected version of the code (Original), four hardened
trace information to gather the PC value of both processors in versions were built to run on the dual-core: StackH, DatH-
the system and continuously check that value against a set of Outer, DatH-Inner, and DatH-Acc. In the StackH version,
allowed ranges configured by the user, where the application the matrix multiplication is protected using two threads, one
code is stored. If at any moment any of the processors present per processor core, and triplicating the automatic variables.
a PC value outside the allowed ranges, then the processor This benchmark performs the same computation as the code
control flow is wrong, and the application is about to fail, without protection, except for the inner loop checks performed
so the IP sets a signal to trigger a system reset with much lower on the automatic variables. These checks aim to verify that
latency than a common watchdog. In addition, the PC value is the computation is being performed with the same data and
used to perform an additional check by the use of a PC-based are achieved by checking the indexes that control the loops.
watchdog. For the PC-based watchdog, the IP is continuously Once each value from the output matrix has been calculated,
looking for a specific user-configurable PC value in the trace the result is verified with the other thread output before
information. If the configured PC value does not appear in a committing the result to memory and exiting the SoR.
configurable time because of a possible hang condition, the Finally, in the Dat-x versions, the SoR is extended from
IP sets a signal indicating it. The PC-based watchdog has the the calculation of each element of the resulting matrix (DatH-
advantage over a traditional watchdog that it is not relying Acc) to the data residing on memory, resulting in the triplica-
on all in the application to reload it, but it will be reloaded tion of all involved matrices (DatH-Outer). In this case, the
using the trace information every time the application reaches verification is carried out once each core has calculated the
a particular point in execution, which can be commonly set as output matrix, resulting in a coarse grain check. At this point,
the first instruction of the main application loop. only two copies of the output matrix are completed. Therefore,
to complete the data triplication, the calculated matrices are
IV. S OFT E RROR M ITIGATION A NALYSIS compared and saved into the third copy if no errors are found.
To assess our hybrid approach, two fault injection cam- Otherwise, the matrices are restored to the initial state to restart
paigns were carried out. The first campaign was designed the calculus from an error-free state.
to observe the impact of the software-based part of the The fault injector was configured to perform 1800 injections
technique (multithreading DWC-R) applied at different levels per core at the register file (100 · 18 registers), 800 injections
of granularity. The second was conceived to estimate the per memory section and core (200 injections per data replica
overall contribution of the complete hybrid technique to the at .rodata, .data, .bss, and .stack sections), and 200 injection
applications’ reliability. at .text section per core. Thus, 5200 injections of faults have
The matrix multiplication code of the project BEEBS [25] been injected at the single-core original version and 10 400
was selected as the benchmark to operate on 20 × 20 32-bit at multithreaded DWC-R versions. Faults were labeled as
integer matrices. The structure of the code, basically three unnecessary for architectural correct execution (unACE) when
nested loops, allowed us to analyze the tradeoffs between an injection is made, and they do not affect the result of
the granularity of the protection and the reliability obtained. the program’s output, silent data corruption (SDC), when the
The index of the first loop (the outer loop) runs through result is not correct but the program ends, and HANG, if it
the rows of the first matrix. Defining here the boundaries does not end or exceeds a time limit. The programs have
of the SoR means a unique point of checking but a large been evaluated having as a reference a faultless (ground truth)
amount of data to be verified (the whole matrix). The second execution of themselves and adding a recovery time equal to
index runs through the columns of the second matrix (inner the faultless execution duration. If this temporal restriction is
loop). This boundary defines multiple checkpoints to verify exceeded, the program is considered that does not meet the
the correctness of each resultant row. Finally, the innermost valid requirements, and the fault is labeled as HANG.
loop computes the multiplication of one row and one column Raw event rates (SDC and HANG) from simulated fault
to obtain the corresponding element of the result matrix. injection campaign are shown in Fig. 4. Results demonstrate
At this point, every element must be checked increasing the that the unprotected version presents a high rate of SDC,
number of synchronizations between threads but significantly above 30%, and a very low percentage of HANG. These
reducing the amount of data to be verified. The output matrix results are in accordance with the data-intensive nature of the
is initialized to zero on each run, and a golden matrix is used algorithm. This way, the successive multithreading versions
to verify correctness. The output initialization ensures that the clearly decrease the SDC occurrence depending on the amount
whole computation is performed and committed to memory, of data protected and the granularity of the checkpoints. The
detecting possible masked errors due to intermediate cached StackH version only protects the automatic variables (the
calculations. indexes of the loops), so it gets a modest improvement of 2×
but at the cost of increasing the HANG rate up to 8×. The
DatH-Acc version protects every result individually; therefore,
A. Simulated Fault Injection Campaign it offers the best SDC rate (0.8%) improving the baseline rate
In the multithreading DWC-R analysis, we employed an by 38×. However, it involves a high number of checkpoints
instruction accurate simulator, WindRiver Simics [26], con- and threads synchronization, which makes the code more
figured to inject bit-flips in the register files and memory prone to control flow errors. As a consequence, the HANG rate
sections of an ARM Cortex-A9 dual-core processor. Besides is still increased by 5.7×. The DatH-Outer version presents

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1579

TABLE I
E MULATED I NJECTION R ESULTS

TABLE II
Fig. 4. Error rates for multithreading DCW-R technique. T IME OVERHEADS AND R ELATIVE I MPROVEMENT OF THE MWTF M ETRIC

the most balanced results due to the low control overhead


with a very reduced SDC percentage (1%, i.e., improvement
of 29.7×) and the lower HANG rate at the same time (1.8%).
In terms of unACE faults, it reaches up to 96.1%.

B. Emulated Fault Injection Campaign


The analysis of the proposed technique (data and control were able to be labeled), HANGs, PotHDetected, SDC, and
flow protection) was complemented with one additional fault Corrected events for the Original, StackH, and DataH-Outer
injection campaign over those versions showing the most versions of the software. In addition, the third column (%
uneven behavior, i.e., StackH and DatH-Outer. The faults were TotalInj) indicates the percentage of the labeled faults taking
emulated on a Zybo board [27] connected to an external host into account the total number of emulated faults.
(single-board computer), which was in charge of generating The campaigns were run up to get a significant number
random seeds for injection and retrieving test results. In the of events (about 100). As expected, the original benchmark
event of any error during fault emulation, the external host presents the lower reliability in terms of SDC, and in this case,
power cycled the Zybo board. The campaigns were focused no HANGs were produced. Similar to simulated campaigns,
on the injection on the register file and are triggered randomly StackH version improves the SDC and presents an important
inside the processor itself by software emulated upsets [28] number of corrected events even higher than DatH-Outer
using timer interrupts. Upon each timer interrupt, the current benchmark. Some specific registers are exclusively used to
state of the register file is saved on the processor stack before access the stack (where the automatic variables are stored),
attending the interrupt service routine. Inside the routine, the and the majority of the faults are detected and corrected by
software randomly introduces a bit-flip in a random bit of a the technique. On the contrary, DatH-Outer uses massively the
random register of the saved stack state and returns. When the memory to operate with the data, which may explain the dif-
content of the stack is restored into the processor registers, the ference in terms of corrected errors, since the register file is the
injected fault becomes effective. Only one fault is injected per only target of the faults injected in this campaign. In summary,
benchmark execution, following a single error model, and in the hardware-implemented control-flow protection provides an
the case that reexecution is needed, no additional faults are important reduction in the error rate associated with HANG
injected. events without interfering with the recovery capability of the
The results obtained with this emulated fault injection data protection.
campaign are shown in Table I. In addition to the recovery Finally, in order to estimate the overall reliability improve-
capability exposed by the multithread part of the technique, the ment of our hybrid approach, the mean work to fail-
control-flow protection is able to identify conditions that would ure (MWTF) metric is provided. MWTF takes into account
lead the dual-core system to lose synchronization and get not only the fault coverage but also the period of time that
stuck, such as the triggering of a nonmanaged exception. The the code is exposed to faults. Table II shows the execution
observer IP can detect that events, which commonly cannot time of each benchmark and the relative MWTF, taking the
be corrected, with low latency to trigger a system reset, thus original unprotected code as the baseline. The protected codes
increasing the overall availability of the system. A new cate- present time overheads of 2.9× and 2.5× for StackH and
gory, named potential hang detected (PotHDetected), is used DatH-Outer, respectively. In the case of protecting the stack,
to classify those events. In addition, the label corrected was the technique produces a large number of checking points and
assigned to cases where data errors were detected and recov- threads synchronization, which is the main cause of the shown
ered. Columns of Table I show the number (N) and its relative overhead. On the contrary, the coarser granularity protection
percentage (%) of TotalEvents (total amount of faults that of DatH-Outer reduces notably the number of checkpoints but

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
1580 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 69, NO. 7, JULY 2022

introduces a large number of memory accesses to verify and


restore the matrices, which explains the excess in the execution
time. Even though the performance penalty incurred by our
proposal, the gain in MWTF is remarkable being of one order
of magnitude when only the stack is hardened and reaching
up to 97.7× when all data are triplicated.

V. R ADIATION E XPERIMENTS
The original and DatH-Outer benchmarks have been tested
in the external beamline of the 18/9 ion beam applications
compact cyclotron located at the Centro Nacional de Aceler-
adores (CNA), Seville, Spain. The device under test (DUT)
was a Xilinx Zynq-7010 System on a Chip (SoC) [29] that
integrates a hard-core dual-core ARM Cortex-A9 proces-
sor [30] along with programmable logic, interconnections, and
peripherals. The DUT was mounted on a commercial board
(Zybo) [27] and irradiated in open air with 15.2-MeV protons. Fig. 5. Cross section (cm2 ) and 95% confidence intervals for DatH-Outer
The energy of the protons in the active area of the silicon and unprotected Original benchmarks.
is about 10 MeV, which is considered enough to produce
single event effects on the 28-nm technology device with no
thinning [31]. statically linked within the executable, loaded to the OCM,
The DUT configuration is stored in an SD card, which is and therefore exposed to the beam. This way, the boundaries
inserted in the receptacle of the board. Upon power-on, the of the SoR were extended up to the OCM. To do this,
DUT loads the code from the SD card to on-chip-memory it was necessary to insert routines to verify and restore the
(OCM) using a two-stage bootloader to initialize the program- three copies of the input data. Also included in the .rodata
mable logic and boot the application. Because we use OCM section were three copies of the golden results and routines to
for the benchmarks, we used a two-stage bootloader scheme. periodically refresh their values to avoid the accumulation of
It is important to mention that all the benchmarks are executed errors between successive runs of the benchmark.
using only OCM memory, which is inside the SoC, so all The results obtained from the irradiation campaign are
computing hardware, including the memory, is irradiated. presented in Fig. 5. It shows the cross section of the aforemen-
The external observer IP has been implemented in the tioned error classification and the cross section of all observed
programmable logic of the device and connected to the trace errors (total errors) for the benchmarks tested under radiation.
interface over the extended multiplexed input–output (EMIO) Note that even the algorithm without protection (original),
interface available on the Zynq device. The IP leverages which represents our starting point, presents high resilience to
the information produced by the PTM of each core, which radiation (a low cross section). However, our technique is able
provides relevant PC values during execution. The processors to improve the cross section and, therefore, the vulnerability
are running at the nominal 650-MHz clock frequency. to soft errors.
An external host, placed outside the beam and connected to As expected, the cross section of the original benchmark
the DUT through a serial communication interface, was used to throws the worst results in terms of total errors and SDC.
control the experiment. The benchmarks provide a periodical Note that our technique is able to detect the exception and
message if no error is present, and the code is instrumented timeout events before they become errors, so they are not
to provide different codes depending on the observed error. accounted for in the total errors category for the DatH-Outer
In the case of any error, the host performs a power cycle to experiment. It results in 4.29e-11-cm2 cross section for the
restart the DUT. unprotected benchmark and 3.31e-12 cm2 for the DatH-Outer,
We distinguish the following error categories. showing an improvement of one order of magnitude. It is
1) Exception Error: The processor execution flow has been worth noting that our proposal was tested under the worst
abruptly interrupted by an unexpected exception, proba- case scenario; thus, presumably further improvements would
bly caused by forbidden memory access. If not handled, be obtained using EDAC protected memories.
this type of error would become a timeout error. Remarkably, most errors are tagged as SDC due to the
2) Timeout Error: The processor has become unresponsive. different matrices’ corruption (input, calculated, and golden).
3) Communication Error: The serial communication with It is remarkable that this protection can reduce the propagation
the processor has become corrupted, thus making it of erroneous outcomes (SDC). More precisely, during the
impossible to identify further errors. irradiation campaign, the technique was able to correct up to
4) Silent Data Corruption (SDC): The benchmark execu- 144 errors, which is a good demonstration of the mitigation
tion has finished with errors in the result matrix. capabilities of the proposed technique.
To test our technique under the worst case scenario, all Regarding the timeout errors, results show that the harden-
program memory sections (.rodata, .data, .bss, and .text) were ing approaches are more prone to hang the platform, which

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1581

can be caused by one of the threads missing synchronization. [11] G. S. Rodrigues, F. Rosa, Á. B. de Oliveira, F. L. Kastensmidt, L. Ost,
In such a case, the system can enter an abnormal waiting state and R. Reis, “Analyzing the impact of fault-tolerance methods in ARM
processors under soft errors running Linux and parallelization Apis,”
where both cores are waiting for each other to finish and can- IEEE Trans. Nucl. Sci., vol. 64, no. 8, pp. 2196–2203, Aug. 2017.
not synchronize. Although undesirable, it can be considered [12] S. Hukerikar, K. Teranishi, P. C. Diniz, and R. F. Lucas, “RedThreads:
less critical than SDC since this situation may be detected An interface for application-level fault detection/correction through
adaptive redundant multithreading,” Int. J. Parallel Program., vol. 46,
with common processor mechanisms (e.g., watchdog). no. 2, pp. 225–251, Apr. 2018.
[13] Y.-S. Chen and P.-S. Chen, “A software-based redundant execution
programming model for transient fault detection and correction,” in
VI. C ONCLUSION Proc. 45th Int. Conf. Parallel Process. Workshops (ICPPW), Aug. 2016,
We have presented a new hybrid soft error mitigation pp. 66–71.
[14] H. Quinn, Z. Baker, T. Fairbanks, J. L. Tripp, and G. Duran, “Soft-
technique for multicore processors based on multithreaded ware resilience and the effectiveness of software mitigation in micro-
lockstep and a custom hardware IP that uses the trace port controllers,” IEEE Trans. Nucl. Sci., vol. 62, no. 6, pp. 2532–2538,
of the microprocessor to observe the control flow. The tech- Dec. 2015.
[15] J. S. Monson, M. Wirthlin, and B. Hutchings, “Fault injection results of
nique has been validated with low-energy proton irradiation Linux operating on an FPGA embedded platform,” in Proc. Int. Conf.
and tested by means of emulated injection campaigns. Both Reconfigurable Comput. (FPGAs), Dec. 2010, pp. 37–42.
campaigns show insights of reliability improvements. On the [16] F. Abate, L. Sterpone, C. A. Lisboa, L. Carro, and M. Violante, “New
techniques for improving the performance of the lockstep architecture
one hand, fault injection campaigns have demonstrated the for SEEs mitigation in FPGA embedded processors,” IEEE Trans. Nucl.
detection and recovery capabilities of the proposed approach. Sci., vol. 56, no. 4, pp. 1992–2000, Aug. 2009.
On the other hand, the irradiation campaign has validated the [17] M. Violante, C. Meinhardt, R. Reis, and M. S. Reorda, “A low-cost
solution for deploying processor cores in harsh environments,” IEEE
reliability improvements observed in the analysis of the fault Trans. Ind. Electron., vol. 58, no. 7, pp. 2617–2626, Jul. 2011.
injection results. Therefore, error mitigation is improved by [18] H.-M. Pham, S. Pillement, and S. J. Piestrak, “Low-overhead fault-
using our hybrid multithreaded lockstep-based approach for tolerance technique for a dynamically reconfigurable softcore processor,”
IEEE Trans. Comput., vol. 62, no. 6, pp. 1179–1192, Jun. 2013.
soft error mitigation. [19] I. Oz and S. Arslan, “A survey on multithreading alternatives for soft
error fault tolerance,” ACM Comput. Surveys, vol. 52, no. 2, pp. 1–38,
R EFERENCES Mar. 2020.
[20] E. W. Wachter, S. Kasap, X. Zhai, S. Ehsan, and K. McDonald-Maier,
[1] X. Iturbe, B. Venu, E. Ozer, and S. Das, “A triple core lock-step “Survey of lockstep based mitigation techniques for soft errors in
(TCLS) ARM Cortex-R5 processor for safety-critical and ultra-reliable embedded systems,” in Proc. 11th Comput. Sci. Electron. Eng. (CEEC),
applications,” in Proc. 46th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Sep. 2019, pp. 124–127.
Netw. Workshop (DSN-W), Jun. 2016, pp. 246–249. [21] A. Serrano-Cases, F. Restrepo-Calle, S. Cuenca-Asensi, and
[2] Á. B. de Oliveira et al., “Lockstep dual-core ARM A9: Implementation A. Martínez-Álvarez, “Multi-threaded mitigation of radiation-induced
and resilience analysis under heavy ion-induced soft errors,” IEEE Trans. soft errors in bare-metal embedded systems,” J. Electron. Test., vol. 36,
Nucl. Sci., vol. 65, no. 8, pp. 1783–1790, Aug. 2018. no. 1, pp. 47–57, Dec. 2019.
[3] F. Abate, L. Sterpone, and M. Violante, “A new mitigation approach for [22] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August,
soft errors in embedded processors,” IEEE Trans. Nucl. Sci., vol. 55, “SWIFT: Software implemented fault tolerance,” in Proc. Int. Symp.
no. 4, pp. 2063–2069, Aug. 2008. Code Gen. Optimiz., Mar. 2005, pp. 243–254.
[4] S. K. Reinhardt and S. S. Mukherjee, “Transient fault detection via [23] M. Pena-Fernandez, A. Lindoso, L. Entrena, and M. Garcia-Valderas,
simultaneous multithreading,” ACM SIGARCH Comput. Archit. News, “The use of microprocessor trace infrastructures for radiation-induced
vol. 28, no. 2, pp. 25–36, 2000. fault diagnosis,” IEEE Trans. Nucl. Sci., vol. 67, no. 1, pp. 126–134,
[5] J. Smolens, B. Gold, B. Falsafi, and J. Hoe, “Reunion: Complexity- Jan. 2020.
effective multicore redundancy,” in Proc. 39th Annu. IEEE/ACM Int. [24] M. Peña-Fernandez, A. Lindoso, L. Entrena, M. Garcia-Valderas,
Symp. Microarchitecture (MICRO), Dec. 2006, pp. 223–234. Y. Morilla, and P. Martín-Holgado, “Online error detection through trace
[6] C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar, “Utilizing infrastructure in ARM microprocessors,” IEEE Trans. Nucl. Sci., vol. 66,
dynamically coupled cores to form a resilient chip multiprocessor,” in no. 7, pp. 1457–1464, Jul. 2019.
Proc. 37th IEEE Int. Conf. Dependable Syst. Netw. (DSN), Edinburgh, [25] J. Pallister, S. J. Hollis, and J. Bennett, “BEEBS: Open benchmarks for
Scotland, Apr. 2007, pp. 317–326. energy measurements on embedded platforms,” 2013, arXiv:1308.5174.
[7] K. Mitropoulou, V. Porpodas, and T. M. Jones, “COMET: [26] P. S. Magnusson et al., “Simics: A full system simulation platform,”
Communication-optimised multi-threaded error-detection technique,” in IEEE Comput., vol. 35, no. 2, pp. 50–58, Feb. 2002.
Proc. Int. Conf. Compil., Architectures Synth. Embedded Syst. (CASES), [27] Zybo Reference Manual, Digilent, Pullman, WA, USA, 2014.
2016, pp. 2.3.1–2.3.10. [28] M. Peña-Fernandez et al., “PTM-based hybrid error-detection archi-
[8] Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August, “DAFT: tecture for ARM microprocessors,” Microelectron. Rel., vols. 88–90,
Decoupled acyclic fault tolerance,” in Proc. 19th Int. Conf. Parallel pp. 925–930, Sep. 2018.
Archit. Compilation Techn. (PACT), 2010, pp. 87–97. [29] Zynq-7000 All Programmable SoC: Technical Reference Manual, docu-
[9] H. So, M. Didehban, Y. Ko, A. Shrivastava, and K. Lee, “EXPERT: ment UG585, Xilinx, San Jose, CA, USA, 2016.
Effective and flexible error protection by redundant multithreading,” [30] Cortex-A9 Technical Reference Manual r4p1, ARM, Cambridge, U.K.,
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018, 2012.
pp. 533–538. [31] A. Lindoso, M. García-Valderas, L. Entrena, Y. Morilla, and P. Martín-
[10] B. Döbel and H. Härtig, “Can we put concurrency back into redundant Holgado, “Evaluation of the suitability of NEON SIMD microprocessor
multithreading?” in Proc. 14th Int. Conf. Embedded Softw. (EMSOFT), extensions under proton irradiation,” IEEE Trans. Nucl. Sci., vol. 65,
2014, pp. 1–10. no. 8, pp. 1835–1842, Aug. 2018.

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.

You might also like