Hybrid Lockstep Technique For Soft Error Mitigation
Hybrid Lockstep Technique For Soft Error Mitigation
7, JULY 2022
Abstract— This work presents the evaluation of a new continue the operation even in the presence of faults, is a key
dual-core lockstep hybrid approach aimed to improve the fault to enable these devices for safety-critical applications.
tolerance in microprocessors. Our approach takes advantage of This work is focused on the enhancement of proces-
modern multicore processor resources to combine software-based
lockstep with a custom hardware observer. The first is used to sors fault tolerance to soft errors, i.e., transient faults on
duplicate data and instruction flows; meanwhile, the second is memory cells that eventually can lead to system failures.
in charge of the control-flow monitoring. The proposal has been Traditionally, the protection of microprocessors is addressed
implemented in a dual-core ARM microprocessor and validated by means of spatial and/or temporal redundancy to detect
with low-energy proton irradiation and emulated fault injection and/or correct radiation-induced faults. Depending on how
campaigns. The results show an improvement of one order of
magnitude in the cross section of the benchmarks tested, even they are implemented, the techniques are usually categorized
considering the worst case scenario. as hardware- or software-based. Hardware techniques are
those based on the replication of processing by means of
Index Terms— Dual cores, fault tolerance, lockstep, multi-
threading, proton irradiation, soft errors. redundant hardware blocks: registers, memories, or even entire
processing units. Dual- and triple-redundant core locksteps
I. I NTRODUCTION (DCLS/TCLS) [1], [2] replicate the whole processor and
compare the system output every clock cycle to detect any
C OMMERCIAL processors are becoming a commodity
for the implementation of critical electronic systems in
a multitude of industrial domains: from traditional aerospace
mismatch during the execution of the code. TCLS, in addition,
offers the ability to recover the system using the third core
and military sectors to emerging markets, such as high- state.
performance computing, autonomous vehicles, or medical Generally, only output data are checked for errors [2], [3].
appliances. The superior flexibility and performance offered However, control-flow errors may cause one of the proces-
by their advanced multicore architectures make those devices sors to lose synchronization and eventually hang or get lost.
a promising alternative to other specifically designed circuits. Control-flow errors are not easy to detect as they may not have
Unfortunately, the progressive miniaturization of the electronic an immediate observable effect in the computed data. More-
components jointly with the high clock frequencies demanded over, it is common in dual cores that one of the processors
by new applications is making the cores more vulnerable acts as primary and the other as secondary. In such a case, the
to radiation-induced faults. Therefore, providing some kind hang of the primary can lead to the crash of the entire system.
of fault tolerance to the microprocessors, i.e., the ability to Software techniques are aimed to protect the code execution on
unreliable hardware, mostly commercial off-the-shelf (COTS)
Manuscript received 20 December 2021; revised 30 January 2022; accepted devices. Similar to the hardware techniques, they introduce
31 January 2022. Date of publication 7 February 2022; date of current version replication at different software levels: programs, functions,
18 July 2022. This work was supported in part by the Spanish Ministry of
Science and Innovation under Project PID2019-106455GB-C22 and Project loops, instructions, and so on. Although their implementations
PID2019-106455GB-C21 and in part by the Community of Madrid under have a lower impact on development costs compared with
Grant IND2017/TIC-7776. hardware techniques, their application presents relevant over-
M. Peña-Fernández is with Arquimea ADS, 28918 Leganés, Spain (e-mail:
[email protected]). heads, in terms of performance and memory footprint, which
A. Serrano-Cases, S. Cuenca-Asensi, and A. Martínez-Álvarez are with the should be taken into consideration.
Computer Technology Department, University of Alicante, 03690 Alicante, Unlike related works, this proposal tries to reduce the
Spain (e-mail: [email protected]; [email protected]; [email protected]).
A. Lindoso and L. Entrena are with the Department of Electronic Tech- impact of the unavoidable unreliable software to achieve
nology, Universidad Carlos III de Madrid, 28911 Leganés, Spain (e-mail: a reliable and efficient computation. In fact, no operating
[email protected]; [email protected]). system (OS) nor external threading libraries are used at all.
Y. Morilla and Pedro Martín-Holgado are with the Centro Nacional
de Aceleradores (CNA), Centro Nacional de Aceleradores, CSIC, JA, Our approach exploits the multithreading capability of modern
Universidad de Seville, 41092 Seville, Spain (e-mail: [email protected]; microprocessors by means of multiple instances of the same
[email protected]). program running in parallel on separate cores and without any
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNS.2022.3149867. communication between them, excluding a little piece of code
Digital Object Identifier 10.1109/TNS.2022.3149867 for stall and synchronization purposes.
0018-9499 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1575
Usually, mitigation techniques are conceived, assuming that work, the OS service transparently replicates the execution of
memory chips are protected with some kind of error detection applications at the binary level by creating redundant threads in
and correction (EDAC) mechanism. In our case, no assump- separate address spaces. Some studies propose the use of stan-
tions are made and the proposal includes procedures to miti- dard libraries, such as OpenMP and Pthreads [11], to generate
gate soft errors even for nonprotected memories. In addition, redundancy programmatically. Furthermore, authors in [12]
errors that affect the control flow in any core can be detected and [13] propose the use of custom application programming
by a custom hardware observer (IP observer) connected to the interfaces (API) and language directives to allow the program-
trace subsystem. This IP core decodes on-the-fly the program mer to define redundant threads and selectively decide the code
traces and checks the obtained information. regions and variables to be protected. Also, software tools,
Our hybrid and multithreaded DCLS lockstep-based such as Trikaya [14], have been proposed to automatically
approach have shown improvements in error mitigation, even transform the source and produce a DMR version with rollback
for applications with high resilience to radiation. The tech- reexecution. All those high-level approaches suffer from two
nique has been validated with low-energy proton irradiation main problems. The first is the performance overhead pro-
and tested by means of emulated injection campaigns. duced by the additional software layers. The second is the
The paper is organized as follows. Section II introduces the increment in the susceptibility to errors due to the complexity
software and hardware mechanism combined in our proposal. introduced by the software stack and the OS itself. That is
Section II makes a review of works related to multithreading expressed by a higher number of both silent data corruption
and lockstep mitigation techniques. Section III describes the events and OS crashes, as pointed in [15].
fault injection campaigns performed and the previous results’ There are a lower number of approaches based on hardware
analysis to estimate the contribution of the technique to the redundancy as they usually apply architectural modifications
system reliability. Section IV reports the radiation experiments to real devices or include custom modules in programmable
and their results. Finally, Section V summarizes the conclu- SoCs. In [16], authors add a hardware module to a standard
sions of this work. implementation of a lockstep mechanism over two PowerPC
processors hardwired on an FPGA. The module reduces
II. R ELATED W ORKS the checkpoint overhead by comparing only the modified
The emergence of multicore processors has enabled the addresses and values. A different approach is presented in [17]
execution of multiple copies of the same instruction flow on where two FPGA-based boards are used to implement a
separated execution units. The technique known as redundant rollback recovery method to protect periodic tasks running on
multithreading (RMT) along with the concept of sphere of two LEON processors. Softcore processors are also employed
replication was proposed in [4] for detection and recov- in [18]. In this case, Microblaze processors are configured in
ery of soft errors. Basically, the sphere-of-replication (SoR) DMR to detect errors in the application outputs; meanwhile,
defines the set of resources, hardware or software, which are a TMR Picoblaze continuously reads the configuration mem-
replicated. This way, the values entering the SoR must be ory looking for errors. Finally, the work [2] implements a roll-
replicated, and the values leaving the SoR must be checked back/recovery mechanism using the programmable resources
to assure their integrity. Initial RMT approaches included the of an FPGA. The application is manually divided into several
processor pipeline and the register file in the SoR boundaries, blocks delimited by verification points, and it is executed
relaying the correctness of the execution on helper structures, simultaneously in both cores of an ARM cortex-A9 processor.
such as store buffers, load, and branch queues, not present on Every time a verification point is reached, the context and
real processors [5], [6]. Those proposals were tested on simula- data are saved in a dual-port private memory and compared
tors with promising results but never evaluated on real devices. to detect some mismatch by custom hardware.
Other approaches deal with soft errors considering their In our approach, a single program multiple data (SPMD)
effect on the software running on the system. They address scheme has been adopted for bare metal applications (without
the problem from either the compiler, OS, or application OS), which renders a reduced number of race conditions and
level. Most of them make the assumption that memories are lower control overhead compared to traditional solutions. This
protected by an error detection mechanism. In [7] and [8], technique is combined with a custom IP that leverages the
a custom compiler transforms the application code into two information provided by the on-chip debugging facilities to
communicating threads: the leading thread performs all the detect on-the-fly any control flow error. It results in an efficient
load/store operations and sends the data to the trailing thread implementation with a very low area usage.
that replicates the ALU operations and compare the results. Among all the reviewed works, only a few were tested in
The SoR only includes computations; therefore, they suffer accelerated radiation experiments [2], [11], [14]. More com-
from the vulnerable input replication and output compari- prehensive surveys about multithreading and lockstep-based
son processes. Another approach [9] suggests duplicating the mitigation techniques can be found in the respective sur-
memory read/write operation values to solve this problem; veys [19], [20].
however, it increases the synchronization and performance
overheads up to 5× given the low granularity of the memory III. H YBRID L OCKSTEP A PPROACH
operations. Our approach combines hardware and software techniques
Other authors [10] propose specific OS services to support that exploit common resources available in modern micro-
RMT execution providing error detection and recovery. In that processors. It is intended to be directly applied to them with
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
1576 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 69, NO. 7, JULY 2022
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1577
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
1578 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 69, NO. 7, JULY 2022
sequential execution is assumed. The observer IP leverages the unprotected version of the code (Original), four hardened
trace information to gather the PC value of both processors in versions were built to run on the dual-core: StackH, DatH-
the system and continuously check that value against a set of Outer, DatH-Inner, and DatH-Acc. In the StackH version,
allowed ranges configured by the user, where the application the matrix multiplication is protected using two threads, one
code is stored. If at any moment any of the processors present per processor core, and triplicating the automatic variables.
a PC value outside the allowed ranges, then the processor This benchmark performs the same computation as the code
control flow is wrong, and the application is about to fail, without protection, except for the inner loop checks performed
so the IP sets a signal to trigger a system reset with much lower on the automatic variables. These checks aim to verify that
latency than a common watchdog. In addition, the PC value is the computation is being performed with the same data and
used to perform an additional check by the use of a PC-based are achieved by checking the indexes that control the loops.
watchdog. For the PC-based watchdog, the IP is continuously Once each value from the output matrix has been calculated,
looking for a specific user-configurable PC value in the trace the result is verified with the other thread output before
information. If the configured PC value does not appear in a committing the result to memory and exiting the SoR.
configurable time because of a possible hang condition, the Finally, in the Dat-x versions, the SoR is extended from
IP sets a signal indicating it. The PC-based watchdog has the the calculation of each element of the resulting matrix (DatH-
advantage over a traditional watchdog that it is not relying Acc) to the data residing on memory, resulting in the triplica-
on all in the application to reload it, but it will be reloaded tion of all involved matrices (DatH-Outer). In this case, the
using the trace information every time the application reaches verification is carried out once each core has calculated the
a particular point in execution, which can be commonly set as output matrix, resulting in a coarse grain check. At this point,
the first instruction of the main application loop. only two copies of the output matrix are completed. Therefore,
to complete the data triplication, the calculated matrices are
IV. S OFT E RROR M ITIGATION A NALYSIS compared and saved into the third copy if no errors are found.
To assess our hybrid approach, two fault injection cam- Otherwise, the matrices are restored to the initial state to restart
paigns were carried out. The first campaign was designed the calculus from an error-free state.
to observe the impact of the software-based part of the The fault injector was configured to perform 1800 injections
technique (multithreading DWC-R) applied at different levels per core at the register file (100 · 18 registers), 800 injections
of granularity. The second was conceived to estimate the per memory section and core (200 injections per data replica
overall contribution of the complete hybrid technique to the at .rodata, .data, .bss, and .stack sections), and 200 injection
applications’ reliability. at .text section per core. Thus, 5200 injections of faults have
The matrix multiplication code of the project BEEBS [25] been injected at the single-core original version and 10 400
was selected as the benchmark to operate on 20 × 20 32-bit at multithreaded DWC-R versions. Faults were labeled as
integer matrices. The structure of the code, basically three unnecessary for architectural correct execution (unACE) when
nested loops, allowed us to analyze the tradeoffs between an injection is made, and they do not affect the result of
the granularity of the protection and the reliability obtained. the program’s output, silent data corruption (SDC), when the
The index of the first loop (the outer loop) runs through result is not correct but the program ends, and HANG, if it
the rows of the first matrix. Defining here the boundaries does not end or exceeds a time limit. The programs have
of the SoR means a unique point of checking but a large been evaluated having as a reference a faultless (ground truth)
amount of data to be verified (the whole matrix). The second execution of themselves and adding a recovery time equal to
index runs through the columns of the second matrix (inner the faultless execution duration. If this temporal restriction is
loop). This boundary defines multiple checkpoints to verify exceeded, the program is considered that does not meet the
the correctness of each resultant row. Finally, the innermost valid requirements, and the fault is labeled as HANG.
loop computes the multiplication of one row and one column Raw event rates (SDC and HANG) from simulated fault
to obtain the corresponding element of the result matrix. injection campaign are shown in Fig. 4. Results demonstrate
At this point, every element must be checked increasing the that the unprotected version presents a high rate of SDC,
number of synchronizations between threads but significantly above 30%, and a very low percentage of HANG. These
reducing the amount of data to be verified. The output matrix results are in accordance with the data-intensive nature of the
is initialized to zero on each run, and a golden matrix is used algorithm. This way, the successive multithreading versions
to verify correctness. The output initialization ensures that the clearly decrease the SDC occurrence depending on the amount
whole computation is performed and committed to memory, of data protected and the granularity of the checkpoints. The
detecting possible masked errors due to intermediate cached StackH version only protects the automatic variables (the
calculations. indexes of the loops), so it gets a modest improvement of 2×
but at the cost of increasing the HANG rate up to 8×. The
DatH-Acc version protects every result individually; therefore,
A. Simulated Fault Injection Campaign it offers the best SDC rate (0.8%) improving the baseline rate
In the multithreading DWC-R analysis, we employed an by 38×. However, it involves a high number of checkpoints
instruction accurate simulator, WindRiver Simics [26], con- and threads synchronization, which makes the code more
figured to inject bit-flips in the register files and memory prone to control flow errors. As a consequence, the HANG rate
sections of an ARM Cortex-A9 dual-core processor. Besides is still increased by 5.7×. The DatH-Outer version presents
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1579
TABLE I
E MULATED I NJECTION R ESULTS
TABLE II
Fig. 4. Error rates for multithreading DCW-R technique. T IME OVERHEADS AND R ELATIVE I MPROVEMENT OF THE MWTF M ETRIC
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
1580 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 69, NO. 7, JULY 2022
V. R ADIATION E XPERIMENTS
The original and DatH-Outer benchmarks have been tested
in the external beamline of the 18/9 ion beam applications
compact cyclotron located at the Centro Nacional de Aceler-
adores (CNA), Seville, Spain. The device under test (DUT)
was a Xilinx Zynq-7010 System on a Chip (SoC) [29] that
integrates a hard-core dual-core ARM Cortex-A9 proces-
sor [30] along with programmable logic, interconnections, and
peripherals. The DUT was mounted on a commercial board
(Zybo) [27] and irradiated in open air with 15.2-MeV protons. Fig. 5. Cross section (cm2 ) and 95% confidence intervals for DatH-Outer
The energy of the protons in the active area of the silicon and unprotected Original benchmarks.
is about 10 MeV, which is considered enough to produce
single event effects on the 28-nm technology device with no
thinning [31]. statically linked within the executable, loaded to the OCM,
The DUT configuration is stored in an SD card, which is and therefore exposed to the beam. This way, the boundaries
inserted in the receptacle of the board. Upon power-on, the of the SoR were extended up to the OCM. To do this,
DUT loads the code from the SD card to on-chip-memory it was necessary to insert routines to verify and restore the
(OCM) using a two-stage bootloader to initialize the program- three copies of the input data. Also included in the .rodata
mable logic and boot the application. Because we use OCM section were three copies of the golden results and routines to
for the benchmarks, we used a two-stage bootloader scheme. periodically refresh their values to avoid the accumulation of
It is important to mention that all the benchmarks are executed errors between successive runs of the benchmark.
using only OCM memory, which is inside the SoC, so all The results obtained from the irradiation campaign are
computing hardware, including the memory, is irradiated. presented in Fig. 5. It shows the cross section of the aforemen-
The external observer IP has been implemented in the tioned error classification and the cross section of all observed
programmable logic of the device and connected to the trace errors (total errors) for the benchmarks tested under radiation.
interface over the extended multiplexed input–output (EMIO) Note that even the algorithm without protection (original),
interface available on the Zynq device. The IP leverages which represents our starting point, presents high resilience to
the information produced by the PTM of each core, which radiation (a low cross section). However, our technique is able
provides relevant PC values during execution. The processors to improve the cross section and, therefore, the vulnerability
are running at the nominal 650-MHz clock frequency. to soft errors.
An external host, placed outside the beam and connected to As expected, the cross section of the original benchmark
the DUT through a serial communication interface, was used to throws the worst results in terms of total errors and SDC.
control the experiment. The benchmarks provide a periodical Note that our technique is able to detect the exception and
message if no error is present, and the code is instrumented timeout events before they become errors, so they are not
to provide different codes depending on the observed error. accounted for in the total errors category for the DatH-Outer
In the case of any error, the host performs a power cycle to experiment. It results in 4.29e-11-cm2 cross section for the
restart the DUT. unprotected benchmark and 3.31e-12 cm2 for the DatH-Outer,
We distinguish the following error categories. showing an improvement of one order of magnitude. It is
1) Exception Error: The processor execution flow has been worth noting that our proposal was tested under the worst
abruptly interrupted by an unexpected exception, proba- case scenario; thus, presumably further improvements would
bly caused by forbidden memory access. If not handled, be obtained using EDAC protected memories.
this type of error would become a timeout error. Remarkably, most errors are tagged as SDC due to the
2) Timeout Error: The processor has become unresponsive. different matrices’ corruption (input, calculated, and golden).
3) Communication Error: The serial communication with It is remarkable that this protection can reduce the propagation
the processor has become corrupted, thus making it of erroneous outcomes (SDC). More precisely, during the
impossible to identify further errors. irradiation campaign, the technique was able to correct up to
4) Silent Data Corruption (SDC): The benchmark execu- 144 errors, which is a good demonstration of the mitigation
tion has finished with errors in the result matrix. capabilities of the proposed technique.
To test our technique under the worst case scenario, all Regarding the timeout errors, results show that the harden-
program memory sections (.rodata, .data, .bss, and .text) were ing approaches are more prone to hang the platform, which
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.
PEÑA-FERNÁNDEZ et al.: HYBRID LOCKSTEP TECHNIQUE FOR SOFT ERROR MITIGATION 1581
can be caused by one of the threads missing synchronization. [11] G. S. Rodrigues, F. Rosa, Á. B. de Oliveira, F. L. Kastensmidt, L. Ost,
In such a case, the system can enter an abnormal waiting state and R. Reis, “Analyzing the impact of fault-tolerance methods in ARM
processors under soft errors running Linux and parallelization Apis,”
where both cores are waiting for each other to finish and can- IEEE Trans. Nucl. Sci., vol. 64, no. 8, pp. 2196–2203, Aug. 2017.
not synchronize. Although undesirable, it can be considered [12] S. Hukerikar, K. Teranishi, P. C. Diniz, and R. F. Lucas, “RedThreads:
less critical than SDC since this situation may be detected An interface for application-level fault detection/correction through
adaptive redundant multithreading,” Int. J. Parallel Program., vol. 46,
with common processor mechanisms (e.g., watchdog). no. 2, pp. 225–251, Apr. 2018.
[13] Y.-S. Chen and P.-S. Chen, “A software-based redundant execution
programming model for transient fault detection and correction,” in
VI. C ONCLUSION Proc. 45th Int. Conf. Parallel Process. Workshops (ICPPW), Aug. 2016,
We have presented a new hybrid soft error mitigation pp. 66–71.
[14] H. Quinn, Z. Baker, T. Fairbanks, J. L. Tripp, and G. Duran, “Soft-
technique for multicore processors based on multithreaded ware resilience and the effectiveness of software mitigation in micro-
lockstep and a custom hardware IP that uses the trace port controllers,” IEEE Trans. Nucl. Sci., vol. 62, no. 6, pp. 2532–2538,
of the microprocessor to observe the control flow. The tech- Dec. 2015.
[15] J. S. Monson, M. Wirthlin, and B. Hutchings, “Fault injection results of
nique has been validated with low-energy proton irradiation Linux operating on an FPGA embedded platform,” in Proc. Int. Conf.
and tested by means of emulated injection campaigns. Both Reconfigurable Comput. (FPGAs), Dec. 2010, pp. 37–42.
campaigns show insights of reliability improvements. On the [16] F. Abate, L. Sterpone, C. A. Lisboa, L. Carro, and M. Violante, “New
techniques for improving the performance of the lockstep architecture
one hand, fault injection campaigns have demonstrated the for SEEs mitigation in FPGA embedded processors,” IEEE Trans. Nucl.
detection and recovery capabilities of the proposed approach. Sci., vol. 56, no. 4, pp. 1992–2000, Aug. 2009.
On the other hand, the irradiation campaign has validated the [17] M. Violante, C. Meinhardt, R. Reis, and M. S. Reorda, “A low-cost
solution for deploying processor cores in harsh environments,” IEEE
reliability improvements observed in the analysis of the fault Trans. Ind. Electron., vol. 58, no. 7, pp. 2617–2626, Jul. 2011.
injection results. Therefore, error mitigation is improved by [18] H.-M. Pham, S. Pillement, and S. J. Piestrak, “Low-overhead fault-
using our hybrid multithreaded lockstep-based approach for tolerance technique for a dynamically reconfigurable softcore processor,”
IEEE Trans. Comput., vol. 62, no. 6, pp. 1179–1192, Jun. 2013.
soft error mitigation. [19] I. Oz and S. Arslan, “A survey on multithreading alternatives for soft
error fault tolerance,” ACM Comput. Surveys, vol. 52, no. 2, pp. 1–38,
R EFERENCES Mar. 2020.
[20] E. W. Wachter, S. Kasap, X. Zhai, S. Ehsan, and K. McDonald-Maier,
[1] X. Iturbe, B. Venu, E. Ozer, and S. Das, “A triple core lock-step “Survey of lockstep based mitigation techniques for soft errors in
(TCLS) ARM Cortex-R5 processor for safety-critical and ultra-reliable embedded systems,” in Proc. 11th Comput. Sci. Electron. Eng. (CEEC),
applications,” in Proc. 46th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Sep. 2019, pp. 124–127.
Netw. Workshop (DSN-W), Jun. 2016, pp. 246–249. [21] A. Serrano-Cases, F. Restrepo-Calle, S. Cuenca-Asensi, and
[2] Á. B. de Oliveira et al., “Lockstep dual-core ARM A9: Implementation A. Martínez-Álvarez, “Multi-threaded mitigation of radiation-induced
and resilience analysis under heavy ion-induced soft errors,” IEEE Trans. soft errors in bare-metal embedded systems,” J. Electron. Test., vol. 36,
Nucl. Sci., vol. 65, no. 8, pp. 1783–1790, Aug. 2018. no. 1, pp. 47–57, Dec. 2019.
[3] F. Abate, L. Sterpone, and M. Violante, “A new mitigation approach for [22] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August,
soft errors in embedded processors,” IEEE Trans. Nucl. Sci., vol. 55, “SWIFT: Software implemented fault tolerance,” in Proc. Int. Symp.
no. 4, pp. 2063–2069, Aug. 2008. Code Gen. Optimiz., Mar. 2005, pp. 243–254.
[4] S. K. Reinhardt and S. S. Mukherjee, “Transient fault detection via [23] M. Pena-Fernandez, A. Lindoso, L. Entrena, and M. Garcia-Valderas,
simultaneous multithreading,” ACM SIGARCH Comput. Archit. News, “The use of microprocessor trace infrastructures for radiation-induced
vol. 28, no. 2, pp. 25–36, 2000. fault diagnosis,” IEEE Trans. Nucl. Sci., vol. 67, no. 1, pp. 126–134,
[5] J. Smolens, B. Gold, B. Falsafi, and J. Hoe, “Reunion: Complexity- Jan. 2020.
effective multicore redundancy,” in Proc. 39th Annu. IEEE/ACM Int. [24] M. Peña-Fernandez, A. Lindoso, L. Entrena, M. Garcia-Valderas,
Symp. Microarchitecture (MICRO), Dec. 2006, pp. 223–234. Y. Morilla, and P. Martín-Holgado, “Online error detection through trace
[6] C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar, “Utilizing infrastructure in ARM microprocessors,” IEEE Trans. Nucl. Sci., vol. 66,
dynamically coupled cores to form a resilient chip multiprocessor,” in no. 7, pp. 1457–1464, Jul. 2019.
Proc. 37th IEEE Int. Conf. Dependable Syst. Netw. (DSN), Edinburgh, [25] J. Pallister, S. J. Hollis, and J. Bennett, “BEEBS: Open benchmarks for
Scotland, Apr. 2007, pp. 317–326. energy measurements on embedded platforms,” 2013, arXiv:1308.5174.
[7] K. Mitropoulou, V. Porpodas, and T. M. Jones, “COMET: [26] P. S. Magnusson et al., “Simics: A full system simulation platform,”
Communication-optimised multi-threaded error-detection technique,” in IEEE Comput., vol. 35, no. 2, pp. 50–58, Feb. 2002.
Proc. Int. Conf. Compil., Architectures Synth. Embedded Syst. (CASES), [27] Zybo Reference Manual, Digilent, Pullman, WA, USA, 2014.
2016, pp. 2.3.1–2.3.10. [28] M. Peña-Fernandez et al., “PTM-based hybrid error-detection archi-
[8] Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August, “DAFT: tecture for ARM microprocessors,” Microelectron. Rel., vols. 88–90,
Decoupled acyclic fault tolerance,” in Proc. 19th Int. Conf. Parallel pp. 925–930, Sep. 2018.
Archit. Compilation Techn. (PACT), 2010, pp. 87–97. [29] Zynq-7000 All Programmable SoC: Technical Reference Manual, docu-
[9] H. So, M. Didehban, Y. Ko, A. Shrivastava, and K. Lee, “EXPERT: ment UG585, Xilinx, San Jose, CA, USA, 2016.
Effective and flexible error protection by redundant multithreading,” [30] Cortex-A9 Technical Reference Manual r4p1, ARM, Cambridge, U.K.,
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018, 2012.
pp. 533–538. [31] A. Lindoso, M. García-Valderas, L. Entrena, Y. Morilla, and P. Martín-
[10] B. Döbel and H. Härtig, “Can we put concurrency back into redundant Holgado, “Evaluation of the suitability of NEON SIMD microprocessor
multithreading?” in Proc. 14th Int. Conf. Embedded Softw. (EMSOFT), extensions under proton irradiation,” IEEE Trans. Nucl. Sci., vol. 65,
2014, pp. 1–10. no. 8, pp. 1835–1842, Aug. 2018.
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on November 10,2022 at 13:29:17 UTC from IEEE Xplore. Restrictions apply.