0% found this document useful (0 votes)
12 views13 pages

SOFIA An Automated Framework For Early Soft Error Assessment - Identification and Mitigation

The paper introduces SOFIA, an automated framework designed for early assessment, identification, and mitigation of radiation-induced soft errors in electronic computing systems. SOFIA integrates fault injection techniques, machine learning methods, and various mitigation strategies, including triple modular redundancy and a novel register allocation technique, validated through extensive fault injection campaigns. This framework aims to enhance the reliability of complex software stacks, particularly in safety-critical applications such as autonomous vehicles.

Uploaded by

Lilian Jurado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

SOFIA An Automated Framework For Early Soft Error Assessment - Identification and Mitigation

The paper introduces SOFIA, an automated framework designed for early assessment, identification, and mitigation of radiation-induced soft errors in electronic computing systems. SOFIA integrates fault injection techniques, machine learning methods, and various mitigation strategies, including triple modular redundancy and a novel register allocation technique, validated through extensive fault injection campaigns. This framework aims to enhance the reliability of complex software stacks, particularly in safety-critical applications such as autonomous vehicles.

Uploaded by

Lilian Jurado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Journal of Systems Architecture 131 (2022) 102710

Contents lists available at ScienceDirect

Journal of Systems Architecture


journal homepage: www.elsevier.com/locate/sysarc

SOFIA: An automated framework for early soft error assessment,


identification, and mitigation
Jonas Gava a , Vitor Bandeira a , Felipe Rosa a , Rafael Garibotti b , Ricardo Reis a , Luciano Ost c ,∗
a
PGMicro - Federal University of Rio Grande do Sul, UFRGS, Brazil
b
School of Technology, PUCRS, Brazil
c
Wolfson School, Loughborough University, UK

ARTICLE INFO ABSTRACT

Keywords: The occurrence of radiation-induced soft errors in electronic computing systems can either affect non-essential
Fault injection system functionalities or violate safety–critical conditions, which might incur life-threatening situations. To
Fault mitigation techniques reach high safety standard levels, reliability engineers must be able to explore and identify efficient mitigation
Soft error
solutions to reduce the occurrence of soft errors at the initial design cycle. This paper presents SOFIA, a
Reliability
framework that integrates: (i) a set of fault injection techniques that enable bespoke inspections, (ii) machine
learning methods to correlate soft error results and system architecture parameters, and (iii) mitigation
techniques, including: full and partial triple modular redundancy (TMR) as well as a register allocation
technique (RAT), which allocates the critical code (e.g., application’s function, machine learning layer) to
a pool of specific processor registers. The proposed framework and novel variations of the RAT are validated
through more than 1739k fault injections considering a real Linux kernel, benchmarks from different domains
and a multi-core Arm processor.

1. Introduction gains over the hardware-based approaches, mainly at the cost of ignor-
ing the impact of new technologies on soft error rates. The majority
Electronic computing systems are incorporating more functionalities of such FI frameworks are reasonable environments to base detailed
and new technologies into their software stacks (e.g., kernels, drivers soft error assessment and identification. However, to ensure the fail-
and heavy applications). Software stacks running on such architec- safe functionality of emerging electronic computing systems, designers
tures differ in terms of security, reliability, performance and power should be able to assess, identify and promote efficient alternatives to
requirement. While supercomputer software development considers
mitigate the occurrence of soft errors.
performance as primary criteria, software stacks embedded in cars
This paper, therefore, mainly contributes by describing a pioneering
must comply with strict safety and reliability requirements, defined
fully automated framework, called SOFIA, which supports fast and
by specific standards such as the ISO 26262 Road vehicles Functional
Safety [1]. Such standards will undoubtedly impose more restrictions, early soft error assessment, diagnosis and susceptibility reduction eval-
mainly due to the advance of autonomous vehicles, which will make uation through the application of software-based mitigation techniques.
decisions that can put human lives at risk. Such systems are expected SOFIA offers a wide range of soft error evaluation opportunities, which
to integrate artificial intelligence (AI) and machine learning (ML) tech- go beyond the classical toolsets, thereby furthering potential advan-
niques that will be just as complex as those found in today’s data tages that fill the gap between the available tools and the industry
centres. With the constant growth of software stack code size and requirements. Another novelty of this paper is the soft error assessment
complexity, designing fast, flexible and cost-effective tools that enable considering, for the first time, variations of the register allocation
in-depth soft error susceptibility analysis of complex software stacks technique (RAT) and its joint use with more established mitigation
has become of utmost importance. techniques (e.g., triple modular redundancy—TMR) when varying the
With this in mind, researchers and market leaders are investigating number of processor cores.
new alternatives, such as the use of virtual platform fault injection
The other contributions of this work are as follows:
(FI) frameworks [2–10]. Such frameworks allow enormous productivity

∗ Corresponding author.
E-mail addresses: [email protected] (J. Gava), [email protected] (V. Bandeira), [email protected] (F. Rosa), [email protected]
(R. Garibotti), [email protected] (R. Reis), [email protected] (L. Ost).

https://doi.org/10.1016/j.sysarc.2022.102710
Received 6 April 2022; Received in revised form 8 August 2022; Accepted 17 August 2022
Available online 23 August 2022
1383-7621/© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

• completely automated soft error analyses flow, which leverages 2.3. Soft error assessment capable VP frameworks
the high simulation performance to acquire representative
error/failure-related data efficiently; Table 1 summarises the related works in virtual platform FI simu-
• integration of soft error mitigation techniques from literature lators. The work described in [5] performs fault injections on the NAS
(e.g., TMR), enabling susceptibility analyses of real soft stacks Parallel Benchmarks; however, their instrumentation can add up to 30×
considering different configurations; on top of the tenfold overhead added by QEMU [5]. Another QEMU-
• extensive framework validation through more than 1739k fault based FI framework is presented in [10]. However, it only considers
injections considering a Linux kernel, different benchmarks and a simple benchmarks and single-core processors.
multicore processor. Except for Hari et al. [3] that proposed a hybrid simulation frame-
work (i.e., Simics [18] and GEMS [19] simulators), the remaining
The rest of this paper is organised as follows. Section 2 presents works rely on a single VP simulator. In [4], the authors present two FI
basic concepts of soft error reliability assessment and related works in tools: MaFIN (MARSSx86) and GeFIN (gem5) that support the injection
virtual platform FI simulators. Section 3 describes the main features of faults in micro-components such as cache memory control, load-
of SOFIA and introduces its new hardening module. Section 4 uses store queue and other SRAM-based arrays. Nonetheless, the validation
examples to validate the SOFIA’s flow. In Section 5, the efficiency of both frameworks only considers ten applications [20] running on
of the adopted mitigation techniques is evaluated considering distinct single-core processors. Other gem5-based FI framework is reported
software stacks. Furthermore, the RAT is evaluated in a specific case in [6], which fails to consider complex software stacks and OSs in their
study analysing the registers criticality. Section 6 presents the final experiments.
remarks and future works. While authors in [8,21] employ machine learning to reduce the time
needed for the fault injection campaigns, Rosa et al. [9] promote a
2. Basic concepts and related works module that uses ML algorithms to correlate large subsets of application
profiles and architecture characteristics with fault injection results. The
This Section presents basic concepts and terminologies related to developed ML-based module reduces user intervention and enables the
the assessment and mitigation of soft errors. Further, a comparison identification of relevant relationships or associations between appli-
between the proposed framework and those available in the literature cation profiling and specific single or multicore platform parameters.
is presented. The developed module was integrated into the first version of the
SOFIA framework, which supports an extensive range of fault injection
2.1. Basic concepts and adopted fault classification techniques as well as bespoke soft error classification and analysis.
The authors in [8] report a 3x slowdown to simulate the application
This work considers the definitions from [11] for fault, error, and without FI and 35x or more depending on the metrics extracted with
failure. A fault is an event that may cause a system’s internal state to FI. In turn, SOFIA presents an worst case of 4x slowdown with FI.
change, e.g., a radiation particle strike. When a fault affects the system’s Although it does not use a virtual platform to operate, the well-known
internal state, it becomes an error. If the error causes a deviation of LLFI fault injector [22] uses the LLVM compiler to instrument the code
at least one of the system’s external states, it is considered a failure. with fault points, which are considered during its execution on a target
These events are evaluated through fault injection campaigns, which architecture.
comprises a set of simulations affected by faults with well-defined The authors in [15–17] presented CLEAR, ReDO, and SyRA. These
configurations. frameworks focus on early cross-layer soft error reliability analysis. The
This work uses the Cho’s et al. [12] five classes classification to CLEAR framework uses the BEE3 FPGA emulation and RTL simulation
group the faults resulting from the fault injection campaigns. Appli- for the soft error assessment, while the other two use the GeFIN
cation Output Memory Mismatch (OMM), the application terminates fault injector (microarchitectural-level) and the LIFILL (software-level).
without any error indication; however, the resulting memory is af- It is noteworthy that previous works do not present any feature to
fected (equivalent to SDC). Application Output Not Affected (ONA), the apply soft error mitigation techniques, except for Cheng et al. [15],
resulting memory is not modified; nevertheless, one or more remaining which includes a resilience library with several software and hardware
bits of the architectural state is incorrect. Unexpected termination (UT), mitigation techniques. However, due to its level of detail, this work
the application terminates abnormally with an error indication. Hang, does not consider complex processors and software stacks on conducted
the application does not finish, requiring a preemptive removal after a experiments. In turn, ReDO and SyRA do not implement any mitigation
threshold execution time. Vanished, if no fault traces are left. technique, but they exploit a Bayesian model to describe the target
system and estimate the protection impact using the average reliability
improvement reported by different works.
2.2. Reliability metrics

2.4. Fault mitigation techniques


The adoption of adequate reliability metrics is crucial to guide the
experiments’ analysis. This work uses two metrics: the Mean Work To
The fault mitigation problem can be tackled both in hardware and
Failure (MWTF) [13] and the Architectural Vulnerability Factor (AVF).
software. Hardware-based techniques led to area overhead, thus in-
The MWTF is defined in Eq. (1) as the workload that a system can
creasing the system cost. On the other hand, software-based techniques
complete before failing.
usually incur extra execution time and memory overhead. Software
1 techniques can range from low-level (e.g., modifying or adding as-
𝑀𝑊 𝑇 𝐹 = (1)
𝐴𝑉 𝐹 𝑂𝑀𝑀 ∗ 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 sembly code) to high-level approaches (e.g., C/C++ libraries, function
This work uses the application’s runtime gathered from the gem5 wrappers). The former may lead to less overhead, but it is architecture
simulator [14] to calculate the MWTF and the fault injection results to and application dependent, while the latter is application dependent.
measure the AVF. In general, the most critical vulnerability is presented Considering the high-level software-based approach, authors in [23,
by the occurrence of OMM. For example, in safety–critical applications, 24] proposed tools that apply fault mitigation techniques in C/C++
such as autonomous cars, an OMM error can alter the detection of an applications. Supported transformations are architecture-independent,
obstacle in front of the vehicle, which can lead to an accident. For this but the language is fixed, and the compiler may remove redundant
reason, this work employs the OMM to measure the AVF (𝐴𝑉 𝐹 𝑂𝑀𝑀 ). code during the optimisation phases. Serrano et al. [25] uses genetic

2
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Table 1
Related works in fault injection frameworks developed on the basis of virtual platforms (VPs).
Works Year Virtual platforms Target architecture Features
Single-core Multicore Profiling ML Mitigation
Hari et al. [2,3] 2014 Simics + GEMS ✓
Kaliorakis et al. [4] 2015 MARSSx86 + gem5 ✓
Tanikella et al. [6] 2016 gem5 ✓
Guan et al. [5] 2016 QEMU ✓ ✓
Cheng et al. [15] 2016 BEE3 ✓ ✓
Khosrowjerdi et al. [8] 2018 QEMU ✓ ✓
Savino et al. [16] 2018 gem5 ✓ ✓ ✓ ✓ a
a
Vallero et al. [17] 2018 gem5 ✓ ✓ ✓ ✓
da Rosa et al. [9] 2019 M*DEV + gem5 ✓ ✓ ✓ ✓
Hauschild et al. [10] 2021 QEMU ✓
This work 2021 M*DEV + gem5 ✓ ✓ ✓ ✓ ✓
a Worksthat do not support the automatic application of mitigation techniques but rather estimate their impact based on results presented by
other works.

algorithms to find a combination of optimisation parameters (i.e., com- Reviewed works either consider simple in-house applications or do
pilation flags) that increase the final binary’s reliability and present not present any feature to apply fault mitigation technique (Section 2.3)
a reasonable trade-off in terms of performance and memory size. Ro- or propose and evaluate them on an individual basis—they are stan-
drigues et al. [26] implements two mitigation techniques in C code: dalone solutions not attached to an automated framework (e.g., [26,
the Triple Modular Redundancy (TMR) and the Conditional Modular 34]). In summary, SOFIA distinguishes from the previous works in four
Redundancy (CMR). Their results show that both techniques do not main aspects:
provide reasonable protection to a complex system executing Linux
kernel. According to the authors, the OS itself is an enormous source of 1. SOFIA is a ‘‘one-stop-shop’’ to assess, identify, and protect ap-
errors and need to be protected if employed on safety–critical systems. plications with support for fast simulation, in-depth analysis,
James et al. [27] extended the COAST (COmpiler Assisted Software multiple architectures, and different programming languages.
fault Tolerance) tool to automatically insert redundancies (DMR and 2. During the soft error identification process, users can leverage
TMR) in the program’s intermediate code through an LLVM plugin. machine learning techniques to identify the correlation between
The latest version added support for new processor platforms such as fault injection results and multi-parameters (i.e., application and
RISC-V and Xilinx SoC-based products. Their mitigation tool is similar platform characteristics), which is missing in previous works.
to ours, but their soft error experiments only assess the impact of using 3. It includes an automated mitigation flow that can be used early
redundancy-based mitigation techniques in two simple applications. in the design process to reduce applications susceptibility to soft
The main downside of the approaches mentioned above is that parts errors by applying various system-level mitigation techniques.
of the protected code may be wrongly removed during the compiler 4. It supports a lightweight technique (i.e., RAT), which can be
optimisations. A solution is to modify the assembly code, which can tuned according to application needs and the available hardware
be done using an existing compiler infrastructure or by building a sepa- resources.
rate tool to manage the code. Regarding the techniques implemented
inside compilers, the most relevant ones use either GCC [13,28] or 3. SOFIA framework
LLVM [7,29–31]. Some focus on error detection by duplicating in-
structions (i.e., DMR) [13] or symptom-based mechanisms [29]. In SOFIA framework1 has many features that can work independently
contrast, other works focus on error recovery by instruction tripli- or as part of a complex workflow. The framework generates an ex-
cation with majority voters (i.e., TMR) [7,31] or duplication with tensive and detailed soft error analysis for a given application and
checkpoints [30]. Moreover, few works rely on their own tools to protects it by applying mitigation techniques, such as TMR, P-TMR and
manage the application code [32,33]. The authors in [32] developed a RAT. This Section describes the modules currently integrated into our
generic intermediate code for the hardening process and a compilation automated framework, as shown in Fig. 1.
infrastructure (frontend and backend) to transform the high-level code
language to generic instructions (GI) and later to the target architecture 3.1. Cross-compilation module
code. In this direction, the authors in [33] proposed the CFT-tool, a
framework that modifies assembly code by applying different data-flow SOFIA integrates a cross-compilation module that supports both
and control-flow protection techniques. Compared to our work, the the GCC and the Clang/LLVM2 compilers. While GCC is the default
lower level language tends to have higher efficiency in the reliability compiler available on most GNU/ Linux distributions, LLVM has many
result. However, working at this level is not feasible when applied to valuable resources to enhance our framework, as LLVM aims to make
complex software stacks, thus limiting its usability. program analysis and transformation available for arbitrary software in
Hybrid mechanisms are complementary and try to achieve better a manner that is transparent to programmers [35].
efficiency by combining both hardware and software approaches. Kasap
et al. [34] presents a hybrid mitigation technique called triple-core 3.2. Framework simulation module
lockstep with roll-back and roll-forward recovery. This technique works
by running the same application simultaneously in three CPUs (2 Arm The framework simulation module supports two well-known simula-
and 1 MicroBlaze) and a checker module that monitors the application tors. The gem5 [14] simulator provides a cycle-accurate simulation and
execution to detect and correct inconsistencies. The monitor module many insightful metrics of the underlying architecture at simulation
and the MicroBlaze processor have TMR applied in hardware, and cost. In turn, the Multicore Developer (M*DEV) virtual platform [36]
the application is partitioned in blocks and combined with redundant
code and recovery routines. Although this technique seeks the best of
both mechanisms, it still suffers from hardware limitations (e.g., area) 1
Available at: https://github.com/ManyCoreResearchTeam/SOFIA
and demands a bespoke development for each specific device, which 2
Note that the Application Hardening Module only works with code
reduces its usability. compiled with LLVM.

3
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Fig. 1. Automated modular framework, highlighting the new hardening module that includes three mitigation techniques: TMR, P-TMR and RAT.

sacrifices cycle accuracy in favour of speed by simulating instructions’ bit-flips target only storage elements due to their higher susceptibility
behaviour. For best results, the users can leverage the accuracy of to radiation events when compared with logic elements [37]. The
gem5 to perform application and architecture profiling, while M*DEV fault injection configuration (e.g., bit location, injection time) relies
is better suited for massive fault injection campaigns and soft error on a random uniform function, which is a well-accepted fault injection
assessment. Note that the same application binary and Linux kernel can technique since it covers the majority of possible faults on a system
be used in both simulators. at a low computation cost. Fault injections occur during the target
The framework simulation module also presents two important fea- application lifespan, i.e., the OS startup is not subject to faults but
tures to speed up fault injection simulation: checkpoint and distributed includes OS system calls and parallelisation API subroutines arising
simulation. The first technique is used to reduce the execution time of during this period. This approach allows identifying unexpected appli-
a FI simulation. During flawless execution, several checkpoints are cre- cation execution errors (e.g., segmentation fault), which correlate with
ated by collecting the software context of platform components. Then, adopted OS components or API libraries.
before starting a new FI simulation, this module restores the software The FIM module is also responsible for generating the reference in-
context by choosing the checkpoint closest to the fault injection time, formation obtained through the golden execution and the configuration
avoiding re-executing the initial faultless code. and execution of the fault injection campaigns. In addition, this module
The second technique reduces the time spent on large fault cam- has support for distributed (several hosts within the same network)
paigns by parallelising each FI simulation across jobs on computers and parallel (multiple processors and cores in the same host) fault
with multiprocessing capabilities (e.g., data centres). For example, con- injection simulations [9]. The FIM module is available for gem5 and
sider an FI campaign having a distributed fault injection process across M*DEV, providing high fault injection controllability, considering real
hundreds or thousands of parallel executions. First, jobs responsible for and complex software stacks including applications, operating system
the fault injection are submitted to the scheduler queue. According to and API structure/functionality. All done without any modifications on
the system resource availability, these jobs are assigned to the com- the target software.
pute nodes independently. Note that the entire fault injection process
and the classification analysis occur locally at each node. Finally, all 3.5. Machine Learning Module
generated individual reports are further merged and analysed by other
modules (i.e., Machine Learning Module and Analysis and Visualisation Converting fault injection explorations into actual system reliabil-
Module). ity improvements is not straightforward. SOFIA includes a ML-based
module [9] — i.e., a cross-layer investigation toolset that uses su-
3.3. Profiling module pervised and unsupervised techniques and methods (e.g., linear re-
gression) to recognise patterns in sizeable data sets obtained from
The tracing and profiling of SOFIA enable the collection of crucial large fault campaigns while guiding the target application development
information about the applications’ behaviour executing under the pres- towards a more dependable execution. The underlying toolset can
ence of faults. For instance: (i) the number of each executed instruction effectively boost the soft error assessment, allowing users to explore
by its opcode (add, bneq) or its class (e.g., arithmetic, branch); (ii) the trends quickly, considering complex data sets while avoiding human
processor registers’ utilisation percentage that allows the evaluation of bias.
different instructions reliability and register criticality. To execute the SOFIA’s ML module provides engineers with automatic and appro-
target platform binary on the host machine in the M*DEV, the simulator priate means to investigate the soft error reliability of complex software
needs to translate the opcode from the target machine into a host stacks, considering state-of-art ISAs and compilers. Conducting the soft
machine code. It uses a call-back to perform this translation. This call- error analysis of such complex systems is impractical at lower levels,
back is responsible for fetching an instruction, decoding it and then time-consuming and error-prone when manually conducted, mainly
calling the routines that describe its behaviour. During the call-back, a due to the complexity and size of failure-related data sets that might
user-defined routine changes the instruction’s behaviour, which enables be involved. Underlying process involves multiple files/spreadsheets
the generation of a bespoke profile. and manual report generation but trusting the results depends on how
attentive and accurate operators are at process. Finally, identifying
3.4. Fault injection module (FIM) correlations may require hours of manually data analysis and plowing
through large files of simulation records to show that results are
The developed fault injection module emulates the occurrence of consistent, etc. The automated ML data analysis may provide engineers
single-bit-upsets (SBUs) and multiple-bit-upsets (MBUs) by injecting with reasonable and adequate results that can guide them to select
flipped bit(s) in registers or memory addresses during the execution an existing (e.g., device hardening, redundant execution) or investi-
of a given soft stack. It is also possible to define whether the MBUs gate a proper fault mitigation technique to improve the entire system
will affect sequential or random bits of the same word. In our setup, reliability at early design phases.

4
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

3.6. Analysis and visualisation module will generate an error. Moreover, even when the application compiles
correctly, restricting the pool’s size causes the allocator to spill more
SOFIA also integrates a module to analyse and visualise the data registers values into memory, significantly reducing the application’s
either gathered from the fault injection campaigns or resulting from the performance.
ML-based module. The framework can perform soft error analysis using The software engineer can also set the list of application functions
well-established (Section 2.1) or user-customised classifications. The (e.g., the critical ones) that must be hardened. Functions are limited
availability of this visualisation feature can increase the productivity of to those available in the program’s source code; that is, functions from
application/reliability engineers, making raw data easy to understand external libraries do not suffer any effect as they are already in the
and find patterns. machine code format. After setting the parameters, the LLVM static
compiler (LLC) produces the assembly code, and finally, the binary
3.7. Application hardening module is generated by any compatible linker (e.g., LLD, GOLD, GNU LD).
Note that the RAT mitigation technique does not work properly on
A processor-based system can be affected by two main types of modern out-of-order processors that use register renaming. However,
soft errors: control-flow and data-flow. While a control-flow error oc- RAT can increase the soft error reliability of applications running on
curs when an error causes a deviation from the correct program flow resource-constrained devices that employ simple in-order microproces-
(e.g., incorrect branch), a data-flow soft error refers to the occurrence sors [43,44]. For instance, results presented in [44] show that RAT
of a bit-flip in a storage element, such as a register or memory block. significantly reduces the occurrence of soft errors when applied to
The application hardening module integrates data-flow mitigation tech- the MobileNet CNN model. Furthermore, the preliminary results from
niques because most soft errors affect data-flow rather than the system’s neutron radiation tests also suggest that RAT can improve the soft
control-flow [38]. The underlying software-based mitigation techniques error resiliency of a CNN application running on an Arm Cortex-M4
are architecture-independent, and their development is underpinned processor [45].
by a set of rules proposed in [39], which target data-flow strategies
that aim to detect faults affecting values stored in register banks and 4. SOFIA validation
memory elements.
The promoted hardening module relies on the LLVM compiler to This Section shows the SOFIA’s flow and presents a case study used
implement the mitigation techniques without modifying the original to validate the proposed framework.
application source code, as demonstrated by Bohman et al. [40].
First, the LLVM compiler receives the high-level language source code 4.1. SOFIA’s automated flow
(e.g., C/C++) and outputs the LLVM IR instructions, which describes
a program using an abstract RISC-like instruction set. Based on LLVM Fig. 2 shows the automated flow supported by SOFIA, which com-
IR, this module applies code optimisations (e.g., -O3) and implements prises eleven steps distributed between the seven modules. The first
the chosen mitigation technique. Lastly, it transforms the IR code into step is the (1) Software Stack Configuration, where the software
assembly code for a given target (e.g., x86, ARM, RISC-V). Note that engineer configures the software stack and selects the cross-compiler
modifications related to the mitigation techniques are performed before based on the target platform. Then, the (2) Cross-Compilation step
assembly code generation to ensure architecture independence. is responsible for compiling the application’s source files to the target
The new hardening module incorporates three soft error mitigation platform using the parameters defined in the previous step. This step
techniques, which are described as follows. generates the unprotected binary version of the application. If the
LLVM compiler was previously selected, it also generates the LLVM
3.7.1. Full and partial triple modular redundancy intermediate representation (LLVM IR), which describes a program
SOFIA supports two selective TMR variants based on the VAR3+ using an abstract RISC-like instruction set.
technique [41]. These techniques were chosen due to their capabil- The (3) Function Profile step has two responsibilities. First, it
ity of increasing reliability while maintaining a low overhead w.r.t. simulates the unprotected binary using gem5 and M*Dev, aiming to
other TMR-based solutions. The first version has the original VAR3+ verify its correctness execution. Second, it also runs the profiling tool
definition, where all instructions, except for branches and stores, are with the trace flag enabled to capture the execution time of each
replicated three times (rules G1, D2). In this regard, majority voters application’s function, revealing the critical one. Note that some data
check the replicas before every load, store, or branch instruction (rules profiles might be used to feed the Machine Learning Module.
C3, C4, C5, C6). This approach usually incurs a reasonable performance The (4) Hardening Configuration step is used to set whether the
penalty, which might restrict its adoption under resource-constrained application or its function must be protected or not. If positive, this step
devices. In the second version, the VAR3+ technique can be applied to will generate LLVM parameters to apply one or a combination of the
one or more critical functions instead of the entire application code. chosen hardening techniques. If no protection is chosen, the flow goes
This approach can considerably reduce code size and execution time to the fault injection steps. If the TMR mitigation technique is selected,
while maintaining similar reliability. In this work, these techniques are the flow goes to the LLVM compilation step. Otherwise, the function
named TMR and P-TMR, respectively. to be protected must be chosen, and the flow goes to step (5). Note
that step (6) will only be covered if the RAT mitigation technique is
3.7.2. Register Allocation Technique (RAT) the user’s option.
SOFIA also incorporates a lightweight technique called Register The (5) Critical Function Selection is the next step for P-TMR
Allocation Technique (RAT) [42], which restricts the number of avail- or RAT hardening protection. Here the software engineer can either
able registers for specific functions aiming to minimise the number manually determine the most critical application function (s) or use the
of vulnerable registers. RAT can be fine-tuned to achieve the highest default option, which selects the most executed one (i.e., the function
level of relative trade-off in terms of performance and reliability. For with the higher probability of being struck by a fault). For RAT, the (6)
instance, the software engineer can set the registers’ pool parame- Register Profile step is performed, and the register’s usage is obtained.
ter, aiming to restrict the application’s register allocation considering The (7) LLVM Compilation is a mandatory step for any of SOFIA’s
the minimum number of registers demanded by each application and hardening protection. First, the LLVM optimiser transforms the LLVM
their availability in the target architecture. For example, the standard IR code, generated in step (2), and applies a hardening technique to
instruction ADD R0, R1, R2 presented in most ISAs, needs three the target application. For example, in the case of RAT, the compila-
registers to work. If the registers’ pool is set to {R0, R1}, the compiler tion is performed considering the critical function and a register pool

5
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Fig. 2. Proposed automated flow for early soft error assessment, identification and mitigation.

(i.e., least used registers). Then, the LLVM static compiler transforms 4.2. Case study
the IR code into an assembly code for the target architecture. The code
optimisation is disabled in this step to carry inserted instructions to the This Section uses the OpenMP version of the Integer Sort (IS)
final assembly code. Finally, the LLD linker is used to link the assembly application from NAS Parallel Benchmarks (NPB) [46] to validate the
code and generate the hardened binary code. However, if the software efficiency of SOFIA’s flow. For the first two steps, CMake configuration
engineer decides on combined protection, the flow returns to step (5) scripts are used to set compiler-related ISA flags. After this initial setup,
once before generating the hardened binary code. the application is compiled using the template CMake script with op-
After generating the binaries, the software engineer configures the tional specific compilation flags (e.g., -O2, -mcpu, -mfloat-abi), which
fault injection parameters in the (8) Fault Injection Configuration are passed through the command line in step (1). After generating the
step, such as the number of fault injection campaigns. Then, the (9) unprotected binary for the target platform, the application is simulated
Fault Injection Framework executes the binary and emulates the without faults, and the profiling in step (3) is performed using gem5
occurrence of bit-flips into the registers. All fault injection reports are and M*Dev simulators. Data from these initial simulations are then used
by both Application Hardening and ML modules.
saved for later analysis. Furthermore, steps (8) and (9) are iterated by
Fig. 3 shows the application’s unhardened function profile when
the number of binaries generated.
running on a quad-core Arm Cortex-A72 processor. The most per-
Data from profile and FI reports feed into the Machine Learning
formed function is .omp_outlined.6, created by the OpenMP li-
Module in the (10) ML Applied Techniques step. This step uses
brary. There are dozens of functions in the ‘‘others’’ category, of which
machine learning techniques to perform multi-variable and statistical
more than 90% belong to Linux routines. In addition, the thread man-
analysis using the provided data to find correlations that can help agement routines (e.g., kmp_fork_barrier, kmp_join_barrier,
choose the correct hardening technique. In addition, this step also kmp_hyper_barrier) represent the majority of the total execution
generates graphics to further assist software engineers in the results’ time. These thread synchronisation functions are more sensitive to
analysis. register bit-flips.
Finally, the (11) Fault Injection Analysis step provides a soft error For an initial soft error assessment, three fault injection campaigns
reliability assessment of the generated binaries. It includes the Analysis were performed varying only the number of processor cores (i.e., 1, 2,
and Visualisation Module that helps software engineers to examine the and 4). Each FI campaign injects 3.1𝑘 random bit-flips in the general-
data obtained from the fault injection campaign by facilitating the raw purpose registers of the Cortex-A72 processor, i.e., 99% confidence
data understanding and the recognition of patterns. level and a 2.3% error margin according to [47]. Table 2 shows the

6
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Fig. 3. Sub-figures (a–d) show the function profile for each core on a quad-core processor for the unhardened program version. Function A: .omp_outlined.6, B: randlc, C:
kmp_hyper_barrier, D: kmp_fork_barrier, E: kmp_join_barrier, F: kmp_barrier, G: others.

Table 2
NAS parallel fault injection campaign results.
Core variant Van. ONA OMM UT Hang MWTF
Single 91.45 0.03 1.32 7.10 0.10 88.45
Dual 79.16 0 9.35 11.35 0.13 4.56
Quad 74.68 0.29 13.42 11.35 0.26 1.99

Soft error values are in %.

overall results of the FI campaigns. The more cores, the higher the op-
erating system thread management routines, which leads to an increase
in OMMs and a decrease in the occurrence of Vanishes. Consequently,
the MWTF is also severely affected with a 19.40× reduction from single
to dual-core and a further 2.29× reduction from dual to quad-core.
The ML module was used to speed up the analysis of the applica-
tion’s behaviour considering different configurations and parameters.
Fig. 4 shows four ML correlations generated from the case study FI
reports. Each circle indicates a FI campaign that considers different
parameters such as the number of CPU cores, compiler optimisation
level, among others. Fig. 4(a) shows that the increase in branch in-
structions is directly related to the higher incidence of Hangs. Fig. 4(b)
illustrates that the increase in the number of writes in the integer
register is related to the decrease in UTs. In Fig. 4(c), it is possible
to see that the rise in the number of accesses to ALU is associated
with the reduction of ONA. Lastly, Fig. 4(d) shows a direct relationship
between the number of readings from control registers and the increase
in OMMs. Note that proposed ML-based soft error correlation module
has the ability to compare hundreds or even thousands of variable
combinations automatically, providing the user with a list of possible Fig. 4. Sub-figures (a–d) show ML analysis examples generated using FI reports. The
solutions using multiple correlation indexes. Nevertheless, the correla- circles are FI campaigns, and the lines are distinct ML regression techniques: red, poly;
tion does not prove causality by itself. Therefore, the software engineer green, linear; and, blue, rbf regression. (For interpretation of the references to colour
in this figure legend, the reader is referred to the web version of this article.)
can rely on the resulting correlations to understand where application
vulnerabilities lie and take more appropriate decisions to mitigate the
occurrence of soft errors.
After the ML-based soft error analysis, the flow is used to protect the With the data collected from the unprotected binary and the two hard-
application by applying the TMR and the P-TMR mitigation techniques. ened versions (TMR and P-TMR) — in step (9) –, software engineers can

7
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Fig. 5. Comparison between mitigation techniques, values are normalised by Fig. 6. Results applying RAT for unhardened, TMR, and P-TMR versions. All data points
unhardened version. are normalised with the unhardened version without RAT.

use the Analysis and Visualisation Module to compare the reliability 4.3. Benefits of SOFIA’s design space exploration
and performance trade-offs, aiming to define which protection strategy
suits best to the target application. While simulation-based soft error analysis is expected to produce
Fig. 5 shows the system reliability and the overhead for runtime accurate results at RTL or gate levels, they are restricted to rela-
and code size, with all values normalised by the unhardened version. tively small systems and experimental fault campaigns due to their
The program code size presents an increase of 44% when applying TMR low simulation performance and availability of platform components.
compared to a 16% increase with P-TMR, which is an exciting finding Therefore, the SOFIA framework enables the soft error analysis of
for memory-constrained projects. On the other hand, considering the complex systems comprising not only real software stacks but also in-
reliability metrics, Fig. 5 shows that the MWTF for TMR and P-TMR struction set architectures (ISAs) and state-of-the-art processor models,
is reduced by 79% and 66%, respectively. The reduction shows that which are rarely available to users. The soft error analysis consis-
both mitigation techniques present inefficient protection for single-core tency of SOFIA’s M*DEV module has been investigated against RTL-
due to the high natural fault-masking rate (+90%). In addition, the two based simulations considering Arm Cortex-M0 and M3 processors [49].
mitigation techniques are also inadequate to protect external functions Results show that the SOFIA M*DEV module presents a worst-case
and OS routines that use the most susceptible registers. However, the mismatch of 7.5% when faults are injected into general-purpose regis-
more cores, the less is the application susceptibility to the occurrence ters. The mismatch between fault injection approaches remains similar
of soft errors. This improvement is due to a significant increase in even when different ISAs, software stacks, and cross-compilers are
the execution of OpenMP functions, making the protection worthwhile employed. SOFIA’s M*DEV module has some limitations due to the lack
in this case. For example, the MWTF gain is ∼16% for the dual-core of micro-components, but the framework also integrates a gem5-based
protected with P-TMR and ∼13% for the quad-core protected with TMR. module that enables more detailed FI explorations, considering other
Like most of applications, this case study uses a small fraction of components such as processor pipeline, cache, among others.
the 31 general-purpose registers available in the ARMv8 ISA, while the SOFIA offers an unmatched performance (e.g., 10 to 100 thousand
operating system handles them more evenly. Thus, even if a register times faster than an RTL or gate-level simulator) and design flexi-
fault occurs during the execution of an application function, the effect bility, allowing the evaluation of realistic self-driven software stack
(i.e., billion of executed instructions for each simulation [9]) previously
of this fault can impact on other functions of the OS (e.g., threads man-
impossible when using cycle-accurate or RTL simulation engines. Con-
agement), which are more likely to use long-lived registers. Note that in
cerning the simulation time, the fault injection process does not change
addition to this case study, initial versions of SOFIA flow have already
the target application source; instead, the fault injector interrupts the
been validated with RISC-V [48] and different Arm processors [43,44].
simulator (i.e., gem5 or M*DEV) a single time to change a single bit in
Furthermore, the SOFIA’s flow validation also considers the joint use
the application execution, having an unnoticeable impact on the overall
of mitigation techniques. In this regard, it investigates the application
simulation time compared to the unmodified application execution.
protection with the RAT alone and in conjunction with both TMR and
Finally, we believe that SOFIA provides a beneficial approach that
P-TMR solutions.
provides engineers with automatic and appropriate means to investi-
Fig. 6 shows that the hardened RAT version increases the MWTF gate the soft error reliability of complex software stacks. Conducting
by 2%, 16%, and 8% for single-core, dual-core, and quad-core, respec- the soft error analysis of such complex systems is time-consuming and
tively. Note that a minimal addition of runtime and code size overhead error-prone when manually performed, mainly due to the complexity
(∼1%) exists and that all scenarios are normalised by the unprotected and size of failure-related data sets that might be involved. The under-
version (Ref). In turn, the joint use of RAT with the TMR technique lying process involves multiple files/spreadsheets and manual report
leads to a higher MWTF gain, i.e., an increase around 2.54× and 2.75× generation but trusting the results depends on how attentive and accu-
for dual-core and quad-core, respectively. The higher MWTF is due to rate operators are in the process. Thus, identifying the most unreliable
the more significant number of registers used by the program since software configurations and their correlations with the target platform
all its functions have redundant code. In this regard, it is possible to may require hours of manual data analysis and plowing through large
distribute the use of registers more evenly. However, no significant files of simulation records to show consistent results. In this regard,
effect on P-TMR (+RAT) has been identified in these experiments, SOFIA enables software engineers to obtain initial results to eliminate
meaning that choosing the least used registers strategy does not work not suitable solutions at early design phases, thus reducing the number
in all cases. of future investigations that might be conducted at a lower level.

8
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Table 3 Table 4
Experimental setup. Reference WCET benchmark suite.
Framework setup # Application Description
Processor Arm Cortex A72 1 adpcm Adaptive pulse code modulation algorithm.
Compiler Clang/LLVM 6.0 2 binary_search Binary search for an array of integer elements.
Fault model Random Single Bit-Flip 3 bit_manipulation Complex embedded code.
Fault injection per scenario 3100 4 blowfish Encryption algorithm.
WCET benchmark
5 bubble Bubblesort program.
6 compress Data compression program.
Core variants Single-Core 7 counts Counts non-negative numbers in a matrix.
OS None 8 crc Cyclic redundancy check computation.
Mitigation techniques TMR, P-TMR 9 edn Finite Impulse Response (FIR) filter calculations.
Number of applications 25 10 expint Series expansion for computing an integral function.
Rodinia benchmark 11 factorial Calculates the factorial function.
12 fdct Fast Discrete Cosine Transform.
Core variants Single, Dual and Quad-Core
13 fibonacci Simple iterative Fibonacci calculation.
OS Linux (kernel 4.3)
14 hanoi Tower of Hanoi puzzle game.
Mitigation techniques RAT, TMR (+RAT), P-TMR (+RAT)
15 harm Harmonic calculations with recursive calls.
Number of applications 13
16 insert_sort Insertion sort on a reversed array.
17 jfdct_int Discrete-cosine transformation.
18 matrix_mult Matrix Multiplication 20 × 20.
19 mdc Calculates the greatest common divisor.
5. Soft error assessment considering different mitigation tech- 20 peakspeed Memory bounded program.
niques 21 petri_net Simulate an extended Petri Net.
22 prime Calculates whether numbers are prime.
This Section evaluates the effectiveness of the mitigation techniques 23 switch_cases Control bounded program.
24 ud Linear Equations by LU Decomposition.
for two distinct sets of benchmarks: WCET and Rodinia. The WCET
25 usqrt Sqrt calculation without multiplications or divisions.
benchmark suite considers bare metal applications running on a single-
core processor. These applications are simple but comprise the basis
for more complex ones—for example, matrix multiplication and linear Table 5
equation solver are the core of many machine learning routines. On the WCET benchmark results for runtime, code-size, and MWTF, normalised with the
reference application.
other hand, the Rodinia benchmark suite leverages external libraries to
manage multicore processing in parallel, thus adding a new layer of # Runtime CodeSize MWTF
P-TMR TMR P-TMR TMR P-TMR TMR
complexity on top of the Linux kernel.
Table 3 shows the experimental setup, which includes 3.1k fault 1 2.33 2.93 1.02 1.21 5.08 3.63
injections per campaign, resulting in a 99% confidence level and a 2.3% 2 2.67 2.78 1.10 1.12 4.66 6.90
error margin, according to Leveugle et al. [47]. This work assumes 3 2.28 2.33 1.12 1.21 0.56 0.74
that an error-correcting code protects the main memory; thus, the fault 4 2.02 2.70 1.03 1.16 1.47 1.95
injection targets only the X0-X30 general-purpose registers. 5 2.31 2.31 1.09 1.12 12.97 14.43
6 2.50 2.88 1.03 1.30 0.33 0.33
7 1.83 2.50 1.08 1.13 5.04 1.97
5.1. Bare metal applications 8 3.00 3.85 1.09 1.24 0.60 1.02
9 1.67 2.67 1.02 1.40 2.50 2.36
The applications selected for this Section comes from the real-time 10 3.41 3.41 1.29 1.32 7.15 8.55
benchmark suite WCET [50]. Table 4 shows a brief description of 11 2.55 2.65 1.03 1.09 0.18 0.18
each benchmark application. They are concise applications with dis- 12 2.75 2.75 1.45 1.46 0.60 0.67
tinct behaviours (e.g., loop, recursion, arrays) and no external library 13 3.69 3.69 1.08 1.08 5.82 5.82
14 1.62 1.62 1.01 1.07 0.31 0.21
dependencies (e.g., string.h, math.h). Due to the lack of operating
15 2.22 2.28 1.03 1.10 2.38 1.78
system and external calls, SOFIA can trace almost 100% of the executed 16 2.64 2.64 1.09 1.09 5.29 5.12
code, ensuring a fine-grain control over the fault injection and the 17 3.00 3.00 1.43 1.47 0.57 0.66
application of the mitigation techniques. 18 2.24 2.35 1.06 1.11 2.76 3.10
Table 5 shows the soft error results for the (P-TMR and TMR) 19 2.00 2.20 1.03 1.12 3.26 2.64
techniques when normalised by the unhardened binary. Although the 20 4.00 4.00 1.19 1.19 1.95 2.11
21 3.42 3.42 1.99 1.99 26.66 29.13
TMR generally offers a better MWTF, the difference to the P-TMR 22 2.28 2.94 1.09 1.17 19.26 33.71
results is minimal in most cases, demonstrating the benefit of lighter 23 1.08 1.13 1.27 1.46 1.33 0.89
TMR solutions. 24 2.16 2.68 1.28 1.37 2.48 2.00
The average normalised runtime, code size, and MWTF are about 25 3.15 3.31 1.09 1.13 5.79 16.43
2.76×, 1.24×, and 5.97×, respectively for TMR; and 2.51×, 1.16×, and Avg 2.51 2.76 1.16 1.24 4.76 5.97
4.76×, for P-TMR. However, on a few applications (highlighted in
blue), protecting only the critical function (P-TMR) has an even better
reliability/performance trade-off than TMR. Also, applications 6 and
11 highlighted in grey and 14 in blue had worse results in terms of 5.2. Parallel applications on Linux
MWTF for both mitigation techniques. These results indicate that the
additional code inherent to TMR protection can lead to new points Rodinia [51] targets heterogeneous computing systems with sup-
of failure. The TMR majority voters implementation for the ARMv8- port to multicore architectures, thus adding complexity w.r.t. WCET
A ISA uses four additional instructions, making the application more
applications. This Section considers applications that use the OpenMP
vulnerable if there are a significant number of voters in the code.
parallelisation library and are executed on top of a Linux kernel.
For instance, the TMR protection of application 11 (worst MWTF)
includes more than 30% of dynamic instructions that are used for Table 6 shows each application’s domain.
checking. One way to solve this problem would be to use a protected Figs. 7 and 8 show the normalised runtime and the MWTF for
majority voter implementation. This would significantly increase the the two mitigation techniques considering single-core, dual-core, and
performance penalty, but the reliability improvement can be worth it quad-core processors. A slight difference can be seen when comparing
in some specific cases. the average runtime (1.11× vs 1.04×), and the MWTF (1.08× vs 1.07×)

9
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Table 6
Rodinia applications domain.
# Application Domain
A Backprop Pattern Recognition
B BFS Graph Algorithms
C HeartWall Medical Imaging
D HotSpot Physics Simulation
E HotSpot3D Physics Simulation
F Kmeans Data Mining
G LUD Linear Algebra
H Myocyte Biological Simulation
I NN Data Mining
J ParticleFilter Medical Imaging
K PathFinder Grid Traversal
L SradV1 Image Processing
M SradV2 Image Processing
Fig. 8. P-TMR reliability comparison with the reference.

Table 7
Comparison between Rodinia’s applications code size increase for each mitigation
technique.
A B C D E F G
TMR 1.36 1.35 1.85 1.36 1.43 1.36 1.41
P-TMR 1.03 1.11 1.47 1.21 1.05 1.01 1.07
H I J K L M
TMR 1.35 1.36 1.54 1.35 1.59 1.33
P-TMR 1.21 1.10 1.00 1.14 1.05 1.09

scope, which restricts software stack hardening. Possible solutions to


this problem are: (i) replicating the application’s function calls. How-
Fig. 7. TMR reliability comparison with the reference.
ever, aside from the considerable performance overhead, this approach
might lead to possible collateral damages (e.g., modifying the same
data structure multiple times); (ii) compiling the library using the LLVM
results between the TMR and the P-TMR techniques. Note that there extension integrated into SOFIA.
are discrepancies in some isolated cases. For instance, the application In addition, faults injected during OS routines can significantly
F running on a quad-core processor presented a MWTF of 1.82× for impact the system reliability. Some Rodinia applications have a thread
the TMR and 0.90× for the P-TMR protection. This behaviour shows management overhead surpassing 50% of the total execution time in
that partial protection may be insufficient in some cases. In turn, when specific cores. Taking application L as an example: more than 99% of
executing the application K in a single-core processor, a MWTF of 1.36× the execution time is used for application IO functions for single-core.
and 1.67× for TMR and P-TMR were obtained, respectively. The above When running on quad-core, we have the main core with 95% of the
discrepancies demonstrate the importance of providing software engi- execution time for application functions; the other three cores stay idle
neers with appropriate means that support identifying the occurrence or run some thread barrier function for more than 95% of the time.
of soft errors in complex software stacks and enhancing the system This means that when a fault is injected into one of the cores that is
reliability by applying a soft error mitigation technique that meets the not the main one, we have an almost 100% chance of affecting an OS
requirements of each application. routine. Thus, software protection is ineffective in these cases because
Note that in some cases (D, H), the protected program’s execution it does not reach OS-specific functions and the external libraries that
time is less than the reference version. This occurs because most of the use IO operations.
execution of the applications (D, H) is to handle I/O communication
from external libraries. In this regard, the TMR protection is applied 5.3. RAT custom parameters reliability evaluation
to an application’s portion that runs less than 1% of the total appli-
cation execution time, making the protection ineffective. A possible A series of tests are conducted to investigate the impact of RAT’s
solution would be the implementation of the mitigation technique in customisation parameters on the soft error reliability. In this regard,
the compiler linking phase. However, this restricts the technique to Table 8 defines six sets of registers. The first two (S1 and S2) are
single processors. limited to the lower and upper half of the Arm Cortex-A72 processor’s
Table 7 shows the impact of TMR and P-TMR on the object code size register file. Next, the registers are divided into four parts (i.e., S3-S6),
for the Rodinia benchmark. For instance, application K has the highest aiming to explore how applications behave when the compiler forces
reliability gain overall. The resulting code size increases 14.15% when a configuration rather than using the default calling convention. The
applying the P-TMR technique, while TMR presents a code size increase question is: how do these parameters impact the application’s soft error
of 35.23% with a similar reliability gain in a single-core processor. reliability?
Similar behaviour occurs for all other applications, whereas the P- Fig. 9 shows the average normalised MWTF results for each mitiga-
TMR technique presents a better trade-off between reliability and code tion technique grouped by CPU core configuration. Each bar shows the
size. average of the 13 applications of the Rodinia benchmark. The RAT-
When comparing the WCET and Rodinia reliability results, a mas- s2 mitigation technique is the one that stands out in Fig. 9, showing
sive reduction can be seen in the average normalised MWTF gain from superior soft error reliability w.r.t. the other techniques. This behaviour
5.97× (WCET) to 1.08× (Rodinia). This occurs due to a large number corroborates along with our previous work [42], which indicates that
of external function calls (e.g., math/physics libraries). The applied the top registers (x15-x29) are, in general, more susceptible to the
mitigation techniques cover only the functions inside the application’s occurrence of soft errors. It is also possible to notice that the RAT-s3

10
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Table 8
Defined register pools for RAT.
Name Register pool
RAT-s1 X0-X14
RAT-s2 X15-X29
RAT-s3 X0-X6
RAT-s4 X7-X13
RAT-s5 X14-X20
RAT-s6 X21-X27

Fig. 10. Sub-figures a and b show the MWTF normalised with the unhardened
version for 11 mitigation techniques for two specific Rodinia applications (Kmeans
and Pathfinder) grouped by the number of cores.

Fig. 9. Average normalised MWTF results per CPU core configuration for the Rodinia
benchmark. P-TMR and especially TMR greatly impact execution time and code
size through instruction redundancy, which can easily exceed 30%
overhead (Section 5.2).
generates results significantly inferior to the other techniques. In other
6. Conclusions
words, allocating the X0-X6 registers to the most performed function
is more likely to decrease the protection efficiency. One possible justi- To boost the design space exploration and reduce the development
fication for this is because the underlying registers are highly used to effort, this work promotes an automated framework, called SOFIA,
handle input and output functions’ data. Most hardened applications which supports early soft error assessment, identification and suscep-
with TMR present an overall lower resilience to soft errors due to their tibility reduction evaluation by applying different soft error mitigation
high use of external libraries, which are not protected. techniques. Proposed framework has been validated with 1.74 million
Figs. 10(a) and 10(b) show a more detailed and individual analysis fault injections, covering the equivalent of up to 9.66k hours of sim-
of the two most resilient applications: Kmeans and Pathfinder. While the ulation, which has been reduced to 301 h due to the parallelisation
Kmeans application shows similar reliability results when running in capabilities provided by the SOFIA framework. SOFIA integrates well-
single and quad-core processors, there are notable differences when it established replication-like mitigation techniques such as TMR and
is executed in a dual-core processor. This behaviour comes from the OS partial-TMR. Results show that code replication techniques may not
thread management and the unbalanced workload distribution among always give the best result in terms of soft error reliability and perfor-
the processor cores. mance trade-off. In this regard, SOFIA also supports a novel lightweight
Another point to be investigated is that the MWTF values for the mitigation technique, called RAT, which explores different ways to
TMR and TMR+RAT techniques differ significantly from the general allocate the critical application function to a pool of registers. Results
average of the applications. For single-core, an improvement can be demonstrate that RAT increases the soft error reliability at a lower
observed in the normalised MWTF of 1.87× and 1.44× for TMR+RAT performance and memory usability costs.
and RAT-s2, respectively—and of 1.86× and 1.79× for quad-core con- For future works, we plan to improve SOFIA with an automatic
figuration. However, the results are quite different from the Pathfinder soft error mitigation technique selection that would consider specific
application. For this case, the number of cores does not significantly application/platform characteristics and vulnerabilities extracted from
affect the system’s reliability. Note that the P-TMR is more effective the fault injection campaigns and correlation reports.
than the other techniques, showing an increase of 1.67×, 1.73×, 1.57×
in the normalised MWTF to single-core, dual-core, and quad-core, re- Declaration of competing interest
spectively. Another curiosity in this example is that it was not possible
The authors declare that they have no known competing finan-
to compile the application using the RAT-s3 configuration. This prob-
cial interests or personal relationships that could have appeared to
lem may have occurred due to ISA restrictions, since some particular
influence the work reported in this paper.
instructions can only use an specific set of registers. Therefore, as
mentioned before, each application should be tested and evaluated Data availability
individually, taking into account the different mitigation options as
well as the target system requirements and constraints. Data will be made available on request.
Regarding execution time and code size overhead, there is an in-
crease in memory access instructions (load/store) when changing the Acknowledgements
register allocation according to the defined restriction level. However,
the RAT is applied in one function of the target application, and The authors would like to thank Duncan Graham, Larry Lapides,
as the applications are significantly large, the proportional impact is Simon Davidmann, and the Imperas Software Ltd. for their technical
negligible (< 0.1%) compared to the reference version. In contrast, support and access to their models and simulator.

11
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

References [27] B. James, H. Quinn, M. Wirthlin, J. Goeders, Applying compiler-automated


software fault tolerance to multiple processor platforms, IEEE Trans. Nucl. Sci.
[1] ISO, Road vehicles – functional safety, 2011. 67 (1) (2019) 321–327.
[2] S.K.S. Hari, S.V. Adve, H. Naeimi, P. Ramachandran, Relyzer: Exploiting [28] G.A. Reis, J. Chang, D.I. August, Automatic instruction-level software-only
application-level fault equivalence to analyze application resiliency to transient recovery, IEEE Micro 27 (1) (2007) 36–47.
faults, ACM SIGPLAN Notices 47 (4) (2012) 123–134. [29] S. Feng, S. Gupta, A. Ansari, S. Mahlke, Shoestring: Probabilistic soft error
[3] S.K.S. Hari, R. Venkatagiri, S.V. Adve, H. Naeimi, GangES: Gang error simulation reliability on the cheap, ACM SIGARCH Comput. Archit. News 38 (1) (2010)
for hardware resiliency evaluation, ACM SIGARCH Comput. Archit. News 42 (3) 385–396.
(2014) 61–72. [30] S. Feng, S. Gupta, A. Ansari, S.A. Mahlke, D.I. August, Encore: Low-cost,
[4] M. Kaliorakis, S. Tselonis, A. Chatzidimitriou, N. Foutris, D. Gizopoulos, Differ- fine-grained transient fault recovery, in: IEEE Micro, 2011, pp. 398–409.
ential fault injection on microarchitectural simulators, in: IEEE IISWC, 2015, pp. [31] M. Didehban, S.R.D. Lokam, A. Shrivastava, Incheck: An in-application recovery
172–182. scheme for soft errors, in: IEEE DAC, 2017, pp. 1–6.
[5] Q. Guan, N. BeBardeleben, P. Wu, S. Eidenbenz, S. Blanchard, L. Monroe, E. [32] A. Martinez-Alvarez, S. Cuenca-Asensi, F. Restrepo-Calle, F.R. Palomo Pinto,
Baseman, L. Tan, Design, use and evaluation of P-FSEFI: A parallel soft error H. Guzman-Miranda, M.A. Aguirre, Compiler-directed soft error mitigation for
fault injection framework for emulating soft errors in parallel applications, in: embedded systems, IEEE Trans. Dependable Secure Comput. 9 (2) (2012)
SIMUTOOLS, 2016, pp. 9–17. 159–172.
[6] K. Tanikella, Y. Koy, R. Jeyapaul, K. Lee, A. Shrivastava, gemV: A validated [33] E. Chielle, R.S. Barth, A.C. Lapolli, F.L. Kastensmidt, Configurable tool to protect
toolset for the early exploration of system reliability, in: IEEE ASAP, 2016, pp. processors against SEE by software-based detection techniques, in: IEEE LATW,
159–163. 2012, pp. 1–6.
[7] M. Didehban, A. Shrivastava, nZDC: A compiler technique for near zero silent [34] S. Kasap, E.W. Wächter, X. Zhai, S. Ehsan, K.D. McDonald-Maier, Novel
data corruption, in: IEEE DAC, 2016, pp. 1–6. lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward
recovery, Microelectron Reliab 124 (2021) 114297.
[8] H. Khosrowjerdi, K. Meinke, A. Rasmusson, Virtualized-fault injection testing: A
[35] C. Lattner, V. Adve, LLVM: A compilation framework for lifelong program
machine learning approach, in: IEEE ICST, 2018, pp. 297–308.
analysis & transformation, in: CGO, 2004, pp. 75–86.
[9] F.R. da Rosa, R. Garibotti, L. Ost, R. Reis, Using machine learning techniques
[36] Imperas, Open virtual platforms (OVP), 2021, URL http://www.ovpworld.org.
to evaluate multicore soft error reliability, IEEE Trans. Circuits Syst. I 66 (6)
[37] N. Seifert, B. Gill, S. Jahinuzzaman, J. Basile, V. Ambrose, Q. Shi, R. Allmon, A.
(2019) 2151–2164.
Bramnik, Soft error susceptibilities of 22 nm tri-gate devices, IEEE Trans. Nucl.
[10] F. Hauschild, K. Garb, L. Auer, B. Selmke, J. Obermaier, ARCHIE: A QEMU-based
Sci. 59 (6) (2012) 2666–2673.
framework for architecture-independent evaluation of faults, in: IEEE FDTC,
[38] A. Rhisheekesan, R. Jeyapaul, A. Shrivastava, Control flow checking or not? (for
2021, pp. 20–30.
soft errors), ACM Trans. Embedded Comput. Syst. 18 (1) (2019) 1–25.
[11] A. Avižienis, J.-C. Laprie, B. Randell, Dependability and its threats: A taxonomy,
[39] E. Chielle, F. Rosa, G.S. Rodrigues, L.A. Tambara, J. Tonfat, E. Macchione,
in: Building the Information Society, 2004, pp. 91–120.
F. Aguirre, N. Added, N. Medina, V. Aguiar, M.A.G. Silveira, L. Ost, R. Reis,
[12] H. Cho, S. Mirkhani, C.-Y. Cher, J.A. Abraham, S. Mitra, Quantitative evaluation
S. Cuenca-Asensi, F.L. Kastensmidt, Reliability on ARM processors against soft
of soft error injection techniques for robust system design, in: IEEE DAC, 2013,
errors through sihft techniques, IEEE Trans. Nucl. Sci. 63 (4) (2016) 2208–2216.
pp. 1–10.
[40] M. Bohman, B. James, M.J. Wirthlin, H. Quinn, J. Goeders, Microcontroller
[13] G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, D.I. August, S.S. Mukherjee,
compiler-assisted software fault tolerance, IEEE Trans. Nucl. Sci. 66 (1) (2019)
Software-controlled fault tolerance, ACM Trans. Archit Code Optim 2 (4) (2005)
223–232.
366–396.
[41] J.R. Azambuja, A. Lapolli, M. Altieri, F.L. Kastensmidt, Evaluating the efficiency
[14] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, J. Hestness,
of data-flow software-based techniques to detect SEEs in microprocessors, in:
D.R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M.D.
IEEE LATW, 2011, pp. 1–6.
Hill, D.A. Wood, The gem5 simulator, ACM SIGARCH Comput. Archit. News 39
[42] J. Gava, R. Reis, L. Ost, RAT: A lightweight architecture independent system-level
(2) (2011) 1–7.
soft error mitigation technique, in: IEEE VLSI-SoC, 2020, pp. 235–253.
[15] E. Cheng, S. Mirkhani, L.G. Szafaryn, C.-Y. Cher, H. Cho, K. Skadron, M.R.
[43] G. Abich, J. Gava, R. Garibotti, R. Reis, L. Ost, Applying lightweight soft error
Stan, K. Lilja, J.A. Abraham, P. Bose, et al., Clear: Cross-layer exploration for
mitigation techniques to embedded mixed precision deep neural networks, IEEE
architecting resilience: Combining hardware and software techniques to tolerate
Trans. Circuits Syst. I 68 (11) (2021) 4772–4782.
soft errors in processor cores, in: IEEE DAC, 2016, pp. 1–6.
[44] G. Abich, R. Garibotti, R. Reis, L. Ost, The impact of soft errors in memory units
[16] A. Savino, A. Vallero, S. Di Carlo, Redo: Cross-layer multi-objective design-
of edge devices executing convolutional neural networks, IEEE Trans. Circuits
exploration framework for efficient soft error resilient systems, IEEE Trans.
Syst. II 69 (3) (2022) 679–683.
Comput. 67 (10) (2018) 1462–1477.
[45] J. Gava, G. Abich, R. Garibotti, S. Cuenca-Asensi, R.P. Bastos, R. Reis, A
[17] A. Vallero, A. Savino, A. Chatzidimitriou, M. Kaliorakis, M. Kooli, M. Riera, M.
lightweight mitigation technique for resource-constrained devices under neutron
Anglada, G. Di Natale, A. Bosio, R. Canal, A. Gonzalez, D. Gizopoulos, R. Mariani,
radiation, in: RADECS, 2022, pp. 1–4.
S. Di Carlo, SyRA: Early system reliability analysis for cross-layer soft errors
[46] NASA, NAS parallel benchmarks, 2021, URL https://www.nas.nasa.gov/
resilience in memory arrays of microprocessor systems, IEEE Trans. Comput. 68
publications/npb.html.
(5) (2018) 765–783.
[47] R. Leveugle, A. Calvez, P. Maistri, P. Vanhauwaert, Statistical fault injection:
[18] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J.
Quantified error and confidence, in: IEEE DATE, 2009, pp. 502–506.
Hogberg, F. Larsson, A. Moestedt, B. Werner, Simics: A full system simulation
[48] N. Lodéa, W. Nunes, V. Zanini, M. Sartori, L. Ost, N. Calazans, R. Garibotti, C.
platform, Computer 35 (2) (2002) 50–58.
Marcon, Early soft error reliability analysis on RISC-V, IEEE Latin Am. Trans.
[19] M.M. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen,
100 (2022) 1–7.
K.E. Moore, M.D. Hill, D.A. Wood, Multifacet’s general execution-driven multi-
[49] G. Abich, R. Garibotti, V. Bandeira, F. da Rosa, J. Gava, F. Bortolon, G. Medeiros,
processor simulator (GEMS) toolset, ACM SIGARCH Comput. Archit. News 33
F.G. Moraes, R. Reis, L. Ost, Evaluation of the soft error assessment consistency of
(4) (2005) 92–99.
a JIT-based virtual platform simulator, IET Comput. Digital Tech. 15 (2) (2021)
[20] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, R.B. Brown,
125–142.
MiBench: A free, commercially representative embedded benchmark suite, in:
[50] J. Gustafsson, A. Betts, A. Ermedahl, B. Lisper, The mälardalen WCET
IEEE WWC, 2001, pp. 3–14.
benchmarks: Past, present and future, in: WCET, 2010, pp. 136–146.
[21] D.R. Falcó, A. Serrano-Cases, A. Martinez-Alvarez, S. Cuenca-Asensi, Soft error
[51] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, K. Skadron,
reliability predictor based on a deep feedforward neural network, in: IEEE LATS,
Rodinia: A benchmark suite for heterogeneous computing, in: IEEE IISWC, 2009,
2020, pp. 1–5.
pp. 44–54.
[22] Q. Lu, M. Farahani, J. Wei, A. Thomas, K. Pattabiraman, LLFI: An intermediate
code-level fault injection tool for hardware faults, in: IEEE QRS, 2015, pp. 11–16.
[23] A. Benso, S. Chiusano, P. Prinetto, L. Tagliaferri, A C/C++ source-to-source
compiler for dependable applications, in: IEEE DSN, 2000, pp. 71–78. Jonas Gava received a bachelor’s degree in computer
[24] B. Nicolescu, R. Velazco, Detecting soft errors by a purely software approach: engineering from the Federal University of Rio Grande do
Method, tools and experimental results, in: IEEE DATE, 2003, pp. 57–62. Sul (UFRGS) in 2019. In 2020 started as an M.Sc. Student
[25] A. Serrano-Cases, Y. Morilla, P. Martín-Holgado, S. Cuenca-Asensi, A. Martínez- in Microelectronics and followed his Ph.D. in 2021 at the
Álvarez, Nonintrusive automatic compiler-guided reliability improvement of same institution. For the past three years, he has been
embedded applications under proton irradiation, IEEE Trans. Nucl. Sci. 66 (7) researching and developing tools involving the implementa-
(2019) 1500–1509. tion and evaluation of software-based soft error mitigation
[26] G.S. Rodrigues, F.L. Kastensmidt, R. Reis, F. Rosa, L. Ost, Analyzing the impact of techniques.
using pthreads versus OpenMP under fault injection in ARM cortex-A9 dual-core,
in: RADECS, 2006, pp. 1–6.

12
J. Gava et al. Journal of Systems Architecture 131 (2022) 102710

Vitor Bandeira received a bachelor’s degree in computer Ricardo Reis (M’81–SM’06) received the Electrical Engi-
science from the Federal University of Rio Grande do Sul neering degree from the Federal University of Rio Grande
(UFRGS) in 2019 and is currently a Ph.D. Student in do Sul (UFRGS), Brazil, in 1978, and the Ph.D. degree
Microelectronics at the same university. For the past four in informatics, option microelectronics from the Institut
years, he has been researching and developing tools for National Polytechnique de Grenoble, France, in 1983. He
reliability and soft error analysis using virtual platforms. received the Doctor Honoris Causa from University of Mont-
His current research involves applying machine learning pellier, France, in 2016. He has been a Full Professor with
techniques to improve the soft error investigations and the UFRGS since 1981. He is at research level 1A of the CNPq
development of multiprocessor platforms. (Brazilian National Science Foundation), and the head of
several research projects supported by government agencies
and industry. He has published over 700 papers in journals
Felipe Rocha da Rosa received the bachelor’s degree
and conference proceedings and authored or co-authored
in computer engineering and the Ph.D. degree in mi-
several books. His current research interests include physical
croelectronics from the Federal University of Rio Grande
design, physical design automation, design methodologies,
do Sul. For the past six years, he has been research-
digital design, EDA, circuits tolerant to radiation, and micro-
ing and developing tools for performance and reliability
electronics education. Prof. Reis was a recipient of the IEEE
analysis of arm-based processors. His current research
Circuits and Systems Society (CASS) Meritorious Service
involves applying data science and machine learning tech-
Award 2015. He was the Vice President of the IEEE CASS
niques to improve the soft error investigations during
representing Region 9 (Latin America) and president of the
early design space explorations process, simulation, system
Brazilian Computer Society (SBC).
design, computer architecture, multi/many-core systems,
network-on-chip, fault tolerance, and soft errors.
Luciano Ost is currently a Faculty Member with Lough-
borough University’s Wolfson School - UK. He received his
Rafael Garibotti (M’14–SM’22) is an Associate Professor
Ph.D. in Computer Science from PUCRS, Brazil, in 2010.
at PUCRS University. Former Visiting Scholar at Univer-
During his Ph.D., Dr. Ost worked as an invited researcher
sité Grenoble Alpes, France. Former Postdoctoral Fellow
at the Microelectronic Systems Institute of the Technische
at both the prestigious School of Engineering and Applied
Universitaet Darmstadt (from 2007 to 2008) and at the
Sciences of Harvard University and UFRGS, Brazil. He
University of York (October 2009). After completing his
received his Ph.D. and M.Sc. Degree in Microelectronics,
doctorate, he worked as a research assistant (2 years) and
respectively from the University of Montpellier and EMSE,
then as an assistant professor (2 years) at the University
France and his B.Sc. Degree in Computer Engineering from
of Montpellier II in France. He has authored more than 90
PUCRS University, Brazil. He is a distinguished Brazilian
papers, and his research is devoted to advancing hardware
researcher with a CNPq PQ-2 grant. His research activity
and software architectures to improve the performance,
focuses on AI safety, robotics and autonomous systems,
security, and reliability of machine learning and life-critical
multicore architectures, hardware accelerator and robust
embedded systems.
deep learning.

13

You might also like