0% found this document useful (0 votes)
14 views4 pages

Characterization and Considerations For Upset in FPGA

This paper discusses the susceptibility of FPGAs to upsets caused by cosmic radiation, particularly in safety-critical applications, and explores various mitigation techniques such as Triple Modular Redundancy (TMR), Error Correcting Codes (ECC), and scrubbing. Practical tests on a Xilinx UltraScale+ MPSOC FPGA validate these techniques, revealing improvements in SEU performance and error correction capabilities. The findings indicate that newer FPGA technologies provide better immunity to radiation-induced errors while still requiring careful consideration of design and mitigation strategies.

Uploaded by

Rogerio Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

Characterization and Considerations For Upset in FPGA

This paper discusses the susceptibility of FPGAs to upsets caused by cosmic radiation, particularly in safety-critical applications, and explores various mitigation techniques such as Triple Modular Redundancy (TMR), Error Correcting Codes (ECC), and scrubbing. Practical tests on a Xilinx UltraScale+ MPSOC FPGA validate these techniques, revealing improvements in SEU performance and error correction capabilities. The findings indicate that newer FPGA technologies provide better immunity to radiation-induced errors while still requiring careful consideration of design and mitigation strategies.

Uploaded by

Rogerio Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Characterization and Considerations for Upset in

FPGA
Christian Johansson and Torbjörn Månefjord
Saab AB, Huskvarna, Sweden
[email protected], [email protected]

Abstract— The increase in performance and the relatively low FPGA’s. The antifuse based technology is radiation hardened
cost have made the FPGA an attractive technology for use in but do not provide re-programming and is also relatively
various product areas. When used in safety-critical applications, expensive [3]. The genuine flash-based FPGA’s ensures
the susceptibility against upsets due to cosmic radiation requires hardening against radiation and generally have several power
special considerations. In this paper, a number of mitigation savings advantages compared to the SRAM based [4], [5] and
techniques against upsets are discussed together with the use of a [6]. However, the SRAM technology with its susceptibility to
COTS IP. Furthermore, practical tests are performed to validate radiation is used for the majority of the FPGA’s [7], [8]. This is
the upset rates and mitigation techniques. The tests are mainly due to the ability to achieve a higher logic density,
performed on a Xilinx UltraScale+ MPSOC FPGA.
maturity of process technology and cost compared to the other
technologies.
I. INTRODUCTION The degree of vulnerability that can be accepted varies
The radiation in the atmosphere can induce undesired between different applications and is an aspect of the system
effects in circuits such as an FPGA. The radiation is present at design. Safety considerations determine if any mitigation
both ground level and at elevated altitudes where the neutrons method is necessary to use. One important parameter to
are the dominating contributor in the earth atmosphere [1]. characterize is the SEU failure rate for the used FPGA in order
Also, the flux of neutrons varies and have the largest flux at to estimate the design’s susceptibility. The failure rate is
high altitudes and near the poles. An example of a neutron typically estimated by radiation-based testing but could also be
relative flux calculator is available [2]. evaluated through different SEU simulation techniques [9].
The effect of changing the internal state of a device such as
an FPGA due to a single energetic particle strike is referred to II. ELABORATION OF DIFFERENT MITIGATION TECHNIQUES
as a Single Event Effect (SEE). The SEE can further be divided There are several mitigation techniques to decrease the
into different types of failure modes. Among the most common influence of SEU in either the user logic or the configuration
modes are the Single Event Upset (SEU) corresponding to a memory such as fault masking techniques, e.g., Triple Modular
soft error and change in a memory bit or latch induced by a Redundancy (TMR), Error Correcting Codes (ECC) or
single energetic particle strike, Multi-Bit Upset (MBU) a soft recovery by scrubbing of the configuration memory.
error where more than one error bit occurs in the same word,
Single Event Transients (SET) a spurious signal/voltage due to A. TMR
a single particle that propagates during one clock cycle and
One technique to mitigate SEU in the logic is to utilize
Single Event Functional Interrupt (SEFI) a control path is
Triple Modular Redundancy. The TMR is a method where the
corrupted and the functionality is altered. The term soft error
logical sources are triplicated, and majority voters are added.
means that the device is not damaged and it may be corrected
The error in one of the branches is masked by the voter and
by a power-cycling or rewrite with correct data.
avoids the propagation to the outputs. Since the design uses a
The SEU is the most common type of error. Despite new redundancy, it also introduces a significant hardware overhead.
generations of circuits with a higher density of integration the There are several approaches to implementing TMR with
susceptibility to SEU per bit have decreased due to new respect to application and use of resources. The common types
innovative processes and layout mitigations. Without these are Distributed TMR (DTMR) with triplication of flip-flops,
mitigations at the device level, the scaling of the transistor combinatorial data paths and voters which are placed after the
would have caused a higher susceptibility of each circuit, since flip-flops and Block TMR (BTMR) where the entire design is
the charged required to change the state of each memory cell or triplicated and voters are placed at the outputs. A number of
latch is reduced. TMR techniques have previously been evaluated in terms of
reliability and performance. One example is the usage of the
The different types of FPGA’s are mainly categorized TMR for the Xilinx architecture [10].
based on the technology used for the storing of the
configuration logic. The main technologies can be classified
into three categories: Antifuse, SRAM and Flash-based

978-1-5386-7656-1/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 25,2022 at 03:35:48 UTC from IEEE Xplore. Restrictions apply.
The TMR can generally be manually implemented [11]. technique. This disadvantage is eliminated when scrubbing is
However, tools as, e.g., Synopsys Synplify® Premier offers combined with ECC and/or TMR.
automated insertion of TMR during synthesis and supports
several device families [12]. The TMR is often combined with One way to implement scrubbing of the configuration
a scrubbing of the configuration memory for SRAM based memory on a Xilinx FPGA is to use the Soft Error Mitigation
FPGA’s in order to remove the accumulation of errors that COTS IP developed by Xilinx. The mitigation approach using
otherwise could ruin the redundancy. the Soft Error Mitigation IP (SEM IP) core provides a
customizable interface with options for fault injection,
mitigation, and classification. The core utilizes the Internal
B. Error detection and correction code Configuration Access Port (ICAP) for internal readback to
There are several methods to implement error correcting perform error detection with ECC and CRC codes [14]. A
codes, ECC, each with their advantage. One of the most simplified overview of a generated system-level example SEM
commonly used for single error correction in memories is the IP is shown in figure 1.
Hamming code [13]. During a write to the memory, the ECC-
logic adds a few extra bits that are stored with the data word.
The stored bits are then compared with the calculated ECC
value at the read operation to correct any single bit error or
detect any double bit error [13]. The ECC mitigation could be
utilized for both BRAM [13] or CRAM [14]. To further
decrease the possibility to get multi-bit upset for high-density
memories scattering techniques are implemented at device
level where large column multiplexing could be applied, e.g.,
the distance between the memory cells for one memory word
is increased.
C. Scrubbing Figure 1. Overview of the system-level example SEM IP.
The technique using scrubbing is performed where
specialized logic periodically scans the memory cells and The interface for the SEM IP has different appearances due
rewrites the original content. This can be done either to its various configuration options. If the classification is
unconditionally or when a fault has been localized. Scrubbing enabled, an additional file is used with a description of the
is performed in order to prevent accumulation of faults in the essential bits together with a fetch interface to further increase
configuration memory or BRAM. In the case of configuration the system availability. In the figure, the mitigation and report
memory, a reload of the bitstream, i.e., scrubbing is necessary. interfaces are shown with no classification enabled. The
This scrubbing allows a repair of the configuration without primitive frame_ecc and ICAP allows the signaling of errors in
disrupting its operation but does not correct functional logics, the read frame where the SEM controller corrects the
e.g., refresh the content of flip-flops or block ram. configuration through the ICAP. All reporting of errors and
status could be detected through the status signals or the UART
Scrubbing of BRAM with ECC is easiest performed by interface. Also, emulation of error injection is supported which
reading a memory cell and then writing it back again. If there provides a method to evaluate an SEU and the response of the
has been an SEU, the ECC logic corrects it upon the read and SEM IP. The arbitration interface is necessary to use for
presents the original data for the computer or scrubbing logic. implementations targeting Zynq devices where the logical
It is this corrected data that is written back to the memory. access must be transferred to the ICAP.
Scrubbing can successfully be combined with TMR and ECC,
previously mentioned. Both TMR and ECC hide the effect of a The error mitigation time is directly related to the ICAP
fault, but the fault remains. Scrubbing heals the fault by clock and should be as high as possible where frequencies
rewriting the invalid data with a golden copy of the initial data ranging up to 200 MHz could be used for some devices. The
in the case of configuration fault and the corrected data in the latency of error detection and correction is device and clock
case of a data error. dependent and typically less than 60 ms and 100 us
respectively for UltraScale devices [14]. The rate of scrubbing
There are several approaches for scrubbing. One solution is should be set, so the time between the upset and the repair is as
to periodically rewrite the content of the configuration memory low as possible, e.g., preferably a scrub rate one order of
with an update frequency related to the expected SEU. Other magnitude (x10) above the SEU rate.
improved methods are partial reconfiguration, which allows a
selected part of the configuration memory, i.e., the faulty frame
to be rewritten [15]. III. EXPERIMENTAL
In order to validate the system aspects of different
However, there are some important aspects to consider
mitigation techniques, tests were performed on devices with
when scrubbing is used as the only mitigation against SEU.
some of the previously mentioned mitigations implemented.
There is a time before the SEU is detected, and a time when the
The tests were conducted at the ISIS Neutron facility in
error is detected but not corrected. Both these times has to be
Rutherford Appleton Laboratory UK [16]. The dedicated
acceptable by the system design in order to use this mitigation
irradiation beamline ChipIr is designed to give a neutron

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 25,2022 at 03:35:48 UTC from IEEE Xplore. Restrictions apply.
spectrum as similar as possible to the atmospheric. The neutron previously reported results for a Xilinx 20 nm Kintex
flux with energies > 10MeV was measured to be 5x106 cm-2s-1. UltraScale XCKU040 device [18]. The failure rate is also
The board containing the FPGA was placed perpendicular to added for comparison by the use of the flux calculator [2].
the beam during the tests. The FPGA device was a Zynq
UltraScale+ MPSoC XCZU9EG-FFVB1156-ES1 (pre-
production engineering silicon) fabricated by a 16 nm FinFet SEU cross-section Failure rate
process. The FPGA was loaded with a design containing the Device Source (cm2/bit) (FIT/Mbit)
SEM IP as previously described. During each test, the FPGA CRAM BRAM CRAM
was power cycled and configured before irradiation. The report
from the SEM IP UART consists of a verbose log indicating its 35
(NYC Sea level)
state and if a correction or uncorrectable error have occurred. XCKU040
Neutron 2.6x10-15 4.5x10-15
The log was manually supervised together with the SEM IP [18]
10224
status signals during the tests. If an uncorrectable error was (NYC alt. 10km)
indicated, a reconfiguration was made through a power cycle of
the board. The BRAM was initialized with a known pattern 1.5
(NYC Sea level)
before the start of the experiment. During the test, the different XCZU9EG
Neutron 1.1x10-16 4.1x10-16
SEU were recorded, and the correct pattern was rewritten. > 10MeV (< ±15 %) (< ±15 %)
433
(NYC alt. 10km)
The device for the test was mounted in the neutron beam
room as shown in figure 2. Different types of boards were
simultaneously tested and aligned with respect to the neutron Table 1. SEU test results.
beam (illustrated by the white arrow). The device
communicates through an RS232 UART to a monitoring laptop A comparison between the devices gives an improvement
and also through a supervising CPU-board enabling logging of of more than x15 times for the CRAM and x8 for the BRAM
various IO signals. All information from the test setup was sent due to the refined build up technology for the XCZU9EG.
to the main computer and stored on its hard drive. Also, further improvement for SEU could be achieved by the
available ECC for the BRAM [19]. The trend is that the latest
technology provides a better immunity compared to the
previous 20 nm. Nevertheless, due to the higher integration
(the number of bits per memory device has increased), the
sensitivity per device has either been the same or increased.
However, not all configuration bits in a load file are used
for defining interconnect or logic. When calculating the
number of expected upsets FIT (failures in time over 109 hours
of operation), the amount of critical bits is used. There are
several methods to estimate the critical bits. One simple
approach is to use Xilinx resource analysis utilizing the
essential bits reported during the bitstream creation. The value
reported implies that all of the essential bits may be critical.
Since the upset can be masked by design properties, not all of
them may give rise to a functional failure. For example, it has
been reported that a suggested derating of 10% could be used
Figure 2. Setup for irradiation test by neutron beam. as a typical factor that gives rise to a functional error [20]. To
use the essential bits is usually a more conservative approach.
The SEU cross section, σ is calculated as σ=number of
events/(flux*time) where the number of events is found from
the SEM IP log, flux given as neutrons cm-2s-1 and the time IV. CONCLUSION
how long the device was irradiated. To further obtain the cross A number of different mitigation techniques are discussed
section per bit the cross section is divided by the actual size of and scrubbing validated during practical tests. The UltraScale+
the specific memory resource used. The calculated failure in MPSoC FPGA 16nm was tested for SEU performance by
time (FIT) rate is based on the NYC sea level neutron flux of neutron irradiation with an average flow of 5x106 /cm2s. The
12.9 neutrons/cm2/h [17]. The percentage of correctable SEM SEU cross sections have been presented for both the
IP errors was better than 99.6% where the rest required a configuration memory and block RAM. The utilized SEM IP
reconfiguration by, e.g., a power cycle. Previously reported was used for the detection of configuration errors and was able
values for the 20 nm UltraScale [18] have shown that less than to repair the majority of the errors reported. The values for the
0.1% of all CRAM events are uncorrectable. estimated upset rates can be used as a design input for
The total fluence was chosen to be larger than 1010 analyzing specific mitigation needs. In particular, the
neutrons/cm2 to give a sufficient significance of the SEU implementation and test of the SEM IP validate its system
during the experiment. The measured SEU cross sections for performance in a radiation environment.
both CRAM and BRAM are given in table 1 together with

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 25,2022 at 03:35:48 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGMENT [20] Xilinx, “Continuing experiments of atmospheric neutron effects on deep
submicron integrated circuits”, WP286, v2.0, 2016, [Online]. Available:
The authors would like to thank Thomas Granlund at Saab https://www.xilinx.com/support/documentation/white_papers/wp286.pdf
AB for his efforts in the setup of the testing strategy and [Accessed 1 October 2018].
practical neutron tests.
REFERENCES
[1] “Process management for avionics – Atmospheric radiation effects –
part 1: Accommodation of atmospheric radiation effects via single event
effects within avionics electronic equipment”, IEC 62396-1:2012-05.
[2] Flux calculator, [Online]. Available: http://www.seutest.com/
[Accessed 1 October 2018].
[3] Microsemi, Antifuse FPGAs, [Online]. Available:
https://www.microsemi.com/product-directory/fpga-soc/1641-antifuse-
fpgas/ [Accessed 1 October 2018].
[4] S.C. Davis, R. Koga and J. S. George, “Proton and Heavy Ion Testing of
the Microsemi Igloo2 FPGA”, IEEE Radiation Effects Data Workshop
(REDW), pp.1-6, 2017.
[5] Paulo R. C. Villa, et. al. ,”Analysis of single-event upsets in a
Microsemi ProAsic3E FPGA”, 18th IEEE Latin American Test
Symposium (LATS), pp.1-4, 2017.
[6] T. Morin, “Flash FPGAs give designers more flexibility”, Embedded,
2018, [Online]. Available: https://www.embedded.com/electronics-
blogs/industry-comment/4438457/Flash- FPGAs- give- designers- more-
flexibility [Accessed 1 October 2018].
[7] Intel FPGA devices, [Online]. Available: https://www.intel.com/
content/www/us/en/fpga/devices.html [Accessed 1 October 2018].
[8] Xilinx FPGA devices, [Online]. Available: https://www.xilinx.com/
products/silicon-devices/fpga.html [Accessed 1 October 2018].
[9] D. Munteanu and J.L. Autran, “Modeling and Simulation of Single-
Event Effects in Digital Devices and ICs”, IEEE Transactions on
Nuclear science, no.4, pp.1854-1878, 2008.
[10] Xilinx, “Triple Module Redundancy Design Techniques for Virtex
FPGAs”, XAPP197, v1.0.1, 2006, [Online]. Available:
https://www.xilinx.com/support/documentation/application_notes/xapp1
97.pdf [Accessed 1 October 2018].
[11] S. Habinc, “Functional Triple Modular Redundancy (FTMR), VHDL
Design Methodology for Redundancy in Combinatorial and Sequential
Logic”, Gaisler Research, Design and Assessment Report, FPGA-003-
01, 2002, [Online]. Available: https://www.gaisler.com/
doc/fpga_003_01-0-2.pdf [Accessed 1 October 2018].
[12] Synopsys, Synplify Pro and Premier Datasheet, “Fast, Reliable FPGA
Implementation and Debug”, 2015, [Online]. Available:
https://www.synopsys.com/content/dam/synopsys/implementation&sign
off/datasheets/synplify-pro-premier.pdf [Accessed 1 October 2018].
[13] Xilinx, “UltraScale Architecture Memory Resources”, UG573, v1.9,
2018, [Online]. Available: https://www.xilinx.com/support/
documentation/user_guides/ug573-ultrascale-memory-resources.pdf
[Accessed 1 October 2018].
[14] Xilinx, “UltraScale Architecture Soft Error Mitigation Controller”,
PG187, v3.1, 2018, [Online]. Available: https://www.xilinx.com/
support/documentation/ip_documentation/sem_ultra/v3_1/pg187-
ultrascale-sem.pdf [Accessed 1 October 2018].
[15] Xilinx, “Correcting Single Event Upset Through Virtex Partial
Configuration”, XAPP216, v1.0, 2000, [Online]. Available:
https://www.xilinx.com/support/documentation/application_notes/xapp2
16.pdf [Accessed 1 October 2018].
[16] ISIS ChipIr, Rutherford Appleton Laboratory [Online]. Available:
https://www.isis.stfc.ac.uk/Pages/Chipir.aspx [Accessed 1 October
2018].
[17] “Measurement and Reporting of Alpha Particle and Terrestrial Cosmic
Ray-Induced Soft Errors in Semiconductor Devices”, JEDEC Test
Standard 89A, Sep. 2006.
[18] P. Maillard, et. al., “Neutron, 64 MeV Proton, Thermal Neutron and
Alpha Single-Event Upset Characterization of Xilinx 20nm UltraScale
Kintex FPGA”, IEEE Radiation Effects Data Workshop (REDW), pp.1-
5, 2015.
[19] Xilinx, “Zynq UltraScale+ MPSoC Data Sheet”, DS891, v1.6, 2018,
[Online]. Available: https://www.xilinx.com/support/documentation/
data_sheets/ds891-zynq-ultrascale-plus-overview.pdf [Accessed 1
October 2018].

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on January 25,2022 at 03:35:48 UTC from IEEE Xplore. Restrictions apply.

You might also like