Using An Innovative SoC-level FMEA Methodology To
Using An Innovative SoC-level FMEA Methodology To
Using An Innovative SoC-level FMEA Methodology To
net/publication/221339759
CITATIONS READS
38 708
3 authors, including:
Riccardo Mariani
YOGITECH SpA
61 PUBLICATIONS 564 CITATIONS
SEE PROFILE
All content following this page was uploaded by Riccardo Mariani on 08 November 2014.
YOGITECH SpA
Pisa, Italy
http://www.yogitech.com
zone and so on), this fault is not considered as an hazard Main Effect
ECC_SHELL
F-MEM
a) it is performed an exhaustive fault injection of
ALARMS
ERROR
CODER DECODER CTRL
MCE/MPU
zone it is injected a certain number of faults. At the end MCE MUX MCE
DMA
Fault List
memory access time due to the ECC.
OP
List of
Sensible zones
At first, the sensible zones have been extracted by
Randomizer
using the previously described tool: about 170 sensible
Collapser
Operational
Environment builder
zones resulted, including the memory controller, the
Candidate
Profiler
memory and the F-MEM/MCE blocks. The memory has
fault list
case of random fault injection been modeled by using a proper fault model as for
Figure 4 : the fault injector instance described in [13-15]. Then, the FMEA
spreadsheet have been completed including S,D, F and
DDF values following the procedure described in the
6. Example
sections 3 and 4.
To show how this methodology can be successfully The spreadsheet identified the critical zones. Besides
applied to the design of safety-critical SoCs, a proof-of- the memory array itself, the most critical blocks were the
concept example is described in the following. It is the BIST control logic, the registers involved in addresses
latching, most of the blocks of the decoder, the registers The methodology has been used to certify the
of the write buffer, some of the blocks of the MCE fRMEM product of YOGITECH SpA according IEC
handling the interconnections with the bus and so forth. 61508. It is currently in use for the final certification of
With the initial implementation, resulting SFF (around the other IPs of YOGITECH faultRobust technology and
95%) was not enough to reach SIL3. Then, the for the complete analysis of fault-robust microcontrollers
architecture was modified by adding the addresses to the for automotive applications [16,17].
coding (required as well by IEC61508), by adding parity
bits to the write buffer and by deeply modifying the
decoder implementation. In particular, this last action References
was really important to increase the SFF: [1] J.C Laprie, “Dependable Computing and Fault Tolerance Concepts
i) an “error checker” was added immediately after the and Terminology “, IEEE Computer, 1985
“code generator” section of the decoder, in order to [2] H. Tahne, “Safe and Reliable Computer Control: Systems Concepts
and Methods”, Mech. Lab, Univ. Stock, 1996
cover also the errors in such coder; [3] CEI International Standard IEC 61508, 1998-2000
ii) a double-redundant “error checker” was [4] S.Brown, “Overview of IEC 61508 Design of
implemented after the intermediate decoder pipeline electrical/electronic/programmable electronic safetyrelated
stage, to check the correctness of code and data fields systems”, Computing & Control Engineering Journal February
2000, pages 6-12
after the pipeline as also – in case of no errors – directly [5] R.E. McDermott et al, “The Basic of FMEA”, Quality Resources
connect the decoder output with the memory data. The Press, 1996
spreadsheet shown that this measure was strongly [6] R. Mariani, G. Boschi, “A System Level Approach for Embedded
decreasing the error probability of the second part of the Memory Robustness” Special Issue: Papers selected from the 1st
International Conference on Memory Technology and Design -
decoder architecture; ICMTD’05.
iii) a “distributed” syndrome checking architecture [7] R. Mariani, M. Chiavacci, S. Motto, “Dependable microcontroller,
was implemented to allow a finer error detection (i.e. to method for designing a dependable microcontroller and computer
discriminate if an error is in the code field, or in data program product therefor”, European Patent, EP1496435
[8] R. Mariani, P. Fuhrmann, B. Vittorelli, “Cost-effective Approach to
field or if it was an addressing error, etc…). As shown in Error Detection for an Embedded Automotive Platform”, SAE
the FMEA, also this architecture strongly decreased the 2006 World Congress & Exhibition, April 2006, Detroit, MI, USA
error probability. New alarms were generated by these [9] www.fr.yogitech.com
checking architectures: as shown by the FMEA, by [10] http://www.cadence.com/products/functional_ver
[11] http://www.cadence.com/products/digital_ic/encountertest
combining the alarms generated by the error checker [12] IEEE standard 1647, http://www.ieee1647.org/
after the decoder’s coder, the redundant error checkers [13] S. Mukherjee et al. “Cache scrubbing in Microprocessors: Mith or
after the pipeline and the final syndrome checks, it is Necessaity?”, 2004
possible to cover with a very high level of coverage the [14] S. Mukherjee et al. “A Systematic Methodology to Compute the
Architectural Vulnerability Factors for a High-Performance
possible error combinations in the decoder. Microprocessor”, 2003
Moreover, some SW start-up tests were identified for [15] M. Spica, “Do we need anything more than single bit error
the memory controller parts not covered by the memory correction (ECC)?”, 2004
protection IP. The resulting SFF of this second [16] R. Mariani, “A Platform-based Technology For Fault-robust Soc
Design”, IP/SOC 2006 Conference, December 2006, Grenoble,
implementation was 99,38% and it was very stable as France
well, i.e. changes on S,D,F and fault models didn’t [17] R. Mariani, P. Fuhrmann, B. Vittorelli, “Fault-Robust
change the result in a sensible way. The previous microcontrollers for automotive applications”, 12th IEEE
described validation flow was run in order to have the International On-Line Testing Symposium - 12 July 2006 -
Como,Italy
highest confidence on the results, with different
synthesis of the design in order to cross check the
sensitivity to the final implementation.
7. Conclusions
In summary, the methodology proposed in this paper
is a new way to extract useful information from a SoC,
to take into consideration the IEC guidelines about fault
models and failure modes, to compute (following IEC
61508 norm) the Safe Failure Fraction and the
Diagnostic Coverage, to validate the results by means of
a complete flow including a fault-injector. It’s an
innovative and systematic approach to assess the safety
of a circuit, delivering very detailed reports on sensible
zones, fault effects, failure rates, etc… that can be used
for SoC analysis. It allows the identification of critical
part of a circuit and the exploration of possible
implementations for best safety as well.
The methodology has been developed under the
supervision of TÜV-SÜD and it has been approved by
TÜV as the flow to assess and validate the Safe Failure
Fraction of a given SoC in adherence to IEC 61508.