Introduction To Designing For High-Reliability: Prepared by Adiuvo Engineering and Training, LTD
Introduction To Designing For High-Reliability: Prepared by Adiuvo Engineering and Training, LTD
High-Reliability
14.10.2020
Depending upon the use case of the embedded system it may have to
address challenges such as
» Temperature
» Vibration
» Shock
» EMC / ESD
» Radiation – Terrestrial
» Radiation – Space
» Simplest select parts which are appropriated rated e.g. industrial or Mil
» Use Margin – Qualify design at higher limits than operational e.g. +/- 10C to operating
temperatures.
» Work with mechanical & Thermal team to produce a design which uses Conduction,
Radiation and Convection effectively to meet enable the derating to be achieved.
» Ensure timing analysis, includes worse case performances – not just the best or nominal.
7
Environment – EMC / ESD
The ability of an Equipment, Sub-System or System to share the electromagnetic
spectrum and perform their desired functions without unacceptable degradation
from or to the environment in which they exist.
Needs to be able to accept and function correctly when subject to Radiated and
Conducted Susceptibility
» Complete corruption of digital words
» Offsets in analogue signals
» Crosstalk in communication systems
» Equipment switch off/reset
Radiation Effects Environment
Radiation falls into four types
The first three of these categories have the potential to effect electronic
components while the fourth DDD is a more systematic effect.
Hard Error – SHE Unalterable change of memory element Memories, Latches, Yes
Registers
Burn Out – SEB Destructive Burn Out BJT, N Channel Power Yes
MOSFET
Upset – SEU Register or Memory Corruption Memories, Registers and No
Laches
Multi Bit Upset – MBU As SEU but multiple bits affected Memories, Registers and No
Laches
Functional Interrupt - SEFI Loss of normal operation Complex device e.g. FPGA No
0.2
0.4
0.6
0.8
1.2
0
1
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
3300
3600
3900
4200
4500
4800
5100
5400
5700
6000
6300
6600
6900
7200
Probability of Success
7500
7800
8100
8400
Time Hours
8700
9000
9300
9600
9900
10200
10500
10800
Probability of Success for MTBF of 1 Year
11100
11400
11700
12000
12300
12600
12900
13200
13500
13800
14100
14400
14700
15000
15300
15600
15900
P(S) for MTBF of one year = 0.3678
16200
MTBF
The MTBF is also only valid for the constant failure rate period of the
equipment and does not include infant mortality or wear out phase.
For this reason units are normally burnt in to weed out infant mortality
Component Quality Level
Standard Commercial, Extended and Industrial
Differing standards for components
Avoidance
» How to AVOID (or to reduce the Probability of) the failure occurrence? No single point
failures, best Hi Rel products
Tolerance
» How to TOLERATE failures? Recognize Avoidance is not to achieved and include
Redundancy scheme in mitigation
Avoidance and Tolerance
Examples of avoidance include
» Can be permanent faults e.g. an input stuck at a level
» Functional Analysis, Derating, Design Rules, Parts Quality Level, Worse case
analysis, Quality Standards.
24
Holistic Approach Required
Failure Rate Standards
There are a number of specifications for failure rate
»Mil-Hnd Book 217F Notice 2
»Telcordia SR332
»Siemens Norm
»FIDES
»UTE 80-810
While Mil-Hnd BK 217 is the most well used it has not been
updated for a number of years and as such is not accurate for
modern advanced technology
Two Approaches FMECA vs FTA
Failure Mode Effect and Criticality Analysis:
» Single event (-)
» Exhaustivity (+)
» Volume (-)
» Easiness (+)
» Extension to SW errors and operators errors (+)
FAULT TREE:
» Events combination (+)
» Based on analyst skill (-)
» Optimised (+)
» Difficult to perform (-)
» Extension to SW errors and operators errors (+)
27
Failure Mode Effect and Criticality Analysis
Purpose:
» To ASSESS the consequences of the elementary failures (parts, functional block...)
on the system
» To DESIGN (building) and DEMONSTRATE (verification) the efficiency of the FDIR
policy.
» To IDENTIFY Critical Items
» To SUPPORT contingency analysis
» To SUPPORT safety analysis
Phase:
» Preliminary: functional approach
» Detailed Design: detailed approach
28
Fault Tree Analysis
Purpose:
» To IDENTIFY the causes of hazardous events
» To IDENTIFY Critical Items
» To SUPPORT Safety analysis
29
Hardware Level – Derating
If the engineer designed the system such that the device was operating
with an electrical stress just below the absolute maximum the reliability
of the design would be considerably lowered as it would be operating
outside the recommended operating conditions
Reducing the electrical stress upon the design therefore enables the
engineer to produce reliable equipment.
Hardware Level - Derating Standards
• Depending upon end user / application
» Mil-STD-975 – NASA
» Mil-STD-1547 – DoD
» AS4613 – US Navy
» Nav Sea TE000-AB-GTP-010 – US Navy
» ESCC-Q-30-11A – ESA
» MSFC-STD-3012 – NASA Marshall Space Centre
Section 1 Section 2 Section 3 Session 4
Cold Spared – Redundant part powered down, takes time to get up and running
» Clocks
» Resets
36
How can we achieve code quality?
• Stringent set of coding rules – Including
37
Bad coding practice example
38
How can we enforce code quality
• Linting tools – able to check the code against language and other rules
» FSM Checking
39
Clocks
Very simple approach which is easily often to get wrong
» Carefully Plan the use of Clock Global and Local Clock resources
42
Metastability
43
Clock Domain Crossing
Several Techniques which can be used depending upon what needs to be
transferred
• Two stage synchroniser – Ideal for single bit data
• Grey Code Synchroniser – Encodes data bus in grey code and transfer
between domains – Ideal for counters as input to be converted to grey
code can only decrement / increment by one from previous value
• Hand shake synchroniser – Transfers data bus between two clock
domains using handshake signals
• Pulse synchroniser – transfer pulse from one clock domain to another
• Asynchronous FIFO – transfers data from one domain to another, useful
for high throughput / burst transfers
44
CDC Design Analysis tools
• Detecting all CDC issues can be a challenge in large designs
» Very easy to associate signal with wrong domain e.g. FIFO empty and WR clock
• CDC issues can be very difficult to find changing on each start up and may be
intermittent
• Can be hard to find in simulation – Timing simulation required, takes a long time
simulate the system
• Static analysis tools are better suited to find the CDC issues.
45
Blue Pearl - Demo
Need to ensure
spatial separation in
the implemented
device
Large Grain TMR
• Triplicates all resources in the design including IOB and Clocks, reset trees,
HOWEVER Flip Flops are not voted upon
• Uses one voter prior to the output of the three modules.
• Unlike Global TMR the FF are not resynchronised
» Can be used with partial reconfiguration to reconfigure a incorrect chain if required.
• It has minimal domain cross points unlike global TMR.
• Mitigates both configuration and user logic errors
• Like global TMR it has area and power penalties
Large Grain TMR
Local TMR
• Triplicates Flip Flops and votes on the output
• Best used in slow deigns to stop SET being clocked in
• Offers area advantage as combinatorial logic is not replicated
• Mitigates both user and configuration memory
State Machine
State machines are logical constructs that transition between a finite number
of states. They are often at the heart of many of our FPGA designs.
However, there are several issues which can arise if not designed correctly
» Terminal states – State from which once entered there is no path to leave
» Unused states – State which are unused in the design, transition in these states also
results in a terminal state.
» Safe state machine – Adds in additional logic, which could be effected and cause un-
expected behaviour
» Encoding – Is the state machine encoding correct for the environmental challenges
State Machine
The Unmapped State Machine
State Machine
State machine choices other than TMR
State Machines
• What does a Hamming code of three mean?
• For each state with n number of bits there are also n adjacent
states
Adjacent States
001
State
100 000
010
State Machine Analysis
• Complex state machines can be
hard to analyse
• Are all states mapped
• Is there terminal states
• Static FSM Analysis is important
in providing understanding &
correct behaviour
• Can be used for documentation
as well
• Blue Pearl VVS ideal
56
Counters
Many Engineers look just for the terminal count of the counter for example
IF count = 9 THEN
However SEU can result in the value of the counter being >= 9
In such a case the counter would fail, and the effects may propagate
Instead use
IF count >= 9 THEN
In the worst case the timer will action early (which can be mitigated at FPGA
level) rather than lock up the counter
Watchdog
• Commonly used in embedded systems
58
Questions ?
www.adiuvoengineering.com