Fault Tolerant CNN
Fault Tolerant CNN
Types of faults
● A transient fault is a fault that is not long
and can be restored.
● An intermittent fault, often called simply an
"intermittent", is a malfunction of a device
or system that occurs at intervals, usually
irregular.
● Permanent faults are failures that manifest
as stuck-at bits in the architecture, that is,
lines that always carry the logical signal “0”
or “1” as the result of a short or open circuit.
Soft Errors
● Soft Errors are transient faults that manifest as bit flips and result in data
corruption.
○ These are mainly caused by extrinsic sources like alpha particles emitted from the impurities in
packaging materials or by neutrons from cosmic radiations when they strike the chip.
● Temperature is another factor that can result in an increase in the Soft Error Rate
(SER).
Aging
Aging of electronic circuits occurs due to various physical phenomena such as Bias
Temperature Instability (BTI), Hot Carrier Injection (HCI), Time-Dependent Dielectric
Breakdown (TDDB) and Electromigration (EM).
● It typically results in circuits becoming slower with time, e.g., by increasing the
threshold voltage (VTH) of the circuits, or breakdown of dielectric and wires.
● The aging faults manifest as timing errors during the early stages, and later can
transform into permanent faults. Alongside various other factors, the rate of aging
usually increases with temperature.
Process Variations
● Recent literature indicates that soft errors are inevitable in modern systems, from
edge computing devices to supercomputers
○ high-energy cosmic radiation
○ Aging and wear of devices
● CNN inference applications often call for power-efficient and cost-efficient machine
learning accelerators, which may adopt overclocking with voltage underscaling,
incurring more soft errors than common hardware incurs.
● Resilient convolutional neural networks are essential for guaranteeing the correctness
of inference applications.
● A single bit flip happened during CNN image classification could result in as much
as 40% and 70% SDC rate in datapath and memory, respectively.
● Such high SDC rates would downgrade the CNN prediction accuracy dramatically.
S W F
Element-wise Partial Sum (psum)
Accumulation
Multiplication 9
Convolution (CONV) Layer
input fmap output fma
filter (weights) an output
activation
H
R E
S W F
Sliding Window Processing
10
Convolution (CONV) Layer
input fmap
filter C… … …
output fmap
C… ……
H …
R E
… …
S W F
Many Input Channels (C)
11
Convolution (CONV) Layer
many input fmap
C…
output fmap
filters (M) … …
C… … …
…
H … M
R E
1 … … …
S W F
…
Many
C… ……
Output Channels (M)
12
R …
M
S
Convolution (CONV) Layer
Many
Input fmaps (N) Many
filters C… … Output fmaps (N)
M… …
C… …
H
R E
… 1 … 1 …
S W F
…
…
C… C… … … … …
…… 13
R E
… H …
N …
S N … F
W
CNN Decoder Ring
• N – Number of input fmaps/output fmaps (batch size)
• C – Number of 2-D input fmaps /filters (channels)
• H – Height of input fmap (activations)
• W – Width of input fmap (activations)
• R – Height of 2-D filter (weights)
• S – Width of 2-D filter (weights)
• M – Number of 2-D output fmaps (channels)
• E – Height of output fmap (activations)
• F – Width of output fmap (activations) 14
CONV Layer Tensor Computation
Output fmaps (O) Input fmaps (I)
Biases (B) Filter weights (W)
15
Apply Checksums
● Fault represents a malfunction event, and we denote its corresponding symptom as soft error.
Proof: Full Checksum Scheme uses both the row checksum Co1 and column checksum
Co2 so that it can correct soft errors in both directions.
Analysis of Protection Ability for Convolution Checksum Schemes
Ability of Row and Column Checksum Scheme
● the row checksum scheme and column checksum scheme
are symmetric.
● the row checksum scheme can detect and correct soft
errors if they are in the same row.
● If the soft errors are in the same column, the row
checksum scheme can only detect soft errors; it has no
correction ability.
● The column checksum scheme, on the contrary, can
detect and correct errors located in the same column but
fail to correct those appearing in the same row.
Revisit Row Checksum Scheme (RC)
Ability of Row and Column Checksum Scheme
● the row checksum scheme and column checksum scheme
are symmetric.
● the row checksum scheme can detect and correct soft
errors if they are in the same row.
● If the soft errors are in the same column, the row
checksum scheme can only detect soft errors; it has no
correction ability.
● The column checksum scheme, on the contrary, can
detect and correct errors located in the same column but
fail to correct those appearing in the same row.
Revisit Column Checksum Scheme (ClC)
Ability of Full Checksum Scheme
● Full checksum scheme has the highest ability to correct soft errors.
● The scheme uses both the row checksum Co1 and column checksum Co2 so that it can correct soft errors in
both directions.
Limitation in Fault Model
● Without loss of generality, the following analysis considers at most one fault per
convolution.
● One convolutional neural network contains several or even tens of convolutional layers, and the
total forward execution time of a CNN model is usually within seconds.
● Thus, we can reasonably assume that at most one fault may strike to one convolutional layer,
considering the short executing time of a single layer.
● Multiple faults per convolution can also be detected by our schemes and recovered by
recomputing the corrupted convolutional layer.
Checksum-of-Checksum Scheme (CoC/CoC-D)
A single soft error in O can be corrected by CoC using all checksums including Co5, Co6, and Co7,
as shown in Figure 2 (b).
Soft Error Protection Ability of CoC Scheme
A single soft error in O can be corrected by CoC using all checksums including Co5, Co6, and Co7.
● Only one of them will be corrupted among Co5, Co6, and Co7
[`One fault during the convolution execution’: Assumption from fault model]
Ability of four schemes
● CoC scheme has the lowest error correction ability and that the full checksum scheme has the best error
correction ability.
● The abilities of the row checksum scheme and column checksum scheme are higher than that of the CoC
scheme but lower than that of the full checksum scheme.
● CoC-D can detect multiple soft errors but without correction ability.
The analysis here serves as the fundamental basis of our low-overhead high protection
design.
Multischeme Workflow for Soft Error Protection
● The error detection modules will be executed for every execution whether there is a soft error or not.
Thus, any unnecessary computations should be avoided in order to reduce the overall overhead.
● For instance, both CoC-D and FC are able to detect all the soft errors, but we adopt only CoC-D in the
workflow for error detection because FC has a much higher overhead. RC and ClC cannot detect soft errors
correctly if the checksum is corrupted.
Multischeme Workflow for Soft Error Protection
● The error correction module will not be executed until some soft errors are detected. The schemes
in this module will be invoked to fix soft errors according to the workflow. If it fails to correct the errors
due to inconsistency of checksum blocks or illegal error locations, the next-level scheme will be
invoked.
Multischeme Workflow for Soft Error Protection
● Checksums can be reused among different CNN schemes in the workflow, the runtime of the workflow is
actually lower than the sum of all schemes’ runtimes.
● For example, both CoC-D and CoC use Co5; if CoC-D detects soft errors and CoC is invoked to correct soft
errors, CoC can save the time of computing Co5 and its corresponding summation So5, since they have been
computed by CoC-D.
Code Ref [Optional]
https://oneapi-src.github.io/oneDNN/v0/index.html
https://github.com/oneapi-src/oneDNN/blob/master/examples/cnn_inference_f32.cpp
https://github.com/oneapi-src/oneDNN/blob/master/examples/cnn_inference_int8.cpp
Ref:
https://www.researchgate.net/publication/348144782_FT-CNN_Algorithm-
Based_Fault_Tolerance_for_Convolutional_Neural_Networks