0% found this document useful (0 votes)
36 views47 pages

Fault Tolerant CNN

This document discusses fault tolerance in convolutional neural networks (CNNs). It describes different types of faults including transient, intermittent, and permanent faults. It then discusses sources of faults such as soft errors from radiation, temperature increases, aging, process variations, and overclocking. The document notes that soft errors are common in CNN inference applications and can cause high error rates. It proposes applying checksums to the CNN filter weights and input/output tensors to detect soft errors during computation. Applying checksums can detect but not correct errors in the input, while errors in weights can be corrected by reloading from the model.

Uploaded by

Anil Yogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views47 pages

Fault Tolerant CNN

This document discusses fault tolerance in convolutional neural networks (CNNs). It describes different types of faults including transient, intermittent, and permanent faults. It then discusses sources of faults such as soft errors from radiation, temperature increases, aging, process variations, and overclocking. The document notes that soft errors are common in CNN inference applications and can cause high error rates. It proposes applying checksums to the CNN filter weights and input/output tensors to detect soft errors during computation. Applying checksums can detect but not correct errors in the input, while errors in weights can be corrected by reloading from the model.

Uploaded by

Anil Yogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Fault Tolerant CNN

Types of faults
● A transient fault is a fault that is not long
and can be restored.
● An intermittent fault, often called simply an
"intermittent", is a malfunction of a device
or system that occurs at intervals, usually
irregular.
● Permanent faults are failures that manifest
as stuck-at bits in the architecture, that is,
lines that always carry the logical signal “0”
or “1” as the result of a short or open circuit.
Soft Errors

● Soft Errors are transient faults that manifest as bit flips and result in data
corruption.
○ These are mainly caused by extrinsic sources like alpha particles emitted from the impurities in
packaging materials or by neutrons from cosmic radiations when they strike the chip.
● Temperature is another factor that can result in an increase in the Soft Error Rate
(SER).
Aging

Aging of electronic circuits occurs due to various physical phenomena such as Bias
Temperature Instability (BTI), Hot Carrier Injection (HCI), Time-Dependent Dielectric
Breakdown (TDDB) and Electromigration (EM).
● It typically results in circuits becoming slower with time, e.g., by increasing the
threshold voltage (VTH) of the circuits, or breakdown of dielectric and wires.
● The aging faults manifest as timing errors during the early stages, and later can
transform into permanent faults. Alongside various other factors, the rate of aging
usually increases with temperature.
Process Variations

Process Variations is an issue caused by the variations introduced during the


manufacturing process.
This issue arises due to the fact that it is difficult to manufacture transistors with the
same properties, such as channel length, oxide thickness and doping levels, at the
nano-scale.
results in performance (e.g., in the form of reduced operating frequency) and power
efficiency, as well as the yield of the manufacturing process (in case of permanent
faults).
Error Prone CNN Application

● Recent literature indicates that soft errors are inevitable in modern systems, from
edge computing devices to supercomputers
○ high-energy cosmic radiation
○ Aging and wear of devices
● CNN inference applications often call for power-efficient and cost-efficient machine
learning accelerators, which may adopt overclocking with voltage underscaling,
incurring more soft errors than common hardware incurs.
● Resilient convolutional neural networks are essential for guaranteeing the correctness
of inference applications.
● A single bit flip happened during CNN image classification could result in as much
as 40% and 70% SDC rate in datapath and memory, respectively.
● Such high SDC rates would downgrade the CNN prediction accuracy dramatically.

[SDC Rate: Silent Data Corruption (SDC) Rate]


● Existing resilient solutions (ECC) are insufficient for protecting CNN inference applications against
multiple soft errors.
Pruning in hardware: A solution to Permanent Faults

● In hardware, pruning is achieved by introducing a


separate bypass path for faulty MAC units.
● With the bypass path being enabled, the faulty
MAC unit’s contribution to the column sum is
skipped, which is equivalent to setting the faulty
MAC’s weight to zero.
● The area overhead due to the new bypass path is
only about 9%.

How to handle soft errors?


Convolution (CONV) Layer
input fmap output fmap
filter (weights) an output
H activation
R E

S W F
Element-wise Partial Sum (psum)
Accumulation
Multiplication 9
Convolution (CONV) Layer
input fmap output fma
filter (weights) an output
activation
H
R E

S W F
Sliding Window Processing
10
Convolution (CONV) Layer
input fmap
filter C… … …
output fmap
C… ……
H …
R E
… …
S W F
Many Input Channels (C)
11
Convolution (CONV) Layer
many input fmap
C…
output fmap
filters (M) … …
C… … …

H … M
R E
1 … … …
S W F

Many
C… ……
Output Channels (M)
12

R …
M
S
Convolution (CONV) Layer
Many
Input fmaps (N) Many
filters C… … Output fmaps (N)
M… …
C… …
H
R E
… 1 … 1 …
S W F


C… C… … … … …
…… 13

R E
… H …
N …
S N … F
W
CNN Decoder Ring
• N – Number of input fmaps/output fmaps (batch size)
• C – Number of 2-D input fmaps /filters (channels)
• H – Height of input fmap (activations)
• W – Width of input fmap (activations)
• R – Height of 2-D filter (weights)
• S – Width of 2-D filter (weights)
• M – Number of 2-D output fmaps (channels)
• E – Height of output fmap (activations)
• F – Width of output fmap (activations) 14
CONV Layer Tensor Computation
Output fmaps (O) Input fmaps (I)
Biases (B) Filter weights (W)

15
Apply Checksums

● Cd1 and Cw1 are calculated before the


convolution operation,
● Any memory error striking D or W during
the convolution would not affect Cd1 or
Cw1.
Original Checksum

● In checksum error detection scheme, the data is divided into k


segments each of m bits.
● In the sender’s end the segments are added using 1’s complement
arithmetic to get the sum. The sum is complemented to get the
checksum.
● The checksum segment is sent along with the data segments.
● At the receiver’s end, all received segments are added using 1’s
complement arithmetic to get the sum. The sum is complemented.
● If the result is zero, the received data is accepted; otherwise discarded.
Fault Model
Assumptions:
○ transient faults in computational units
○ data corruption faults (both transient and persistent) in memory (including cache).

● Fault represents a malfunction event, and we denote its corresponding symptom as soft error.

● Soft error protection includes error detection and error correction.


● Error detection means that the scheme can detect soft errors without knowing the exact location.
Error correction means that the scheme can locate the soft error locations and recover the
incorrect result.
Analysis of Soft Error in D and W
● The soft errors in W can be detected by comparing the checksum of W with Cw1 and
corrected by reloading weights from the CNN model.
● The soft errors in D do not need correction because D will be discarded after convolution
computation.
● The resulting errors in the output can be detected and corrected by the checksums of the output,
as demonstrated below.
● One fault during the convolution execution can result in
corruption of one block row or column of O.
● By definition, the row i of O is computed by the ith block of D
with W. Thus, one fault in D would result in at most one
corrupted row.
● The column j of O is computed by D with the jth block of W.
Thus, one fault in W would result in at most one corrupted
column.
● Moreover, the intermediate result will be reused only by the
same row or column, such that one fault in the computational
units would corrupt only values in the same row or column.
● Accordingly, the soft error protection ability in the context of
at most one corrupted row or column of O will be decided.
CNN Checksums

● The output O is represented in the form of


blocks.

● Elements inside the same block are


independent with respect to checksums.
● perform the checksum comparison
independently for each element across blocks.
● Multiple soft errors in the same block can be
detected and corrected independently.
CNN Checksums

● The output O is represented in the form of


blocks.

● Elements inside the same block are


independent with respect to checksums.
● perform the checksum comparison
independently for each element across blocks.
● Multiple soft errors in the same block can be
detected and corrected independently.
Distributive property of Convolution
Sum of Outputs and Output CheckSum
Row Checksum Scheme (RC)
Column Checksum Scheme (ClC)
Practice Assignment

Proof: Full Checksum Scheme uses both the row checksum Co1 and column checksum
Co2 so that it can correct soft errors in both directions.
Analysis of Protection Ability for Convolution Checksum Schemes
Ability of Row and Column Checksum Scheme
● the row checksum scheme and column checksum scheme
are symmetric.
● the row checksum scheme can detect and correct soft
errors if they are in the same row.
● If the soft errors are in the same column, the row
checksum scheme can only detect soft errors; it has no
correction ability.
● The column checksum scheme, on the contrary, can
detect and correct errors located in the same column but
fail to correct those appearing in the same row.
Revisit Row Checksum Scheme (RC)
Ability of Row and Column Checksum Scheme
● the row checksum scheme and column checksum scheme
are symmetric.
● the row checksum scheme can detect and correct soft
errors if they are in the same row.
● If the soft errors are in the same column, the row
checksum scheme can only detect soft errors; it has no
correction ability.
● The column checksum scheme, on the contrary, can
detect and correct errors located in the same column but
fail to correct those appearing in the same row.
Revisit Column Checksum Scheme (ClC)
Ability of Full Checksum Scheme
● Full checksum scheme has the highest ability to correct soft errors.
● The scheme uses both the row checksum Co1 and column checksum Co2 so that it can correct soft errors in
both directions.
Limitation in Fault Model
● Without loss of generality, the following analysis considers at most one fault per
convolution.
● One convolutional neural network contains several or even tens of convolutional layers, and the
total forward execution time of a CNN model is usually within seconds.
● Thus, we can reasonably assume that at most one fault may strike to one convolutional layer,
considering the short executing time of a single layer.
● Multiple faults per convolution can also be detected by our schemes and recovered by
recomputing the corrupted convolutional layer.
Checksum-of-Checksum Scheme (CoC/CoC-D)

● scheme involves neither D nor W but only their checksums,


so it is named checksum-of-checksum scheme (or CoC
scheme for short).
Soft Error Protection Ability of CoC Scheme

When soft errors strike the input or output data:

A single soft error in O can be detected by CoC using Co5.


Soft Error Protection Ability of CoC Scheme

When soft errors strike the input or output data:

A single soft error in O can be corrected by CoC using all checksums including Co5, Co6, and Co7,
as shown in Figure 2 (b).
Soft Error Protection Ability of CoC Scheme

When soft errors strike the input or output data:

A single soft error in O can be corrected by CoC using all checksums including Co5, Co6, and Co7.

[CoC cannot correct soft errors across multiple blocks in O.]


When checksums are corrupted

Such soft errors can cause inconsistency among the


output checksums of CoC, which can be used for
error detection.
● For example, if Cd1 is corrupted, leading to
corrupted Co5 and Co6 with correct Co7.
● We can detect this abnormal pattern when
comparing checksums with the summation of O
to detect the input checksum corruption.
● The input D, W, and output O are clean and
without soft errors since fault frequency is at most
once per convolution. Thus, we can safely discard
all the checksums and finish this convolution
computation
When checksums are corrupted

● Only one of them will be corrupted among Co5, Co6, and Co7

[`One fault during the convolution execution’: Assumption from fault model]
Ability of four schemes

● CoC scheme has the lowest error correction ability and that the full checksum scheme has the best error
correction ability.
● The abilities of the row checksum scheme and column checksum scheme are higher than that of the CoC
scheme but lower than that of the full checksum scheme.
● CoC-D can detect multiple soft errors but without correction ability.

The analysis here serves as the fundamental basis of our low-overhead high protection
design.
Multischeme Workflow for Soft Error Protection

● To achieve the highest protection ability and


lowest overhead: multischeme workflow by
integrating the four schemes.
● use CoC-D to detect errors because it has the
lowest overhead.
● For the error correction, put CoC in the
beginning because it is the most lightweight
method.
● By comparison, FC has highest correction
ability but also highest time overhead, so put
it at the end of the workflow.
Multischeme Workflow for Soft Error Protection

● The error detection modules will be executed for every execution whether there is a soft error or not.
Thus, any unnecessary computations should be avoided in order to reduce the overall overhead.
● For instance, both CoC-D and FC are able to detect all the soft errors, but we adopt only CoC-D in the
workflow for error detection because FC has a much higher overhead. RC and ClC cannot detect soft errors
correctly if the checksum is corrupted.
Multischeme Workflow for Soft Error Protection

● The error correction module will not be executed until some soft errors are detected. The schemes
in this module will be invoked to fix soft errors according to the workflow. If it fails to correct the errors
due to inconsistency of checksum blocks or illegal error locations, the next-level scheme will be
invoked.
Multischeme Workflow for Soft Error Protection

● Checksums can be reused among different CNN schemes in the workflow, the runtime of the workflow is
actually lower than the sum of all schemes’ runtimes.
● For example, both CoC-D and CoC use Co5; if CoC-D detects soft errors and CoC is invoked to correct soft
errors, CoC can save the time of computing Co5 and its corresponding summation So5, since they have been
computed by CoC-D.
Code Ref [Optional]
https://oneapi-src.github.io/oneDNN/v0/index.html

https://github.com/oneapi-src/oneDNN/blob/master/examples/cnn_inference_f32.cpp

https://github.com/oneapi-src/oneDNN/blob/master/examples/cnn_inference_int8.cpp
Ref:
https://www.researchgate.net/publication/348144782_FT-CNN_Algorithm-
Based_Fault_Tolerance_for_Convolutional_Neural_Networks

You might also like