0% found this document useful (0 votes)
47 views85 pages

3 Deep Learning Overview v3.5

The document provides an overview of deep learning, including: - The development of neural networks from single-layer perceptrons to modern deep networks. - The components of neural networks, including input/output layers, hidden layers, and connections between nodes. - Common problems in deep learning engineering such as limited data, feature interpretation, and hardware requirements. - The contents that will be covered include deep learning, training rules, activation functions, normalization, optimizers, and types of neural networks.

Uploaded by

Dany Sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views85 pages

3 Deep Learning Overview v3.5

The document provides an overview of deep learning, including: - The development of neural networks from single-layer perceptrons to modern deep networks. - The components of neural networks, including input/output layers, hidden layers, and connections between nodes. - Common problems in deep learning engineering such as limited data, feature interpretation, and hardware requirements. - The contents that will be covered include deep learning, training rules, activation functions, normalization, optimizers, and types of neural networks.

Uploaded by

Dany Sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Deep Learning Overview

Foreword

 This chapter describes the basic knowledge related to deep learning, including the
development history, components and types of neural networks for deep learning,
and common problems in deep learning engineering.

1 Huawei Confidential
Objectives

 Upon completion of this course, you will be able to:


 Describe the definition and development of neural networks.
 Be familiar with the components of neural networks for deep learning.
 Be familiar with the training and optimization of neural networks.
 Describe common problems in deep learning.

2 Huawei Confidential
Contents

1. Deep Learning

2. Training Rules

3. Activation Functions

4. Normalization

5. Optimizers

6. Neural Network Types

3 Huawei Confidential
Traditional Machine Learning vs. Deep Learning
 Deep learning is a learning model based on unsupervised feature learning and the feature
hierarchical structure. It has great advantages in computer vision, speech recognition, and
natural language processing (NLP).
Traditional Machine Learning Deep Learning
It has low requirements for hardware because the amount It has certain hardware requirements because numerous
of computation is limited, and GPUs are not required for matrix operations are needed, and also requires GPUs for
parallel computing. parallel computing.
It is suitable for training with a small amount of data, but It achieves high performance when high-dimensional
its performance cannot improve as the amount increases. weight parameters and massive training data are provided.
Problems are located level by level. It is an end-to-end learning method.
Features are manually selected. Algorithms are used to automatically extract features.
Feature interpretability is strong. Feature interpretability is weak.

4 Huawei Confidential
Traditional Machine Learning
Problem analysis
& Task locating

Data Feature Feature


cleansing extraction selection

Model
training

Question: Can you use


an algorithm to
automatically execute Inference, prediction,
the procedure? and identification

5 Huawei Confidential
Deep Learning
 Generally, the deep learning architecture is a deep neural network. "Deep" in "deep learning"
refers to the number of layers of the neural network. Input layer

Hidden layer

Output layer
Input layer Hidden layer

Output layer

Human neural network Perceptron Deep neural network

6 Huawei Confidential
Neural Network
 Currently, definitions of the neural network are not unified. According to Hecht Nielsen, an American neural network
scientist, a neural network is a computing system made up of a number of simple, highly interconnected processing
elements, which process information by their dynamic state response to external inputs.
 Based on the source, features, and explanations of the neural network, the artificial neural network (ANN) can be
simply expressed as an information processing system designed to imitate the human brain structure and
functions.
 ANN is a network formed by artificial neurons connected to each other. It extracts and simplifies the human brain's
microstructure and functions, and is an important approach to simulating human intelligence. It reflects several
basic features of human brain functions, such as concurrent information processing, learning, association, model
classification, and memory.

7 Huawei Confidential
Development History of Neural Networks

Deep
SVM
XOR Multi-layer network
Perceptron perceptron

Golden period AI winter

1957 1970 1986 1995 2006

8 Huawei Confidential
Single-layer Perceptron
 Input vector: =[ , ,…, ] .
 Weight: =[ , ,…, ] , where indicates an offset.
1, > 0,
 Activation function: = =
−1, ℎ .

 The preceding perceptron is equivalent to a classifier. It uses the high-dimensional vector as the input and performs binary
classification on input samples in the high-dimensional space. When > 0, O = 1. In this case, the samples are classified into a
type. Otherwise, O = −1. In this case, the samples are classified into the other type. The boundary of these two types is = 0,
which is a high-dimensional hyperplane.

Classification point Classification line Classification plane Classification


+ =0 + + =0 + + + =0 hyperplane
X+ = 0

9 Huawei Confidential
XOR Problem
 In 1969, Marvin Minsky, an American mathematician and AI pioneer, proved that perceptrons
are essentially a type of linear model capable of processing only linear classification.

AND OR XOR

10 Huawei Confidential
Solving the XOR Problem

w0

w1

w2

w3

w4
XOR w5
XOR

11 Huawei Confidential
Feedforward Neural Network (FNN)
 An FNN is a typical deep learning model. It has the following features:
 Input nodes do not provide the computing function and are used to represent only the element values of an input vector.

 Each neuron is connected only to a neuron at the previous layer, receives an output of the previous layer as its input, and outputs the computing result to the next layer.
A unidirectional multi-layer structure is used. There is no feedback in the entire network. Signals are transmitted unidirectionally from the input layer to the output layer.
The network can be represented by a directed acyclic graph (DAG).

( ) ( ) ( )

( )
( ) ( )

( )
( ) ( )

( )
Node at layer is connected to node
( )
at layer − 1.

Hidden Hidden layer Layer


layer 1 2

Input layer Fully-concatenated layer Output layer

12 Huawei Confidential
Influence of Hidden Layers on Neural Networks

0 hidden layers 3 hidden layers 20 hidden layers

13 Huawei Confidential
Contents

1. Deep Learning

2. Training Rules

3. Activation Functions

4. Normalization

5. Optimizers

6. Neural Network Types

14 Huawei Confidential
Common Loss Functions in Deep Learning
 When training a deep learning network, we need to parameterize the error of target classification. A loss function (error function)
is used, which reflects the error between the target output and the actual output of the perceptron.
 For regression tasks, the commonly used loss function is the quadratic cost function. For a separate training sample , the quadratic cost
function can be written as follows:

1
= −
2

indicates a neuron at the output layer, indicates all neurons at the output layer, indicates the target output, and indicates the actual
output.
 For classification tasks, the commonly used loss function is the cross-entropy cost function.

1
=− [ ln ]

The cross-entropy cost function depicts the distance between two probability distributions, which is a widely used loss function for classification problems.

 Generally, the quadratic cost function is used for regression problems, and the cross-entropy cost function is used for classification
problems.

15 Huawei Confidential
Extremum of a Loss Function
 Purpose: A loss function can iteratively search for and update parameter in the negative
gradient direction to minimize the loss function.
 Limitation: There is no effective method for solving the extremum in mathematics on the
complex high-dimensional surface of = ∑ ∈ − .
 Approach: The core approach of the gradient descent method is as follows: The negative
gradient direction is the fastest descent direction of the function. Therefore, the minimum
point of ( ) is expected to exist along the − () direction.

16 Huawei Confidential
Gradient Descent and Loss Function
 The gradient of a multivariable function = , ,…, at =[ , ,…, ] is as follows:

, ,…, =[ , ,…, ] | ,
Direction of the gradient vector points to the direction in which the function grows fastest. Therefore, the direction of the negative
gradient vector − points to the direction in which the function decreases fastest.

 The learning rate (LR) is a coefficient for adjusting weights according to an error gradient, and is usually
denoted as .

= −

Based on the learning rate and gradient value, updating all parameter values will reduce the network loss.

17 Huawei Confidential
Batch Gradient Descent (BGD) Algorithm
 < , > indicates a sample in the training set. is an input value vector, is a target output, is an actual output,
is a learning rate, and C is a loss function.
 Initialize each to a random value with a smaller absolute value.

 Before the termination condition is met, do as follows:


 Initialize each ∆ to zero.
 For each < , > in the training set, do as follows:

− Input to this unit and compute the output .

( , )
− For each in this unit: ∆ += -η ∑ ∑ ∈

 For each in this unit: += ∆

 The gradient descent algorithm of this version is not commonly used because it has the following defect:
 The convergence process is very slow because all training samples need to be computed every time the weight is updated.

18 Huawei Confidential
Stochastic Gradient Descent (SGD) Algorithm
 To address the defect of the BGD algorithm, a common variant is developed, which is called incremental gradient descent or
stochastic gradient descent. One implementation is called online learning, which updates the gradient based on each sample:

1 C( , ) C( , )
∆ =− ⟹∆ =−
∈ ∈

 ONLINE-GRADIENT-DESCENT
 Initialize each to a random value with a smaller absolute value.

 Before the termination condition is met, do as follows:

 Randomly select <X, t> in the training set:


 Input X to this unit and compute the output o.

( , )
 For each in this unit: += -η∑ ∈

19 Huawei Confidential
Mini-batch Gradient Descent (MBGD) Algorithm
 To address the defects of the previous two gradient descent algorithms, the MBGD algorithm was proposed and has
been most widely used. It uses a small batch of samples with a fixed batch size to compute ∆ and update weights.
 BATCH-GRADIENT-DESCENT
 Initialize each to a random value with a smaller absolute value.

 Before the termination condition is met, do as follows:


 Initialize each ∆ to zero.
 For each < , > in the samples obtained from the training set, do as follows:

− Input to this unit and compute the output .

( , )
− For each in this unit: ∆ += -η ∑ ∑ ∈

 For each in this unit: += ∆


 If it is the last batch, shuffle the training samples.

20 Huawei Confidential
Network Training Process
 Forward propagation: ( ) = ( − ) = ( − ( )) .

( )
 If the parameter needs to be updated, according to = − , compute: = .

Forward Propagation

Linear Combination Activation Linear Combination Activation


= = ( ) = = ( ) 1
( )= ( − )
2
( )

( )
( ) ( ) ( ) ( ) ( )

Backward Propagation
21 Huawei Confidential
Backward Propagation
 Error backward propagation is an important algorithm in neural networks. It uses the chain rule to send
the errors of the output layer back to the network, so that a gradient with respect to the weights of the
neural network can be easily computed. The steps are as follows:
 Backward propagate the loss function values to each compute unit.
 Each compute unit updates the weight according to the obtained errors.

22 Huawei Confidential
Vanishing and Exploding Gradients (1)
 Vanishing gradient: As network layers increase, the derivative value of backward propagation
decreases and the gradient vanishes.
 Exploding gradient: As network layers increase, the derivative value of backward propagation
increases and the gradient explodes.
 Cause:
y = ( ) = + , where is a sigmoid function.

w1 w2 w3 w4
x b1 b2 b3 y

 Backward propagation can be deduced as follows:


=
=

23 Huawei Confidential
Vanishing and Exploding Gradients (2)
 The maximum value of ( ) is :

 The network weight is usually less than 1. Therefore, ≤ . When using the chain rule, the value of

becomes smaller when the number of layers increases, resulting in the vanishing gradient problem.
 When the network weight is large, that is > 1, the exploding gradient problem occurs.

 Solutions: The ReLU activation function and LSTM neural network are used to solve the vanishing gradient problem, and gradient
clipping is used to solve the exploding gradient problem.

24 Huawei Confidential
Contents

1. Deep Learning

2. Training Rules

3. Activation Functions

4. Normalization

5. Optimizers

6. Neural Network Types

25 Huawei Confidential
Concept of Activation Functions
 In a neural network, each neuron receives the outputs of neurons at the previous layer as its inputs and then
transmits the inputs to the next layer. Neurons at the input layer directly transmit input attribute values to the next
layer (a hidden or output layer). In a multi-layer neural network, there is a function between outputs of nodes at the
previous layer and inputs of nodes at the next layer. Such function is referred to as an activation function.

Single-layer Perceptron

26 Huawei Confidential
Purpose of Activation Functions (1)

Multiple perceptrons without


activation functions are equivalent
to linear functions.

27 Huawei Confidential
Purpose of Activation Functions (2)

The activation function is equivalent to adding


nonlinear factors to a neural network, which
transforms the neural network into a nonlinear
model.

28 Huawei Confidential
Sigmoid Function
1
=
1+

29 Huawei Confidential
Tanh Function


tanh =
+

30 Huawei Confidential
Softsign Function

=
+1

31 Huawei Confidential
Rectified Linear Unit (ReLU) Function

, ≥0
=
0, <0

32 Huawei Confidential
LeakyRelu Function

33 Huawei Confidential
Softplus Function

= ln +1

34 Huawei Confidential
Softmax Function
 Softmax function body:
σ(z) =

 The Softmax function is used to map a K-dimensional vector of arbitrary real values to another K-
dimensional vector of real values, where each vector element is in the interval (0, 1). All the elements
add up to 1.
 The softmax function is often used as the output layer of a multiclass classification task.

35 Huawei Confidential
Contents

1. Deep Learning

2. Training Rules

3. Activation Functions

4. Normalization

5. Optimizers

6. Neural Network Types

36 Huawei Confidential
Overfitting
 Problem description: The model performs well in the training set, but poorly in the test set.
 Root cause: There are too many feature dimensions, model assumptions, and parameters, too
much noise, but very less training data. As a result, the fitting function almost perfectly predicts
the training set, whereas the prediction result of the test set of new data is poor. Training data
is overfitted without considering the generalization capability of the model.

37 Huawei Confidential
38 Huawei Confidential
Normalization
 Normalization is a very important and effective technology to reduce generalization errors in
machine learning. It is especially useful for deep learning models which tend to have overfitting
due to diverse parameters. Therefore, researchers have also proposed many effective
technologies to prevent overfitting, including:
 Adding constraints to parameters, such as and norms
 Expanding the training set, such as adding noise and transforming data
 Dropout
 Early stopping

39 Huawei Confidential
Parameter Penalty
 Many normalization methods limit the learning ability of the model by adding a parameter
penalty Ω( ) to the objective function . Assume that the normalized objective function is .

; , = ; , + Ω ( ),

where [0, ∞) is a hyperparameter that weighs the relative contribution of the norm penalty
term Ω and the standard objective function ( ; ). If is set to 0, no normalization is
performed. A larger value of indicates a larger normalization penalty.

40 Huawei Confidential
Normalization
 Add norm constraints to model parameters, that is:

; , = ; , + ,
 If a gradient method is used to resolve the value, the parameter gradient is:
= + .
 The parameter optimization method is:
= − ( ) − ( )

41 Huawei Confidential
Normalization
 The norm penalty term is added to the parameter constraint. This technique is used to
prevent overfitting.

; , = ; , + ,

A parameter optimization method can be inferred using an optimization technology (such as a


gradient method):

= 1− − ( ),

where is the learning rate. Compared with the normal gradient optimization formula, the
parameter is multiplied by a reduction factor.

42 Huawei Confidential
vs.
 The differences between and are as follows:
 According to the preceding analysis, can generate a sparser model than . When the value of
parameter is small, normalization can directly reduce the parameter value to 0, which can be
used for feature selection.
 From the perspective of probability, many norm constraints are equivalent to adding prior probability
distribution to parameters. In normalization, the parameter value complies with the Gaussian
distribution rule. In normalization, the parameter value complies with the Laplace distribution
rule.

43 Huawei Confidential
Data Augmentation
 The most effective way to prevent overfitting is to add a training set. A larger training set has a
smaller overfitting probability. Data augmentation is time-saving and effective but not
applicable in certain fields.
 A common method in the object recognition field is to rotate or scale images. (The prerequisite to image
transformation is that the type of the image should not be changed through transformation. For example, for
handwriting digit recognition, classes 6 and 9 can be easily changed after rotation).
 Random noise is added to the input data in speech recognition.
 A common practice of natural language processing (NLP) is replacing words with their synonyms.

44 Huawei Confidential
Early Stopping of Training
 A test on data of the validation set can be inserted during the training. When the data loss of
the validation set increases, the training is stopped in advance.

45 Huawei Confidential
Dropout
 Dropout is a common and simple normalization method, which has been widely used since 2014. Simply
put, dropout randomly discards some inputs during the training process. In this case, the parameters
corresponding to the discarded inputs are not updated. Dropout is an integration method. It combines
all sub-network results and obtains sub-networks by randomly dropping inputs. For example:

46 Huawei Confidential
Data Imbalance (1)
 Problem description: In the dataset consisting of various task classes, the number of samples varies
greatly from one class to another. One or more classes in the predicted classes contain very few samples.
 For example, in an image recognition experiment, more than 2,000 classes among 4251 training images
contain just one image each. Some other classes contain 2 to 5 images.
 Impact:
 Due to the unbalanced number of samples, we cannot get the optimal real-time result because the
model/algorithm never examines implied classes adequately.
 It creates a problem with the acquisition of validation and test samples because it is difficult to be representative
in classes when some classes are rarely observed.

47 Huawei Confidential
Data Imbalance (2)
 Random undersampling
 Deleting redundant samples in a class

 Random oversampling
 Copying samples

 Synthetic sampling
 Obtaining samples
 Synthesizing samples

48 Huawei Confidential
49 Huawei Confidential
Contents

1. Deep Learning

2. Training Rules

3. Activation Functions

4. Normalization

5. Optimizers

6. Neural Network Types

50 Huawei Confidential
Optimizers
 There are various improved versions of gradient descent algorithms. In object-oriented
language implementation, different gradient descent algorithms are often encapsulated into
objects called optimizers.
 The purposes of the algorithm optimization include but are not limited to:
 Accelerating algorithm convergence.
 Preventing or jumping out of local extreme values.
 Simplifying manual parameter setting, especially the learning rate (LR).

 Common optimizers: common GD optimizer, momentum optimizer, Nesterov, AdaGrad,


AdaDelta, RMSProp, Adam, AdaMax, and Nadam.

51 Huawei Confidential
Momentum Optimizer
 A most basic improvement is to add momentum terms for ∆ . Assume that the weight correction of the -th
iteration is ∆ ( ) . The weight correction rule is:
∆ =− ( )+ ∆ −1
Where α is a constant (0 ≤ α < 1) called momentum coefficient and ∆ − 1 is a momentum term.
 Imagine a small ball rolls down from a random point on the error surface. The introduction of the momentum term
is equivalent to giving the small ball inertia.

− ( )

52 Huawei Confidential
Advantages and Disadvantages of Momentum Optimizer
 Advantages:
 Enhances the stability of the gradient correction direction and reduces mutations.
 In areas where the gradient direction is stable, the ball rolls faster and faster (there is a speed upper
limit because < 1), which helps the ball quickly overshoot the flat area, and accelerates
convergence.
 A small ball with inertia is more likely to roll over some narrow local extrema.

 Disadvantages:
 The learning rate and momentum need to be manually set, which often requires more
experiments to determine the appropriate value.

53 Huawei Confidential
AdaGrad Optimizer (1)
 The common feature of the stochastic gradient descent (SGD) algorithm, small-batch gradient descent algorithm
(MBGD), and momentum optimizer is that each parameter is updated with the same LR.
 According to the approach of AdaGrad, different learning rates need to be set for different parameters.
C (t , o)
gt = Computing gradient
wt
rt  rt 1  gt2 Square gradient accumulation

wt   gt Computation update
  rt
wt 1 =wt  wt Application update

 indicates the tth gradient, is a gradient accumulation variable, and the initial value of is 0, which increases
continuously. indicates the global LR, which needs to be set manually. is a small constant, and is set to about 10-7
for numerical stability.
54 Huawei Confidential
AdaGrad Optimizer (2)
 It can be seen from the AdaGrad optimization algorithm that, as the algorithm is continuously iterated, the
becomes larger and the overall learning rate becomes smaller. This is because we hope the LR becomes slower as
the number of updates increases. In the initial learning phase, we are far away from the optimal solution to the loss
function. As the number of updates increases, we are closer and closer to the optimal solution, and therefore LR can
decrease.
 Advantages:
 The learning rate is automatically updated. As the number of updates increases, the learning rate decreases.

 Disadvantages:
 The denominator keeps accumulating. As a result, the learning rate will eventually become very small and the algorithm will
become ineffective.

55 Huawei Confidential
RMSProp Optimizer
 The RMSProp optimizer is an improved AdaGrad optimizer. It introduces an attenuation coefficient to ensure a certain attenuation
ratio for in each round.
 The RMSProp optimizer solves the problem of the AdaGrad optimizer ending the optimization process too early. It is suitable for
non-stable target handling and has good effects on RNNs.

C (t , o) Computing gradient
gt =
wt
rt = rt 1  (1   ) gt2 Square gradient accumulation

wt   gt Computation update
  rt
wt 1  wt  wt Application update

 indicates the tth gradient, is a gradient accumulation variable, and the initial value of is 0, which may not increase and may
need to be adjusted using a parameter. indicates the decay factor, and indicates the global LR, which needs to be set manually.
is a small constant, and is set to about 10-7 for numerical stability.
56 Huawei Confidential
Adam Optimizer (1)
 Adaptive moment estimation (Adam): Developed based on AdaGrad and AdaDelta, Adam
maintains two additional variables and for each variable to be trained:
= + (1 − )
= + (1 − )

where represents the -th iteration and is the computed gradient. and are moving
averages of the gradient and square gradient. From the statistical perspective, and are
estimates of the first moment (the average value) and the second moment (the uncentered
variance) of the gradients respectively, hence the name of the method.

57 Huawei Confidential
Adam Optimizer (2)
 If and are initialized using the zero vector, and are close to 0 during the initial iterations, especially
when and are close to 1. To solve this problem, we use and :

=
1−

=
1−
 The weight update rule of Adam is as follows:

= −
+
 Although the rule involves manual setting of , , and , the setting is much simpler. According to experiments,
the default settings are = 0.9, = 0.999, = 10 , and = 0.001. In practice, Adam will converge quickly.
When convergence saturation is reached, xx can be reduced. After several times of reduction, a satisfying local
extrema will be obtained. Other parameters do not need to be adjusted.
58 Huawei Confidential
Contents

1. Deep Learning

2. Training Rules

3. Activation Functions

4. Normalization

5. Optimizers

6. Neural Network Types

59 Huawei Confidential
Convolutional Neural Network (CNN)
 A CNN is a feedforward neural network. Its artificial neurons may respond to units within the coverage
range. CNN excels at image processing. It includes a convolutional layer, a pooling layer, and a fully-
connected layer.
 In the 1960s, Hubel and Wiesel studied cats' cortex neurons used for local sensitivity and direction
selection and found that their unique network structure could simplify feedback neural networks. They
then proposed the CNN.
 Now, CNN has become one of the research hotspots in many scientific fields, especially in the pattern
classification field. The network is widely used because it can avoid complex pre-processing of images
and directly input original images.

60 Huawei Confidential
Main Concepts of CNN
 Local receptive field: It is generally considered that human perception of the outside world is from local
to global. Spatial correlations among local pixels of an image are closer than those among pixels that
are far away. Therefore, each neuron does not need to know the global image. It only needs to know the
local image and the local information is then combined at a higher level to generate global information.
 Parameter sharing: One or more convolution cores may be used to scan input images. Parameters
carried by the convolution cores are weights. In a layer scanned by convolution cores, each core uses the
same parameters during weighted computation. Weight sharing means that when each convolution core
scans an entire image, the parameters of the convolution core are fixed.

61 Huawei Confidential
CNN Architecture
Input image 3 feature maps 3 feature maps 5 feature maps 5 feature maps Output layer

Convolutional Pooling Convolutional Pooling Fully-concatenated


layer layer layer layer layer

Bird PBird

Fish PFish

Dog Pdog

Cat Pcat

Convolution + nonlinearity Max pooling Vectorization

Convolution layers + pooling layers Multiple classes


Fully-concatenated layer

62 Huawei Confidential
Computing a Convolution Kernel (1)
 Convolution computation

63 Huawei Confidential
Computing a Convolution Kernel (2)
 Convolution computation result

hanbingtao, 2017, CNN

64 Huawei Confidential
Convolutional Layer
 The basic architecture of a CNN is a multi-channel convolution consisting of multiple single convolutions. The output of the previous layer (or the
original image of the first layer) is used as the input of the current layer. It is then convolved with the convolution core of the layer and serves as the
output of this layer. The convolution kernel of each layer is the weight to be learned. Similar to the fully-connected layer, after the convolution is
complete, the result is biased and activated through activation functions before being input to the next layer.

Wn bn
Fn

Input Output
tensor tensor
F1
W2 b2 Activation
Output

W1 b1
Convolution kernel Bias

65 Huawei Confidential
Pooling Layer
 Pooling combines nearby units to reduce the size of the input on the next layer, reducing dimensions. Common
pooling includes max pooling and average pooling. When max pooling is used, the maximum value in a small square
area is selected as the representative of this area, whereas the mean value is selected as the representative when
average pooling is used. The side of this small area is the pool window size. The following figure shows the max
pooling operation whose pooling window size is 2.

Sliding direction

66 Huawei Confidential
Fully-concatenated Layer
 The fully connected layer is essentially a classifier. The features extracted on the convolutional
layer and pooling layer are straightened and placed at the fully connected layer to output and
classify results.
 Generally, the softmax function is used as the activation function of the final fully connected
output layer to combine all local features into global features and compute the score of each
type.

σ(z) =

67 Huawei Confidential
Recurrent Neural Network (RNN)
 RNN is a neural network that captures dynamic information in sequential data through
periodical connections of hidden layer nodes. It can classify sequential data.
 Unlike FNNs, the RNN can keep a context state and even store, learn, and express related
information in context windows of any length. Different from traditional neural networks, it is
not limited to the space boundary but it supports time sequences. In other words, there is a
side between the hidden layer of the current moment and the hidden layer of the next
moment.
 The RNN is widely used in scenarios related to sequences, such as videos consisting of image
frames, audio consisting of clips, and sentences consisting of words.

68 Huawei Confidential
RNN Architecture (1)
 is the input of the input sequence at time t.
 is the memory cell of the sequence at time t and caches previous
information.

= + .
 S passes through multiple hidden layers, and then passes through the
fully-connected layer V to obtain the final output O at time t.

= .

69 Huawei Confidential
RNN Architecture (2)

70 Huawei Confidential
RNN Types

One to one One to many Many to one Many to many Many to many

71 Huawei Confidential
Backward Propagation Through Time (BPTT)
 BPTT:
 It is an extension of traditional backward propagation based on time sequences.

 There are two errors in a memory cell at time sequence t: 1. Partial derivative of the error at time sequence t to the memory cell. 2. Partial
derivative of the error at the next time sequence t+1 of the memory cell to the memory cell at time sequence t. The two errors need to be
added.

 If the time sequence is longer, it is more likely that the loss of the last time sequence to the gradient of weight w in the first time sequence will
cause the vanishing gradient or exploding gradient problem.

 The total gradient of weight w is the accumulation of the gradient of the weights in all time sequences.

 Three steps of BPTT:


 Compute the output of each neuron (forward).

 Compute the error of each neuron (backward).

 Compute the gradient of each weight.

 Updating weights using the SGD algorithm

72 Huawei Confidential
RNN Problems
 = + is extended based on time sequences.

 =σ + + + …

 Despite that the standard RNN structure solves the problem of information memory, the information attenuates in
the case of long-term memory.
 Information needs to be saved for a long time in many tasks. For example, a hint at the beginning of a speculative
fiction may not be answered until the end.
 RNN may not be able to save information for long due to the limited memory cell capacity.
 We expect that memory cells can remember key information.

73 Huawei Confidential
Long Short-Term Memory (LSTM) Network

74 Huawei Confidential
Forget Gate in LSTM
 The first step in LSTM is to decide what information to discard from the cell state.

 The decision is made through the forget gate. This gate reads ℎ and and outputs a numeric value in
the range from 0 to 1 for each digit in the cell state . The value 1 indicates that the information is
completely retained while the value 0 indicates that the information is completely discarded.

tanh

= ( ℎ , + )
tanh
ℎ ℎ

75 Huawei Confidential
Input Gate in LSTM
 This step is to determine what information is stored in the cell state.
 It consists of two parts:
 The sigmoid layer is called the input gate layer, which determines the value to be updated.
 A candidate value vector is created at the tanh layer and is added to the state.

tanh
= ( ℎ , + )

tanh
= ( ℎ , + )

76 Huawei Confidential
Update in LSTM
 In this step, it is time to update the state of the old cell. is updated as .
 We multiply the old state by and discard information we decide to discard. Then add * . This is a new
candidate value, scaled by how much we decided to update each state value.

tanh
= ∗ + ∗
tanh


tanh

77 Huawei Confidential
Output Gate in LSTM
 We run a sigmoid layer to determine which part of the cell state will be output.
 Then we process the cell state through tanh (a value between –1 and 1 is obtained) and multiply the value by the
output of the sigmoid gate. In the end, we output only the part we determine to output.

tanh = ( ℎ , + )

tanh ℎ = ∗ ℎ( )

78 Huawei Confidential
Quiz

1. (Single-answer question) Which of the following is not a deep learning neural network? ( )

A. CNN

B. RNN

C. LSTM

D. Logistic

79 Huawei Confidential
Quiz

2. (Multiple-answer question) Which of the following are CNN "components"? ( )

A. Activation function

B. Convolutional kernel

C. Pooling

D. Fully-connected layer

80 Huawei Confidential
Quiz

3. (True or false) Compared with RNN, CNN is more suitable for image recognition. ( )

A. True

B. False

81 Huawei Confidential
Summary

 This chapter mainly introduces the definition and development of neural networks,
perceptrons and their training rules, and common types of neural networks.

82 Huawei Confidential
Recommendations

 Huawei Talent
 https://e.huawei.com/en/talent/#/home

 Huawei Support Knowledge Base


 https://support.huawei.com/enterprise/en/knowledge?lang=en

83 Huawei Confidential
Thank you. Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2023 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like