0% found this document useful (0 votes)
19 views

Chapter 6 Deep Learning Knowledge

The document discusses techniques to improve deep learning networks, including remedies for vanishing gradients, overfitting, and computational load. It introduces ReLU activation functions and dropout methods to address these issues and enable effective training of deep networks.

Uploaded by

durant
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Chapter 6 Deep Learning Knowledge

The document discusses techniques to improve deep learning networks, including remedies for vanishing gradients, overfitting, and computational load. It introduces ReLU activation functions and dropout methods to address these issues and enable effective training of deep networks.

Uploaded by

durant
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CHAPTER 6

Deep Learning Knowledge

Deep Learning (BSD4543)


DR. KU MUHAMMAD NA’IM KU KHALIF
Content
Chapter 6.1: Improve of the Deep Learning
Chapter 6.2: Example: ReLU and Dropout
Chapter 6.1: Improve of
the Deep Learning
By the end of this topic, you should be able
to:
▪ improve the deep learning by play around
the parameter settings.
Deep Learning
▪ Briefly, deep learning is a
machine learning technique that
employs the deep neural network.
As you know, the deep neural
network is the multi-layer neural
network that contains two or more
hidden layers.
▪ Although this may be
disappointingly simple, this is the
Figure 6.1: The concept of deep learning and its
true essence of deep learning. relationship to machine learning.
Figure 6.1 illustrates the concept
of deep learning and its
relationship to machine learning.
▪ The deep neural network lies
in the place of the final product
of machine learning.
▪ And the learning rule becomes
the algorithm that generates
the model (the deep neural
network) from the training
data.
Improvement of the Deep Neural Network

▪ Despite its outstanding achievements, deep learning actually does


not have any critical technologies to present. The innovation of
deep learning is a result of many small technical improvements.
This section briefly introduces why the deep neural network yielded
poor performance and how deep learning overcame this problem.
▪ The reason that the neural network with deeper layers yielded
poorer performance was that the network was not properly trained.
The backpropagation algorithm experiences the following three
primary difficulties in the training process of the deep neural
network:
1. Vanishing Gradient
2. Overfitting
3. Computational Load
Vanishing Gradient
▪ The gradient in this context can be thought as a similar concept to
the delta of the back-propagation algorithm. The vanishing gradient
in the training process with the back-propagation algorithm occurs
when the output error is more likely to fail to reach the farther
nodes.
▪ The back-propagation algorithm trains the neural network as it
propagates the output error backward to the hidden layers.
However, as the error hardly reaches the first hidden layer, the
weight cannot be adjusted. Therefore, the hidden layers that are
close to the input layer are not properly trained. There is no point of
adding hidden layers if they cannot be trained (see Figure 6.2).
Figure 6.2: The vanishing gradient.

▪ The representative solution to the vanishing gradient is the use of


the Rectified Linear Unit (ReLU) function as the activation function.
It is known to better transmit the error than the sigmoid function. The
ReLU function is defined as follows:
▪ Figure 6.3 depicts the ReLU function. It produces zero for negative
inputs and conveys the input for positive inputs.2 Its implementation
is extremely easy as well.

Figure 6.3: The ReLU function.

▪ The sigmoid function limits the node’s outputs to the unity regardless
of the input’s magnitude. In contrast, the ReLU function does not
exert such limits.
▪ Isn’t it interesting that such a simple change resulted in a drastic
improvement of the learning performance of the deep neural
network? Another element that we need for the back-propagation
algorithm is the derivative of the ReLU function.
▪ By the definition of the ReLU function, its derivative is given as:

▪ In addition, the cross entropy-driven learning rules may improve the


performance. Furthermore, the advanced gradient descent, which is
a numerical method that better achieves the optimum value, is also
beneficial for the training of the deep neural network.
Overfitting
▪ The reason that the deep neural network is especially vulnerable to
overfitting is that the model becomes more complicated as it
includes more hidden layers, and hence more weight. As addressed
in Chapter 1, a complicated model is more vulnerable to overfitting.
Here is the dilemma—deepening the layers for higher performance
drives the neural network to face the challenge of machine learning.
▪ The most representative solution is the dropout, which trains only
some of the randomly selected nodes rather than the entire
network. It is very effective, while its implementation is not very
complex. Figure 6.4 explains the concept of the dropout. Some
nodes are randomly selected at a certain percentage and their
outputs are set to be zero to deactivate the nodes.
Figure 6.4: Dropout is where some nodes are randomly
selected and their outputs are set to zero to deactivate the
nodes
▪ The dropout effectively prevents overfitting as it continuously alters
the nodes and weights in the training process. The adequate
percentages of the dropout are approximately 50% and 25% for
hidden and input layers, respectively.
▪ Another prevailing method used to prevent overfitting is adding
regularization terms, which provide the magnitude of the weights, to
the cost function. This method works as it simplifies the neural
network’ architecture as much as possible, and hence reduces the
possible onset of overfitting. Furthermore, the use of massive
training data is also very helpful as the potential bias due to
particular data is reduced.
Computational Load
▪ The last challenge is the time required to complete the training. The
number of weights increases geometrically with the number of
hidden layers, thus requiring more training data. This ultimately
requires more calculations to be made. The more computations the
neural network performs, the longer the training takes.
▪ This problem is a serious concern in the practical development of
the neural network. If a deep neural network requires a month to
train, it can only be modified 20 times a year. A useful research
study is hardly possible in this situation. This trouble has been
relieved to a considerable extent by the introduction of high-
performance hardware, such as GPU, and algorithms, such as
batch normalization.
▪ The minor improvements that this section introduced are the drivers
that has made deep learning the star of machine learning.
▪ The three primary research areas of Machine Learning are usually
said to be the image recognition, speech recognition, and natural
language processing.
▪ Each of these areas had been separately studied with specifically
suitable techniques. However, deep learning currently outperforms
all the techniques of all three areas.
Chapter 6.2:
Example: ReLU and
Dropout
By the end of this topic, you should be able
to:
▪ develop of the ReLU function
▪ develop the Dropout
ReLU Function
▪ This section introduces the ReLU function via the example. The
function drelu trains the given deep neural network using the back-
propagation algorithm. It takes the weights of the network and
training data and returns the trained weights.
▪ [W1, W2, W3

▪ where W1, W2, W3, and W4 are weight matrices of input-hidden1,


hidden1-
▪ hidden2, hidden2-hidden3, and hidden3-output layers, respectively.
X and D
▪ are input and correct output matrices of the training data.
Dropout
▪ This section presents the code that implements the dropout. We use
the sigmoid activation function for the hidden nodes.
▪ This code is mainly used to see how the dropout is coded, as the
training data may be too simple for us to perceive the actual
improvement of overfitting.

You might also like