0% found this document useful (0 votes)
14 views

Module 1part 2

This document discusses a syllabus for a Deep Learning module that covers neural networks. It discusses representation power in multi-layer perceptrons and how nonlinear activation functions like ReLU increase this power by enabling nonlinear transformations of data. This allows data to become linearly separable in higher dimensions. The document also discusses loss functions like cross entropy and risk minimization in training neural networks, with the goal of minimizing generalization error. Backpropagation is introduced as the method for training multi-layer networks by computing gradients through the network layers.

Uploaded by

Swathi K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module 1part 2

This document discusses a syllabus for a Deep Learning module that covers neural networks. It discusses representation power in multi-layer perceptrons and how nonlinear activation functions like ReLU increase this power by enabling nonlinear transformations of data. This allows data to become linearly separable in higher dimensions. The document also discusses loss functions like cross entropy and risk minimization in training neural networks, with the goal of minimizing generalization error. Backpropagation is introduced as the method for training multi-layer networks by computing gradients through the network layers.

Uploaded by

Swathi K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

CST414

DEEP LEARNING
MODULE 1-PART 2

1
2
SYLLABUS
Module-1 (Neural Networks ) Introduction to
neural networks -Single layer perceptrons, Multi Layer Perceptrons (MLPs),
Representation Power of MLPs, Activation functions - Sigmoid, Tanh, ReLU,
Softmax. , Risk minimization, Loss function, Training MLPs with
backpropagation, Practical issues in neural network training - The Problem of

TRACE KTU
Overfitting, Vanishing and exploding gradient problems, Difficulties in
convergence, Local and spurious Optima, Computational Challenges.
Applications of neural networks.
REPRESENTATION POWER OF MLP
3

 Representation power is related to ability of a neural network to assign


proper labels to a particular instance and create well defined accurate
decision boundaries for that class.
 a neural network is a computational graph that performs compositions

TRACE KTU
of simpler functions to provide a more complex function
 power of deep learning arises from the fact that repeated composition
of multiple nonlinear functions has significant expressive power.
 the repeated composition of certain types of functions increases the
representation power of the network, and therefore reduces the parameter
space required for learning.
4 The Importance of Nonlinear Activation

TRACE KTU
5

 A neural network with only linear activations will never be able to


classify the training data perfectly because the points are not linearly
separable.
 On the other hand, consider a situation in which the hidden units have
ReLU activation, and they learn the two new features h1 and h2, which
are as follow:
TRACE KTU
 Note that these goals can be achieved by using appropriate weights
from the input to hidden layer, and also applying a ReLU activation
unit.
 The latter achieves the goal of thresholding negative values to 0.
6

 The coordinates of the three points in the 2-dimensional hidden layer are
{(1, 0), (0, 1), (0, 0)}. It is immediately evident that the two classes become
linearly separable in terms of the new hidden representation.
 In a sense, the task of the first layer was representation learning to
enable the solution of the problem with a linear classifier.

TRACE KTU
 The key point is that the use of the nonlinear ReLU function is crucial
in ensuring this linear separability.
 Activation functions enable nonlinear mappings of the data, so that the
embedded points can become linearly separable.
7

 if both the weights from hidden to output layer are set to 1 with a
linear activation function, the output O will be defined as follows:
 O = h1 + h2
 This simple linear function separates the two classes because it
always takes on the value of 1 for the two points labeled ‘*’ and takes
TRACE KTU
on 0 for the point labeled ‘+’.

 Therefore, much of the power of neural networks is hidden in the use


of activation functions
8

TRACE KTU
9

Activation functions enable nonlinear transformations of the data,


that become increasingly powerful with multiple layers.

 Sequence of nonlinear activations imposes a specific type of structure


on the learned model, whose power increases with the depth of the

TRACE KTU
sequence (i.e., number of layers in the neural network)
 Advantages of sigmoid
10  The main reason why we use sigmoid function is because it exists
between (0 to 1). Therefore, it is especially used for models
where we have to predict the probability as an output.Since
probability of anything exists only between the range of 0 and 1,
 In binary classification, also called logistic regression, the sigmoid
function is used to predict the probability of a binary variable.
 σ(sigma)
 E=2.781TRACE KTU
 Disadvantages of the Sigmoid Activation Function
 One of the major disadvantages of using sigmoid is the problem
of vanishing gradient. ...
 It saturates and kills gradients. ...
 It is computationally expensive because of the exponential term
in it.
 Advantages of ReLu
11  The model trained with ReLU converged quickly and thus
takes much less time when compared to models trained on
the Sigmoid function. The model performance is significantly
better when trained with ReLU
 It doesn't allow for the activation of all of the neurons at the
same time. i.e., if any input is negative, ReLU converts it to zero
and doesn't allow the neuron to get activated. This means that
TRACE KTU
only a few neurons are activated, making the network easy for
computation.
 Disadvantages of ReLu
 The drawback of ReLU is that they cannot learn on examples
for which their activation is zero. It usually happens if you
initialize the entire neural network with zero and place ReLU on
the hidden layers.
 ReLu is faster to compute than the sigmoid function,
Problem
12

binary sigmoid

TRACE KTU
Assignment 1
13
 1. Suppose you have a 3-dimensional input x = (x1, x2, x3) = (2, 2, 1) fully connected to 1 neuron which is in
the hidden layer with activation function sigmoid. Calculate the output of the hidden layer neuron
 2. Design a single layer perceptron to compute the NAND (not-AND) function. This function receives two
binary-valued inputs x1 and x2, and returns 0 if both inputs are 1, and returns 1 otherwise.
 3. Suppose we have a fully connected, feed-forward network with no hidden layer, and 5 input units
connected directly to 3 output units. Briefly explain why adding a hidden layer with 8 linear units does not
make the network any more powerful.

TRACE KTU
 4. Briefly explain one thing you would use a validation set for, and why you can’t just do it using the test set.

 Submit on or before 13th march 2023


Loss Function
14

 The choice of the loss function is critical in defining the outputs


in a way that is sensitive to the application at hand.
 For example, least-squares regression with numeric outputs
requires a simple squared loss of the form (y − ˆy)2 for a single
training instance with target y and prediction ˆy.

TRACE KTU
 other types of loss like hinge loss for y {−1, +1} and real-
valued prediction ˆy (with identity activation):
 L = max{0, 1 − y · ˆy}
 The hinge loss can be used to implement a learning method,
which is referred to as a support vector machine
 For probabilistic predictions, two different types of loss
15 functions are used, depending on whether the prediction is
binary or whether it is multiway:
 1. Binary targets (logistic regression):
 In this case, it is assumed that the observed value y is drawn
from {−1, +1}, and the prediction ˆy is a an arbitrary
numerical value on using the identity activation function.

TRACE KTU
 In such a case, the loss function for a single instance with
observed value y and real-valued prediction ˆy (with identity
activation) is defined as follows:
 L = log(1 + exp(−y · ˆy))
 This type of loss function implements a fundamental machine
learning method, referred to as logistic regression
 one can use a sigmoid activation function to output ˆy (0, 1),
which indicates the probability that the observed value y is 1.
16  2. Categorical targets:
 In this case, if ˆy1 . . . ˆyk are the probabilities of the k classes and the rth
class is the ground-truth class, then the loss function for a single instance is
defined as follows:
 L = −log(ˆyr)
 This type of loss function implements multinomial logistic regression, and it
is referred to as the cross-entropy loss.

TRACE KTU
 The key point to remember is that the nature of the output nodes, the
activation function, and the loss function depend on the application at hand.
 For discrete-valued outputs, it is common to use softmax activation with
cross entropy
 loss. For real-valued outputs, it is common to use linear activation with
squared
 loss. Generally, cross-entropy loss is easier to optimize than squared loss.
Risk minimization
17
 The goal of a machine learning algorithm is to reduce the expected
generalization error given by equation

 This quantity is known as the risk.

TRACE KTU
 the expectation is taken over the true underlying distribution
Pdata.
 to convert a machine learning problem back into an optimization
 problem is to minimize the expected loss on the training set. This
 means replacing the true distribution p(x, y) with the empirical
distribution ˆ p(x, y) defined by the training set. We now minimize
the empirical risk
18

where m is the number of training examples


The training process based on minimizing this average
training error is known as empirical risk minimization.

TRACE KTU
empirical risk minimization is prone to overfitting
The most effective modern optimization algorithms are
based on gradient descent,
In the context of deep learning, we rarely use empirical
risk minimization
19
TRAINING MLPS WITH BACKPROPAGATION

 In the single-layer neural network, the training process is relatively


straightforward because the error (or loss function) can be computed
as a direct function of the weights,
 In the case of multi-layer networks, the problem is that the loss is a
complicated composition function of the weights in earlier layers

TRACE KTU
 The gradient of a composition function is computed using the
backpropagation algorithm.
 The backpropagation algorithm leverages the chain rule of
differential calculus, which computes the error gradients in terms of
summations of local-gradient products over the various paths from a
node to the output.
 The backpropagation algorithm is a direct application of dynamic
20 programming.
 It contains two main phases, referred to as the forward and
backward phases, respectively. The forward phase is required to
compute the output values and the local derivatives at various
nodes, and the backward phase is required to accumulate the
products of these local values over all paths from the node to the
output:

TRACE KTU
 1. Forward phase:
 In this phase, the inputs for a training instance are fed into the
neural network. This results in a forward cascade of computations
across the layers, using the current set of weights.
 The final predicted output can be compared to that of the training
instance and the derivative of the loss function with respect to the
output is computed.
 2. Backward phase: The main goal of the backward phase is to
21 learn the gradient of the loss function with respect to the different
weights by using the chain rule of differential calculus.
 These gradients are used to update the weights. Since these
gradients are learned in the backward direction, starting from the
output node, this learning process is referred to as the backward
phase

TRACE KTU
22

TRACE KTU
Figure 1.13: Illustration of chain rule in computational graphs: The products of
node-specific partial derivatives along paths from weight w to output o are aggregated.
The resulting value yields the derivative of output o with respect to weight w. Only two
paths between input and output exist i
 Consider a sequence of hidden units h1, h2, . . . , hk followed by
output o, with respect to which the loss function L is computed.
23
 assume that the weight of the connection from hidden unit hr
to hr+1 is w(hr,hr+1).
 in the case that a single path exists from h1 to o, one can derive
the gradient of the loss function with respect to any of these
edge weights using the chain rule:

TRACE KTU
 A generalized variant of the chain rule, referred to as the multivariable
chain rule, computes the gradient in a computational graph, where
more than one path might exist.
 This is achieved by adding the composition along each of the paths from h1
to o.
 the path-aggregated term above [annotated by Δ(hr, o) = ∂L/∂hr] is
aggregated over an exponentially increasing number of paths .
24
 A key point is that the computational graph of a neural network
does not have cycles, it is possible to compute such an
aggregation in a principled way in the backwards direction by first
computing Δ(hk, o) for nodes hk closest to o, and then recursively
computing these values for nodes in earlier layers in terms of the
nodes in later layers.

TRACE KTU
 Furthermore, the value of Δ(o, o) for each output node is initialized
as follows:

 The recursion for Δ(hr, o) can be derived using the multivariable


chain rule:
 Consider a situation in which the edge joining hr to h has weight
w(hr,h),and let ah be the value computed in hidden unit h just
25
before applying the activation function Φ(·).
 we have h = Φ(ah), where ah is a linear combination of its inputs
from earlier-layer units incident on h. Then, by the univariate
chain rule, the following expression for ∂h/∂hr can be derived:

TRACE KTU
 which is repeated recursively in the backwards direction, starting
with the output node. The corresponding updates in the
backwards direction are as follows:
 needs to be repeated for each incoming edge into the node to
compute the gradient with respect to all edge weights.
26

 the key gradient that is backpropagated is the derivative with


respect to layer activations, and the gradient with respect to
the weights is easy to compute for any incident edge on the
corresponding unit.
TRACE KTU
 uses the variables in the hidden layers as the “chain” variables for the dynamic
programming recursion.
 The pre-activation variables in a neuron are obtained after applying the linear
transform as the intermediate variables. The pre-activation value of the hidden
variable h = Φ(ah) is ah. The differences between the pre-activation and post-
activation values within a neuron
27

TRACE KTU
 As with the single-layer network, the process of updating the
nodes is repeated to convergence by repeatedly cycling through
the training data in epochs. A neural network may sometimes
require thousands of epochs through the training data to learn the
weights at the different nodes.
28

Derivation of Back propagation Algorithm


 In Gradient Descent, we consider all the points in calculating loss and
derivative, while in Stochastic gradient descent, we use single point in loss
function and its derivative randomly

TRACE KTU
29

TRACE KTU
30

TRACE KTU
31

TRACE KTU
32

TRACE KTU
33

TRACE KTU
34

TRACE KTU
35

TRACE KTU
36

TRACE KTU
37

TRACE KTU
38

TRACE KTU
Problem 1
39

TRACE KTU
40

TRACE KTU
(Network output)
41

TRACE KTU
42

TRACE KTU
43

TRACE KTU
44

TRACE KTU
45

TRACE KTU
 Algorithm: Backpropagation. Neural network learning for classification or numeric
prediction, using the backpropagation algorithm.
 Input:
46
• D, a data set consisting of the training tuples and their associated target values;
• l, the learning rate;
• network, a multilayer feed-forward network.
 Output: A trained neural network.
 Method:

TRACE KTU
Problem 2

47

TRACE KTU
PRACTICAL ISSUES IN NEURAL
48 NETWORK TRAINING
 challenges are primarily related to several practical problems
associated with training, the most important one of which is
overfitting
 The Problem of Overfitting and underfitting

TRACE KTU
 there is always a gap between the training and test data
performance,
 In general, models with a larger number of parameters are said to
have high capacity, and they require a larger amount of data in
order to gain generalization power to unseen test data.
49  How to Detect Overfitting
 The error on test data might be caused by several reasons.
– Other reasons might be bias (underfitting), noise, and
poor convergence.
 Overfitting shows up as a large gap between in-sample and

TRACE KTU
out-of-sample accuracy.
 First solution is to collect more data.
 – More data might not always be available!
 High overfitting shows overfitting of data
 Data is not cleaned
 Model is not trained for sufficient time
 High bias in data---Underfitting
50
 Model assumed to be too simple

TRACE KTU
Regularization
51  Since a larger number of parameters causes overfitting, a natural approach
is to constrain the model to use fewer non-zero parameters.
 Smaller absolute values of the parameters also tend to overfit less. Since it is
hard to constrain the values of the parameters, the softer approach of
adding the penalty to the loss function is used.
 The value of p is typically set to 2, which leads to Tikhonov regularization
 In general, the squared value of each parameter (multiplied with the

TRACE KTU
regularization parameter λ > 0) is added to the objective function.
 The practical effect of this change is that a quantity proportional to λwi is
subtracted from the update of the parameter wi.
 An example of a regularized version of Equation
for mini-batch S and update step-size α > 0 is as follows:

E[X] represents the current error (y − ˆy) between observed and


predicted values of training instance X.
 Regularization is particularly important when the amount of
52 available data is limited
 A neat biological interpretation of regularization is that it
corresponds to gradual forgetting, as a result of which “less
important” (i.e., noisy) patterns are removed.
 Neural Architecture and Parameter Sharing
 many of the parameters might be shared. For example, a

TRACE KTU
convolutional neural network uses the same set of parameters to
learn the characteristics of a local block of the image.
 The recent advancements in the use of neural networks like
recurrent neural networks and convolutional neural networks are
examples of this phenomena.
Early Stopping
53
 Another common form of regularization is early stopping, in
which the gradient descent is ended after only a few iterations.
 One way to decide the stopping point is by holding out a part of
the training data, and then testing the error of the model on the
held-out set.
 The gradient-descent approach is terminated when the error on
TRACE KTU
the held-out set begins to rise.
 Early stopping essentially reduces the size of the parameter
space to a smaller neighborhood within the initial values of the
parameters.
 From this point of view, early stopping acts as a regularizer
because it effectively restricts the parameter space.
Trading Off Breadth for Depth
54  networks with more layers (i.e., greater depth) tend to require far
fewer units per layer because the composition functions created
by successive layers make the neural network more powerful.
 Increased constraints reduce the capacity of the network, which
is helpful when there are limitations
on the amount of available data.

TRACE KTU
 Even though deep networks have fewer problems with respect to
overfitting, they come with a different family of problems
associated with ease of training.
Ensemble Methods
55  bagging are used in order to increase the generalization
power of the model.
 number of ensemble methods that are specifically focused on
neural networks have also been proposed. Two such methods
include Dropout and Dropconnect.
 When training with Dropout, a randomly selected subset of

TRACE KTU
activations are set to zero within each layer.
 DropConnect instead sets a randomly selected sub-set of
weights within the network to zero. Each unit thus receives
input from a random subset of units in the previous layer.
56

TRACE KTU
THE VANISHING AND EXPLODING GRADIENT
PROBLEMS
57
 While increasing depth often reduces the number of parameters
of the network, it leads to different types of practical issues.
 Propagating backwards using the chain rule has its drawbacks in
networks with a large number of layers in terms of the stability of
the updates.
 the updates in earlier layers can either be negligibly small
TRACE KTU
(vanishing gradient) or they can be increasingly large (exploding
gradient) in certain types of neural network architectures.
 caused by the chain-like product computation ,which can either
exponentially increase or decay over the length of the path.
 consider a situation in which we have a multi-layer network with
one neuron in each layer.
58
 Each local derivative along a path can be shown to be the
product of the weight and the derivative of the activation
function. The overall backpropagated derivative is the product of
these values.
 If each such value is randomly distributed, and has an expected
value less than 1, the product of these derivatives will drop off

TRACE KTU
exponentially fast with path length.
 the vanishing and exploding gradient problems are rather natural
to deep networks, which makes their training process unstable
 For example, a sigmoid activation often encourages the
59 vanishing gradient problem, because its derivative is less than
0.25 at all values of its argument), and is extremely small at
saturation
 A ReLU activation unit is known to be less likely to create a
vanishing gradient problem because its derivative is always 1 for
positive values of the argument
 the use of adaptive learning rates and conjugate gradient
TRACE KTU
methods can help in many cases.
 a recent technique called batch normalization is helpful in
addressing some of these issues
 Batch Norm is a normalization technique done between the
layers of a Neural Network instead of in the raw data. It is
done along mini-batches instead of the full data set. It serves to
speed up training
60

 In a network of n hidden layers, n derivatives will be multiplied


together. If the derivatives are large then the gradient will
increase exponentially as we propagate down the model until
they eventually explode, and this is what we call the problem of
exploding gradient.

TRACE KTU
DIFFICULTIES IN CONVERGENCE
61

 Sufficiently fast convergence of the optimization process is


difficult to achieve with very deep networks, as depth leads to
increased resistance to the training process in terms of letting
the gradients smoothly flow through the network
LOCAL AND SPURIOUS OPTIMA

TRACE KTU
 The optimization function of a neural network is highly
nonlinear, which has lots of local optima.
 When the parameter space is large, and there are many local
optima, it makes sense to spend some effort in picking good
initialization points.
 One such method for improving neural network initialization is
referred to as pretraining
 The basic idea is to use either supervised or unsupervised training
on shallow sub-networks of the original network in order to
62
create the initial weights
 This type of pretraining is done in a greedy and layer wise
fashion in which a single layer of the network is trained at one
time in order to learn the initialization points of that layer
 The basic idea here is that some of the minima in the loss
function are spurious optima because they are exhibited only

TRACE KTU
in the training data and not in the test data.
 unsupervised pretraining often tends to avoid problems
associated with overfitting
 Using unsupervised pretraining tends to move the initialization
point closer to the basin of “good” optima in the test data. This is
an issue associated with model generalization.
 the notion of spurious optima is often viewed from the lens of model
63 generalization in neural networks
 In traditional optimization, one does not focus on the differences in
the loss functions of the training and test data, but on the shape of
the loss function in only the training data.
 the problem of local optima (from a traditional perspective) is a
smaller issue in neural networks than one might normally expect
from such a nonlinear function
TRACE KTU
COMPUTATIONAL CHALLENGES
 A significant challenge in neural network design is the running
time required to train the network
 In recent years, advances in hardware technology such as Graphics
Processor Units (GPUs) have helped to a significant extent. GPUs
are specialized hardware processors that can significantly speed up
the kinds of operations commonly used in neural networks.
.
64  In this sense, some algorithmic frameworks like Torch are
particularly convenient because they have GPU support tightly
integrated into the platform
 One convenient property of the neural network models is that
most of the computational heavy lifting is front loaded during
the training phase.
 the prediction phase is often computationally efficient, because
TRACE KTU
it requires a small number of operations (depending on the
number of layers).
 This is important because the prediction phase is often far
more time-critical compared to the training phase.
APPLICATIONS OF NEURAL NETWORKS
65  key sectors including finance, healthcare, and automotive They can be used
for image recognition, character recognition and stock market
predictions,etc.
 1. Facial Recognition
 Facial Recognition Systems are serving as robust systems of surveillance.
Recognition Systems matches the human face and compares it with the
digital images.
 They are used in offices for selective entries. The systems thus authenticate a
human face and match it up with the list of IDs that are present in its
database.
 Convolutional Neural Networks (CNN) are used for facial recognition and
image processing.
 2. Stock Market Prediction
66  Investments are subject to market risks.
 To make a successful stock prediction in real time a Multilayer
Perceptron MLP (class of feedforward artificial intelligence
algorithm) is employed.
 MLP comprises multiple layers of nodes, each of these layers is
fully connected to the succeeding nodes. Stock’s past
performances, annual returns, and non profit ratios are
considered for building the MLP model.
 3. Social Media
 Artificial Neural Networks are used to study the behaviours of
social media users. Data shared everyday via virtual
conversations is tacked up and analyzed for competitive
analysis.
 Neural networks duplicate the behaviours of social media users.
67
Post analysis of individuals' behaviours via social media
networks the data can be linked to people’s spending habits.
Multilayer Perceptron ANN is used to mine data from social
media application
 4. Aerospace
 Aerospace Engineering is an expansive term that covers
developments in spacecraft and aircraft.
 Fault diagnosis, high performance auto piloting, securing the
aircraft control systems, and modeling key dynamic simulations
are some of the key areas that neural networks have taken over.
Time delay Neural networks can be employed for modelling non
linear time dynamic systems.
 Time Delay Neural Networks are used for position independent
feature recognition. The algorithm thus built based on time
68
delay neural networks can recognize patterns.
 5. Defence
 Neural Networks also shape the defence operations of
technologically advanced countries.
 Neural networks are used in logistics, armed attack analysis, and
for object location. They are also used in air patrols, maritime
patrol, and for controlling automated drones.
 Convolutional Neural Networks(CNN), are employed for
determining the presence of underwater mines. Underwater mines
are the underpass that serve as an illegal commute route between
two countries
 6. Healthcare
69  Modern day individuals are leveraging the advantages of
technology in the healthcare sector. Convolutional Neural
Networks are actively employed in the healthcare industry for X
ray detection, CT Scan and ultrasound.
 As CNN is used in image processing, the medical imaging data
retrieved from aforementioned tests is analyzed and assessed
based on neural network models. Recurrent Neural Network
(RNN) is also being employed for the development of voice
recognition systems.
 7. Signature Verification and Handwriting Analysis
 Signature Verification , as the self explanatory term goes, is
used for verifying an individual’s signature. Banks, and other
financial institutions use signature verification to cross check
the identity of an individual.
 Artificial Neural Networks are used for verifying the
70 signatures. ANN are trained to recognize the difference between
real and forged signatures. ANNs can be used for the verification
of both offline and online signatures.
 8. Weather Forecasting
 Forecasting is primarily undertaken to anticipate the upcoming
weather conditions beforehand. In the modern era, weather
forecasts are even used to predict the possibilities of natural
disasters.
 Multilayer Perceptron (MLP), Convolutional Neural Network
(CNN) and Recurrent Neural Networks (RNN) are used for
weather forecasting. Traditional ANN multilayer models can
also be used to predict climatic conditions 15 days in advance. A
combination of different types of neural network architecture
can be used to predict air temperatures.
---------------------------------END OF MODULE 1------------------------
-------------------

You might also like