0% found this document useful (0 votes)

34 views112 pages

Unit 2

uuuuu

Uploaded by

Senthil Pandi S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views112 pages

Unit 2

uuuuu

Uploaded by

Senthil Pandi S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 112

Machine learning –Loss Function

• Computational method to improve performance on a task by using

training data.
This shows a NN,
but can replace with
other ML methods
Loss Function
• loss function = cost function = objective
function = error function
• The loss function does not want to
measure the entire performance of the
network against a validation/test dataset.
• The loss function is used to guide the
training process in order to find a set of
parameters that reduce the value of the
loss function.
Loss Function
• Loss functions depends on the type of task:
• Regression: the network predicts continuous, numeric variables
• Example: Length of fishes in images, temperature from latitude/longitude
• Absolute value, square error
• Classification: the network predicts categorical variables (fixed number of
classes)
• Example: classify email as spam, predict student grades from essays.
• hinge loss, Cross-entropy loss
Loss Function - Absolute value, L1-
norm
• Very basic loss function
• Produces sparser solutions
• good in high dimensional spaces
• prediction speed
• Less sensitive to outliers
Loss Function - Square error,
Euclidean loss, L2-norm
• Very common loss function
• More precise and better than L1-norm
• Penalizes large errors more strongly
• Sensitive to outliers
Regularization
• Regularization refers to a set of different techniques
that lower the complexity of a neural network model
during training, and thus prevent the overfitting.
• There are three very popular and efficient regularization
techniques called L1, L2, and dropout
Regularization - L2 regularization
• The L2 regularization is the most common type of all
regularization techniques and is also commonly known
as weight decay or Ridge Regression.
• The mathematical derivation of this regularization, as
well as the mathematical explanation of why this
method works at reducing overfitting, is quite long and
complex.
• During the L2 regularization the loss function of the
neural network as extended by a so-called
regularization term, which is called here Ω.
Regularization - L2 regularization

• The regularization term Ω is defined as the Euclidean Norm (or L2 norm)

of the weight matrices, which is the sum over all squared weight values
of a weight matrix.
• The regularization term is weighted by the scalar alpha divided by two
and added to the regular loss function that is chosen for the current task.
This leads to a new expression for the loss function:

• Alpha is sometimes called as the regularization rate and is an

additional hyperparameter we introduce into the neural network. Simply
speaking alpha determines how much we regularize our model.
• In the next step we can compute the gradient of the new loss function
and put the gradient into the update rule for the weights
Regularization – L1 regularization
• In the case of L1 regularization (also knows as Lasso
regression), we simply use another regularization term Ω. This
term is the sum of the absolute values of the weight
parameters in a weight matrix:

• As in the previous case, we multiply the regularization term by

alpha and add the entire thing to the loss function.

• The derivative of the new loss function leads to the following

expression, which the sum of the gradient of the old loss
function and sign of a weight value times alpha.
Regularization – Dropout
• In addition to the L2 and L1
regularization, another famous
and powerful regularization
technique is called the dropout
regularization. The procedure
behind dropout regularization is
quite simple.
• In a nutshell, dropout means that
during training with some
probability P a neuron of the
neural network gets turned off
during training. Let’s look at a
visual example.
Neural Network with dropout
(bottom) and without (top)
Regularization – Dropout
• Assume on the top we have a
feedforward neural network with no
dropout. Using dropout with let’s say a
probability of P=0.5 that a random
neuron gets turned off during training
would result in a neural network on
the bottom.
• In this case, you can observe that
approximately half of the neurons are
not active and are not considered as a
part of the neural network. And as you
can observe the neural network
becomes simpler.
• A simpler version of the neural
network results in less complexity that
can reduce overfitting. The Neural Network with dropout
deactivation of neurons with a certain (bottom) and without (top)
Building Blocks of a Neural Networks

Neurons
Machine Learning
Deep Learning
Traditional ML Vs DL
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
y
neural network is called an artificial neuron
Why is it called a neuron ? Where does
the inspiration come from ?
σ
The inspiration comes from biology (more
specifically, from the brain)
w1 w2 w3
biological neurons = neural cells = neural
x1 x2
xArtificial
3
processing units
Neuron We will first see what a biological neuron
looks like ...
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
y
neural network is called an artificial neuron
Why is it called a neuron ? Where does
the inspiration come from ?
σ
The inspiration comes from biology (more
specifically, from the brain)
w1 w2 w3
biological neurons = neural cells = neural
x1 x2
xArtificial
3
processing units
Neuron We will first see what a biological neuron
looks like ...
• dendrite: receives signals from
other neurons
• synapse: point of connection to
other neurons
• soma: processes the information
• axon: transmits the output of
Biological
Neurons∗ this neuron

∗
Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jp
g
• Of course, in reality, it is not just a
single neuron which does all this
• There is a massively parallel
interconnected network of neurons
• The sense organs relay information to the
lowest layer of neurons
• Some of these neurons may fire (in red)
in response to this information and in
turn relay information to other
neurons they are connected to
• These neurons may also fire (again, in red)
and the process continues eventually
resulting in a response (laughter in this
case)
• This massively parallel network also
ensures that there is division of work
• Each neuron may perform a certain role
or respond to a certain stimulus

A simplified
illustration
• The neurons in the brain are
arranged in a hierarchy
• We illustrate this with the help of
visual cortex (part of the brain)
which deals with processing
visual information
• Starting from the retina, the
information is relayed to several
layers (follow the ar- rows)
• We observe that the layers V 1, V 2
to A I T form a hierarchy (from
identifying simple visual forms to
high level objects)
25

Sample illustration of
hierarchical processing∗
Neurons
• ANN’s are built upon simple signal processing elements (Neuron) that are connected
together into a large mesh.

What can neural networks do?

• Neural networks can -
• identify faces,
• recognize speech,
• read your handwriting (mine perhaps not),
• translate texts,
• play games (typically board games or card games)
• control autonomous vehicles and robots
Artificial Neurons
Inside an artificial neuron

• You might be surprised to see how simple the calculations inside a neuron actually
are. We can identify three processing steps:
1.Each input gets scaled up or down
• When a signal comes in, it gets multiplied by a weight value that is assigned to this particular input.
That is, if a neuron has three inputs, then it has three weights that can be adjusted individually. During
the learning phase, the neural network can adjust the weights based on the error of the last test result.
2. All signals are summed up
• In the next step, the modified input signals are summed up to a single value. In this step, an offset is
also added to the sum. This offset is called bias. The neural network also adjusts the bias during the
learning phase.
• This is where the magic happens! At the start, all the neurons have random weights and random biases.
After each learning iteration, weights and biases are gradually shifted so that the next result is a bit
closer to the desired output. This way, the neural network gradually moves towards a state where the
desired patterns are “learned”.
3. Activation
• Finally, the result of the neuron’s calculation is turned into an output signal. This is done by feeding the
result to an activation function (also called transfer function).
Activation Functions
• The activation function is used as a decision making body at the output of a neuron. The neuron
learns Linear or Non-linear decision boundaries based on the activation function. It also has a
normalizing effect on the neuron output which prevents the output of neurons after several layers to
become very large, due to the cascading effect. There are three most widely used activation functions
• Sigmoid
It maps the input ( x axis ) to values between 0 and 1.
• Tanh
It is similar to the sigmoid function butmaps the input to values between -1 and 1.
• Rectified Linear Unit (ReLU)
It allows only positive values to pass through it. The negative values are mapped to zero.
Input Layer
• This is the first layer of a neural network. It is used to provide the input data or features to the network.
Output Layer
• This is the layer which gives out the predictions. The activation function to be used in this layer is
different for different problems. For a binary classification problem, we want the output to be either 0
or 1. Thus, a sigmoid activation function is used. For a Multiclass classification problem,
a Softmax ( think of it as a generalization of sigmoid to multiple classes ) is used. For a regression
problem, where the output is not a predefined category, we can simply use a linear unit.
Commonly used Terminologies in Neural
networks

The input vector All the input values of each perceptron are collectively called
the input vector of that perceptron.

The weight vector Similarly, all the weight values of each perceptron are
collectively called the weight vector of that perceptron.
• McCulloch (neuroscientist) and Pitts
McCulloch Neurons (logician) proposed a highly simplified
computational model of the neuron
(1943)
y ∈ { 0, • g aggregates the inputs and the function
1} f takes a decision based on this
aggregation
f
• The inputs can be excitatory n or
inhibitory
g(x 1 , x 2 , ..., xn ) = g(x ) = Σ x i
g
i= 1
• y = 0 if any x i is inhibitory, else
y = f (g(x )) = 1 if
x1 x 2 .. .. x n ∈ { 0, g(x ) ≥ θ
1}
= 0
McCulloch
y ∈ { 0, 1}
Neurons with
y ∈
Boolean
{ 0,
functions
y ∈ { 0,
1} 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3

A McCulloch Pitts AND OR

unit function function
y ∈ { 0, y ∈ { 0, y ∈ { 0,
1} 1} 1}

1 0 0

x1 x2 x1 x2 x1

x 1 AND ! NOR NOT

∗
circle atx 2the
∗ function
end indicates inhibitory input: if any inhibitory input is 1 the function
output will be 0
Limitations Of M-P Neuron
• What about non-boolean (say, real) inputs?
• Do we always need to hand code the threshold?
• Are all inputs equal? What if we want to assign more importance to some inputs?
• What about functions which are not linearly separable? Say XOR function.
The Perceptron
• A perceptron is a very simple learning machine.
• It takes few inputs, each of which has a weight to signify how important it is, and generate an
output decision of “0” or “1”.
• When combined with many other perceptron's, it forms an artificial neural
network.
The perceptron

• The most basic form of an activation function is a simple binary function that has
only two possible results.
• This function returns 1 if the input is positive or zero, and 0 for any negative input. A
neuron whose activation function is a function like this is called a perceptron.
Threshold Logic Unit (TLU)
• In a Threshold Logic Unit (TLU) the output of the unit y in response to
a particular input pattern is calculated in two stages.
• First the activation is calculated.
• The activation a is the weighted sum of the inputs:
inputs
x1 w1
weights output
w2 activation
x2
 q
y
. 
. a= i=1
n
wi x i
. wn
xn
y= { 1 if a  q
0 if a < q
Linear Unit
• Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks.
• Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business
intelligence.
• Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
• Hence, we can consider it as a single-layer neural network with four main parameters,
inputs
i.e., input values, weights and Bias, net sum, and an activation function.
x1 w1 weights
w2 activation output
x2
.
 y

. wn
a=  i=1
n
wi x i y= a =  i=1
n
wi x i
.x
n
Training ANNs
• Training set S of examples {x,t}
• x is an input vector and
• t the desired target vector
• Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}
• Iterative process
• Present a training example x , compute network output y , compare
output y with target t, adjust weights and thresholds
• Learning rule
• Specifies how to change the weights w and thresholds q of the
network as a function of the inputs x, output y and target t.
Perceptron Learning Rule
• w’=w + a (t-y) x
Or in components
• w’i = wi + Dwi = wi + a (t-y) xi (i=1..n+1)
With wn+1 = q and xn+1=-1
• The parameter a is called the learning rate. It determines the
magnitude of weight updates Dwi .
• If the output is correct (t=y) the weights are not changed (Dwi
=0).
• If the output is incorrect (t  y) the weights wi are changed
such that the output of the TLU for the new weights w’i is
closer/further to the input xi.
Perceptron Training Algorithm

Repeat
for each training vector pair (x,t)
evaluate the output y when x is the input
if yt then
form a new weight vector w’ according
to w’=w + a (t-y) x
else
do nothing
end if
end for
Until y=t for all training vector pairs
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Convergence Theorem
The algorithm converges to the correct classification
• if the training data is linearly separable
• and a is sufficiently small
• If two classes of vectors X1 and X2 are linearly separable, the application
of the perceptron training algorithm will eventually result in a weight
vector w0, such that w0 defines a TLU whose decision hyper-plane
separates X1 and X2 (Rosenblatt 1962).
• Solution w0 is not unique, since if w0 x =0 defines a hyper-plane, so
does w’0 = k w0.
• regularize means to make things regular or acceptable
• Regularization refers to a set of different techniques that
lower the complexity of a neural network model during
training, and thus prevent the overfitting
• Regualrization penalizes the weight matrices of the
nodes

56
What is Overfitting
• The training data contains information about the
regularities in the mapping from input to output. But it
also contains sampling error.
• There will be accidental regularities because of the
particular training cases that were choosen.
• When we fit the model, It cannot tell which regularities
are real and which are caused by sampling error.
• So it fits both kinds of regularity. If the model is very
flexible it can model the sampling error really well.
• This means the model will not generalize well to unseen
data 57
Diagnosing Overfitting

58
Regularization Techniques
• L2 Regualrizartion / Ridge Regularization
• L1 Regualrizartion / Lasso Regularization
• Dropout
• Early Stopping

Alpydin & Ch. Eick: ML Topic1 59

Ridge Regularization

Salar
• y

Experien
ce

60
L1 vs L2 Regularization Methods
• L1 Regularization, also called a Lasso regression, adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function.
• L2 Regularization, also called a Ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function.
• The key difference between these two is the penalty term.

61
Ridge Regularization

Salar
• y

Experien
ce

62
Steep slope
Salar
• y

Experien
ce

63
• Assume lamda =1
• Slope = 1.3
• Then cost = 0 + 1(1.3)2
• = 1.69

• Assume lamda =1
• Slope = 1.1
• Then cost = 0 + 1(1.3)2
• = 1.21
64
•

65
Lasso Regression
• This help in feature selection too

66
Dropout
• This is the one of the most interesting types of
regularization techniques.
• It also produces very good results and is consequently
the most frequently used regularization technique in
the field of deep learning
• To understand dropout, let’s say our neural network
structure

67
• At every iteration, it randomly selects some nodes
and removes them along with all of their incoming
and outgoing connections as shown below

68
• So each iteration has a different set of nodes and this
results in a different set of outputs. It can also be
thought of as an ensemble technique in machine
learning.
• Ensemble models usually perform better than a single
model as they capture more randomness. Similarly,
dropout also performs better than a normal neural
network model
• Due to these reasons, dropout is usually preferred
when we have a large neural network structure in
order to introduce more randomness.

69
Early stopping

• Early stopping is a kind of cross-validation strategy where

we keep one part of the training set as the validation
set.
• When we see that the performance on the validation set
is getting worse, we immediately stop the training on the
model. This is known as early stopping

70
• In the above image, we will stop training at the dotted
line since after that our model will start overfitting on
the training data

71
Why Training a Neural Network
Is Hard
• Fitting a neural network involves using a training dataset to update
the model weights to create a good mapping of inputs to outputs.
• This training process is solved using an optimization algorithm that
searches through a space of possible values for the neural network
model weights for a set of weights that results in good performance
on the training dataset.
• will discover the challenge of training a neural network framed as an
optimization problem.
Session Outcome
• Training a neural network involves using an optimization algorithm to
find a set of weights to best map inputs to outputs.
• The problem is hard, not least because the error surface is non-
convex and contains local minima, flat spots, and is highly
multidimensional.
• The stochastic gradient descent algorithm is the best general
algorithm to address this challenging problem.
Learning as Optimization
• Deep learning neural network models learn to map inputs to outputs given a
training dataset of examples.
• The training process involves finding a set of weights in the network that
proves to be good, or good enough, at solving the specific problem.
• This training process is iterative, meaning that it progresses step by step with
small updates to the model weights each iteration and, in turn, a change in the
performance of the model each iteration.
• The iterative training process of neural networks solves an optimization
problem that finds for parameters (model weights) that result in a minimum
error or loss when evaluating the examples in the training dataset.
• Optimization is a directed search procedure and the optimization problem that
we wish to solve when training a neural network model is very challenging.
Optimization problems
• The optimization algorithm iteratively steps across
this landscape, updating the weights and seeking out
good or low elevation areas.
• For simple optimization problems, the shape of the
landscape is a big bowl and finding the bottom is easy,
• So easy that very efficient algorithms can be designed
to find the best solution.
• These types of optimization problems are referred to
mathematically as convex.
Optimization problems
• The error surface we wish to navigate when
optimizing the weights of a neural network is not a
bowl shape.
• It is a landscape with many hills and valleys.
• These type of optimization problems are referred to
mathematically as non-convex.
Local Minima

• Local minimal or local optima refer to the

fact that the error landscape contains
multiple regions where the loss is relatively
low.
• These are valleys, where solutions in those
valleys look good relative to the slopes and
peaks around them.
• The problem is, in the broader view of the
entire landscape, the valley has a relatively
high elevation and better solutions may
exist.
Flat Regions (Saddle Points)

• A flat region or saddle point is a point on the

landscape where the gradient is zero.
• These are flat regions at the bottom of valleys
or regions between peaks.
• The problem is that a zero gradient means that
the optimization algorithm does not know
which direction to move in order to improve
the model.
High-Dimensional
• The optimization problem solved when training a neural network is high-
dimensional.
• Each weight in the network represents another parameter or dimension of
the error surface.
• Deep neural networks often have millions of parameters, making the
landscape to be navigated by the algorithm extremely high-dimensional, as
compared to more traditional machine learning algorithms.
• The problem of navigating a high-dimensional space is that the addition of
each new dimension dramatically increases the distance between points in
the space, or hypervolume.
• This is often referred to as the “curse of dimensionality”.
Reasons for Difficulty in Deep
Learning
• Possibly Questionable Solution Quality. The optimization process
may or may not find a good solution and solutions can only be
compared relatively, due to deceptive local minima.
• Possibly Long Training Time. The optimization process may take a
long time to find a satisfactory solution, due to the iterative nature of
the search.
• Possible Failure. The optimization process may fail to progress (get
stuck) or fail to locate a viable solution, due to the presence of flat
regions.
Greedy layer wise training
vanishing gradient problem
• Training deep neural networks was traditionally
challenging due to vanishing gradient
• Vanishing Gradient: Weights in layers close to the input
layer were not updated in response to errors calculated on
the training dataset

• Important milestone in the field of deep learning was

greedy layer-wise pretraining that allowed very deep
neural networks to be successfully trained, achieving
better performance.

82
• For increased hidden layers the amount of error
information propagated back to earlier layers is
dramatically reduced.
• Weights in hidden layers close to the output layer are
updated normally, whereas weights in hidden layers
close to the input layer are updated minimally or not
at all.
• Generally, this problem prevented the training of very
deep neural networks and was referred to as the
vanishing gradient problem

83
• Pretraining :
• add a new hidden layer to a model.
• Allow the newly added model to learn the inputs from the existing hidden layer, keeping
the weights for the existing hidden layers fixed.
• This gives the technique the name “layer-wise” as the model is trained one layer at a
time.
• Greedy algorithm:
• Breaks the problem into many components, then solve for the optimal version of each
component in isolation
• Pretraining is based on the assumption that it is easier to train a shallow network instead
of a deep network and contrives a layer-wise training process that we are always only
ever fitting a shallow model

84
85
Pre-training and fine
tuning
• Using dataset A train model M
• Pre-training:
– You have a dataset B
– Before training the model, initialize some of the
parameters of M with model trained on A
• Fine-tuning:
– You train M on B
• This is one form of transfer learning
87
• Training a deep structure is difficult due to high dependencies across
layers’ parameters , i.e. the relation between parts of pictures and
pixels.
• To resolve this problem, two things are suggested
• Adapting lower layers to feed good input to the upper layers
• Adjust upper layers to make use of that end setting of lowerr layers

88
Greedy Algorithm
• Greedy algorithms break a problem into many
components, then solve for the optimal version of
each component in isolation

• Unfortunately, combining the individually optimal

components is not guaranteed to yield an optimal
complete solution
Single-layer representation learning

• We need a single-layer representation

learning algorithm, such as:
– An RBM
• (a Markov network)
– A single-layer autoencoder

– A sparse coding model

– Or another model that learns latent representations

Training a 4-layer
network
• Pairs of layers active in each
stage
Greedy pretraining terminology
Greedy layer-wise pretraining
 Greedy because
It is a greedy algorithm that optimizes each piece of
the solution independently
 One piece at a time rather than jointly
 Layer-wise because
Independent pieces are the layers of the network
Training proceeds one layer at a time
 Training the kth layer while previous ones are fixed
 Pretraining because
It is only a first step before applying a joint training

algorithm is applied to fine-tune all layers together

Unsupervised pretraining combines two
ideas

1. Initial parameters have a regularizing effect

– i.e., approach one local minimum over another
– But local minima no longer considered serious
2. Learning about input distribution can help to
learn about the mapping from inputs to
outputs
– Learns that cars and motorcycles have wheels
– The representation for wheels is useful for the
supervised learner
Optimization methods for Neural Networks-Adagrad, Adam
• Optimizers are algorithms or methods used to change
the attributes of the neural network such
as weights and learning rate to reduce the losses.
• Optimizers are used to solve optimization problems by
minimizing the function
• it’s impossible to know model’s weights should be right
from the start. But with some trial and error based on
the loss function we can end up getting there eventually

95
Gradient Descent Algorithm
• A gradient measures how much the output of a
function changes if you change the inputs a little
bit."
• In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
• Gradient Descent is an optimization algorithm for
finding a local minimum of a differentiable function.
• Gradient descent is simply used to find the values of
a function's parameters (coefficients) that minimize a
cost function as far as possible.

96
• the lowest point on the parabola
occurs at x = 1.
• The objective of gradient
descent algorithm is to find the
value of “x” such that “y” is
minimum
• . “y” here is termed as the
objective function that the
gradient descent algorithm
operates upon, to descend to
the lowest point

97
• Find the slope of the objective function with respect to
each parameter/feature. In other words, compute the
gradient of the function.
• Pick a random initial value for the parameters.
• Update the gradient function by plugging in the
parameter values.
• Calculate the step sizes for each feature as : step size =
gradient * learning rate.
• delta = - learning_rate * gradient

• Calculate the new parameters as : new params = old

params -step size
• theta += delta

• Repeat steps 3 to 5 until gradient is almost 0.

98
Importance of the Learning Rate
• Learning rate determines the size of the gradient
descent steps into the direction of the local minimum.
• set the learning rate to an appropriate value, which is
neither too low nor too high.
• if the steps are too big, it may not reach the local
minimum because it bounces back and forth between
the convex function of gradient descent.
• If we set the learning rate to a very small value,
gradient descent will eventually reach the local
minimum but that may take a while

99
The learning rate should never be too high or too low for this
reason.
100
Downsides of the gradient descent
algorithm
• Consider we have 10,000 data points and 10 features.
• We need to compute the derivative 10000 * 10 = 100,000
computations per iteration.
• It is common to take 1000 iterations, in effect we have 100,000 *
1000 = 100000000 computations to complete the algorithm.
• Hence gradient descent is slow on huge data

101
Stochastic Gradient Descent (SGD)

• randomly picks one data point from the whole data set at each
iteration to reduce the computations enormously.
• Mini-batch tries to strike a balance between the goodness of gradient
descent and speed of SGD

102
Momentum
• some additional processing of the gradients to be
faster and better
• in addition to the regular gradient, it also adds on the
movement from the previous step
• sum_of_gradient = gradient + previous_sum_of_gradient *
decay_rate
• delta = -learning_rate * sum_of_gradient
• theta += delta

103
• Momentum simply moves faster
• Momentum has a shot at escaping local minima
(because the momentum may propel it out of a local
minimum

104
Adaptive Gradient
Descent(AdaGrad)
• One of the disadvantages of all the optimizers is that the learning rate is constant
for all parameters and for each cycle.

• Adagrad changes the learning rate.

• It changes the learning rate ‘η’ for each parameter and at every time step ‘t’.

• It works on the derivative of an error function.

• It performs smaller updates for parameters associated with frequently occurring

features, and larger updates for parameters associated with infrequently occurring
features.

105
• Instead of keeping track of the sum of gradient,
AdaGrad for s keeps track of the sum of
gradient squared and uses that to adapt the gradient
in different directions.
• Sum_of_gradient_squared =
previous_sum_of_gradient_squared + gradient²
• delta = -learning_rate * gradient /
sqrt(sum_of_gradient_squared)
• theta += delta

106
where
• θ is the parameter to be updated,
• η is the initial learning rate,
• ε is some small quantity that used to avoid the division of zero,
• I is the identity matrix,
• gt is the gradient estimate in time-step t that we can get with
the equation

107
Root Mean Square Propagation
• AdaGrad is incredibly slow, because the sum of gradient squared only
grows and never shrinks.
• RMSprob adds a decay factor.
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
decay_rate+ gradient² * (1- decay_rate)
• delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
• theta += delta

108
Adaptive Moment estimation.

• Adam as combining the advantages of two other extensions of stochastic

gradient descent. Specifically:

• Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter

learning rate that improves performance on problems with sparse
gradients (e.g. natural language and computer vision problems).

• Root Mean Square Propagation (RMSProp) that also maintains per-

parameter learning rates that are adapted based on the average of
recent magnitudes of the gradients for the weight (e.g. how quickly it is
changing).

• Adam realizes the benefits of both AdaGrad and RMSProp

109
• Instead of adapting the parameter learning rates
based on the average first moment (the mean) as in
RMSProp, Adam also makes use of the average of the
second moments of the gradients (the uncentered
variance).
• Specifically, the algorithm calculates an exponential
moving average of the gradient and the squared
gradient, and the parameters beta1 and beta2 control
the decay rates of these moving averages.

110
• sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1
- beta1) [Momentum]
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
beta2 + gradient² * (1- beta2) [RMSProp]
• delta = -learning_rate * sum_of_gradient /
sqrt(sum_of_gradient_squared)
• theta += delta

Alpydin & Ch. Eick: ML Topic1 111

References:
• https://machinelearningmastery.com/why-training-a-neural-network-i
s-hard/
• https://www.predictiveanalyticstoday.com/deep-learning-software-libraries/

• https://www.simplilearn.com/tutorials/deep-learning-tutorial/what-is
-deep-learning
• https://machinelearningmastery.com/what-is-deep-learning/

ML Interview Questions PDF
100% (5)
ML Interview Questions PDF
20 pages
A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
100% (1)
A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
41 pages
Unit 2
No ratings yet
Unit 2
112 pages
CS 329 Lecture4 2025new
No ratings yet
CS 329 Lecture4 2025new
61 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Intro To DL
No ratings yet
Intro To DL
28 pages
Unit 1
No ratings yet
Unit 1
20 pages
8.2.1: Introduction To Neural Networks: Objectives
No ratings yet
8.2.1: Introduction To Neural Networks: Objectives
11 pages
Unit III
No ratings yet
Unit III
37 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Unit 4
No ratings yet
Unit 4
19 pages
Module 5 Lecture 2
No ratings yet
Module 5 Lecture 2
45 pages
Unit V
No ratings yet
Unit V
9 pages
Safari - 25 Jul 2019 at 11:43
No ratings yet
Safari - 25 Jul 2019 at 11:43
1 page
Intro To NN
No ratings yet
Intro To NN
4 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Neural Network
No ratings yet
Neural Network
7 pages
Unit 1
No ratings yet
Unit 1
19 pages
NN Concepts
No ratings yet
NN Concepts
4 pages
Unit 5
No ratings yet
Unit 5
59 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Neural Networks
100% (1)
Neural Networks
26 pages
Lecture8,9-Neural Networks
No ratings yet
Lecture8,9-Neural Networks
65 pages
Single Neuron Model
No ratings yet
Single Neuron Model
16 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
No ratings yet
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
57 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
29 pages
Machine Learning
No ratings yet
Machine Learning
77 pages
Shortnotedeeplearning
No ratings yet
Shortnotedeeplearning
11 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
Unit 5 ML
No ratings yet
Unit 5 ML
37 pages
Neural Network Basics 2.1 Neurons or Nodes and Layers
No ratings yet
Neural Network Basics 2.1 Neurons or Nodes and Layers
9 pages
NNDL
No ratings yet
NNDL
96 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
Soft Module 1
No ratings yet
Soft Module 1
14 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
CSPE 102 - Module 3
No ratings yet
CSPE 102 - Module 3
19 pages
Unit III
No ratings yet
Unit III
29 pages
Neural Networks
No ratings yet
Neural Networks
61 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Competitive Applications
No ratings yet
Competitive Applications
16 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
Notes Chapter Neural Networks
No ratings yet
Notes Chapter Neural Networks
18 pages
ANN Unit
No ratings yet
ANN Unit
40 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
Neural Network
No ratings yet
Neural Network
85 pages
CS217 2024 Lec11
No ratings yet
CS217 2024 Lec11
7 pages
Neural Networks
No ratings yet
Neural Networks
11 pages
Unit 3 - Ann
No ratings yet
Unit 3 - Ann
49 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
No ratings yet
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
65 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Artificial Neural Networks (Anns) : Intro
No ratings yet
Artificial Neural Networks (Anns) : Intro
15 pages
NNFL Unit III For ECE & EEE
No ratings yet
NNFL Unit III For ECE & EEE
29 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Ai ML
No ratings yet
Ai ML
34 pages
Sound Deposit Insurance Pricing Using A Machine Le
No ratings yet
Sound Deposit Insurance Pricing Using A Machine Le
18 pages
HW3: (Regularized) Least Square Problem (65 PTS) : Mathematical Backgrounds
No ratings yet
HW3: (Regularized) Least Square Problem (65 PTS) : Mathematical Backgrounds
13 pages
Daily Dose of Data Science - Archive
No ratings yet
Daily Dose of Data Science - Archive
580 pages
ML Unit-2
No ratings yet
ML Unit-2
17 pages
4-1 Data Science Syllabus
No ratings yet
4-1 Data Science Syllabus
7 pages
On Inexact Newton Methods For Inverse Problems in Banach Spaces
No ratings yet
On Inexact Newton Methods For Inverse Problems in Banach Spaces
124 pages
Regularization For Neural Network
No ratings yet
Regularization For Neural Network
37 pages
SQ Preview
No ratings yet
SQ Preview
63 pages
171 Iccipc2025
No ratings yet
171 Iccipc2025
8 pages
Sparse Regularized Optimal Transport With Deformed Q-Entropy
No ratings yet
Sparse Regularized Optimal Transport With Deformed Q-Entropy
27 pages
Programming Exercise 4: Neural Networks Learning
No ratings yet
Programming Exercise 4: Neural Networks Learning
15 pages
Trans Dimensional Monte Carlo Inversion of Short Period Magnetotelluric Data For Cover Thickness Estimation
No ratings yet
Trans Dimensional Monte Carlo Inversion of Short Period Magnetotelluric Data For Cover Thickness Estimation
8 pages
An Introduction To The Mathematical Theory of Inverse Problems 3rd Edition Andreas Kirsch Download
No ratings yet
An Introduction To The Mathematical Theory of Inverse Problems 3rd Edition Andreas Kirsch Download
86 pages
1 s2.0 S2214509523007374 Main
No ratings yet
1 s2.0 S2214509523007374 Main
26 pages
rfp0697 Chenaemb
No ratings yet
rfp0697 Chenaemb
10 pages
Tiny Machine Learning
No ratings yet
Tiny Machine Learning
39 pages
SLS Corrected 1.4.16 PDF
No ratings yet
SLS Corrected 1.4.16 PDF
362 pages
NeurIPS 2020 Learning To Solve TV Regularised Problems With Unrolled Algorithms Paper
No ratings yet
NeurIPS 2020 Learning To Solve TV Regularised Problems With Unrolled Algorithms Paper
12 pages
Sensitivity Analysis of The Thermal Detection of T
No ratings yet
Sensitivity Analysis of The Thermal Detection of T
8 pages
Project Proposal
No ratings yet
Project Proposal
4 pages
A Survey On Transfer Learning: Sinno Jialin Pan and Qiang Yang, Fellow, IEEE
No ratings yet
A Survey On Transfer Learning: Sinno Jialin Pan and Qiang Yang, Fellow, IEEE
15 pages
Regularization: The Problem of Overfitting
No ratings yet
Regularization: The Problem of Overfitting
23 pages
Inverse Problems: From Regularization To Bayesian Inference: Calvetti D, Somersalo E
No ratings yet
Inverse Problems: From Regularization To Bayesian Inference: Calvetti D, Somersalo E
37 pages
10OCEAN
No ratings yet
10OCEAN
10 pages
Semi-Supervised Medical Image Classification With Relation-Driven Self-Ensembling Model
No ratings yet
Semi-Supervised Medical Image Classification With Relation-Driven Self-Ensembling Model
12 pages
Bayesian Sparse Topical Coding
No ratings yet
Bayesian Sparse Topical Coding
14 pages
Book Summary
No ratings yet
Book Summary
35 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

Machine learning –Loss Function

• Computational method to improve performance on a task by using

• The regularization term Ω is defined as the Euclidean Norm (or L2 norm)

• Alpha is sometimes called as the regularization rate and is an

• As in the previous case, we multiply the regularization term by

• The derivative of the new loss function leads to the following

What can neural networks do?

A McCulloch Pitts AND OR

x 1 AND ! NOR NOT

Alpydin & Ch. Eick: ML Topic1 59

• Early stopping is a kind of cross-validation strategy where

• Local minimal or local optima refer to the

• A flat region or saddle point is a point on the

• Important milestone in the field of deep learning was

• Unfortunately, combining the individually optimal

• We need a single-layer representation

– A sparse coding model

– Or another model that learns latent representations

algorithm is applied to fine-tune all layers together

1. Initial parameters have a regularizing effect

• Calculate the new parameters as : new params = old

• Repeat steps 3 to 5 until gradient is almost 0.

• Adagrad changes the learning rate.

• It works on the derivative of an error function.

• It performs smaller updates for parameters associated with frequently occurring

• Adam as combining the advantages of two other extensions of stochastic

• Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter

• Root Mean Square Propagation (RMSProp) that also maintains per-

• Adam realizes the benefits of both AdaGrad and RMSProp

Alpydin & Ch. Eick: ML Topic1 111

You might also like