0% found this document useful (0 votes)

28 views65 pages

Module2 Optimizations

The document discusses optimizers for neural networks, focusing on gradient descent as a method to minimize error in machine learning models. It explains the steps involved in neural network training, the importance of learning rates, and various types of gradient descent including batch, stochastic, and mini-batch. Additionally, it highlights challenges faced by gradient descent such as getting stuck in local minima and the need for adaptive learning rates.

Uploaded by

hicey94162

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views65 pages

Module2 Optimizations

Uploaded by

hicey94162

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Optimizers for Neural

Networks

“Generalization is the ultimate goal of any machine learning algorithm”

Introduction

Two major types of problems that machine learning algorithms try to

solve are:

• Classification — Predict the class of the given data point.

• Regression — Predict continuous value of a given data point.

Deep Learning and Linear Regression
The neural network equation is:

Z = Bias + W1X1 + W2X2 + …+ WnXn

Z is the symbol for denotation of the graphical representation of ANN.

Wis, are the weights or the beta coefficients
Xis, are the independent variables or the inputs
Bias = W0

Each of the neurons in the hidden layer will have an

equation like above which will connect between the
layers and the respective weights and the bias terms.
This is how the neurons get estimated and then are
passed on to the next layer.
Deep Learning and Linear Regression
Steps involved in a Neural Network:

1.Take the input equation: Z = W0 + W1X1 + W2X2 + …+ WnXn and predict the output Y
(Ypred).
2.Calculate the error - tells how much the model deviates from the actual observed values.
It is always calculated as the Ypred – Yactual.
• Error to be calculated changes depending on whether the problem is of regression or classification.
• Regression - RMSE, MAPE, MAE
• Classification - Binary Cross Entropy Cost Function.

3.What is the end goal of any model?

• To minimize this error term!

• And, how is it done here in Neural Networks?
• The computed loss value is taken back to each layer, so that it updates the weights in such
a way that minimizes the loss.
Linear Regression
• Linear Regression - tries to predict a continuous value of a
given data point by generalizing on the data in hand. The linear
part indicates that a linear approach is used in generalizing over
the data.

• Example: Predict the price of a house by knowing its size.

• In hand data: some house sizes and their corresponding prices.

Linear Regression
• Charting the data and fitting a line among them will look something like
this:
• To generalize, a straight line is drawn, such that it crosses through the
maximum points. Once you get that line, for house of any size you just
project that data point over the line which gives you the house price.

• We are done! Wait…

Linear Regression
The Real Problem

• The problem was never finding the house price!

• The problem was to find the best fit line which generalizes over the
data, well.

• The same old line equation is used: y = mx +c with a little statistical

look and some below terminologies are added specific to Linear
Regression modeling.
Linear Regression

•y — The value that you want to predict

•β₀ — The y-intercept of the line means where the line intersects
the Y-axis
•β₁ — The slope or gradient of the line means how steep is the
line
•x — The value of the data point
•u — The residual or noise that are caused by unexplained
factors
Linear Regression
Mathematics to the rescue!

• Possibly infinite number of lines - but never sure if the line is the best fit or not.
• Saviour: The Cost Function (J)
• Aim: Achieve the best-fit regression line to predict ‘y’ value such that the error difference between the
predicted value and true value is minimum.
• Goal: Reduce the Cost function, which in turns improves the accuracy.

• Solution: Take the parameter values by hit and trial method and calculate MSE for each combination of
parameters.
• Not at all efficient!
• Seems like there is a calculus solution to this problem!
Gradient Descent
• Gradient descent (GD) is an iterative first-order optimization
algorithm used to find a local minimum/maximum of a given function.

• Various Types of Critical Points:

Gradient Descent

Function Requirements

Gradient descent algorithm does not work for all functions. There are
two specific requirements. A function has to be:

• Differentiable

• Convex
Gradient Descent
What does it mean when a function is differentiable?

Answer: It has a derivative for each point in its domain.

Twist: Not all functions meet these criteria.

Differentiable functions Non-differentiable functions

Gradient Descent
Next requirement — The function has to be convex.

For a univariate function, this means that the line segment connecting
any two points on the function, lies on or above its curve and does not
cross it.
If it does it means that it has a local minimum which is not a global one.
Gradient Descent
Another way to check mathematically if a univariate function is convex
is to calculate the second derivative and check if its value is always
greater than 0.

More Calculus Ahead!

A quadratic function:

It’s first and second derivatives are:

Because the second derivative is greater than zero, the function above is
strictly convex.
Gradient Descent
• Gradient - slope of a curve at a given point in a specified direction.
• For a univariate function, it is the first derivative of the function at
a selected point.
• In the case of a multivariate function, it is a vector of (partial)
derivatives in each direction (along variable axes).
• A gradient for an n-dimensional function f(x) at a given point p is
defined as:
Gradient Descent
Gradient at point p(10,10):

What does the gradient values mean?

The slope is twice steeper along the y axis!!

Gradient Descent Algorithm
Let’s try an analogy!

• Visualise the function as a valley.

• You are standing on some random point.
• Your aim is to reach the bottom most point of the valley.
• What would you do?
Gradient Descent Algorithm
Here is how Gradient Descent does it!

• It has to know where the slope of the valley is going (and it doesn’t have
eyes as you do). So it takes the help of mathematics here.

• To know the slope of a function at any point, differentiate that point with
respect to its parameters. Thus Gradient Descent differentiates the Cost
function (J) and comes to know of the slope of that point.

• To go to the bottom most point it has to go in the opposite direction of the

slope, i.e. where the slope is decreasing.
Gradient Descent Algorithm

• It has to take small steps to move towards the bottom point. Here. the
learning rate decides the length of step that gradient descent will take.

• After every move it validates if the current position is the global minima
or not. This is validated by the slope of that point. If the slope is zero,
then the algorithm has reached the bottom most point.

• After every step, it updates the parameter (or weights). By doing the
above step repeatedly it reaches to the bottom most point.
Gradient Descent Algorithm

• Once the bottom most point of the valley is reached, it means, the
parameters corresponding to the least MSE or Cost function has been
obtained.

• The Linear Regression model is now ready for use, to predict the
dependent variable of any unforeseen data point with very high
accuracy.
Steps in the Gradient Descent Algorithm
• Choose a starting point (initialisation) - pn
• Calculate gradient at this point -
• Make a scaled step in the opposite direction to the gradient (objective:
minimise) -

• Repeat points 2 and 3 until one of the criteria is met:

• maximum number of iterations reached.

• step size is smaller than the tolerance (due to scaling or a small gradient).
Learning Rate & Step Size
• There’s an important parameter η which
scales the gradient and thus controls the step
size.

• In machine learning, it is called the learning

rate and have a strong influence on the
learned model’s performance.

• The smaller the learning rate, the longer the

Gradient Descent algorithm takes to converge
or the maximum iteration would be reached,
before reaching the optimum point.

• If the learning rate is too big, the algorithm

may not converge to the optimal point (jump
around) or even diverge completely.
Learning Process

5/19/2023 Natural Language Processing 23

Learning Process
• Output response from neuron is yk(n)
• Desired output or target output is dk(n)
• Error signal ek(n)= dk(n)- yk(n)
• Step by step adjustment to the synaptic weights of neuron k are continued
until the system reaches the steady state. At this point learning process is
terminated.
• Objective function is to minimize cost function in terms of error signal, leads
to delta rule
Δwkj (n)=η ek(n) xj(n)
where η is the learning rate, wkj(n) denote the value of weight of neuron k
excited by an element xj(n) at time step n
Updated value of weight wkj(n) is
wkj (n+1)= Δwkj (n)+ wkj (n)
5/19/2023 Natural Language Processing 24
Back Propagation Algorithm
Back propagation algorithm applies a correction term Δwkj (n) to the
synaptic weight wkj (n) ,which is proportional to the partial derivative

Δwkj (n)=η * δk(n) * xj(n)

The Gradient is a derivative of loss function with respect to the weights.

Δwkj (n)→Weight correction

η → Learning rate parameter
δk(n)→ Local gradient
xj(n) → Input signal of neuron j

5/19/2023 Natural Language Processing 25

Gradient Descent
• It is an algorithm to minimize a function by optimizing parameters
New value=Old value- Step size
Step size=learning rate* slop
Slop will give which direction to move, to reach global minimum

5/19/2023 Natural Language Processing 26

Gradient Descent

5/19/2023 Natural Language Processing 27

Learning Rate & Step Size
How to make sure the selected
learning rate works properly?

• A good way to make sure the gradient descent

algorithm runs properly is by plotting the cost function
as the optimization runs.
• Helps in viewing the value of cost function after each
iteration of gradient descent, and provides a way to
easily spot how appropriate the selected learning rate
is.
• If the gradient descent algorithm is working properly,
the cost function should decrease after every iteration.
• When gradient descent can’t decrease the cost-function
anymore and remains more or less on the same, it has
converged.
• The number of iterations gradient descent needs to
converge can vary a lot.
• It can take 50 iterations, 60,000 or maybe even 3
million, making the number of iterations to
convergence hard to estimate in advance.
Shortfalls of Gradient Descent
• There can be numerous hidden layers and neurons in a neural network.
• The weights are attached to each neuron in each layer and that too
from the output layer back to the original input layer.
• In such case, the weights that are updated via Gradient Descent falter
at two important places:

• Gradient Descent gets stuck at Local Minima

• The learning rate does not change in Gradient Descent

Challenge 1: Gradient Descent gets stuck at
Local Minima
• Consider the function:

• The weights are updates as:

Once stuck at the local point, the parameters doesn’t update. A

push is required to get out of this and move further to reach the
global minimum!
Challenge 1: Gradient Descent gets stuck at
Local Minima
Solution: Stochastic Gradient Descent with Momentum

To get the ball out of the local minima, we need accumulating momentum or
speed.
In Neural Network, the accumulated speed is equivalent to the weighted sum of the
past gradients and this is represented as:

where,
dL/dw = current gradient at the time t
mt-1 = previously accumulated gradient till the time t-1
ꞵ gives how much weightage to be given to the current gradient and the previously accumulated gradient.
Generally, 10 percent weightage is given to the current gradient and 90 percent to the previously accumulated
gradient.
Challenge 1: Gradient Descent gets stuck at Local Minima
• At the local minima (at the position of the full red ball), the slope dL/dw
will be zero and now the equation will become:

• This mt will give the required push to come out of the local minima.
• mt updates the weights to minimize the cost function for the Neural
Network as:
Challenge 2: The learning rate does not change in
Gradient Descent
• Solution RMSProp (Root Mean Squared Propagation)
• During the training process, the slope or the gradient dL/dw changes.
It is the rate of change in the loss function concerning the parameter
weight and by applying this we can update the learning rate.
• We can take the sum of squared the past gradients i.e square of the
sum of the partial derivatives of dL/dw as below:
Challenge 2: The learning rate does not change in
Gradient Descent
• The new weights are updated as:

• ε is the error term. It is added to Vt so that the denominator does not become zero
and this error term is generally very small in value.
• When the square of the slopes, (dL/dw)2 is high, then this will increase the value
of Vt and which would reduce the learning rate.
• Similarly, the value of Vt will be low when the square of the slopes is low and this
will increase the learning rate.
Challenge 2: The learning rate does not change in
Gradient Descent
• In the left panel, calculating the gradient at the
first topmost point gives a high magnitude of
slope. This increases the square of the slope.
This will reduce the learning rate and small
steps are taken to minimize the loss.

• In the right side, the point on the loss function

has a low magnitude of the gradient. Therefore,
the square of this gradient will also be less. It
will increase the learning rate and hence will
take bigger steps towards the optimal solution.

• This is how the RMSProp scales the learning

rate depending on the square of the gradients,
which eventually leads to faster convergence to
the optimal solution.
Types of Gradient Descent
There are three popular types of gradient descent that mainly differ in the
amount of data they use:

BATCH GRADIENT DESCENT

• Also called vanilla gradient descent, calculates the error for each example
within the training dataset
• Only after all training examples have been evaluated, the model
get updated. This whole process is like a cycle and it’s called a training
epoch.
• Advantages
• Computational efficiency
• It produces a stable error gradient and a stable convergence.
• Disadvantages
• The stable error gradient can sometimes result in a state of convergence that isn’t the
best the model can achieve.
• It requires the entire training dataset to be in memory and available to the algorithm.
Types of Gradient Descent
STOCHASTIC GRADIENT DESCENT
It updates the parameters for each training example one by one.
Advantages
• Depending on the problem, this can make SGD faster than batch gradient
descent.
• The frequent updates allow a pretty detailed rate of improvement.
Disadvantages
• The frequent updates, however, are more computationally expensive than the
batch gradient descent approach.
• The frequency of those updates can result in noisy gradients, which may cause
the error rate to jump around instead of slowly decreasing.
Types of Gradient Descent
MINI-BATCH GRADIENT DESCENT
It splits the training dataset into small batches and performs an update for
each of those batches.
This creates a balance between the robustness of stochastic gradient descent
and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like any other
machine learning technique, there is no clear rule because it varies for
different applications.

This is the go-to algorithm when training a neural network and it is the most
common type of gradient descent within deep learning.
Applications of Gradient Descent
Sales Driver Analysis — Linear Regression can be used to predict the sale of products in the future based
on past buying behaviour.

Predict Economic Growth — Economists use Linear Regression to predict the economic growth of a
country or state.

Score Prediction — Sports analyst use linear regression to predict the number of runs or goals a player
would score in the coming matches based on previous performances.

Salary Estimation — An organisation can use linear regression to figure out how much they would pay to
a new joiner based on the years of experience.

House Price Prediction — Linear regression analysis can help a builder to predict how much houses it
would sell in the coming months and at what price.

Oil Price Prediction — Petroleum prices can be predicted using Linear Regression
Errors in Machine Learning

• The ultimate goal of any supervised machine learning problem is to

find a model or function that predicts a target or label and minimizes
the expected error over all possible inputs and labels.
• Minimizing error over all possible input means the function must be
able to generalize and make accurate predictions on unseen inputs.

• The fundamental goal of machine learning is for the algorithm to

generalize beyond the training sets.
Errors in Machine Learning
• Irreducible errors are errors which will
always be present in a machine learning
model, because of unknown variables,
and whose values cannot be reduced.
• Reducible errors are those errors whose
values can be further reduced to
improve a model. They are caused
because our model’s output function
does not match the desired output
function and can be optimized.
What is Bias?
• To make predictions, our model will
analyze our data and find patterns in it.
Using these patterns, we can make
generalizations about certain instances in
our data. Our model after training learns
these patterns and applies them to the test
set to predict them.
• Bias is the difference between our actual
and predicted values. Bias is the simple
assumptions that our model makes about
our data to be able to predict new data.
What is Bias?
• When the Bias is high, assumptions made by our
model are too basic, the model can’t capture the
important features of our data. This means that our
model hasn’t captured patterns in the training data and
hence cannot perform well on the testing data too. If
this is the case, our model cannot perform on new data
and cannot be sent into production.
• This instance, where the model cannot find patterns in
our training set and hence fails for both seen and
unseen data, is called Underfitting.
• The below figure shows an example of Underfitting.
As we can see, the model has found no patterns in our
data and the line of best fit is a straight line that does
not pass through any of the data points. The model has
failed to train properly on the data given and cannot
predict new data either.
What is Variance?
• Variance is the very opposite of Bias. During
training, it allows our model to ‘see’ the data
a certain number of times to find patterns in
it. If it does not work on the data for long
enough, it will not find patterns and bias
occurs. On the other hand, if our model is
allowed to view the data too many times, it
will learn very well for only that data. It will
capture most patterns in the data, but it will
also learn from the unnecessary data present,
or from the noise.
• We can define variance as the model’s
sensitivity to fluctuations in the data. Our
model may learn from noise. This will cause
our model to consider trivial features as
important.
What is Variance?
• Hence, our model will
perform really well on
testing data and get
high accuracy but will
fail to perform on
new, unseen data.
New data may not
have the exact same
features and the model
won’t be able to
predict it very well.
This is called
Overfitting.
Mathematical Insight into Bias & Variance
• Bias of an estimator is the the “expected” difference between its
estimates and the true values in the data.

• which is literally the difference between the expected value of an

estimator at that point and the true value at that same point.
• simpler models have a higher bias compared to more sophisticated
models.
Mathematical Insight into Bias & Variance
• Variance of an estimator is the “expected” value of the squared
difference between the estimate of a model and the “expected” value
of the estimate.
• Suppose we are training ∞ models using different sample sets of the
data. Then at a test point xₒ, the expected value of the estimate over all
those models is the E[g(xₒ)]. Also, for any individual model out of all
the models, the estimate of that model at that point is g(xₒ). The
difference between these two can be written as g(xₒ) − E[g(xₒ)].
Variance is the expected value of the square of this distance over all
the models. Hence, variance of the estimator at at test point xₒ can be
mathematically written as :-
Mathematical Insight into Bias & Variance
• bias and variance of an estimator are complementary to each
other i.e. an estimator with high bias will vary less(have low
variance) and an estimator with high variance will have less
bias(as it can vary more to fit/explain/estimate the data points).
Mathematical Relation Between Bias and Variance
• defining estimator’s error at a test point as the “expected” squared
difference between the true value and estimator’s estimate.
• For a test point x0,
Mathematical Relation Between Bias and
Variance

1.Error(and hence the accuracy) of the estimator at a test data sample

can be decomposed into variance of the noise in the data, bias and
the variance of the estimator. Implies - both bias and variance are the
sources of error of an estimator.

2. Bias and variance of an estimator are complementary to each other -

increasing one of them would mean a decrease in the other and vice
versa.
Bias Variance Trade-off
• An estimator will have a high error if it
has very high bias and low variance -
when it is not able to adapt at all to the
data points in a sample set.

• An estimator will also have a high error if

it has very high variance and low bias -
when it adapts too well to all the data
points in a sample set and fails to
generalize other unseen samples.
Bull’s eye diagram for Bias-
Variance Trade-off:
Bias Variance Trade-off
Regularization
• Regularization will help select a midpoint between high bias and high
variance.
• regularization performs feature selection by shrinking the contribution
of each feature.
• Types of Regularization:
• Ridge Regression (L2 – Norm)
• Lasso Regression (L1 – Norm)
• Elastic Net Regression (Combination of L1 and L2 Norms)
Ridge Regression
• Uses L2 regularization to impose a penalty on the size of coefficients.
• Minimizing the residual sum of squares after penalization.
• The objective is to minimize:

• α – regularization parameter
• α controls the size of the coefficient and the amount of regularization
Lasso Regression
• Uses L1 regularization technique as a penalty on the size of
coefficients.
• Minimizing the absolute value of weights after penalization.
• The objective is to minimize:
Elastic Net Regression
• Combines Ridge (L2 as regularizer) and Lasso (L1 as regularizer) in
order to train the model.
• The objective is to minimize:

• Selecting the best value for α - use cross-validation combined with

different regression techniques that aim to regularize the model.
Activation Function
• Activation function describes, whether a neuron should be
activated or not by calculating weighted sum and further adding
bias with it
• The purpose is to introduce non-linearity into output of a
neuron
• Neural network without an activation function is a linear
regression model.
• Activation function does the non-linear transformation to the
input, making it to learn and perform more complex tasks.
• Neural network can learn any complex non-linear relationship
between inputs and outputs.
• Activation function is applied to hidden and output layers, not in
input layer

5/19/2023 Natural Language Processing 57

Different Activation Function
Step Function
➢Binary step function is a threshold-based activation function
which means after a certain threshold neuron is activated and
below the said threshold neuron is deactivated.
➢This activation function can be used in binary classifications

5/19/2023 Natural Language Processing 58

Different Activation Function
Linear Activation Function
• Output is directly proportional to the weighted sum of neurons.
Linear Activation function can deal with multiple classes,
unlike binary step function.
• Drawback of Linear Activation Function is that no matter how
deep the neural network is, last layer will always be a function
of the first layer. This limits the neural network’s ability to deal
with complex problems.

5/19/2023 Natural Language Processing 59

Non-Linear Activation Functions
Sigmoid function (Logistic function)
➢Probabilistic approach and the output ranges between 0–1.
➢It normalizes the output of each neuron.

5/19/2023 Natural Language Processing 60

Non-Linear Activation Functions
Tanh function (Hyperbolic tangent)
It is almost like the sigmoid function but slightly better than that
since it’s output ranges between -1 and 1 allowing negative
outputs

5/19/2023 Natural Language Processing 61

Non-Linear Activation Functions
• Sigmoid(z) will yield a value (a probability) between 0 and 1. Also, as the
sigmoid is a non-linear function, the output of this unit would be a non-
linear function of the weighted sum of inputs
• Multiclass extension of sigmoid is called Softmax, which is used for
multiclass classification problems

5/19/2023 Natural Language Processing 62

• Regression: consists of 1 neuron

• Binary Classification: consists of 1 neuron

• Multi-label Classification: consists of 1 neuron per

label

• Multi-class Classification: consists of 1 neuron per

class in the output layer.

5/19/2023 Natural Language Processing 63

Design Issues-Neural Network
 Initial weights
 Transfer function (How the inputs and the weights are
combined to produce output?)
 Error estimation
 Weights adjusting
 Number of neurons
 Data representation
 Size of training set
Thank You!

ML Notes
No ratings yet
ML Notes
14 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Linear Regression by IntuitiveAI v2.5
No ratings yet
Linear Regression by IntuitiveAI v2.5
5 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Basic Machine Learning: Case Study
No ratings yet
Basic Machine Learning: Case Study
11 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
AIMLB-PGP-2025-Session-5
No ratings yet
AIMLB-PGP-2025-Session-5
67 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Linear Regression Using Gradient Descent
No ratings yet
Linear Regression Using Gradient Descent
2 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
Module 3
No ratings yet
Module 3
27 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
8 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Regression PPT
No ratings yet
Regression PPT
21 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Machine_learning_45_a_87
No ratings yet
Machine_learning_45_a_87
43 pages
(PR 2024) Lec2 Regression II
No ratings yet
(PR 2024) Lec2 Regression II
41 pages
Linear Regression
No ratings yet
Linear Regression
61 pages
Week3 Large Scale ML
No ratings yet
Week3 Large Scale ML
66 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
CM20315 06 Fitting
No ratings yet
CM20315 06 Fitting
67 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Week_6
No ratings yet
Week_6
72 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
ML - Week 06
No ratings yet
ML - Week 06
31 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Linear Regression
No ratings yet
Linear Regression
130 pages
UNIT III Part-2
No ratings yet
UNIT III Part-2
39 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
Linear Regression For Absolute Beginners With Implementation in Python
No ratings yet
Linear Regression For Absolute Beginners With Implementation in Python
17 pages
ML Session 1
No ratings yet
ML Session 1
22 pages
MLPPT
No ratings yet
MLPPT
36 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
10 pages
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet
Introduction to Advanced Mathematical Analysis
From Everand
Introduction to Advanced Mathematical Analysis
Simone Malacrida
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Gauss-Seidel Method: Zar97/09/03 Numerical Methods For Engineers Page 3-26
No ratings yet
Gauss-Seidel Method: Zar97/09/03 Numerical Methods For Engineers Page 3-26
5 pages
Transient Solved Bessel Function
No ratings yet
Transient Solved Bessel Function
41 pages
Computer Graphics: (CO 313) (Lab File)
No ratings yet
Computer Graphics: (CO 313) (Lab File)
18 pages
Floyd's Algorithm ADA
No ratings yet
Floyd's Algorithm ADA
19 pages
Math Jingle
75% (4)
Math Jingle
2 pages
Math 21b - Homework 01
No ratings yet
Math 21b - Homework 01
2 pages
Mathematics 12 02898
No ratings yet
Mathematics 12 02898
24 pages
3 Avaerage Percentage
No ratings yet
3 Avaerage Percentage
19 pages
ChatGPT - Determinants and Cramer's Rule
No ratings yet
ChatGPT - Determinants and Cramer's Rule
25 pages
Signals and Systems Using Matlab
100% (7)
Signals and Systems Using Matlab
68 pages
S6 Submath Seminar QNS 2019 Ndejje SSS
100% (2)
S6 Submath Seminar QNS 2019 Ndejje SSS
13 pages
MATH2571 MM 22 11 24.html
No ratings yet
MATH2571 MM 22 11 24.html
2 pages
Flat (R20a0507)
No ratings yet
Flat (R20a0507)
117 pages
Adiabatic Elimination in A Lambda System
No ratings yet
Adiabatic Elimination in A Lambda System
14 pages
EMM514 Control Engineering II
No ratings yet
EMM514 Control Engineering II
120 pages
Python - Programming
No ratings yet
Python - Programming
9 pages
C1: Scientific Calculator: Mathematics Explorace, MIGTY 2016
No ratings yet
C1: Scientific Calculator: Mathematics Explorace, MIGTY 2016
1 page
Series Solution and Special Function
No ratings yet
Series Solution and Special Function
2 pages
Känguru Der Mathematik 2017 Level Kadett (Grade 7 and 8) Österreich - 16. 3. 2017
No ratings yet
Känguru Der Mathematik 2017 Level Kadett (Grade 7 and 8) Österreich - 16. 3. 2017
3 pages
Time Series Measurement of Seasonal Variations
100% (2)
Time Series Measurement of Seasonal Variations
24 pages
Publisher: Pearson India
No ratings yet
Publisher: Pearson India
5 pages
DSP Assignment
No ratings yet
DSP Assignment
2 pages
A Consistent World Model For Consciousness and Physics
No ratings yet
A Consistent World Model For Consciousness and Physics
12 pages
DLL Math
No ratings yet
DLL Math
2 pages
Scalar and Vector Product Notes
No ratings yet
Scalar and Vector Product Notes
2 pages
Research Paper Final Na
No ratings yet
Research Paper Final Na
90 pages
Cambridge Primary: Mathematics
No ratings yet
Cambridge Primary: Mathematics
21 pages
Eton - KS - MathsBPaper - 2009
No ratings yet
Eton - KS - MathsBPaper - 2009
3 pages
Lesson 12
No ratings yet
Lesson 12
30 pages
Lecture 17 - Piecewise Continuous Functions and Improper Integrals
No ratings yet
Lecture 17 - Piecewise Continuous Functions and Improper Integrals
7 pages

Module2 Optimizations

Uploaded by

Module2 Optimizations

Uploaded by

Optimizers for Neural

“Generalization is the ultimate goal of any machine learning algorithm”

Two major types of problems that machine learning algorithms try to

• Classification — Predict the class of the given data point.

• Regression — Predict continuous value of a given data point.

Z = Bias + W1X1 + W2X2 + …+ WnXn

Z is the symbol for denotation of the graphical representation of ANN.

Each of the neurons in the hidden layer will have an

3.What is the end goal of any model?

• To minimize this error term!

• Example: Predict the price of a house by knowing its size.

• In hand data: some house sizes and their corresponding prices.

• We are done! Wait…

• The problem was never finding the house price!

• The same old line equation is used: y = mx +c with a little statistical

•y — The value that you want to predict

• Various Types of Critical Points:

Answer: It has a derivative for each point in its domain.

Differentiable functions Non-differentiable functions

More Calculus Ahead!

It’s first and second derivatives are:

What does the gradient values mean?

The slope is twice steeper along the y axis!!

• Visualise the function as a valley.

• To go to the bottom most point it has to go in the opposite direction of the

• Repeat points 2 and 3 until one of the criteria is met:

• maximum number of iterations reached.

• In machine learning, it is called the learning

• The smaller the learning rate, the longer the

• If the learning rate is too big, the algorithm

5/19/2023 Natural Language Processing 23

Δwkj (n)=η * δk(n) * xj(n)

Δwkj (n)→Weight correction

5/19/2023 Natural Language Processing 25

5/19/2023 Natural Language Processing 26

5/19/2023 Natural Language Processing 27

• A good way to make sure the gradient descent

• Gradient Descent gets stuck at Local Minima

• The learning rate does not change in Gradient Descent

• The weights are updates as:

Once stuck at the local point, the parameters doesn’t update. A

• In the right side, the point on the loss function

• This is how the RMSProp scales the learning

BATCH GRADIENT DESCENT

• The ultimate goal of any supervised machine learning problem is to

• The fundamental goal of machine learning is for the algorithm to

• which is literally the difference between the expected value of an

1.Error(and hence the accuracy) of the estimator at a test data sample

2. Bias and variance of an estimator are complementary to each other -

• An estimator will also have a high error if

• Selecting the best value for α - use cross-validation combined with

5/19/2023 Natural Language Processing 57

5/19/2023 Natural Language Processing 58

5/19/2023 Natural Language Processing 59

5/19/2023 Natural Language Processing 60

5/19/2023 Natural Language Processing 61

5/19/2023 Natural Language Processing 62

• Binary Classification: consists of 1 neuron

• Multi-label Classification: consists of 1 neuron per

• Multi-class Classification: consists of 1 neuron per

5/19/2023 Natural Language Processing 63

You might also like