0% found this document useful (0 votes)
14 views29 pages

PCA and Convex Optimization and Bias, Variance-2

Uploaded by

shashivarma.2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views29 pages

PCA and Convex Optimization and Bias, Variance-2

Uploaded by

shashivarma.2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Convex Optimization

Convex Optimization
• Convex optimization is a powerful tool for solving optimization problems in various fields such as finance,
engineering, and machine learning.

• In a convex optimization problem, the goal is to find a point that maximizes or minimizes the objective
function.

• Linear functions are convex, so linear programming problems are convex problems.

• A convex function is a function whose graph is always curved upwards, which means that the line segment
connecting any two points on the graph is always above or on the graph itself.

• Convex optimization is critical in training machine learning models, which involves finding the optimal
parameters that minimize a given loss function. In machine learning, convex optimization is used to solve
many problems such as linear regression, logistic regression, support vector machines, and neural
networks.
Convex Optimization
• A real-valued function is called convex if the line segment
between any two distinct points on the graph of the function
lies above the graph between the two points.
Convex Optimization

Optimization problem in standard form


Gradient-Based Optimization

Dr. Selva Kumar S (SCOPE)


Gradient-Based Optimization
• Optimization refers to the task of either minimizing or maximizing some function
f(x) by altering x.

• Optimization problems in terms of minimizing f (x) could be done.

• Maximization may be accomplished via a minimization algorithm by minimizing −f (x)

• The function we want to minimize or maximize is called the objective function or


criterion. When we are minimizing it, we may also call it the cost function, loss
function, or error function

• Value that minimizes or maximizes a function is denoted with a superscript

• for example, we might say x∗ = arg min f (x)

Dr. Selva Kumar S (SCOPE)


Gradient Descent

• Gradient Descent is known as one of the most commonly used


optimization algorithms to train machine learning models.

• Gradient descent is also used to train Neural Networks.

• It minimizes errors between actual and expected results.

Dr. Selva Kumar S (SCOPE)


Gradient Descent Cont’d

Dr. Selva Kumar S (SCOPE)


Gradient Descent Cont’d

The Formula of the Gradient Descent Algorithm:

• Gradients are nothing but a vector whose entries are partial


derivatives of a function.
Dr. Selva Kumar S (SCOPE)
Understanding Gradient Descent

Dr. Selva Kumar S (SCOPE)


Learning rate Difference

(a) Large learning rate, (b) Small learning rate, (c) Optimum learning rate
Dr. Selva Kumar S (SCOPE)
Learning rate Difference
So the important points to remember are

•Positive derivative -> reduce

•Negative derivative -> increase

•High absolute derivative -> large step

•Low absolute derivative -> small step

Dr. Selva Kumar S (SCOPE)


Global Minimum
In the case of the linear regression model, there is only one minimum and it is the global
minimum

The local minimum reached depends on the initial coefficients taken into
consideration. Here, point A, B are termed Local Minimum and point C is
Global Minimum.

Dr. Selva Kumar S (SCOPE)


Different Types of Gradient Descent Algorithms
• Batch gradient descent: When the weight update is calculated based on all
examples in the training dataset, it is called batch gradient descent.

• Stochastic gradient descent: When the weight update is calculated


incrementally after each training example or a small group of training
examples, it is called as stochastic gradient descent.

• Mini-batch gradient descent is a gradient descent modification that divides


the training dataset into small batches that are used to compute model error
and update model coefficients.
Dr. Selva Kumar S (SCOPE)
Issues that might occur
• When training a deep neural network with gradient descent and
backpropagation, we calculate the partial derivatives by moving across
the network from the final output layer to the initial layer.

• With the chain rule, layers that are deeper in the network go through
continuous matrix multiplications to compute their derivatives.

• Due to this process, vanishing gradient, exploding gradient and saddle


point occurs

Dr. Selva Kumar S (SCOPE)


Saddle point
• Saddle point injects confusion into the learning process.

• Learning of the model becomes slow.

• It means that the crucial point achieved is the maximum cost value.

• This saddle point gets the focus when the gradient descent works in multi-dimensions.

Dr. Selva Kumar S (SCOPE)


Solutions
• Changing the architecture

• This solution could be used in both the exploding and vanishing gradient problems but requires a
good understanding and outcomes of the change.

• For example, if we reduce the number of layers in our network, Model complexity is reduced.

• Gradient Clipping for Exploding Gradients

• Carefully monitoring and limiting the size of the gradients whilst our model trains is yet another
solution. This requires some deep knowledge of how the changes could impact the overall
performance.

• Careful Weight Initialization

• A more careful initialization of the model parameters for our network is a partial solution since
it does not solve the problem completely.

Dr. Selva Kumar S (SCOPE)


Limitations
• For a good generalization we should have a large training set, which comes
with a huge computational cost.

• i.e., as the training set grows to billions of examples, the time taken to take a
single gradient step becomes long.

Dr. Selva Kumar S (SCOPE)


Choosing Gradient Descent ?

Dr. Selva Kumar S (SCOPE)


Batch Gradient Descent

Dr. Selva Kumar S (SCOPE)


Batch Gradient Descent Cont’d
• In batch gradient descent, we use all our training data in a single
iteration of the algorithm.

• So, we first pass all the training data through the network and
compute the gradient of the loss function for each sample. Then,
we take the average of the gradients and update the parameters
using the computed average.

Dr. Selva Kumar S (SCOPE)


Stochastic Gradient Descent
• SGD is a variant of the optimization algorithm that saves us both time and
computing space while still looking for the best optimal solution

• Stochastic gradient descent is a variant of gradient descent.

• The process simply takes one random stochastic gradient descent example,
iterates, then improves before moving to the next random example.

• However, because it takes and iterates one example at a time, it tends to

result in more noise than we would normally like.

Dr. Selva Kumar S (SCOPE)


Stochastic Gradient Descent

Dr. Selva Kumar S (SCOPE)


One sample will be used

Dr. Selva Kumar S (SCOPE)


Mini-Batch Descent
• Instead of going through the complete dataset or choosing one random
parameter, Mini-batch gradient descent divides the entire dataset into
randomly picked batches and optimizes it.

• The mini-batch is a fixed number of training examples that is less than the
actual dataset. So, in each iteration, we train the network on a different group
of samples until all samples of the dataset are used.

Dr. Selva Kumar S (SCOPE)


Mini Batch Gradient

Dr. Selva Kumar S (SCOPE)


Issue with GD is accidently getting stuck in local minima, where our loss
can still be HUGE
Dr. Selva Kumar S (SCOPE)
Momentum
• Momentum adds to gradient descent by considering previous gradients
(the slope of the hill prior to where the ball is currently at).

• So in the previous case, instead of stopping when the gradient is 0 at the


first local minimum, momentum will continue to move the ball forward
because it takes into consideration how steep the slope before it was.

Dr. Selva Kumar S (SCOPE)


Momentum Cont’d
• Momentum is all about speeding up and smoothening the process of
gradient descent.

• Notice how the ball is “speeds” up after steeper slopes.

• That’s momentum taking into consideration previous steep gradients and


convincing itself to continue moving, regardless of the local minimum.

• Momentum is a good way to prevent getting stuck in local minima.

• Since momentum constantly considers previous gradients, we can say


that momentum calculates moving averages.
Dr. Selva Kumar S (SCOPE)

You might also like