PCA and Convex Optimization and Bias, Variance-2
PCA and Convex Optimization and Bias, Variance-2
Convex Optimization
• Convex optimization is a powerful tool for solving optimization problems in various fields such as finance,
engineering, and machine learning.
• In a convex optimization problem, the goal is to find a point that maximizes or minimizes the objective
function.
• Linear functions are convex, so linear programming problems are convex problems.
• A convex function is a function whose graph is always curved upwards, which means that the line segment
connecting any two points on the graph is always above or on the graph itself.
• Convex optimization is critical in training machine learning models, which involves finding the optimal
parameters that minimize a given loss function. In machine learning, convex optimization is used to solve
many problems such as linear regression, logistic regression, support vector machines, and neural
networks.
Convex Optimization
• A real-valued function is called convex if the line segment
between any two distinct points on the graph of the function
lies above the graph between the two points.
Convex Optimization
(a) Large learning rate, (b) Small learning rate, (c) Optimum learning rate
Dr. Selva Kumar S (SCOPE)
Learning rate Difference
So the important points to remember are
The local minimum reached depends on the initial coefficients taken into
consideration. Here, point A, B are termed Local Minimum and point C is
Global Minimum.
• With the chain rule, layers that are deeper in the network go through
continuous matrix multiplications to compute their derivatives.
• It means that the crucial point achieved is the maximum cost value.
• This saddle point gets the focus when the gradient descent works in multi-dimensions.
• This solution could be used in both the exploding and vanishing gradient problems but requires a
good understanding and outcomes of the change.
• For example, if we reduce the number of layers in our network, Model complexity is reduced.
• Carefully monitoring and limiting the size of the gradients whilst our model trains is yet another
solution. This requires some deep knowledge of how the changes could impact the overall
performance.
• A more careful initialization of the model parameters for our network is a partial solution since
it does not solve the problem completely.
• i.e., as the training set grows to billions of examples, the time taken to take a
single gradient step becomes long.
• So, we first pass all the training data through the network and
compute the gradient of the loss function for each sample. Then,
we take the average of the gradients and update the parameters
using the computed average.
• The process simply takes one random stochastic gradient descent example,
iterates, then improves before moving to the next random example.
• The mini-batch is a fixed number of training examples that is less than the
actual dataset. So, in each iteration, we train the network on a different group
of samples until all samples of the dataset are used.