0% found this document useful (0 votes)
9 views201 pages

Unit3 Rev3

The document discusses various optimization algorithms, particularly focusing on Gradient Descent (GD) and its variants, including challenges such as local minima and flat regions in high-dimensional spaces. It explains the mechanics of GD, critical points, and strategies to mitigate issues like spurious local minima and slow convergence. Additionally, it highlights the importance of advanced optimization techniques, regularization, and learning rate strategies in improving model performance.

Uploaded by

Kunal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views201 pages

Unit3 Rev3

The document discusses various optimization algorithms, particularly focusing on Gradient Descent (GD) and its variants, including challenges such as local minima and flat regions in high-dimensional spaces. It explains the mechanics of GD, critical points, and strategies to mitigate issues like spurious local minima and slow convergence. Additionally, it highlights the importance of advanced optimization techniques, regularization, and learning rate strategies in improving model performance.

Uploaded by

Kunal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 201

Module 3

Module III
Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD,
AdaGrad, RMSProp, Adam, Eigenvalues and eigenvectors, Eigenvalue Decomposition, Basis,
Principal Component Analysis and its interpretations, Singular Value Decomposition.
Autoencoders and relation to PCA, Regularization in autoencoders, Denoisingautoencoders,
Sparse autoencoders, Contractive autoencoders
Eigen Values
Gradient Descent:
• Optimization refers to the task of either minimizing or maximizing
some function f (x) by altering x.
• most optimization problems in terms of minimizing f (x).
• The function we want to minimize or maximize is called the objective
function or criterion.
• When we are minimizing it, we may also call it the cost function, loss
function, or error function.
• the value that minimizes or maximizes a function with a superscript ∗.
For example, we might say 𝒙∗ = argmin f(x).
Gradient Descent:

An illustration of how the gradient descent algorithm uses the derivatives of a function can be used to follow the
function downhill to a minimum
Gradient Descent
• When 𝑓 ′ 𝑥 = 0, the derivative provides no information about which
direction to move.
• Points where 𝑓 ′ 𝑥 = 0 are known as critical points or stationary points.
• A local minimum is a point where f (x) is lower than at all neighboring
• points, so it is no longer possible to decrease f(x) by making infinitesimal
steps.
• A local maximum is a point where f (x) is higher than at all neighboring
points.
• Some critical points are neither maxima nor minima. These are known as
saddle points
Gradient Descent

Examples of each of the three types of critical points in 1-D.


A critical point is a point with zero slope. Such a point can either be a local minimum,
which is lower than the neighboring points,
a local maximum, which is higher than the neighboring points,
or a saddle point, which has neighbors that are both higher and lower than the point it
Gradient Descent
• A point that obtains absolute lowest value of f (x) is a global min.
• It is possible for there to be only one global min or multiple global min of
the function.
• It is also possible for there to be local min that are not globally Optimal.
• In the context of deep learning, we optimize functions that may have
many local minima that are not optimal, and many saddle points
surrounded by
very flat regions.
• All of this makes optimization very difficult, especially when the input to
the function is multidimensional.
Gradient Descent
Gradient Descent:
• Gradient Descent is an optimization algorithm used to minimize a
function by iteratively moving towards the steepest descent
direction, which is indicated by the negative of the gradient.
• Objective Function: The function you want to minimize (e.g., loss
function in machine learning).
• Gradient: The vector of partial derivatives of the function, indicating
the direction of steepest ascent.
• Learning Rate (α): A hyperparameter that determines the step size
for each iteration.
Gradient Descent:
• 1) initialize the parameters (e.g., 𝜃).
• 2) Compute the Gradient of the function at the current parameter
values.
• 3) Update the Parameters
• θ=θ−α∇J(θ)
• 4) Repeat steps 2-3 until convergence
• (i.e., when the change in the function value is smaller than a
threshold).
Gradient Descent:
• θ=θ−α∇J(θ)
• Example
• example of minimizing the quadratic function:
j 𝜃 = 𝜃 2 + 4𝜃 + 4
= 𝜃+2 2
The minimum occurs at θ=−2.
Compute the Gradient
𝑑
∇J(θ) = 𝜃 2 + 4𝜃 + 4 = 2𝜃 + 4
𝑑 𝜃
Gradient Descent:
• 2. Set Initial Parameters • Iteration 2 • Iteration 4
• Compute Gradient: • Compute Gradient:
• Initial parameter: θ=5
• ∇J(3.6)=2*3.6 +4=11.2 • ∇J(1.584)=2*1.584 +4=7.168
• Learning rate (α): 0.1 • Update Parameter: • Update Parameter:
• Perform Iterations • θ=3.6- 0.1*11.2 = 2.48 • θ=1.584- 0.1*7.1768 = 0.8672
• Iteration 1
• Iteration 3 • Iteration 5
• Compute Gradient:
• Compute Gradient: • Compute Gradient:
• ∇J(5)=2⋅5+4=14 • ∇J(2.48)=2*2.48 +4=8.96 • ∇J(0.8672)=2*0.8672 +4= 5.7344
• Update Parameter: • Update Parameter: • Update Parameter:
• θ=5−0.1⋅14=5−1.4=3.6 • θ=2.48- 0.1*8.96 = 1.584 • θ=0.8672- 0.1*5.7344 = 0.29376

It iteratively updates parameters to minimize a cost function,


Θ = 3.6 2.481.5840.86720.29376
The Challenges with Gradient Descent
• The primary challenge in optimizing deep learning models is that we are
forced to use minimal local information to infer the global structure of the
error surface.
• This is a hard problem because there is usually very little correspondence
between local and global structure.
• For nonconvex surface, which means that even if we find a valley (a local
minimum), we have no idea if it’s the lowest valley on the map
(the global minimum).
The Challenges with Gradient Descent

Mini-batch gradient descent may aid in escaping shallow local minima, but often fails when dealing with
deep local minima, as shown
The Challenges with Gradient Descent
• One observation about deep neural networks is
that their error surfaces are guaranteed to have
a large
• —and in some cases, an infinite—number of
local minima.
• local minima are only problematic when they
are spurious.
• A spurious local minimum incurs a higher error
than the configuration at the global minimum.
• If these kinds of local minima are common, we
quickly run into significant problems while using
gradient based optimization methods because
we can only take into account local structure
The Challenges with Gradient Descent
• Loss Landscape in Deep Networks:
• Deep neural networks have complex loss landscapes due to their high
dimensionality and non-linearity.
• This complexity can result in many local minima.
• In high-dimensional spaces, the chance of encountering a local minimum is
higher, and many of these minima can be spurious.
• Characteristics:
• Spurious local minima are not necessarily the worst possible minima, but they
can be worse than the global minimum or other more optimal minima.
• These minima can trap optimization algorithms, preventing them from
finding the best solution.
• Impact on Training:
• Getting stuck in a spurious local minimum can affect the performance
of the model, leading to poorer generalization on unseen data.
• The model might converge to a suboptimal set of parameters,
reducing its effectiveness.
• Strategies to Mitigate Spurious Local Minima
• Advanced Optimization Algorithms:
• Stochastic Gradient Descent (SGD) and its variants (e.g., Adam, RMSprop)
introduce noise into the gradient updates, which can help escape local
minima.
• Momentum-based methods can help by providing a form of inertia that
might push the optimizer out of local minima.
• Regularization Techniques:
• Techniques like Dropout or weight decay can help generalize better
and avoid overfitting to local minima.
• Network Architecture:
• Batch Normalization (normalizes the outputs of each layer by
adjusting and scaling the activations.)
• and Residual Networks (ResNets) can help by smoothing the
optimization landscape.
• Skip connections and deeper architectures can also help in reducing
the chances of getting stuck in local minima.
• Learning Rate Strategies:
• Using a learning rate schedule that gradually reduces the learning rate
can help in fine-tuning the model and avoiding local minima.
• Initialization Strategies:
• Proper initialization of weights can help start the optimization process
in a region that is less likely to trap the model in poor local minima.
• Ensemble Methods:
• Training multiple models and combining their predictions can
sometimes mitigate the impact of spurious local minima, as different
models might end up in different minima.
• Hyperparameter Tuning:
• Careful tuning of hyperparameters (e.g., learning rate, batch size) can
impact the optimization process and potentially help in avoiding
spurious minima.
• Flat Regions: These are areas in the loss
landscape where the gradient (or change in
the loss function) is very small.
• This means the surface is relatively flat in
these regions.
• Impact of Flat Regions:
• Gradient Magnitude: This makes it difficult
for optimization algorithms, gradient-based
methods like SGD, to make significant
progress.
• Slow Convergence: Because the gradient is
small, leading to slow convergence or
stagnation.
When Gradient Points in the Wrong Direction

• Local Minima or Maxima:


• with non-convex functions, the gradient might point in a direction that
leads to a local minimum or maximum rather than the global one.
• This means that even if the gradient is pointing in a direction of
increasing value locally, it might not be the direction to find the global
optimum.
When Gradient Points in the Wrong Direction

• Gradient Descent Algorithm


Issues:
• A too-large step size might
cause overshooting (diverge)
the min, Ocillate back and forth
• and a too-small step size might
lead to very slow convergence.
• This can sometimes make it
seem like the gradient is
pointing in the wrong direction.
When Gradient Points in the Wrong Direction
• Scaling and Normalization:
• Make sure that the input features are
appropriately scaled or normalized.
• Sometimes gradients can behave
unexpectedly if the feature scales are very
different from each other.
• Stochasticity:
• In stochastic gradient descent, the cost
function is evaluated at random samples
from the data set.
• This introduces randomness into the
algorithm, making converging to a global
minimum more difficult.
Circular Contour
• 𝐽 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
• Gradient Direction: At any point on a circular contour, the gradient
will always point radially outward from the center
• ∇J(x,y)=(2x,2y)
• The gradient points away from the origin (the local minimum at (0,0))
• Moving Towards the Minimum: If the contours are circular, the
gradient does not point towards the local minimum;
• instead, it points away from it.
• To find the local minimum, you would need to move in the opposite
direction of the gradient
Circular Contour
• Gradient Descent: Therefore, in the context of gradient descent, you
would update your parameters by moving against the gradient (i.e.,
towards the local minimum),
• θ=θ−α∇J(θ)

Example
• 𝐽 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
• ∇J(x,y)=(2x,2y)
Circular Contour
• iteration 1 • iteration 2
• Compute Gradient: • Compute Gradient:
∇J(3,4)=(2⋅3,2⋅4)=(6,8) ∇J(2.4,3.2)=(2⋅2.4,2⋅3.2)=(4.8,6.4)
Update Parameters: Update Parameters:
xnew​=3−0.1⋅6=3−0.6=2.4 xnew​=2.4−0.1⋅4.8 =2.4−0.48=1.92
ynew​=4−0.1⋅8=4−0.8=3.2 ynew​=3.2−0.1⋅6.4 =3.2−0.64=2.56
New point: (2.4,3.2) New point: (1.92,2.56)
Circular Contour
• iteration 3 • iteration 4
• Compute Gradient: • Compute Gradient:
∇J(1.92,2.56)=(2⋅1.92,2⋅2.56)=(3.84,5. ∇J(1.536,2.048)=(2⋅1.536,2⋅2.048)=(3.07
12) 2,4.096)
Update Parameters: Update Parameters:
xnew=1.92−0.1⋅3.84=1.92−0.384 xnew​= 1.536−0.1⋅3.072=1.536−0.3072
=1.536 =1.2288
ynew​= 2.56−0.1⋅5.12=2.56−0.512 ynew​= 2.048−0.1⋅4.096=2.048−0.4096
=2.048 =1.6384
New point: (1.536,2.048) New point: (1.2288,1.6384)
Circular Contour
• Gradient Direction: In each iteration, the gradient points away from
the origin, indicating the direction of steepest ascent. We moved in
the opposite direction (toward the local minimum).
• Convergence: With each iteration, the coordinates (x,y)move closer
to the origin (0, 0), where the function has its minimum.
• Circular Contours: Since the contours are circular, the updates
effectively navigate towards the minimum efficiently without
oscillating.
• The gradient provides the direction of steepest ascent, and by moving
in the opposite direction, we effectively converge toward the local
minimum.
Circular Contour
• No Overshooting or Misalignment: There is no risk of the gradient
leading you away from the minimum or causing oscillation due to
misalignment, as is possible with highly elliptical contours.
• There are no issues of directional inaccuracies like those that can
occur with elliptical contours, ensuring efficient and effective
optimization.
When Gradient Points in Wrong Direction

Local information encoded by the gradient usually does not corroborate the global structure of the error surface
Ellipse
• 𝐽 𝑥, 𝑦 = 4𝑥 2 + 𝑦 2 Iteartion 1
• The global minimum Compute Gradient:
occurs at the origin (0,0). ∇J(2,2)=(8⋅2,2⋅2)=(16,4)
• Compute the Gradient Update Parameters:
• ∇J(x,y=(8x,2y) xnew​=2−0.1⋅16=2−1.6=0.4
• Initial point: Let’s start at ynew​=2−0.1⋅4=2−0.4=1.6
(x,y)=(2,2)
New point: (0.4,1.6)
• Learning rate (α): 0.1
Ellipse
At each iteration, the gradient points towards the
steepest ascent, which means we move in the
opposite direction to find the minimum.
Convergence: Each update brings us closer to the
minimum at (0,0).
The gradient updates might cause larger
movements in one direction than the other due
to the aspect ratio of the ellipse
When dealing with functions that have
extremely elliptical contours
• Elliptical Contours: When contours are highly elongated (i.e., very
elliptical), the gradient at a point may not provide a good direction toward
the minimum.
• This is because the steepness of the gradient can vary significantly along
different axes.
• Gradient Direction: The gradient vector points in the direction of steepest
ascent.
• In cases of extreme ellipticity, the steepest ascent could be misaligned with
the shortest path to the minimum.
• This can make the update direction seem like it’s pointing away from the
local minimum.
When dealing with functions that have
extremely elliptical contours
When dealing with functions that have
extremely elliptical contours
• Actual Direction: The update moves from (2,1) to (0.4,0.8)
• Contour Analysis: The actual minimum is at (0,0).
• The update direction based on the gradient may point more towards
(0.4,0.8) rather than directly toward the minimum,
• Overshooting: If you continue to iterate, the gradient could potentially lead
you to oscillate or move in directions that seem correct based on local
gradient information but take you further away from the minimum.
• Gradient as 90 Degrees Off: In cases where the contours are extremely
elliptical, you might experience scenarios where the effective movement
toward the minimum is significantly misaligned.
• This could make the gradient seem like it's pointing almost perpendicular
(or 90 degrees off) from the correct path toward the local minimum.
When Gradient Points in Wrong Direction

We show how the direction of the gradient changes as we move along the direction of steepest descent
When Gradient Points in Wrong Direction

• The elliptical nature of the contours can be related to the condition


number of the Hessian matrix of the function.
• An elliptical contour indicates that the function’s Hessian matrix has
eigenvalues of varying magnitudes.
• This often leads to slower convergence rates in gradient-based
optimization methods unless the problem is well-scaled.
When Gradient Points in Wrong Direction

• The Hessian matrix of a function f(x,y) is a square matrix of second-


order partial derivatives:
When Gradient Points in Wrong Direction

• The condition number of the Hessian matrix is defined as the ratio of the
largest eigenvalue to the smallest eigenvalue:
• Condition Number=λmax/λmin​​
• Condition Number = 2/1/5 = 10
• A high condition number (much greater than 1) indicates that the Hessian
has a significant difference between its maximum and minimum eigenvalues.
• This leads to highly elliptical contours, as the steepest descent direction is
much more pronounced in one direction than the other.
When Gradient Points in Wrong Direction

• if the condition number is close to 1 (indicating that the Hessian is


well-conditioned),
• the contours will be more circular, which implies that the function is
easier to optimize, as all directions have similar steepness.
Momentum-Based Optimization
• Momentum-based optimization
is a powerful technique that
enhances the basic gradient
descent algorithm by
incorporating past gradient
information.
• This results in faster convergence
and improved stability,
particularly in challenging
optimization landscapes.
Momentum-Based Optimization
Momentum-Based Optimization
• Momentum-based optimization is a technique used in gradient descent
algorithms to accelerate convergence, particularly in scenarios with noisy
gradients or when dealing with ravines in the loss landscape.
• In a ravine, the loss function may drop sharply in one direction while
being relatively flat in the perpendicular direction. This creates a "valley"
effect
• The basic idea is to incorporate a "momentum" term that accumulates
past gradients, allowing the optimizer to gain speed in relevant directions
and dampen oscillations in others.
Momentum-Based Optimization
• how a ball rolls down a hilly surface.
• Driven by gravity, the ball eventually settles into a minimum on the surface,
but for some reason, it doesn’t suffer from the wild fluctuations and
divergences that happen during gradient descent.
• there are two major components that determine how a ball rolls down an
error surface.
• The first, is SGD as the gradient, refer to as acceleration.
• But acceleration does not determine the ball’s movements.
• Instead, its motion is more directly determined by its velocity.
• Acceleration only indirectly changes ball’s position by modifying its velocity.
Momentum-Based Optimization
• Velocity-driven motion is desirable because it counteracts the effects
of a wildly fluctuating gradient (not stable gradient ) by smoothing the
ball’s trajectory over its history.
• Velocity serves as a form of memory, and this allows us to more
effectively accumulate movement in the direction of the minimum
while canceling out oscillating accelerations in orthogonal
directions.
• every update is computed by combining the update in the last
iteration with the current gradient.
Momentum-Based Optimization
• Momentum introduces a velocity term that accumulates the past gradients to
smooth out updates.
• This helps to navigate along the relevant directions and dampen oscillations.
• Update Rule:
• Velocity Update
• 𝑣𝑡 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝛻 𝑗(𝜃)
• Parameter Update
• 𝜃 = 𝜃 − 𝛼𝑣𝑡
• β is the momentum factor (usually between 0.5 and 0.9), α is the learning rate
• J(θ) is the loss function.
• It helps accelerate gradients in the right direction, leading to faster convergence.
Momentum-Based Optimization
• 𝑗 𝜃 = 𝜃2
• The gradient of this function is:
• ∇J(θ)=2θ
Momentum-Based Optimization
Momentum-Based Optimization
Momentum-Based Optimization
Momentum-Based Optimization
• The momentum term accumulates past gradients, allowing the
updates to incorporate information from previous steps.
• This can help navigate along the optimization landscape more
effectively.
• The updates to θ decrease in size as it approaches the minimum,
demonstrating how momentum can help smooth out the path toward
convergence while avoiding oscillation
• Momentum-based gradient descent can significantly speed up
convergence, particularly in high-dimensional spaces
• or when the loss surface has a lot of curvature.
Gradient descent with momentum
Gradient descent with momentum

• The intuition behind this is if we are repeatedly asked to go in a


particular direction, we can take bigger steps towards that direction.
• The weighted average of all the previous gradients is added to our
equation, and it acts as momentum to our step.
• As we start to descend, the momentum increases, and even at gentle
slopes where the gradient is minimal, the actual movement is large
due to the added momentum
Gradient descent with momentum

• But this added momentum causes a different type of problem.


• We actually cross the minimum point and have to take a U-turn to
get to the minimum point.
• Momentum-based gradient descent oscillates around the minimum
point,
• and we have to take a lot of U-turns to reach the desired point.
• Despite these oscillations, momentum-based gradient descent is
faster than conventional gradient descent.
• To reduce these oscillations, we can use Nesterov Accelerated
Gradient
• the updates of NAG are similar to that of the momentum-based
gradient for the first three steps because the gradient at that point
and the look-ahead point are positive.
• But at step 4, the gradient of the look-ahead point is negative.
• In NAG, the first partial update 4a will be used to go to the look-
ahead point and then the gradient will be calculated at that point
without updating the parameters.
• Since the gradient at step 4b is negative, the overall update will be
smaller than the momentum-based gradient descent.
• We can see in the above example that the momentum-based gradient
descent takes six steps to reach the minimum point, while NAG takes
only five steps.
• This looking ahead helps NAG to converge to the minimum points in
fewer steps and reduce the chances of overshooting.
• momentum-based gradient, the steps become larger and larger due
to the accumulated momentum, and then we overshoot at the 4th
step.
• We then have to take steps in the opposite direction to reach the
minimum point.
• However, the update in NAG happens in two steps.
• First, a partial step to reach the look-ahead point,
• and then the final update.
• We calculate the gradient at the look-ahead point and then use it to
calculate the final update
• . If the gradient at the look-ahead point is negative, our final update
will be smaller than that of a regular momentum-based gradient.
Nesterov Accelerated Gradient Descent (NAG)
• NAG improves upon the standard momentum method by providing a
more accurate estimate of the future position of the parameters,
which can lead to faster convergence.
• The key idea behind NAG is to look ahead of the current position of
the parameters in the direction of the momentum.
• This approach can provide a more accurate gradient update because
it anticipates where the parameters will be in the next step,
• rather than just considering the current position
Nesterov Accelerated Gradient Descent (NAG)
• Here's a step-by-step explanation of how NAG operates:
• Compute the Lookahead Position:
• Before computing the gradient, move the parameters in the direction of the
current momentum. This is called the "lookahead" step.
• Compute the Gradient:
• Calculate the gradient of the loss function at this lookahead position.
• Update the Momentum:
• Update the momentum using the newly computed gradient.
• Update the Parameters:
• Finally, update the parameters using the updated momentum.
Nesterov Accelerated Gradient Descent (NAG)
• Let θ be the parameters of the model, and let v be the momentum term. Here’s
how you can update your parameters using NAG:
• 1. Initialize:
𝜃𝑜 (Iinitial Parameter)
• 𝑣𝑜 =0 (initial momentum)
• Learning rate 𝜂
• Momentum coefficient 𝜇
• 2. Update Rule:
• For each iteration t:
• Compute the lookahead position:
Nesterov Accelerated Gradient Descent (NAG)
Nesterov Accelerated Gradient Descent (NAG)
Nesterov Accelerated Gradient Descent (NAG)
Nesterov Accelerated Gradient Descent (NAG)
Nesterov Accelerated Gradient Descent (NAG)
• Iteration 1 Result:
• ( 0.8 , 0.8 )
• Iteration 2 Result:
• ( 0.424 , 0.424 )
• Nesterov Accelerated Gradient combines the benefits of momentum with a
lookahead approach, leading to more efficient optimization.
• Its ability to predict the gradient direction based on anticipated future
positions makes it a powerful method for training machine learning models,
particularly in deep learning
Nesterov Accelerated Gradient Descent (NAG)
• Advantages of NAG
• Faster Convergence:
• NAG often converges faster than standard gradient descent or vanilla momentum methods
because it provides a better estimate of where the parameters will be in the next step.
• Improved Stability:
• By incorporating a lookahead, NAG can reduce oscillations and provide a smoother
convergence path, especially in scenarios where the standard momentum method might
struggle.
• More Accurate Updates:
• The lookahead step helps in making more informed parameter updates, which can be
particularly beneficial in complex, high-dimensional optimization problems.
Nesterov Accelerated Gradient Descent (NAG)
• When to Use NAG
• Nesterov Accelerated Gradient Descent is useful in various scenarios:
• Large-scale Machine Learning: It’s particularly effective for large-scale problems
where faster convergence can save significant computational resources.
• Deep Learning: In deep neural networks, where training involves many layers and
parameters, NAG can help in speeding up convergence and achieving better
performance.
• Non-Convex Optimization: While not a cure-all for non-convex problems, NAG
can still provide better performance than basic gradient descent in many cases.
AdaGrad (Adaptive Gradient Algorithm)
• AdaGrad (Adaptive Gradient Algorithm) is an optimization algorithm
designed to adaptively adjust the learning rate for each parameter based
on the historical gradients.
• This allows AdaGrad to handle sparse data and gradients more effectively
than standard gradient descent.
• AdaGrad adjusts the learning rate for each parameter θi based on the
historical squared gradients.
• The core idea is to scale the learning rate inversely proportional to the
square root of the sum of past squared gradients for each parameter.
• This means that parameters with large gradients will have smaller
learning rates,
• while those with smaller gradients will have larger learning rates
AdaGrad (Adaptive Gradient Algorithm)
AdaGrad (Adaptive Gradient Algorithm)
Advantages of AdaGrad

• Adaptive Learning Rates:


• AdaGrad automatically adapts the learning rates for each parameter based on
their historical gradient information, making it suitable for problems with
sparse features (e.g., text data).
• Effective for Sparse Data:
• Since parameters with infrequent updates receive larger learning rates,
AdaGrad can be particularly effective for sparse datasets, where some
features might only be seen a few times.
• Reduces Manual Tuning:
• The adaptive nature of AdaGrad reduces the need for manual tuning of the
learning rate, which can simplify the optimization process.
AdaGrad (Adaptive Gradient Algorithm)
• Disadvantages of AdaGrad
• Learning Rate Decay:
• One major drawback is that the learning rate accumulates over time, leading to a
progressive reduction in learning rates.
• This can cause the algorithm to make very small updates in later stages of training,
potentially causing slow convergence or premature stopping.
• No Momentum:
• AdaGrad does not incorporate momentum, which can be beneficial for escaping local
minima and accelerating convergence in certain scenarios.
• Accumulation of Squared Gradients:
• The accumulation of squared gradients can lead to very small learning rates in some cases,
making it less effective for problems requiring large or consistent updates.
RMSProp—Exponentially Weighted Moving
Average of Gradients
• RMSProp (Root Mean Square Propagation) is an adaptive learning rate
optimization algorithm designed to address some of the shortcomings of AdaGrad,
• particularly the issue of rapidly diminishing learning rates.
• RMSProp maintains a moving average of the squared gradients and uses this to
adjust the learning rate for each parameter, which helps to stabilize and accelerate
training.
• While AdaGrad works well for simple convex functions,
• it isn’t designed to navigate the complex error surfaces of deep networks.
• Flat regions may force AdaGrad to decrease the learning rate before it reaches a
minimum.
• The conclusion is that simply using a naive accumulation of gradients isn’t sufficient
RMSProp
• RMSProp modifies the learning rate of each parameter based on the
recent magnitude of its gradients.
• Instead of accumulating all past squared gradients as in AdaGrad,
RMSProp uses an exponentially decaying average,
• which helps in preventing the learning rate from becoming too small.
RMSProp
RMSProp -Advantages
• Stabilized Learning Rates:
• By using an exponentially decaying average of squared gradients, RMSProp prevents
the learning rate from diminishing too quickly, allowing for more stable and
consistent updates.
• Effective in Practice:
• like training deep neural networks.
• Handles Sparse Gradients:
• Like AdaGrad, RMSProp can handle sparse gradients effectively, which is useful in
scenarios with high-dimensional data.
• parameters associated with frequently occurring features can have a
smaller learning rate, while those associated with sparse features can
maintain a larger learning rate.
• This adaptability helps ensure that all parameters are updated effectively,
even when gradients are sparse.
RMSProp -Advantages
• Disadvantages of RMSProp
• No Momentum:
• RMSProp does not include momentum, which can be beneficial for
accelerating convergence and escaping local minima.
• However, this limitation can be addressed with algorithms like Adam, which
combines RMSProp with momentum.
• Choice of Hyperparameter
• The performance of RMSProp can be sensitive to the choice of
hyperparameters like the decay factor ρ and the learning rate η.
• Fine-tuning these parameters might be necessary for optimal
performance.
Adam—Combining Momentum and RMSProp
• Adam (short for Adaptive Moment Estimation)
• Adam as a variant combination of RMSProp and momentum.
• It adapts learning rates based on both the first and second moments
of the gradients, leading to faster convergence and improved
performance across a variety of machine learning tasks.
• Its robustness and effectiveness make it a go-to choice for many
deep learning applications.
Adam—Combining Momentum and RMSProp
• Initialization:
• Parameters θ (initial values)
• Gradient moments and v(initialized to zeros for each parameter)
• Learning rate η (a fixed value, such as 0.001)
• Exponential decay rates β1 and β2​ (typically set to 0.9 and 0.999)
• Small constant ϵ (for numerical stability, e.g., 1e−8)
Adam—Combining Momentum and RMSProp
• Update Rule:
• For each parameter θi and each iteration t:
• Compute the Gradient:
• gi,t=∇θiL(θt) , Where is the loss function.
Adam—Combining Momentum and RMSProp
Adam—Advantages
• Adaptive Learning Rates:
• Adam adjusts the learning rate for each parameter individually based on the historical
gradient information, making it effective for a wide range of optimization problems.
• Combines Momentum and RMSProp:
• By incorporating both momentum (to accelerate gradients) and RMSProp (to adapt
learning rates), Adam benefits from the strengths of both approaches, leading to faster
convergence and improved performance.
• Bias Correction:
• The bias correction step helps to counteract the initial bias towards zero, providing more
accurate updates in the early stages of training.
• Robust to Hyperparameters:
• Adam is generally more robust to hyperparameter settings compared to other
optimizers, making it easier to use with default settings.
Adam
• Disadvantages of Adam
• Parameter Sensitivity:
• While Adam is robust, it can still be sensitive to the choice of hyperparameters,
especially the learning rate. Fine-tuning may be required for optimal
performance.
• Computational Overhead:
• Adam requires additional memory to store the moment estimates and can be
slightly more computationally intensive compared to simpler algorithms like
standard SGD.
Eigen Values
• We want “eigenvectors” x that don’t change direction when you
multiply by A.
Eigen Values
• Almost all vectors change direction, when they are multiplied by A. Certain
exceptional vectors x are in the same direction as Ax.
• Those are the “eigenvectors”.
• The basic equation is Ax = λx. The number λ is an eigenvalue of A.
• We may find λ = 2 or 1 /2 or −1 or 1.
• The eigenvalue λ could be zero!
• Then Ax = 0x means that this eigenvector x is in the nullspace.
• If A is the identity matrix, every vector has Ax = x. All vectors are
eigenvectors of I.
• All eigenvalues “lambda” are λ = 1.
Dimensionality Reduction
• In high dimensions the center of the space is devoid of points, with most
of the points being scattered along the surface of the space or in the
corners.
• As a consequence high-dimensional data can cause problems for data
mining and analysis,
• although in some cases high-dimensionality can help, for example, for
nonlinear classification.
• it is important to check whether the dimensionality can be reduced
while preserving the essential properties of the full data matrix.
• This can aid data visualization as well as data mining
Let the data D consist of n points over d attributes, it is an n × d matrix,

• 𝑥𝑖 = (𝑥𝑖1 , 𝑥𝑖2 , 𝑥𝑖3 , … … . . 𝑥𝑖𝑑 )𝑡 is a vector in the d-dimensional vector


space spanned by the d standard basis vectors 𝑒1 , 𝑒2 , … . . 𝑒𝑑 , where 𝑒𝑖
corresponds to the ith attribute Xi.
• basis vectors are pairwise orthogonal, 𝑒𝑖𝑡 . 𝑒𝑗 =0
• and have unit length 𝑒𝑖 = 1
• given any other set of d orthonormal vectors u1,u2, . . . ,ud ,
with 𝒖𝒕𝒊 . 𝒖𝒋 =0 , 𝒖𝒊 = 𝟏,
• we can re-express each point x as the linear combination,
• x = a1u1 +a2u2 +···+adud
• a= 𝒂𝟏 , 𝒂𝟐 , … . . 𝒂𝒅 𝒕 represents the coordinates of x in the new basis.
• above linear combination can also be expressed as a matrix multiplication:
• x =Ua
• where U is d ×d matrix, whose ith column comprises the ith basis vector ui
• The matrix U is an orthogonal matrix, whose columns, the basis vectors,
are orthonormal, that is, they are pairwise orthogonal and have unit
length

• x =Ua
• Multiplying Eq. on both sides by Ut yields the expression for computing
the coordinates of x in the new basis
Example
the centered Iris dataset, with n = 150 points, in the d = 3 dimensional
space comprising the sepal length (X1), sepal width (X2), and petal
length (X3) attributes. space is spanned by the standard basis vectors.

the same points in the space comprising the new basis vectors
Example
the new coordinates of the centered point x = (−0.343,−0.754, 0.241)T
can be computed as
Example
Example
• finding the optimal r-dimensional representation of D, with r ≪d.
• given a point x, and assuming basis vectors have been sorted in decreasing
order of importance, we can truncate its linear expansion to just r terms,

• Here x′ is the projection of x onto the first r basis vectors,


• which can be written in matrix notation as follows:
• where Ur is the matrix comprising the first r basis vectors,
• and ar is vector comprising the first r coordinates.
𝒕
= 𝑼𝒓 𝒙
• Further, because 𝒂 𝒓

• projection of x onto first r basis vectors can be compactly written as

• where Pr is the orthogonal projection matrix for the subspace


spanned by the first r basis vectors

projection of x onto the remaining dimensions comprises error vector


• It is worth noting that that x′ and ε are orthogonal vectors:

• The subspace spanned by the first r basis vectors


• Sr = span(u1, . . . ,ur )
• and the subspace spanned by the remaining basis vectors
• Sd−r = span(ur+1, . . . ,ud ) are orthogonal subspaces
Example
• centered point x = (−0.343,−0.754,0.241)T by using only the first basis
vector u1 = (−0.390,0.089, −0.916)T, we have

• projecion of x on u1 obtained directly from the projection matrix


PRINCIPAL COMPONENT ANALYSIS
• seeks a r-dimensional basis that best captures the variance in data.
• The direction with the largest projected variance is called the first
principal component.
• The orthogonal direction that captures the second largest projected
variance is called the second principal component, and so on.
• the direction that maximizes the variance is also the one that
minimizes the mean squared error
Best Line Approximation
• We will start with r = 1, that is, the one-dimensional subspace
• or line u that best approximates D in terms of the variance of projected points.
• This will lead to the general PCA technique for the best 1 ≤ r ≤ d dimensional
basis for D.
• we assume that u has magnitude 𝑢 2 = 𝑢𝑡 𝑢 = 1
• otherwise it is possible to keep on increasing the projected variance by simply
increasing the magnitude of u.
• We also assume that the data has been centered so that it has mean μ = 0.
Best Line Approximation
• The projection of xi on the vector u is given as

• gives the coordinate of 𝑥𝑖′ along u.


• Note that because the mean point is μ = 0, its coordinate along u is
𝜇𝑢 = 0
• We have to choose the direction u such that the variance of the
projected points is maximized.
• The projected variance along u is given as
• To maximize the projected variance, we have to solve a constrained
optimization problem, namely to maximize 𝜎𝑢2 subject to the
constraint that 𝒖𝒕 𝒖 = 𝟏.
• This can be solved by introducing a Lagrangian multiplier α for the
constraint, to obtain the unconstrained maximization problem
• This implies that α is an eigenvalue of the covariance matrix Σ, with the
associated eigenvector u.
• Further, taking the dot product with u on both sides

• To maximize the projected variance 𝝈𝟐𝒖 , we should thus choose the largest
eigenvalue of Σ.
• the dominant eigenvector u1 specifies the direction of most variance, also
called the first principal component, that is, u = u1.
• the largest eigenvalue λ1 specifies the projected variance, that is, σ2
• u = α = λ1.
Minimum Squared Error Approach
• direction that maximizes the projected variance is also the one
that minimizes the average squared error.
• assume that the dataset D has been centered by subtracting the
mean from each point
Minimum Squared Error Approach
Minimum Squared Error Approach
• the total variance of the centered data (i.e., with μ = 0)


Minimum Squared Error Approach

• Thus, the principal component u1, which is the direction that


maximizes the projected variance,
• is also the direction that minimizes the mean squared error.


Example
Example
Best 2-dimensional Approximation
• assume that D has already been centered, so that μ = 0.
• We already computed the direction with the most variance, u1, which
is eigenvector corresponding to the largest eigenvalue λ1 of Σ .
• We now want to find another direction v, which also maximizes the
projected variance, but is orthogonal to u1.
Best 2-dimensional Approximation
• The optimization condition then becomes

• Taking the derivative of J(v) with respect to v, and setting it to the


zero vector, gives
Best 2-dimensional Approximation
• In the derivation above we used the fact that

• and that v is orthogonal to u1. Plugging β = 0

• This means that v is another eigenvector of Σ.


• we have 𝜎𝑢2 v =α.
• To maximize the variance along v, we should choose α = λ2,
• second largest eigenvalue of Σ,
• with the second principal component being given by the corresponding
eigenvector, v = u2
Total Projected Variance
• Let U2 be the matrix whose columns correspond to the two principal components,

𝒕
𝒂𝒊 = 𝑼𝟐 𝒙𝒊

• Assume that each point xi ∈ Rd in D has been projected to obtain its coordinates
• ai ∈ R2, yielding the new dataset A.
• D is assumed to be centered, with μ = 0, the coordinates of the projected mean are
also zero because
𝒕 𝒕
𝑼𝟐 = 𝑼𝟐 =
• 𝝁 𝟎 𝟎
Total Projected Variance

where P2 is the orthogonal projection matrix


Total Projected Variance

Because u1 and u2 are eigenvectors of Σ, we have Σ u1 = λ1u1 and Σ u2 = λ2u2, so that

Thus, the sum of the eigenvalues is the total variance of the projected points, and the first two principal components
maximize this variance.
Mean Squared Error
Example
• For the Iris dataset, the two largest eigenvalues are λ1 = 3.662, and λ2 =
0.239, with the corresponding eigenvectors
Example
• Thus, each point xi can be approximated by its projection onto the first
two principal components 𝑥𝑖′ =P2xi
• The total variance captured by the subspace is given as
• λ1 +λ2 = 3.662+0.239= 3.901 .
• The mean squared error is given as
• MSE = var(D)−λ1−λ2 = 3.96−3.662−0.239= 0.059
Principal Component Analysis
Example
Example
Example

Reduced dimensionality dataset: Iris principal components.


SINGULAR VALUE DECOMPOSITION
• PCA yields the following decomposition of the covariance matrix:

• where the covariance matrix has been factorized into the orthogonal
matrix U containing its eigenvectors,
• and a diagonal matrix ˄ containing its eigenvalues (sorted in decreasing)
• SVD generalizes the above factorization for any matrix.
• for an n × d data matrix D with, SVD factorizes D as follows:
SINGULAR VALUE DECOMPOSITION

• where L is a orthogonal n × n matrix, R is an orthogonal d × d matrix,


• Δ is an n×d “diagonal” matrix.
• The columns of L are called the left singular vectors, and the
• columns of R (or rows of Rt ) are called the right singular vectors.
• The matrix Δ is defined as.
• where i = 1, . . . ,n
• and j = 1, . . . ,d
SINGULAR VALUE DECOMPOSITION
• The entries Δ (i, i) = δi along the main diagonal of Δ are called the
singular values of D,
• and they are all non-negative.
• If the rank of D is r ≤ min(n,d), then there will be only r nonzero
singular values, which we assume are ordered as follows:
• δ1 ≥ δ2 ≥ ··· ≥ δr > 0
• One can discard those left and right singular vectors that correspond
to zero singular values, to obtain the reduced SVD as
SINGULAR VALUE DECOMPOSITION

• where Lr is the n × r matrix of the left singular vectors,


• Rr is the d × r matrix of the right singular vectors,
• and Δr is the r ×r diagonal matrix containing the positive singular vectors.
reduced SVD leads directly to spectral decomposition of D,given as
SINGULAR VALUE DECOMPOSITION
• spectral decomposition represents D as a sum of rank one matrices of
form

• . By selecting the q largest singular values δ1,δ2, . . . ,δq and the


corresponding left and right singular vectors,
• we obtain the best rank q approximation to the original matrix D.
• That is, if Dq is the matrix defined as
SVD
Geometry of SVD
• In general, any n×d matrix D represents a linear transformation,
• D: Rd →Rn, from the space of d-dimensional vectors to the space of n-
dimensional vectors
• because for any x ∈ Rd there exists y ∈ Rn such that
• Dx = y
• The set of all vectors y ∈ Rn such that Dx = y over all possible x ∈ Rd is called
the column space of D,
• and the set of all vectors x ∈ Rd , such that Dt y = x over all y ∈ Rn,
• is called the row space of D, which is equivalent to the column space of Dt.
Geometry of SVD
• Also note that
• the set of all vectors x ∈ Rd , such that Dx = 0 is called the null space
of D, and finally,
• the set of all vectors y ∈ Rn, such that Dt y = 0 is called the left null
space of D.
Geometry of SVD
• If D has rank r, it means that it has only r independent columns, and also only r
independent rows.
• Thus, the r left singular vectors l1, l2, . . . , lr corresponding to the r nonzero
singular values of D in Eq. represent a basis for the column space of D.
• The remaining n−r left singular vectors lr+1, . . . , ln represent a basis for the
left null space of D.
• For the row space, the r right singular vectors r1, r2, . . . , rr corresponding to
the r non-zero singular values, represent a basis for the row space of D,
• and the remaining d −r right singular vectors rj (j = r +1, . . . ,d), represent a
basis for the null space of D
Geometry of SVD
• Consider the reduced SVD expression,
• Right multiplying both sides of the equation by Rr and noting that RT
• 𝑅𝑟𝑡 𝑅𝑟 = 𝐼𝑟 , where Ir is the r ×r identity matrix,
• we have
Geometry of SVD
• From the above, we conclude that

• SVD is a special factorization of the matrix D,


• such that any basis vector ri for the row space is mapped to the
corresponding basis vector li in the column space,
• scaled by the singular value δi .
• SVD as a mapping from an orthonormal basis (r1, r2, . . . , rr ) in Rd (the row
space) to an orthonormal basis (l1, l2, . . . , lr ) in Rn (the column space),
with the corresponding axes scaled according to the singular values δ1,δ2,
. . . ,δr .
Connection between SVD and PCA
Connection between SVD and PCA

we conclude that the right singular vectors R are the same as the eigenvectors of Σ
corresponding singular values of D are related to the eigenvalues of Σ by expression
Connection between SVD and PCA
Connection between SVD and PCA
Example
• Let us consider the n×d centered Iris datamatrix D
• we computed the eigenvectors and eigenvalues of the covariance
matrix Σ as follows:
Example
• Computing the SVD of D yields the following nonzero singular values
and the corresponding right singular vectors
Example
• Notice also that the right singular vectors are equivalent to the
principal components or eigenvectors of Σ, up to isomorphism.
• That is, they may potentially be reversed in direction.
• For the Iris dataset, we have r1 = u1, r2 = −u2, and r3 = u3.
• Here the second right singular vector is reversed in sign when
compared to the second principal component.
Problem with PCA
• While PCA has been used for decades for dimensionality reduction, it
fails to capture important relationships that are piecewise linear or
nonlinear.
Problem with PCA
• The example shows data points selected at random from two
concentric circles.
• We hope that PCA will transform this dataset so that we can pick a
single new axis that allows us to easily separate the red and blue dots.
Unfortunately for us, there is no linear direction that contains more
information here than another (we have equal variance in all
directions)..
Motivating the Autoencoder Architecture
• In feed-forward networks, how each layer learned progressively more
relevant representations of the input.
• output of the final convolutional layer and used that as a lower-dimensional
representation of the input image.
• we want to generate these low-dimensional representations in an
unsupervised fashion,
Motivating the Autoencoder Architecture
• We first take the input and compress it into a low-dimensional vector. This
part of the network is called the encoder
• because it is responsible for producing the low-dimensional embedding or
code.
• The second part of the network, instead of mapping the embedding to an
arbitrary label as we would in a feed-forward network,
tries to invert the computation of the first half of the network and
reconstruct the original input. This piece is known as the decoder
Motivating the Autoencoder Architecture

• The autoencoder architecture attempts to construct a high-dimensional input


into a low-dimensional embedding and then uses that low-dimensional
embedding to reconstruct the input
Autoencoder
• An autoencoder is a neural network
• that is trained to attempt to copy its input to its output.
• it has a hidden layer h that describes a code used to represent the input.
• The network may be viewed as consisting of two parts:
• An encoder function h = f (x)
• and a decoder that produces a reconstruction r = g(h).
• If an autoencoder succeeds in simply learning to set g(f (x)) = x everywhere,
then it is not especially useful.
• autoencoders are designed to be unable to learn to copy perfectly
Autoencoder

Modern autoencoders have generalized the idea of an


encoder and a decoder beyond deterministic functions to
stochastic mappings
pencoder(h | x) and
pdecoder(x | h).

mapping an input x to an output (called reconstruction) r through an internal


representation or code h.
The autoencoder has two components:
the encoder f (mapping x to h) and the decoder g (mapping h to r).
Simple autoencoder
Simple autoencoder
Non Linear autoencoder
Training the autoencoder
Undercomplete, Overcomplete, and Need for
Regularization
Undercomplete, Overcomplete, and Need for
Regularization
• In both cases, it is important to control the capacity of encoder and decoder
• Undercomplete: Imagine K = 1 and very powerful f and g.
Can achieve very small reconstruction error but the learned code will not
capture any interesting properties in the data
• Overcomplete: Imagine K D and trivial (identity) functions f and g.
Can achieve even zero reconstruction error
but the learned code will not capture any interesting properties in the data
• It is therefore important to regularize the functions as well as the learned
code, and not just focus on minimizing the reconstruction error
Regularized Autoencoders
• Several ways to regularize the model, e.g.
• Make the learned code sparse (Sparse Autoencoders)
• Make the model robust against noisy/incomplete inputs (Denoising
Dutoencoders)

Make the model robust against small changes in the input (Contractive Autoencoders)
Sparse Autoencoders
• Make the learned code sparse (Sparse Autoencoders). Done by adding a sparsity
penalty on h
• Loss Function: 𝑙 𝑥,
ො x + Ω(h)
• Where Ω(h) = σ𝐾 𝑘=1 ℎ𝑘 is the l1 norm of h

• Sparse autoencoder is learned by minimizing the above regularized loss function


Denoising Autoencoders
• First add some noise (e.g., Gaussian noise) to the original input x
• Let's denote 𝑥෥ as the corrupted version of x
• The encoder f operates on 𝑥෥ i.e., h = f(𝑥෥ )

• However, we still want to reconstruction ^x


to be close to the original uncorrupted input x
• Since the corruption is stochastic,
we minimize the expected loss:
Deep/Stacked Autoencoders
• Most autoencoders can be extended to have more than one hidden layer
Stochastic Autoencoders
• Can also done the encoder and decoder functions using probability distributions.
• 𝑝𝑒𝑛𝑐𝑜𝑑𝑒𝑟 h x
• 𝑝𝑑𝑒𝑐𝑜𝑑𝑒𝑟 x h
• The choice of distributions depends on the type of data being modeled and of the
encodings
• This gives a probabilistic approach for designing autoencoders
• Negative log-likelihood -log 𝒑𝒆𝒏𝒄𝒐𝒅𝒆𝒓 𝒉 𝒙 is equivalent to the reconstruction error.
• Can also use a prior distribution p(h) on the encodings (equivalent to regularizer)
• Such ideas have been used to design generative models for autoencoders
• Variational Autoencoder (VAE) is a popular example of such a model
Applications of Autoencoders

• autoencoders were used for dimensionality reduction or feature learning.


• Denoising and inpainting (that fills in missing or damaged parts of an image
or video using information from the surrounding area)
• Pre-training of deep neural networks
• Recommender systems applications
Feature learning and Dimensionality
Reduction
• Example: A deep AE for low-dim feature learning for 784-dimensional
MNIST images
Feature learning and Dimensionality Reduction
• Low-dim feature learning for 2000-dimensional bag-of-words
documents
Denoising and Inpainting
Denoising and Inpainting
Undercomplete Autoencoders
• An autoencoder whose code dimension is less than the input dimension
is called undercomplete.
• Learning an undercomplete representation forces the autoencoder to
capture the most salient features of the training data.
• The learning process is described simply as minimizing a loss function
• L(x, g(f(x)))
• where L is a loss function penalizing g(f (x)) for being dissimilar from x,
such as the mean squared error
Autoencoder
• When the decoder is linear and L is the mean squared error,
• an undercomplete autoencoder learns to span the same subspace as
PCA.
• Autoencoders with nonlinear encoder functions f and nonlinear
decoder functions g can thus learn a more powerful nonlinear
generalization of PCA.
Regularized Autoencoders
• if the hidden code is allowed to have dimension equal to the input,
and in the overcomplete case in which the hidden code has
dimension greater than the input.
This means that the number of features in the latent representation exceeds
the number of input features.
Having more neurons than inputs allows the model to learn richer
representations, potentially capturing complicated relationships in the data.
• In these cases, even a linear encoder and linear decoder can learn to copy the
input to the output without learning anything useful about the data
distribution.
Regularized Autoencoders
• Rather than limiting the model capacity by keeping the encoder and decoder
shallow and the code size small,
• regularized autoencoders use a loss function that encourages the model to have
other properties besides the ability to copy its input to its output.
• To prevent overfitting in overcomplete autoencoders, regularization techniques
(like L1 or L2 regularization, dropout, or sparsity constraints) are often applied.
This encourages the model to learn useful features
• These other properties include sparsity of the representation, smallness of the
derivative of the representation (relates to how sensitive the output (or
reconstruction) of the model is to changes in the input. )
• , and robustness to noise or to missing inputs(models can effectively handle noisy
input)
Regularized Autoencoders
• In the case of overcomplete autoencoders, regularization prevent the model
from simply learning to output the training data without generalizing to
unseen examples.
• L1 Regularization: Encourages sparsity in the hidden layer, pushing many
activations toward zero.
• L2 Regularization: Helps prevent large weights, promoting smoother and
more generalized solutions.
• Sparsity Penalties:
enforce that only a small number of neurons are activated,
making the representation more interpretable and robust.
Regularized Autoencoders
• When the derivatives of the representation (the activations in the hidden layers)
are small,
• it means that small changes in the input will lead to small changes in the output
• A regularized autoencoder can be nonlinear and overcomplete but still learn
something useful about the data distribution even if the model capacity is great
enough to learn a trivial identity function.
• The identity function is a mapping where the output is the same as the input
• Most autoencoders are designed to learn complex representations of data using
nonlinear activation functions, such as ReLU, sigmoid, or tanh.
• Nonlinear autoencoders typically consist of multiple layers, where the encoding
and decoding processes use nonlinear transformations.
Regularized Autoencoders
• How Autoencoders Achieve Robustness?
• By compressing the data into a lower-dimensional representation,
autoencoders can filter out noise, focusing on the underlying structure of
the data rather than the noise itself.
• Sparse autoencoders enforce sparsity in the hidden layer, which encourages
the model to focus on the most important features.
• This can help in ignoring noisy inputs that do not contribute to the essential
structure of the data.
Regularized Autoencoders
• How Autoencoders Achieve Robustness?
• Techniques like L2 regularization or dropout can help prevent overfitting to
noisy data,
• making the model more generalizable to unseen examples.
• A specific variant designed to enhance robustness is denoising autoencoder.
• During training, random noise is added to the input data, and the model is
trained to reconstruct the original, noise-free input.
• This helps the autoencoder learn to ignore noise and capture meaningful
features.
Sparse Autoencoders
• designed to learn efficient representations of data by enforcing sparsity in
the hidden layer activations.
• This means that, for any given input, only a small number of neurons in
the hidden layer will be activated, leading to a more compact
representation.
• This is typically achieved using a sparsity penalty in the loss function,
which can be implemented through various methods:
• for example (L1 Regularization or KL Divergence or L2 Regularization)
Sparse Autoencoders
• Sparse autoencoders are powerful tools for learning efficient representations
from high-dimensional data.
• By enforcing sparsity in the hidden layer, they can extract meaningful
features,
• improve robustness to noise, and prevent overfitting,
• making them suitable for a wide range of applications in machine learning
and data analysis.
Sparse Autoencoders
• Regularized maximum likelihood is a powerful approach that combines
the strengths of maximum likelihood estimation with regularization
techniques.
• This combination enhances model robustness, improves generalization,
• and helps manage complexity in high-dimensional datasets,
• making it a fundamental tool in statistical modeling and machine learning.
• regularized maximum likelihood corresponds to maximizing p(θ | x),
which is equivalent to maximizing log p(x | θ) + log p(θ).
Sparse Autoencoders
• Suppose we have a model with visible variables x and latent variables h,
with an explicit joint distribution
• pmodel(x,h) = pmodel(h)pmodel(x | h).
• We refer to pmodel (h) as the model’s prior distribution over the latent
variables, representing the model’s beliefs prior to seeing x.
• The log-likelihood can be decomposed as
Sparse Autoencoders
• we are maximizing

• The log pmodel(h) term can be sparsity-inducing. For example, the


Laplace prior, corresponds to an absolute value sparsity penalty

• Expressing the log-prior as an absolute value penalty, we obtain


Sparse Autoencoders
• where the constant term depends only on λ and not h.
• We typically treat λ as a hyperparameter and discard the constant term
since it does not affect the parameter learning.
• From this point of view of sparsity as resulting from the effect of pmodel
(h) on approximate maximum likelihood learning,
• the sparsity penalty is not a regularization term at all.
• using a Laplace prior on the coefficients of a model encourages sparsity in
the estimates, meaning that many of the coefficients are pushed towards
zero.
• the Laplace prior is a powerful tool for achieving sparsity in high-
dimensional statistical problems
Denoising Autoencoders
• Rather than adding a penalty Ω to the cost function, we can obtain an
autoencoder that learns something useful by changing the reconstruction
error term of the cost function.
• Traditionally, autoencoders minimize some function
• L(x, g(f(x)))
• where L is a loss function penalizing g(f (x)) for being dissimilar from x,
such as the 𝐿2 norm of their difference.
• This encourages g ◦ f to learn to be merely an identity function if they have
the capacity to do so.
Denoising Autoencoders
• A denoising autoencoder or DAE instead minimizes
• L(x, g(f(ഥ
𝒙))),
• where 𝑥ҧ is a copy of x that has been corrupted by some form of noise.
• Denoising autoencoders must therefore undo this corruption rather
than simply copying their input.
• thus provide yet another example of how useful properties can emerge
as a byproduct of minimizing reconstruction error
Regularizing by Penalizing Derivatives
• for regularizing an autoencoder is to use a penalty Ω as in sparse
autoencoders,
Regularizing by Penalizing Derivatives
• This forces the model to learn a function that does not change much when x
changes slightly.
• Because this penalty is applied only at training examples,
• it forces the autoencoder to learn features that capture information about the
training distribution.
• An autoencoder regularized in this way is called a contractive autoencoder or CAE.
• This approach has theoretical connections to denoising autoencoders,
manifold learning (Nonlinear dimensionality reduction, ) and probabilistic modeling
This helps in making learned features more invariant to small variations in input data,
which is useful for tasks such as noise reduction and feature extraction.
Denoising Autoencoders (DAE)
• denoising autoencoder is an autoencoder that receives a corrupted
data point as input and is trained to predict the original, uncorrupted
data point as its output.
• corruption process C(𝑥|ҧ x) which represents a conditional distribution
over corrupted samples 𝑥ҧ , given a data sample x.
The computational graph of the cost function for a denoising
autoencoder, which is trained to reconstruct the clean data point x from
its corrupted version 𝑥.ҧ
This is accomplished by minimizing the
loss L = −log pdecoder(x | h = f(x˜)),
where 𝑥ҧ is a corrupted version of the data example x, obtained through
a given corruption process C(𝑥|ҧ x).
distribution pdecoder is a factorial distribution whose mean parameters
are emitted by a feedforward network g.
Denoising Autoencoders
• The autoencoder then learns a reconstruction distribution
preconstruct (x | 𝑥ҧ ) estimated from training pairs (x, 𝑥),
ҧ as
follows:
• 1. Sample a training example x from the training data.
• 2. Sample a corrupted version 𝑥ҧ from C(𝑥ҧ | x = x).
• 3. Use (x, 𝑥)ҧ as a training example for estimating the
autoencoder reconstruction distribution
• preconstruct(x | 𝑥)ҧ = pdecoder (x | h) with h the output of
encoder f(x˜) and pdecoder typically defined by a decoder
g(h).
Denoising Autoencoders
• we can simply perform gradient-based approximate
minimization (such as minibatch gradient descent) on
the negative log-likelihood −log pdecoder(x | h).
• DAE as performing stochastic gradient descent on the
following expectation:

• where pˆdata(x) is the training distribution


• representing underlying distribution of clean data.
Denoising Autoencoder
• The aim is to learn this distribution through training on noisy
examples,
• allowing the model to recover clean data effectively.
• Denoising Autoencoder, the loss function is
• mean squared error or binary cross-entropy
• between the original clean input x and the reconstructed output x^:
Denoising Autoencoders
A denoising autoencoder is trained
to map a corrupted data point ˜x
back to the original data point x.
We illustrate training examples x as
red crosses lying near a low-
dimensional manifold illustrated
with the bold black line.
We illustrate the corruption
process C (˜x | x) with a gray circle
of equiprobable corruptions
Denoising Autoencoders

A gray arrow demonstrates how one training example is transformed into one sample from this
corruption process.
When the denoising autoencoder is trained to minimize the average of squared errors
𝟐

𝒈 𝒇 𝒙 −𝒙

෥ estimates 𝑬𝒙,෥𝒙 ~𝒑𝒅𝒂𝒕𝒂 𝒙 𝑪 𝒙


The reconstruction of 𝒈 𝒇 𝒙 ෥ 𝒙 [𝐗|෥ 𝒙]
𝑬𝒙,෥𝒙 ~𝒑𝒅𝒂𝒕𝒂 𝒙 taking the expectation of x over the distribution of samples 𝒑𝒅𝒂𝒕𝒂 𝒙
corruption process 𝑪 𝒙෥ 𝒙
Denoising Autoencoders


𝒈 𝒇 𝒙 −𝒙
points approximately towards the nearest point on the manifold ,
෥ estimates the center of mass of the clean points x
since 𝒈 𝒇 𝒙
The autoencoder thus learns a vector field g(f (x)) − x indicated by the green arrows.

This vector field estimates the score ∇x log pdata (x) up to a multiplicative factor that is
the average root mean square reconstruction error.
Estimating the Score
• the score is a particular gradient field:
• ∇x log p(x).
• regarding autoencoders, it is sufficient to understand that learning the gradient
field of log pdata is one way to learn the structure of pdata itself.
• It can guide the optimization process,
• ensuring that the model learns to generate outputs that are more likely according
to the true data distribution.
• A very important property of DAEs is that their training criterion (with
conditionally Gaussian p(x | h)) makes the autoencoder learn a vector field
(g(f(x)) − x) that estimates the score of the data distribution
Estimating the Score
• training with the squared error criterion

• with noise variance 𝜎 2


• there is no guarantee that the reconstruction g(f (x)) minus the input x
corresponds to the gradient of any function, let alone to the score .
Contractive Autoencoders (CAEs)
• Contractive Autoencoders (CAEs) are a type of neural network designed to
learn robust representations of data by enforcing a regularization term
• encourages the model to be invariant to small changes in the input.
• useful for tasks where robustness to noise or slight variations is essential
• Like traditional autoencoders, CAEs consist of an encoder and a decoder.
• encoder compresses input data into a lower-dimensional representation,
• while the decoder reconstructs the input from this representation.
Contractive Autoencoders (CAEs)
• Contractive Loss:
• The main innovation in CAEs is the introduction of a contractive loss term in
addition to the reconstruction loss.
• The loss function for a CAE can be expressed as
• 𝑳 𝒙, 𝒙ෝ = 𝑳𝒓𝒆𝒄𝒐𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝒙, 𝒙 ෝ + 𝝀 𝑳𝒄𝒐𝒏𝒕𝒓𝒂𝒕𝒊𝒗𝒆 (𝒉)
• Where 𝑳𝒓𝒆𝒄𝒐𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝒙, 𝒙ෝ is typically the mean squared error or binary
cross-entropy between the input x and the reconstructed output 𝒙 ෝ
• 𝑳𝒄𝒐𝒏𝒕𝒓𝒂𝒕𝒊𝒗𝒆 (𝒉) encourages the hidden representation h to be robust to small
perturbations in the input.
• 𝝀 is a hyperparameter that controls the trade-off between reconstruction
accuracy and contractive regularization.
Contractive Autoencoders (CAEs)
• Adavantages
• Robust Representations:
• By enforcing stability, CAEs learn features that are less sensitive to noise
and perturbations in the input data.
• Improved Generalization:
• generalize better on unseen data due to the regularization effect.
• Feature Extraction:
• effectively extract meaningful features from the data,
• which can be useful for downstream tasks like classification or clustering.
• Contractive autoencoder has an explicit regularizer on h=f(x),
• encouraging the derivatives of f to be as small as possible:

• Where L(f (x)) + Ω(h)


• Penalty Ω(h) is the squared Frobenius norm (sum of squared
elements) of Jacobian matrix of partial derivatives associated with
encoder function
References
• Ian Goodfellow and YoshuaBengio and Aaron Courville,Deep Learning (2016)
AnMIT Press book, http://www.deeplearningbook.org.(T2)
• . Skansi S., Introduction to Deep Learning - From Logical Calculus to Artificial
Intelligence, 1st Edition, Springer International Publishing, 2018.(T2)
• Buduma N., Fundamentals of Deep Learning, 1st Edition, O Reilly Media, 2016.(R1)

• Piyush Rai, Autoencoders, Extensions, and Applications, IIT Kanpur

• Online :
https://cedar.buffalo.edu/~srihari/CSE676/14.3%20Learning%20Manifolds.pdf

You might also like