Unit3 Rev3
Unit3 Rev3
Module III
Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD,
AdaGrad, RMSProp, Adam, Eigenvalues and eigenvectors, Eigenvalue Decomposition, Basis,
Principal Component Analysis and its interpretations, Singular Value Decomposition.
Autoencoders and relation to PCA, Regularization in autoencoders, Denoisingautoencoders,
Sparse autoencoders, Contractive autoencoders
Eigen Values
Gradient Descent:
• Optimization refers to the task of either minimizing or maximizing
some function f (x) by altering x.
• most optimization problems in terms of minimizing f (x).
• The function we want to minimize or maximize is called the objective
function or criterion.
• When we are minimizing it, we may also call it the cost function, loss
function, or error function.
• the value that minimizes or maximizes a function with a superscript ∗.
For example, we might say 𝒙∗ = argmin f(x).
Gradient Descent:
An illustration of how the gradient descent algorithm uses the derivatives of a function can be used to follow the
function downhill to a minimum
Gradient Descent
• When 𝑓 ′ 𝑥 = 0, the derivative provides no information about which
direction to move.
• Points where 𝑓 ′ 𝑥 = 0 are known as critical points or stationary points.
• A local minimum is a point where f (x) is lower than at all neighboring
• points, so it is no longer possible to decrease f(x) by making infinitesimal
steps.
• A local maximum is a point where f (x) is higher than at all neighboring
points.
• Some critical points are neither maxima nor minima. These are known as
saddle points
Gradient Descent
Mini-batch gradient descent may aid in escaping shallow local minima, but often fails when dealing with
deep local minima, as shown
The Challenges with Gradient Descent
• One observation about deep neural networks is
that their error surfaces are guaranteed to have
a large
• —and in some cases, an infinite—number of
local minima.
• local minima are only problematic when they
are spurious.
• A spurious local minimum incurs a higher error
than the configuration at the global minimum.
• If these kinds of local minima are common, we
quickly run into significant problems while using
gradient based optimization methods because
we can only take into account local structure
The Challenges with Gradient Descent
• Loss Landscape in Deep Networks:
• Deep neural networks have complex loss landscapes due to their high
dimensionality and non-linearity.
• This complexity can result in many local minima.
• In high-dimensional spaces, the chance of encountering a local minimum is
higher, and many of these minima can be spurious.
• Characteristics:
• Spurious local minima are not necessarily the worst possible minima, but they
can be worse than the global minimum or other more optimal minima.
• These minima can trap optimization algorithms, preventing them from
finding the best solution.
• Impact on Training:
• Getting stuck in a spurious local minimum can affect the performance
of the model, leading to poorer generalization on unseen data.
• The model might converge to a suboptimal set of parameters,
reducing its effectiveness.
• Strategies to Mitigate Spurious Local Minima
• Advanced Optimization Algorithms:
• Stochastic Gradient Descent (SGD) and its variants (e.g., Adam, RMSprop)
introduce noise into the gradient updates, which can help escape local
minima.
• Momentum-based methods can help by providing a form of inertia that
might push the optimizer out of local minima.
• Regularization Techniques:
• Techniques like Dropout or weight decay can help generalize better
and avoid overfitting to local minima.
• Network Architecture:
• Batch Normalization (normalizes the outputs of each layer by
adjusting and scaling the activations.)
• and Residual Networks (ResNets) can help by smoothing the
optimization landscape.
• Skip connections and deeper architectures can also help in reducing
the chances of getting stuck in local minima.
• Learning Rate Strategies:
• Using a learning rate schedule that gradually reduces the learning rate
can help in fine-tuning the model and avoiding local minima.
• Initialization Strategies:
• Proper initialization of weights can help start the optimization process
in a region that is less likely to trap the model in poor local minima.
• Ensemble Methods:
• Training multiple models and combining their predictions can
sometimes mitigate the impact of spurious local minima, as different
models might end up in different minima.
• Hyperparameter Tuning:
• Careful tuning of hyperparameters (e.g., learning rate, batch size) can
impact the optimization process and potentially help in avoiding
spurious minima.
• Flat Regions: These are areas in the loss
landscape where the gradient (or change in
the loss function) is very small.
• This means the surface is relatively flat in
these regions.
• Impact of Flat Regions:
• Gradient Magnitude: This makes it difficult
for optimization algorithms, gradient-based
methods like SGD, to make significant
progress.
• Slow Convergence: Because the gradient is
small, leading to slow convergence or
stagnation.
When Gradient Points in the Wrong Direction
Example
• 𝐽 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
• ∇J(x,y)=(2x,2y)
Circular Contour
• iteration 1 • iteration 2
• Compute Gradient: • Compute Gradient:
∇J(3,4)=(2⋅3,2⋅4)=(6,8) ∇J(2.4,3.2)=(2⋅2.4,2⋅3.2)=(4.8,6.4)
Update Parameters: Update Parameters:
xnew=3−0.1⋅6=3−0.6=2.4 xnew=2.4−0.1⋅4.8 =2.4−0.48=1.92
ynew=4−0.1⋅8=4−0.8=3.2 ynew=3.2−0.1⋅6.4 =3.2−0.64=2.56
New point: (2.4,3.2) New point: (1.92,2.56)
Circular Contour
• iteration 3 • iteration 4
• Compute Gradient: • Compute Gradient:
∇J(1.92,2.56)=(2⋅1.92,2⋅2.56)=(3.84,5. ∇J(1.536,2.048)=(2⋅1.536,2⋅2.048)=(3.07
12) 2,4.096)
Update Parameters: Update Parameters:
xnew=1.92−0.1⋅3.84=1.92−0.384 xnew= 1.536−0.1⋅3.072=1.536−0.3072
=1.536 =1.2288
ynew= 2.56−0.1⋅5.12=2.56−0.512 ynew= 2.048−0.1⋅4.096=2.048−0.4096
=2.048 =1.6384
New point: (1.536,2.048) New point: (1.2288,1.6384)
Circular Contour
• Gradient Direction: In each iteration, the gradient points away from
the origin, indicating the direction of steepest ascent. We moved in
the opposite direction (toward the local minimum).
• Convergence: With each iteration, the coordinates (x,y)move closer
to the origin (0, 0), where the function has its minimum.
• Circular Contours: Since the contours are circular, the updates
effectively navigate towards the minimum efficiently without
oscillating.
• The gradient provides the direction of steepest ascent, and by moving
in the opposite direction, we effectively converge toward the local
minimum.
Circular Contour
• No Overshooting or Misalignment: There is no risk of the gradient
leading you away from the minimum or causing oscillation due to
misalignment, as is possible with highly elliptical contours.
• There are no issues of directional inaccuracies like those that can
occur with elliptical contours, ensuring efficient and effective
optimization.
When Gradient Points in Wrong Direction
Local information encoded by the gradient usually does not corroborate the global structure of the error surface
Ellipse
• 𝐽 𝑥, 𝑦 = 4𝑥 2 + 𝑦 2 Iteartion 1
• The global minimum Compute Gradient:
occurs at the origin (0,0). ∇J(2,2)=(8⋅2,2⋅2)=(16,4)
• Compute the Gradient Update Parameters:
• ∇J(x,y=(8x,2y) xnew=2−0.1⋅16=2−1.6=0.4
• Initial point: Let’s start at ynew=2−0.1⋅4=2−0.4=1.6
(x,y)=(2,2)
New point: (0.4,1.6)
• Learning rate (α): 0.1
Ellipse
At each iteration, the gradient points towards the
steepest ascent, which means we move in the
opposite direction to find the minimum.
Convergence: Each update brings us closer to the
minimum at (0,0).
The gradient updates might cause larger
movements in one direction than the other due
to the aspect ratio of the ellipse
When dealing with functions that have
extremely elliptical contours
• Elliptical Contours: When contours are highly elongated (i.e., very
elliptical), the gradient at a point may not provide a good direction toward
the minimum.
• This is because the steepness of the gradient can vary significantly along
different axes.
• Gradient Direction: The gradient vector points in the direction of steepest
ascent.
• In cases of extreme ellipticity, the steepest ascent could be misaligned with
the shortest path to the minimum.
• This can make the update direction seem like it’s pointing away from the
local minimum.
When dealing with functions that have
extremely elliptical contours
When dealing with functions that have
extremely elliptical contours
• Actual Direction: The update moves from (2,1) to (0.4,0.8)
• Contour Analysis: The actual minimum is at (0,0).
• The update direction based on the gradient may point more towards
(0.4,0.8) rather than directly toward the minimum,
• Overshooting: If you continue to iterate, the gradient could potentially lead
you to oscillate or move in directions that seem correct based on local
gradient information but take you further away from the minimum.
• Gradient as 90 Degrees Off: In cases where the contours are extremely
elliptical, you might experience scenarios where the effective movement
toward the minimum is significantly misaligned.
• This could make the gradient seem like it's pointing almost perpendicular
(or 90 degrees off) from the correct path toward the local minimum.
When Gradient Points in Wrong Direction
We show how the direction of the gradient changes as we move along the direction of steepest descent
When Gradient Points in Wrong Direction
• The condition number of the Hessian matrix is defined as the ratio of the
largest eigenvalue to the smallest eigenvalue:
• Condition Number=λmax/λmin
• Condition Number = 2/1/5 = 10
• A high condition number (much greater than 1) indicates that the Hessian
has a significant difference between its maximum and minimum eigenvalues.
• This leads to highly elliptical contours, as the steepest descent direction is
much more pronounced in one direction than the other.
When Gradient Points in Wrong Direction
• x =Ua
• Multiplying Eq. on both sides by Ut yields the expression for computing
the coordinates of x in the new basis
Example
the centered Iris dataset, with n = 150 points, in the d = 3 dimensional
space comprising the sepal length (X1), sepal width (X2), and petal
length (X3) attributes. space is spanned by the standard basis vectors.
the same points in the space comprising the new basis vectors
Example
the new coordinates of the centered point x = (−0.343,−0.754, 0.241)T
can be computed as
Example
Example
• finding the optimal r-dimensional representation of D, with r ≪d.
• given a point x, and assuming basis vectors have been sorted in decreasing
order of importance, we can truncate its linear expansion to just r terms,
• To maximize the projected variance 𝝈𝟐𝒖 , we should thus choose the largest
eigenvalue of Σ.
• the dominant eigenvector u1 specifies the direction of most variance, also
called the first principal component, that is, u = u1.
• the largest eigenvalue λ1 specifies the projected variance, that is, σ2
• u = α = λ1.
Minimum Squared Error Approach
• direction that maximizes the projected variance is also the one
that minimizes the average squared error.
• assume that the dataset D has been centered by subtracting the
mean from each point
Minimum Squared Error Approach
Minimum Squared Error Approach
• the total variance of the centered data (i.e., with μ = 0)
•
Minimum Squared Error Approach
•
Example
Example
Best 2-dimensional Approximation
• assume that D has already been centered, so that μ = 0.
• We already computed the direction with the most variance, u1, which
is eigenvector corresponding to the largest eigenvalue λ1 of Σ .
• We now want to find another direction v, which also maximizes the
projected variance, but is orthogonal to u1.
Best 2-dimensional Approximation
• The optimization condition then becomes
𝒕
𝒂𝒊 = 𝑼𝟐 𝒙𝒊
•
• Assume that each point xi ∈ Rd in D has been projected to obtain its coordinates
• ai ∈ R2, yielding the new dataset A.
• D is assumed to be centered, with μ = 0, the coordinates of the projected mean are
also zero because
𝒕 𝒕
𝑼𝟐 = 𝑼𝟐 =
• 𝝁 𝟎 𝟎
Total Projected Variance
Thus, the sum of the eigenvalues is the total variance of the projected points, and the first two principal components
maximize this variance.
Mean Squared Error
Example
• For the Iris dataset, the two largest eigenvalues are λ1 = 3.662, and λ2 =
0.239, with the corresponding eigenvectors
Example
• Thus, each point xi can be approximated by its projection onto the first
two principal components 𝑥𝑖′ =P2xi
• The total variance captured by the subspace is given as
• λ1 +λ2 = 3.662+0.239= 3.901 .
• The mean squared error is given as
• MSE = var(D)−λ1−λ2 = 3.96−3.662−0.239= 0.059
Principal Component Analysis
Example
Example
Example
• where the covariance matrix has been factorized into the orthogonal
matrix U containing its eigenvectors,
• and a diagonal matrix ˄ containing its eigenvalues (sorted in decreasing)
• SVD generalizes the above factorization for any matrix.
• for an n × d data matrix D with, SVD factorizes D as follows:
SINGULAR VALUE DECOMPOSITION
we conclude that the right singular vectors R are the same as the eigenvectors of Σ
corresponding singular values of D are related to the eigenvalues of Σ by expression
Connection between SVD and PCA
Connection between SVD and PCA
Example
• Let us consider the n×d centered Iris datamatrix D
• we computed the eigenvectors and eigenvalues of the covariance
matrix Σ as follows:
Example
• Computing the SVD of D yields the following nonzero singular values
and the corresponding right singular vectors
Example
• Notice also that the right singular vectors are equivalent to the
principal components or eigenvectors of Σ, up to isomorphism.
• That is, they may potentially be reversed in direction.
• For the Iris dataset, we have r1 = u1, r2 = −u2, and r3 = u3.
• Here the second right singular vector is reversed in sign when
compared to the second principal component.
Problem with PCA
• While PCA has been used for decades for dimensionality reduction, it
fails to capture important relationships that are piecewise linear or
nonlinear.
Problem with PCA
• The example shows data points selected at random from two
concentric circles.
• We hope that PCA will transform this dataset so that we can pick a
single new axis that allows us to easily separate the red and blue dots.
Unfortunately for us, there is no linear direction that contains more
information here than another (we have equal variance in all
directions)..
Motivating the Autoencoder Architecture
• In feed-forward networks, how each layer learned progressively more
relevant representations of the input.
• output of the final convolutional layer and used that as a lower-dimensional
representation of the input image.
• we want to generate these low-dimensional representations in an
unsupervised fashion,
Motivating the Autoencoder Architecture
• We first take the input and compress it into a low-dimensional vector. This
part of the network is called the encoder
• because it is responsible for producing the low-dimensional embedding or
code.
• The second part of the network, instead of mapping the embedding to an
arbitrary label as we would in a feed-forward network,
tries to invert the computation of the first half of the network and
reconstruct the original input. This piece is known as the decoder
Motivating the Autoencoder Architecture
Make the model robust against small changes in the input (Contractive Autoencoders)
Sparse Autoencoders
• Make the learned code sparse (Sparse Autoencoders). Done by adding a sparsity
penalty on h
• Loss Function: 𝑙 𝑥,
ො x + Ω(h)
• Where Ω(h) = σ𝐾 𝑘=1 ℎ𝑘 is the l1 norm of h
A gray arrow demonstrates how one training example is transformed into one sample from this
corruption process.
When the denoising autoencoder is trained to minimize the average of squared errors
𝟐
𝒈 𝒇 𝒙 −𝒙
𝒈 𝒇 𝒙 −𝒙
points approximately towards the nearest point on the manifold ,
estimates the center of mass of the clean points x
since 𝒈 𝒇 𝒙
The autoencoder thus learns a vector field g(f (x)) − x indicated by the green arrows.
This vector field estimates the score ∇x log pdata (x) up to a multiplicative factor that is
the average root mean square reconstruction error.
Estimating the Score
• the score is a particular gradient field:
• ∇x log p(x).
• regarding autoencoders, it is sufficient to understand that learning the gradient
field of log pdata is one way to learn the structure of pdata itself.
• It can guide the optimization process,
• ensuring that the model learns to generate outputs that are more likely according
to the true data distribution.
• A very important property of DAEs is that their training criterion (with
conditionally Gaussian p(x | h)) makes the autoencoder learn a vector field
(g(f(x)) − x) that estimates the score of the data distribution
Estimating the Score
• training with the squared error criterion
• Online :
https://cedar.buffalo.edu/~srihari/CSE676/14.3%20Learning%20Manifolds.pdf