0% found this document useful (0 votes)

11 views7 pages

Module 3

Empirical Risk Minimization (ERM) is a key concept in machine learning aimed at minimizing prediction error by optimizing a loss function based on empirical data. Challenges in neural network optimization include issues like overfitting, local minima, saddle points, and vanishing gradients, which complicate the training process. Adaptive algorithms like AdaGrad and Adam provide solutions by dynamically adjusting learning rates, though they also have limitations such as potential premature convergence and sensitivity to initial conditions.

Uploaded by

Hemanth Hemanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

Module 3

Uploaded by

Hemanth Hemanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

MODULE 3

Empirical Risk Minimization (ERM)

Empirical Risk Minimization (ERM) is a fundamental concept in machine

learning for reducing prediction error. It involves optimizing a loss function
that approximates the error on the training data. The expected
generalization error, also known as risk, is minimized using ERM by
substituting the true data distribution p(x,y) with its empirical distribution
p^(x,y) derived from the training dataset.

Challenges in ERM:

 Overfitting: Models with high capacity may memorize the training

set, leading to poor generalization.

 Non-differentiable Loss Functions: Certain loss functions (e.g.,

0-1 loss) are not directly optimizable via gradient-based methods.

 Generalization vs. Empirical Risk: Minimizing empirical risk does

not guarantee minimal generalization error.

02.Challenges in Neural Network Optimization

Neural network optimization involves finding parameters that minimize a

cost function. Despite advances in optimization techniques, several
challenges make this process complex, particularly for deep networks.
Here’s a detailed explanation of the challenges:
1. Ill-Conditioning

 The Hessian matrix of the cost function may have a poor condition
number, causing slow convergence.

 Even if the gradient is strong, small learning steps are necessary

due to high curvature, slowing down training.

 This is often observed when gradients become smaller than needed

to make meaningful updates.

2. Local Minima

 Non-convex functions, like those in deep neural networks, often

have numerous local minima.

 In practical neural networks, most local minima are found to have

low cost, making global minima less critical.

 This challenge is less significant but requires careful initialization

and optimization strategies.

3. Saddle Points

 Saddle points, where the gradient is zero but not a minimum, are
more common in high-dimensional spaces.

 The presence of both positive and negative eigenvalues in the

Hessian can cause gradients to oscillate or slow down progress.

 Escaping saddle points is vital for optimization efficiency.

4. Plateaus and Flat Regions

 These regions have low gradient magnitudes, causing optimization

to stagnate.

 Traversing plateaus can take significant time, delaying convergence.

5. Cliffs and Exploding Gradients

 Sharp nonlinearities in deep networks can lead to "cliffs" where

gradients are very large.

 Gradient updates in such regions can cause parameters to jump

drastically, leading to instability.
 Gradient clipping is often employed to manage this issue.

6. Vanishing Gradients

 In deep networks, gradients can become exponentially small as they

propagate backward.

 This prevents meaningful parameter updates, particularly in early

layers.

 Proper weight initialization and activation functions like ReLU help

mitigate this problem.

7. Poor Correspondence Between Local and Global Structures

 Local optimization steps may not align with the global cost
structure.

 This can result in long, inefficient trajectories in the parameter

space.

8. Inexact Gradients

 Gradient estimates are often noisy due to minibatch sampling.

 This noise can lead to suboptimal updates or instability during

training.

9. Long-Term Dependencies

 In recurrent networks, repeated multiplication of weights causes

gradients to either explode or vanish over time.

 This is especially problematic for learning dependencies over long

sequences.
03.Explanation of AdaGrad

AdaGrad (Adaptive Gradient Algorithm) is an adaptive learning rate

optimization algorithm. It modifies the learning rate dynamically for each
parameter during training. This is achieved by scaling the learning rates
inversely proportional to the square root of the sum of squared gradients
for each parameter.

Key Features:

1. Adaptive Learning Rates:

o Parameters with frequently large gradients see their learning

rate reduced, allowing slower updates.

o Parameters with small gradients retain higher learning rates,

enabling faster updates.

2. Application:

o Especially useful for sparse datasets where some parameters

update more frequently than others.

o Promotes faster convergence in such contexts.

3. Limitations:

o Gradual accumulation of squared gradients can make the

effective learning rate approach zero prematurely.
Explanation of the Adam Algorithm

The Adam (Adaptive Moment Estimation) algorithm is a widely used

optimization method in machine learning. It is an adaptive learning rate
method combining the advantages of two popular techniques:
Momentum and RMSProp. The name "Adam" is derived from "adaptive
moments," reflecting its reliance on exponentially weighted moving
averages of the gradients and their squares.
Key Features of Adam:

1. Adaptive Learning Rates:

o Learning rates are dynamically adjusted for each parameter.

o Parameters with frequently changing gradients are assigned

lower learning rates.

2. Incorporation of Momentum:

o Uses exponential weighting for past gradients, introducing

momentum into updates.

o Speeds up convergence and avoids oscillations in the

optimization path.

3. Second Moment Estimation:

o Tracks the uncentered variance of gradients (second moment).

o Helps scale updates more effectively.

4. Bias Correction:

o Corrects for initialization bias in the moving averages of the

first and second moments.
Hyperparameters:

 β1: Decay rate for the first moment (default is 0.9).

 β2Decay rate for the second moment (default is 0.999).

 ϵ\epsilon: Global learning rate (default is 0.001).

Advantages of Adam:

 Handles sparse gradients effectively.

 Combines the benefits of momentum and adaptive learning rates.

 Requires minimal hyperparameter tuning and performs well across

diverse tasks.

Limitations:

 May generalize less effectively in some cases due to aggressive

learning rate adjustments.

 Sensitive to initial learning rate choices in specific scenarios.

Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Theory DL
No ratings yet
Theory DL
227 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
No ratings yet
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
41 pages
GREAT Manager Framework
100% (4)
GREAT Manager Framework
14 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Part 1.3. Optimazation of Learning Algorithms
No ratings yet
Part 1.3. Optimazation of Learning Algorithms
14 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Mcculloh: Linear Activation Function
No ratings yet
Mcculloh: Linear Activation Function
18 pages
Code Adam Optimization Algorithm From Scratch
No ratings yet
Code Adam Optimization Algorithm From Scratch
28 pages
Adam Optimizer
No ratings yet
Adam Optimizer
14 pages
Adam 1
No ratings yet
Adam 1
11 pages
Optimization
No ratings yet
Optimization
26 pages
Module3 Notes
No ratings yet
Module3 Notes
18 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Introduction To Optimization-Lec1
No ratings yet
Introduction To Optimization-Lec1
36 pages
DL 12
No ratings yet
DL 12
55 pages
IMP Deep Learning
No ratings yet
IMP Deep Learning
9 pages
Optim Problems From AI
No ratings yet
Optim Problems From AI
5 pages
DL 4
No ratings yet
DL 4
15 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
IJAIML23022024P0A3 (p.1-8)
No ratings yet
IJAIML23022024P0A3 (p.1-8)
8 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
Microcontroller 8051
No ratings yet
Microcontroller 8051
72 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Chapter
No ratings yet
Chapter
46 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Super Gradient Descent: Global Optimization Requires Global Gradient
No ratings yet
Super Gradient Descent: Global Optimization Requires Global Gradient
15 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
Super GD
No ratings yet
Super GD
15 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Optim
No ratings yet
Optim
33 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
No ratings yet
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
11 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Optimizers
No ratings yet
Optimizers
4 pages
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
No ratings yet
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
8 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
ML Notes
No ratings yet
ML Notes
14 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Ecology PDF
No ratings yet
Ecology PDF
3 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Intracranial Aneurysms by Andrew J Ringer Ebook and TestBank Bundle Official Test Bank
No ratings yet
Intracranial Aneurysms by Andrew J Ringer Ebook and TestBank Bundle Official Test Bank
311 pages
Small Hydro Power Plant: Scenario in India - A Comparative Study
No ratings yet
Small Hydro Power Plant: Scenario in India - A Comparative Study
7 pages
DelcoRemy DiagnosticManual Updated Digital
No ratings yet
DelcoRemy DiagnosticManual Updated Digital
32 pages
Simple Compound Complex Sentences
No ratings yet
Simple Compound Complex Sentences
15 pages
Modelling of Fluid Power Systems
No ratings yet
Modelling of Fluid Power Systems
85 pages
Bda Module 5
No ratings yet
Bda Module 5
31 pages
Macos Mojave Compatibility 02 07
No ratings yet
Macos Mojave Compatibility 02 07
11 pages
Q A12
No ratings yet
Q A12
21 pages
@vtucode - in 21CS735 Module 2 Textbook
No ratings yet
@vtucode - in 21CS735 Module 2 Textbook
16 pages
Lecture 1
No ratings yet
Lecture 1
20 pages
DAILY LESSON LOG Organic Compounds
No ratings yet
DAILY LESSON LOG Organic Compounds
4 pages
Staff Manual Chewonki
No ratings yet
Staff Manual Chewonki
34 pages
Ocean DEHUMID
No ratings yet
Ocean DEHUMID
4 pages
Energy and Cost Savings Through Pumping Stations Rehabilitation. Case Study in Bucharest
No ratings yet
Energy and Cost Savings Through Pumping Stations Rehabilitation. Case Study in Bucharest
8 pages
Mahieddine Darbal Middle School School Year2017/2018 Level MS3
No ratings yet
Mahieddine Darbal Middle School School Year2017/2018 Level MS3
3 pages
Carbon Dioxide Enhanced Oil Recovery in The United States: Snapshot and Forecast
No ratings yet
Carbon Dioxide Enhanced Oil Recovery in The United States: Snapshot and Forecast
45 pages
Ayitenew Determinantsof Internal Audit Effectiveness Evidencefrom Gurage Zone
No ratings yet
Ayitenew Determinantsof Internal Audit Effectiveness Evidencefrom Gurage Zone
12 pages
Prof K V Subbaraju
No ratings yet
Prof K V Subbaraju
26 pages
Mod 1maths
No ratings yet
Mod 1maths
155 pages
Susanto Update Cv.2023
No ratings yet
Susanto Update Cv.2023
3 pages
School Students' Physical Activity Physical Activity and Its Contributing Factors in
No ratings yet
School Students' Physical Activity Physical Activity and Its Contributing Factors in
8 pages
Econometrics Problem Set
No ratings yet
Econometrics Problem Set
5 pages
Hemanth HB Uiuxresume
No ratings yet
Hemanth HB Uiuxresume
2 pages
OB Biruktawit Zegeye
No ratings yet
OB Biruktawit Zegeye
6 pages
Introduction To FIR Filter Design
No ratings yet
Introduction To FIR Filter Design
34 pages
Phy340-Tutorial 2
No ratings yet
Phy340-Tutorial 2
2 pages
Reference Photo:: 9-7/8 In. (250.8mm) QD503X
No ratings yet
Reference Photo:: 9-7/8 In. (250.8mm) QD503X
1 page
Printlac High Gloss TDS
No ratings yet
Printlac High Gloss TDS
2 pages
Stanford GSB Ee Sample Schedule MRR
No ratings yet
Stanford GSB Ee Sample Schedule MRR
1 page
2013 ME Magway,, English
No ratings yet
2013 ME Magway,, English
4 pages
Sheets, Preset Names and Formatting Descriptions For Different Elements (Such As "Chapter Title" or
No ratings yet
Sheets, Preset Names and Formatting Descriptions For Different Elements (Such As "Chapter Title" or
2 pages
Active Contour: Advancing Computer Vision with Active Contour Techniques
From Everand
Active Contour: Advancing Computer Vision with Active Contour Techniques
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet

Module 3

Uploaded by

Module 3

Uploaded by

MODULE 3

Empirical Risk Minimization (ERM)

Empirical Risk Minimization (ERM) is a fundamental concept in machine

 Overfitting: Models with high capacity may memorize the training

 Non-differentiable Loss Functions: Certain loss functions (e.g.,

 Generalization vs. Empirical Risk: Minimizing empirical risk does

02.Challenges in Neural Network Optimization

Neural network optimization involves finding parameters that minimize a

 Even if the gradient is strong, small learning steps are necessary

 This is often observed when gradients become smaller than needed

 Non-convex functions, like those in deep neural networks, often

 In practical neural networks, most local minima are found to have

 This challenge is less significant but requires careful initialization

 The presence of both positive and negative eigenvalues in the

 Escaping saddle points is vital for optimization efficiency.

4. Plateaus and Flat Regions

 These regions have low gradient magnitudes, causing optimization

 Traversing plateaus can take significant time, delaying convergence.

5. Cliffs and Exploding Gradients

 Sharp nonlinearities in deep networks can lead to "cliffs" where

 Gradient updates in such regions can cause parameters to jump

 In deep networks, gradients can become exponentially small as they

 This prevents meaningful parameter updates, particularly in early

 Proper weight initialization and activation functions like ReLU help

7. Poor Correspondence Between Local and Global Structures

 This can result in long, inefficient trajectories in the parameter

 Gradient estimates are often noisy due to minibatch sampling.

 This noise can lead to suboptimal updates or instability during

 In recurrent networks, repeated multiplication of weights causes

 This is especially problematic for learning dependencies over long

AdaGrad (Adaptive Gradient Algorithm) is an adaptive learning rate

1. Adaptive Learning Rates:

o Parameters with frequently large gradients see their learning

o Parameters with small gradients retain higher learning rates,

o Especially useful for sparse datasets where some parameters

o Promotes faster convergence in such contexts.

o Gradual accumulation of squared gradients can make the

The Adam (Adaptive Moment Estimation) algorithm is a widely used

1. Adaptive Learning Rates:

o Learning rates are dynamically adjusted for each parameter.

o Parameters with frequently changing gradients are assigned

o Uses exponential weighting for past gradients, introducing

o Speeds up convergence and avoids oscillations in the

3. Second Moment Estimation:

o Tracks the uncentered variance of gradients (second moment).

o Helps scale updates more effectively.

o Corrects for initialization bias in the moving averages of the

 β1: Decay rate for the first moment (default is 0.9).

 β2Decay rate for the second moment (default is 0.999).

 ϵ\epsilon: Global learning rate (default is 0.001).

 Handles sparse gradients effectively.

 Combines the benefits of momentum and adaptive learning rates.

 Requires minimal hyperparameter tuning and performs well across

 May generalize less effectively in some cases due to aggressive

 Sensitive to initial learning rate choices in specific scenarios.

You might also like