0% found this document useful (0 votes)

53 views14 pages

MLP Encoder Decoder

The document discusses different optimization algorithms used in machine learning including stochastic gradient descent, gradient descent with momentum, Adagrad, Adam, and gradient descent with adaptive learning rate. It provides details on each algorithm and compares their advantages and disadvantages. Experimental results on these algorithms using a machine learning model are also presented.

Uploaded by

Aya Helmy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views14 pages

MLP Encoder Decoder

Uploaded by

Aya Helmy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Milestone1

computational intelligence
Names
Omar Emad EL-din Helmy 1900808

Aml Gamal Mohamed 1900339

Aya Ahmed Helmy 1900201

Under supervision of
DR : Hossam Eldin Hassan
Eng: Yehia Zakr
CONTENTS
Names .................................................................................................................................. 0
problem definition and importance ............................................................................... 3
Methods and Algorithms .................................................................................................. 4

1-Stochastic Gradient Descent ........................................................................................... 4

2-gRADIENT DESCENT WITH MOMENTUM ......................................................................... 4

3-ADAGRAD optimizer ....................................................................................................... 4

4-ADAm optimizer ............................................................................................................. 5

5-Gradient descent with adaptive learning rate. .................................................................. 5

Experimental Results ........................................................................................................ 6

1- SGD ................................................................................................................................. 6
Trial 1: ............................................................................................................................ 6
Trial 2: ............................................................................................................................ 6
Result: ............................................................................................................................. 6

2- Gradient descent with momentum ................................................................................ 7

Trial 1: ............................................................................................................................ 7
Trial 2: ............................................................................................................................ 7
Result: ............................................................................................................................. 7

3- Adagrad .......................................................................................................................... 7

Trial 1: ............................................................................................................................ 8
Trial 2: ............................................................................................................................ 8
Result: ............................................................................................................................. 8

4- Adam .............................................................................................................................. 8

Trial 1: ............................................................................................................................ 8
Trial 2: ............................................................................................................................ 9
Result: ............................................................................................................................. 9

5- Gradient descent with adaptive learning rate ............................................................... 9

1
Trial 1: ............................................................................................................................ 9
result: ............................................................................................................................ 10
discussions ........................................................................................................................ 11

Test accuracy with different optimizers: ......................................................................... 11

Training LOSS: ................................................................................................................. 11

Optimizer performance: .................................................................................................. 11

Learning rate adaptation: ................................................................................................ 12

APPENDIX A: .................................................................................................................... 13

2
PROBLEM DEFINITION AND IMPORTANCE

The concept of optimization is integral to machine learning. Most machine learning

models use training data to learn the relationship between input and output data.
The models can then be used to make predictions about trends or classify new
input data. This training is a process of optimization, as each iteration aims to
improve the model’s accuracy and lower the margin of error.

The process of optimization aims to lower the risk of errors or loss from these
predictions and improve the accuracy of the model. Without the process of
optimization, there would be no learning and development of algorithms. So, the very
premise of machine learning relies on a form of function optimization.

Hyperparameters elements like the learning rate or number of classifications clusters

and is a way of refining a model to fit a specific dataset are vital to achieving an
accurate model. The selection of the right model configurations has a direct impact on
the accuracy of the model and its ability to achieve specific tasks.

The concept of optimization is integral to machine learning. Most machine learning

models use training data to learn the relationship between input and output data.
The models can then be used to make predictions about trends or classify new input
data. This training is a process of optimization, as each iteration aims to improve
the model’s accuracy and lower the margin of error.

3
METHODS AND ALGORITHMS
1-STOCHASTIC GRADIENT DESCENT

is a variant of the Gradient Descent algorithm that is used for optimizing machine
learning models. It addresses the computational inefficiency of traditional Gradient
Descent methods when dealing with large datasets in machine learning projects.
Advantages 1-SGD is faster than other variants of Gradient
Descent, it is memory-efficient and can handle large datasets .
Disadvantages:-The updates in SGD are noisy and have a high
variance, SGD may require more iterations to converge to the
minimum
2-GRADIENT DESCENT WITH MOMENTUM

Momentum-based gradient descent adds a momentum term to the update rule. The
momentum term is computed as a moving average of the past gradients. It helps to
accelerate the optimization process
Advantages :-Escape local minima and saddle points, Reduces model complexity and
prevents overfitting
Disadvantage:-If the momentum is too much,
we could just swing back and forward between
the local minima.
3-ADAGRAD OPTIMIZER

Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm that adapts the

learning rates of each parameter during training
based on the historical gradient information.
Advantages:-Adjusts learning rates individually
for each parameter
Disadvantages:-Tends to reduce learning rates
over time, potentially leading to slow convergence,
Performance can be sensitive to the choice of the initial learning rate

4
4-ADAM OPTIMIZER

The ADAM optimizer, short for Adaptive Moment Estimation, is a powerful

optimization algorithm, automatically adjusts the learning rate for each parameter
individually.
Advantages:- Faster convergence, Faster convergence, Good for large models
Disadvantages:- High memory usage, Updates made by ADAM are less clear
compared to other optimizers.
5-GRADIENT DESCENT WITH ADAPTIVE LEARNING RATE.
is an iterative first-order optimization algorithm, used to find a local
minimum/maximum of a given function. This method is commonly used in machine
learning (ML) and deep learning (DL) to minimize a cost/loss function (e.g. in a linear
regression). Due to its importance and ease of implementation. The gradient descent
depends on some parameters one of them is the learning rate. One of the ways we
want to update our learning rate throughout the training process is adaptive
learning rate:
it’s an optimization technique for the GD to reach the optimum learning rate. It’s
whole concept depends on changing the learning rate and lessen it when the value
tends to go to the minimum value. It can done by doing bunch of instructions.
1- first, we have to set large value for the learning rate and instantiate value
for 𝛽= 0,7

2- then we calculate the new gradient descent step

𝑋𝑛+1 = 𝑋𝑛 + 𝜂∇𝑓(𝑋𝑛 )

3- we check if the loss of the new step is bigger than the previous, we
multiply the learning rate with the beta and then update it and go to step till
divergence

𝑓(𝑋𝑛 + 1) > 𝑓(𝑋𝑛 ) then 𝜂 = 𝛽 ∗ 𝜂 then go to step 2 again.

5
EXPERIMENTAL RESULTS
All models with:
• num_epochs = 5
• batch_size = 16

1- SGD
Using the default parameters of the optimizer function which are:
• learning_rate = 0.01
• momentum = 0.0

TRIAL 1:

TRIAL 2:

RESULT:

6
2- GRADIENT DESCENT WITH MOMENTUM
Using the default parameters of the optimizer function which are:
• learning_rate=0.01

and changing the momentum to 0.9

TRIAL 1:

TRIAL 2:

RESULT:

3- ADAGRAD
Using the default parameters of the optimizer function which are:
• learning_rate=0.001
• initial_accumulator_value=0.1
• epsilon=1e-7

7
TRIAL 1:

TRIAL 2:

RESULT:

4- ADAM
Using the default parameters of the optimizer function which are:
• learning_rate=0.001
• beta_1=0.9
• beta_2=0.999
• epsilon=1e-7

TRIAL 1:

8
TRIAL 2:

RESULT:

5- GRADIENT DESCENT WITH ADAPTIVE LEARNING RATE

setting the initial_learning_rate to 0.01, previous loss to infinity, and beta to 0.7, will guide us
to the following results:

TRIAL 1:

9
RESULT:

10
DISCUSSIONS
TEST ACCURACY WITH DIFFERENT OPTIMIZERS:
• SGD: 97.03%
• GD with momentum: 98. 12 %
• Adagrad: 93 .86%
• Adam: 97.91%
• GD with adaptive learning rate: 9 7.28%

TRAINING LOSS:
• GD with momentum seems to have the lowest final loss.

OPTIMIZER PERFORMANCE:
• All optimizers achieved high accuracy on the MNIST dataset (>93%), indicating the
model's effectiveness.
• GD with momentum reached the highest accuracy (98.12%) and lowest final loss,
suggesting better optimization compared to others.

11
• Adam and GD with adaptive learning rate also achieved similar high accuracy.
• Adagrad showed slightly lower accuracy.

LEARNING RATE ADAPTATION:

• Implementing a custom learning rate scheduler with adaptive decay helped to improve
optimization for SGD.
• This resulted in faster training time and slightly better accuracy compared to standard
SGD.

12
APPENDIX A:
Code is in the notebook in the next link:
https://colab.research.google.com/drive/1PFgIRXACd-9X-
O_oWbMPnNNSLqo9UFJX?usp=sharing

Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Training NNs
No ratings yet
Training NNs
34 pages
Module 2
No ratings yet
Module 2
67 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
AdamZ research paper
No ratings yet
AdamZ research paper
13 pages
Optimizer
No ratings yet
Optimizer
13 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Optimizers
No ratings yet
Optimizers
4 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
adam optimizer
No ratings yet
adam optimizer
14 pages
A Study of the Optimization Algorithms in Deep Learning
No ratings yet
A Study of the Optimization Algorithms in Deep Learning
4 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Introduction to Optimization-Lec1
No ratings yet
Introduction to Optimization-Lec1
36 pages
cours5
No ratings yet
cours5
23 pages
Optimizers and Activation functions in Deep Learning
No ratings yet
Optimizers and Activation functions in Deep Learning
15 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
GlobalLogic - Optimization Algorithms For Machine Learning
No ratings yet
GlobalLogic - Optimization Algorithms For Machine Learning
4 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Optimization Gradient Descent Method
No ratings yet
Optimization Gradient Descent Method
3 pages
Lec 8
No ratings yet
Lec 8
43 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
MODULE 3
No ratings yet
MODULE 3
7 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
optimization
No ratings yet
optimization
26 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
INT255_unit-4
No ratings yet
INT255_unit-4
40 pages
Optimisation Methods in Machine Learning
0% (1)
Optimisation Methods in Machine Learning
22 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
CS60010_Fitting-1
No ratings yet
CS60010_Fitting-1
39 pages

MLP Encoder Decoder

Uploaded by

MLP Encoder Decoder

Uploaded by

Milestone1

Aml Gamal Mohamed 1900339

Aya Ahmed Helmy 1900201

1-Stochastic Gradient Descent ........................................................................................... 4

2-gRADIENT DESCENT WITH MOMENTUM ......................................................................... 4

3-ADAGRAD optimizer ....................................................................................................... 4

4-ADAm optimizer ............................................................................................................. 5

5-Gradient descent with adaptive learning rate. .................................................................. 5

2- Gradient descent with momentum ................................................................................ 7

5- Gradient descent with adaptive learning rate ............................................................... 9

Test accuracy with different optimizers: ......................................................................... 11

Training LOSS: ................................................................................................................. 11

Optimizer performance: .................................................................................................. 11

Learning rate adaptation: ................................................................................................ 12

The concept of optimization is integral to machine learning. Most machine learning

Hyperparameters elements like the learning rate or number of classifications clusters

The concept of optimization is integral to machine learning. Most machine learning

Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm that adapts the

The ADAM optimizer, short for Adaptive Moment Estimation, is a powerful

2- then we calculate the new gradient descent step

𝑓(𝑋𝑛 + 1) > 𝑓(𝑋𝑛 ) then 𝜂 = 𝛽 ∗ 𝜂 then go to step 2 again.

and changing the momentum to 0.9

5- GRADIENT DESCENT WITH ADAPTIVE LEARNING RATE

LEARNING RATE ADAPTATION:

You might also like