0% found this document useful (0 votes)

117 views29 pages

Activations, Loss Functions & Optimizers in ML

The document discusses activations, loss functions, and optimizers used in machine learning models. It begins by explaining how activation functions introduce non-linearity in neural networks and discusses common activation functions like sigmoid, tanh, and ReLU. It then explains that loss functions measure the error between predictions and targets and discusses common loss functions for regression and classification. Finally, it provides an overview of optimization algorithms like gradient descent, stochastic gradient descent, and adaptive learning rate methods like Adam that are used to minimize loss functions during training.

Uploaded by

Aniket Dhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views29 pages

Activations, Loss Functions & Optimizers in ML

Uploaded by

Aniket Dhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Activations, Loss functions

& Optimizers in ML
Aniket Dhar
RWS DataLab
A Neural Network Model in Keras : An Example
model = Sequential()
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dropout(0.7))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=['accuracy'])

Activations
What are Activation functions ?

● Activation function of a node defines the output given an input or set of inputs
● Can be called as a Transfer function

The Activation Functions can be broadly divided into 2 types :

● Linear Activation Function (not very useful)

● Non-linear Activation Functions
Activations
Why use Activation functions in Neural Networks?

In Artificial Neural Networks, we calculate the output of each layer as

Y = f(X * W + B)

Y = output, X = input
W = weights, B = bias

f = activation function
and feed it as an input to the next layer.
Activation function determines if a certain neuron would fire or not !
Activations
Linear or Identity Activation Function

A = cx

can be used for linear regression models

Drawbacks:

● has a fixed gradient (change in BP does not depend on changes in input)

● limited power (not suitable for connected layers)
● does not perform well for complex problems
Activations
Why do we need Non-Linearities?

Neural-Networks are considered Universal Function Approximators

● Add ability to it to learn something complex

● Represent non-linear complex arbitrary functional mappings between inputs and
outputs

Non-Linear Activation Functions:

1. Sigmoid or Logistic
2. Tanh - Hyperbolic tangent
3. ReLu - Rectified linear units
Activations
Sigmoid Function

Drawbacks:

● output isn’t zero centered

● vanishing gradient problem
● slow convergence
Activations
Hyperbolic Tangent function- Tanh

solves the output range problem

Drawbacks:

● vanishing gradient problem

Activations
ReLu- Rectified Linear units

A(x) = max(0, x)

range of ReLu is [0, inf). This means it

can blow up the activation.

ReLu solves vanishing gradient problem

Drawbacks:

● Dead neurons, as they always get 0 activation

Activations
Modifications to solve the dead neuron problems

● Leaky ReLu
● Maxout
Activations
Softmax Function

● Calculated probabilities will be in the range of 0 to 1

● Sum of all the probabilities is equals to 1
● Handy for multiple classes
● Gives the probability of being in a particular class
● Used in the output layer for multiclass classification
Loss Functions
What is a Loss Function?

● A method of evaluating how well your algorithm models your dataset

● Simply measures the difference between the target and the prediction
● Also known as Cost Function / Objective Function

An optimization problem seeks to minimize a loss function.

**A Loss function must be differentiable
Loss Functions : Regressive loss functions
Used when the target variable is continuous (eg. regressive problems).

Most widely used regressive loss function is Mean Square Error.

Other loss functions are:

1. Absolute error—measures the mean absolute value of the element-wise difference

between input;

2. Smooth Absolute Error—a smooth version of Abs Criterion.

Loss Functions : Classification loss functions
Usually outputs a probability value
Magnitude of the score represents the confidence of our prediction.
The target variable is a binary variable (1 for true and -1/0 for false).

Some classification algorithms are:

1. Binary/ Categorical Cross Entropy

2. Negative Log Likelihood

3. Margin Classifier

4. Soft Margin Classifier

Loss Functions : Embedding loss functions
Deals with problems where we have to measure whether two inputs are similar or
dissimilar. Some examples are:

1. L1 Hinge Error- Calculates the L1 distance between two inputs.

2. Cosine Error- Cosine distance between two inputs.

Cross Entropy Loss / Log Loss:
● measures the performance of a
classification model whose output is
a probability value between 0 and 1
● increases as the predicted probability
diverges from the actual label

H(y,p) = − ∑i yilog(pi) ,

y = label, p = prediction
Optimization Process
Propagate backwards in the
Network carrying Error terms and
updating Weights values using
Optimizer algorithms

Calculate the gradient of Error (E)

function with respect to the
Weights (W) , and update them in
the opposite direction of the
Gradient
Fig. Updating weights through gradient descent
Optimizers
What are Optimization Algorithms ?

Optimization algorithms helps us to minimize (or maximize) an Objective function.

They update weights and biases i.e. the internal parameters of a model to reduce the
prediction error.

They can be divided into two categories:

● Constant Learning Rate Algorithms (SGD)

● Adaptive Learning Algorithms (Adagrad, Adadelta, RMSprop, Adam)
Optimizers : Gradient Descent
Parameter(θ) update formula:

θ = θ − η⋅∇J(θ) ; η = learning rate, ∇J(θ) = gradient of loss function J(θ)

One of the most popular algorithms used in optimizing Neural Networks

Drawbacks:

● calculates gradient of the whole dataset and performs only one update
● very slow and hard to control for large datasets
● computes redundant updates for large data sets
Optimizers : Stochastic Gradient Descent (SGD)
Parameter(θ) update formula:

θ = θ − η⋅∇J(θ, x(i) ,y(i)) ; where {x(i) ,y(i)} are the training examples

performs a parameter update for each training example, usually much faster

Drawbacks:

● due to these frequent updates, parameters updates have high variance

● causes the Loss function to fluctuate to different intensities
● keeps overshooting due to the frequent fluctuations
Optimizers : Mini Batch Gradient Descent
● performs an update for every batch with ‘n’ training examples in each batch
● reduces the variance in the parameter updates
● batch sizes can vary according to problem

Drawbacks:

● choosing proper LR is difficult; low: slow convergence; high: oscillations

● same learning rate applies to all parameter updates
● gets trapped in local minima and specially at ‘saddle points’
Optimizers : Momentum
Parameter(θ) update formula:

V(t) = γV(t−1) + η∇J(θ) ; V(t) = update vector at time t

θ = θ − V(t) ; γ = momentum
term, usually set to 0.9

Faster and stable convergence

Reduced Oscillations in irrelevant directions
Drawbacks:

● might miss the minima and shoot up due to momentum

● same learning rate applies to all parameter updates
Optimizers : Nesterov Accelerated Gradient
Parameter(θ) update formula:

V(t) = γV(t−1) + η∇J(θ − γV(t−1)) ; V(t) = update vector at time t

θ = θ − V(t) ; γ = momentum term, η =
learning rate

first make a big jump based on previous momentum, then calculate the gradient and
make a correction which results in a parameter update

θ − γV(t−1) gives approximation of the next position of the parameters

Drawbacks:

● same learning rate applies to all parameter updates

Optimizers : Adagrad
Parameter(θ) update formula:

Uses a different Learning Rate for every parameter θ at

a time step based on the past gradients which were computed for that parameter

modifies the general learning rate η at each time step t for every parameter θ(i) based
on the past gradients that have been computed for θ(i).

Drawbacks:

● decaying learning rate(η) problem

Optimizers : AdaDelta / RMSProp
Parameter(θ) update formula:

E[g²](t) = γ.E[g²](t−1)+(1−γ).g²(t)

limits the window of accumulated past gradients to some fixed size w

running average E[g²](t) at time step t then depends only on the previous average and
the current gradient

Drawbacks:

● can not calculate individual momentum changes for each parameter

Optimizers : Adam(Adaptive Moment Estimation)
M(t) and V(t) are values of the first moment which is the Mean and
the second moment which is the uncentered variance of the gradients
respectively.

Then the final formula for the Parameter update is —

The values for β1 = 0.9 , β2 = 0.999, and ϵ = (10 x exp(-8)).

Adam optimizer is usually recommended for most learning problems right now.
Optimizers : A Comparison
“Insofar, RMSprop, Adadelta, and Adam are very similar
algorithms that do well in similar circumstances. […] its bias-
correction helps Adam slightly outperform RMSprop towards the
end of optimization as gradients become sparser. Insofar, Adam
might be the best overall choice.”
- “An overview of gradient descent optimization
algorithms”, 2016, Sebastian Rudger

“In practice Adam is currently recommended as the default

algorithm to use, and often works slightly better than RMSProp.
However, it is often also worth trying SGD+Nesterov Momentum
as an alternative.” Comparison of Adam to Other Optimization Algorithms Training a
Multilayer Perceptron
- “CS231n: Convolutional Neural Networks for Visual Taken from Adam: A Method for Stochastic Optimization, 2015.

Recognition”, Andrej Karpathy, et al.

Optimizers : Visualisation

Fig. optimization on loss surface contours Fig. optimization on saddle points

Thank You

Instructions Reference Manual (W474) CPU CJ2M
100% (1)
Instructions Reference Manual (W474) CPU CJ2M
1,314 pages
C++ CH 2
100% (1)
C++ CH 2
43 pages
Unit 2 AI
No ratings yet
Unit 2 AI
107 pages
USC GenerativeAI 011624 FINAL
No ratings yet
USC GenerativeAI 011624 FINAL
44 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
28 pages
Generative AI A Transformative Force in Business Intelligence
No ratings yet
Generative AI A Transformative Force in Business Intelligence
7 pages
Hill Climbing Vs Simulated Annealing
100% (1)
Hill Climbing Vs Simulated Annealing
14 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Beyond C: Team Emertxe
100% (1)
Beyond C: Team Emertxe
135 pages
Elektor Electronics USA 1991 03
No ratings yet
Elektor Electronics USA 1991 03
72 pages
Solutions To Deep Learning
No ratings yet
Solutions To Deep Learning
25 pages
Applied ML Notes
No ratings yet
Applied ML Notes
123 pages
Lecture Notes - Logistic Regression
100% (1)
Lecture Notes - Logistic Regression
11 pages
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
No ratings yet
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
46 pages
NLP and Generative AI Syllabus - 2025
No ratings yet
NLP and Generative AI Syllabus - 2025
5 pages
Introduction To Dbms
No ratings yet
Introduction To Dbms
107 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
Short Report On Expert Systems
100% (1)
Short Report On Expert Systems
12 pages
Tutorial Letter 102 - Portfolio Exam Information
No ratings yet
Tutorial Letter 102 - Portfolio Exam Information
10 pages
Control Applications in Marine Systems 2001
No ratings yet
Control Applications in Marine Systems 2001
526 pages
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
100% (1)
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
51 pages
SECodec: Structural Entropy-Based Compressive Speech Representation Codec For Speech Language Models
100% (1)
SECodec: Structural Entropy-Based Compressive Speech Representation Codec For Speech Language Models
17 pages
Segmentation
100% (1)
Segmentation
51 pages
Script Output
No ratings yet
Script Output
53 pages
EDA - The Right Way
No ratings yet
EDA - The Right Way
111 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Prompt Engineering For Vision Models Slides 1720084286
No ratings yet
Prompt Engineering For Vision Models Slides 1720084286
17 pages
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
100% (1)
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
12 pages
Residue Number Systems (RNS)
No ratings yet
Residue Number Systems (RNS)
19 pages
Chapter Fundamental Concepts of Database Management
No ratings yet
Chapter Fundamental Concepts of Database Management
36 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Multimedia SYsytem Unit 1
No ratings yet
Multimedia SYsytem Unit 1
20 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
4 - C Problem Solving Agents
No ratings yet
4 - C Problem Solving Agents
17 pages
Python Programming-Grade 9
No ratings yet
Python Programming-Grade 9
53 pages
Unit II Requirements Elicitation
No ratings yet
Unit II Requirements Elicitation
23 pages
Data Science Project
No ratings yet
Data Science Project
3 pages
Scikit Learn Docs
No ratings yet
Scikit Learn Docs
1,810 pages
Federated Learning Overview, Strategies, Applications, Tools and
No ratings yet
Federated Learning Overview, Strategies, Applications, Tools and
24 pages
SSWDPP401 - PHP Programming
No ratings yet
SSWDPP401 - PHP Programming
20 pages
NIST SP 800-160v2r1-Draft
No ratings yet
NIST SP 800-160v2r1-Draft
264 pages
Graven and Venkat
No ratings yet
Graven and Venkat
21 pages
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
Goals of Machine Learning in Artificial Intelligence
No ratings yet
Goals of Machine Learning in Artificial Intelligence
3 pages
Check List (Quality Auditors) - Converted1
No ratings yet
Check List (Quality Auditors) - Converted1
65 pages
General Framework For Object Detection
No ratings yet
General Framework For Object Detection
9 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Cs490 Advanced Topics in Computing (Deep Learning) : Lecture 16: Convolutional Neural Networks (CNNS)
No ratings yet
Cs490 Advanced Topics in Computing (Deep Learning) : Lecture 16: Convolutional Neural Networks (CNNS)
63 pages
Database Management Systems Lec 1a
No ratings yet
Database Management Systems Lec 1a
29 pages
Tutorial Bilevel Optimization Without Tears
No ratings yet
Tutorial Bilevel Optimization Without Tears
39 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
New CZ3005 Module 3 - Constraint Satisfaction and Adversarial Search
No ratings yet
New CZ3005 Module 3 - Constraint Satisfaction and Adversarial Search
53 pages
SUpport Vector Machine
No ratings yet
SUpport Vector Machine
28 pages
Depth Prediction Single Image
No ratings yet
Depth Prediction Single Image
8 pages
Anomaly Detection in Surveillance Camera: Capstone Project Report End-Semester Evaluation
No ratings yet
Anomaly Detection in Surveillance Camera: Capstone Project Report End-Semester Evaluation
26 pages
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
No ratings yet
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
22 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
27 pages
Classroom and Lab Area - Job Roles Wise
No ratings yet
Classroom and Lab Area - Job Roles Wise
115 pages
A Brief Overview of Artificial Intelligence
No ratings yet
A Brief Overview of Artificial Intelligence
2 pages
Segmentation and Object Recognition Using Edge Detection Techniques
No ratings yet
Segmentation and Object Recognition Using Edge Detection Techniques
9 pages
Lecture 12 - Deep Learning
No ratings yet
Lecture 12 - Deep Learning
25 pages
ch9 Ensemble Learning
No ratings yet
ch9 Ensemble Learning
19 pages
Facilities Management Conference Indonesia
No ratings yet
Facilities Management Conference Indonesia
6 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
Deep Learning: Hoàng Huy Minh Hoàng Thảo Lan Chi Phạm Huy Thiên Phúc Trương Huỳnh Đăng Khoa
No ratings yet
Deep Learning: Hoàng Huy Minh Hoàng Thảo Lan Chi Phạm Huy Thiên Phúc Trương Huỳnh Đăng Khoa
25 pages
Eslab 01
No ratings yet
Eslab 01
64 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
BackPropogationCrossEntNotes PDF
No ratings yet
BackPropogationCrossEntNotes PDF
4 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Prediction and Monitoring of Air Pollution Using Internet of Things (IoT)
No ratings yet
Prediction and Monitoring of Air Pollution Using Internet of Things (IoT)
4 pages
Aw Hook-Simulationxpress Study-1
No ratings yet
Aw Hook-Simulationxpress Study-1
11 pages
HI5004 Group Assignment Guideline T1.2021
No ratings yet
HI5004 Group Assignment Guideline T1.2021
15 pages
Lab Manual 10
No ratings yet
Lab Manual 10
12 pages
Sneha Sarkar, 127, B, Beta and Gamma Function
No ratings yet
Sneha Sarkar, 127, B, Beta and Gamma Function
12 pages
Mẫu Câu Writing Task 2 Hay
No ratings yet
Mẫu Câu Writing Task 2 Hay
15 pages
Balloon Manual
No ratings yet
Balloon Manual
7 pages
1210 6261v1 PDF
No ratings yet
1210 6261v1 PDF
8 pages
Hippo 4 - Writing SF
No ratings yet
Hippo 4 - Writing SF
2 pages
Face Detection and Smile Detection
No ratings yet
Face Detection and Smile Detection
8 pages
AngryBirds Physics
No ratings yet
AngryBirds Physics
3 pages
UIIC Motor Commercial Worksheet
No ratings yet
UIIC Motor Commercial Worksheet
2 pages
Unit 6 Listening 1
No ratings yet
Unit 6 Listening 1
2 pages
1c - Business Letter Rules
No ratings yet
1c - Business Letter Rules
1 page
Membership Form: The Accredited Professional Organization in The Phils. (I-Apo No
No ratings yet
Membership Form: The Accredited Professional Organization in The Phils. (I-Apo No
1 page
2annual Student Outcome Goal Plan
No ratings yet
2annual Student Outcome Goal Plan
4 pages
DPKG Command Cheat Sheet For Debian Linux
No ratings yet
DPKG Command Cheat Sheet For Debian Linux
2 pages
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet