0% found this document useful (0 votes)
117 views29 pages

Activations, Loss Functions & Optimizers in ML

The document discusses activations, loss functions, and optimizers used in machine learning models. It begins by explaining how activation functions introduce non-linearity in neural networks and discusses common activation functions like sigmoid, tanh, and ReLU. It then explains that loss functions measure the error between predictions and targets and discusses common loss functions for regression and classification. Finally, it provides an overview of optimization algorithms like gradient descent, stochastic gradient descent, and adaptive learning rate methods like Adam that are used to minimize loss functions during training.

Uploaded by

Aniket Dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views29 pages

Activations, Loss Functions & Optimizers in ML

The document discusses activations, loss functions, and optimizers used in machine learning models. It begins by explaining how activation functions introduce non-linearity in neural networks and discusses common activation functions like sigmoid, tanh, and ReLU. It then explains that loss functions measure the error between predictions and targets and discusses common loss functions for regression and classification. Finally, it provides an overview of optimization algorithms like gradient descent, stochastic gradient descent, and adaptive learning rate methods like Adam that are used to minimize loss functions during training.

Uploaded by

Aniket Dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Activations, Loss functions

& Optimizers in ML
Aniket Dhar
RWS DataLab
A Neural Network Model in Keras : An Example
model = Sequential()
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dropout(0.7))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=['accuracy'])


Activations
What are Activation functions ?

● Activation function of a node defines the output given an input or set of inputs
● Can be called as a Transfer function

The Activation Functions can be broadly divided into 2 types :

● Linear Activation Function (not very useful)


● Non-linear Activation Functions
Activations
Why use Activation functions in Neural Networks?

In Artificial Neural Networks, we calculate the output of each layer as


Y = f(X * W + B)

Y = output, X = input
W = weights, B = bias

f = activation function
and feed it as an input to the next layer.
Activation function determines if a certain neuron would fire or not !
Activations
Linear or Identity Activation Function

A = cx

can be used for linear regression models

Drawbacks:

● has a fixed gradient (change in BP does not depend on changes in input)


● limited power (not suitable for connected layers)
● does not perform well for complex problems
Activations
Why do we need Non-Linearities?

Neural-Networks are considered Universal Function Approximators

● Add ability to it to learn something complex


● Represent non-linear complex arbitrary functional mappings between inputs and
outputs

Non-Linear Activation Functions:

1. Sigmoid or Logistic
2. Tanh - Hyperbolic tangent
3. ReLu - Rectified linear units
Activations
Sigmoid Function

Drawbacks:

● output isn’t zero centered


● vanishing gradient problem
● slow convergence
Activations
Hyperbolic Tangent function- Tanh

solves the output range problem

Drawbacks:

● vanishing gradient problem


Activations
ReLu- Rectified Linear units

A(x) = max(0, x)

range of ReLu is [0, inf). This means it


can blow up the activation.

ReLu solves vanishing gradient problem


Drawbacks:

● Dead neurons, as they always get 0 activation


Activations
Modifications to solve the dead neuron problems

● Leaky ReLu
● Maxout
Activations
Softmax Function

● Calculated probabilities will be in the range of 0 to 1


● Sum of all the probabilities is equals to 1
● Handy for multiple classes
● Gives the probability of being in a particular class
● Used in the output layer for multiclass classification
Loss Functions
What is a Loss Function?

● A method of evaluating how well your algorithm models your dataset


● Simply measures the difference between the target and the prediction
● Also known as Cost Function / Objective Function

An optimization problem seeks to minimize a loss function.


**A Loss function must be differentiable
Loss Functions : Regressive loss functions
Used when the target variable is continuous (eg. regressive problems).

Most widely used regressive loss function is Mean Square Error.

Other loss functions are:

1. Absolute error—measures the mean absolute value of the element-wise difference


between input;

2. Smooth Absolute Error—a smooth version of Abs Criterion.


Loss Functions : Classification loss functions
Usually outputs a probability value
Magnitude of the score represents the confidence of our prediction.
The target variable is a binary variable (1 for true and -1/0 for false).

Some classification algorithms are:

1. Binary/ Categorical Cross Entropy

2. Negative Log Likelihood

3. Margin Classifier

4. Soft Margin Classifier


Loss Functions : Embedding loss functions
Deals with problems where we have to measure whether two inputs are similar or
dissimilar. Some examples are:

1. L1 Hinge Error- Calculates the L1 distance between two inputs.

2. Cosine Error- Cosine distance between two inputs.


Cross Entropy Loss / Log Loss:
● measures the performance of a
classification model whose output is
a probability value between 0 and 1
● increases as the predicted probability
diverges from the actual label

H(y,p) = − ∑i yilog(pi) ,

y = label, p = prediction
Optimization Process
Propagate backwards in the
Network carrying Error terms and
updating Weights values using
Optimizer algorithms

Calculate the gradient of Error (E)


function with respect to the
Weights (W) , and update them in
the opposite direction of the
Gradient
Fig. Updating weights through gradient descent
Optimizers
What are Optimization Algorithms ?

Optimization algorithms helps us to minimize (or maximize) an Objective function.


They update weights and biases i.e. the internal parameters of a model to reduce the
prediction error.

They can be divided into two categories:

● Constant Learning Rate Algorithms (SGD)


● Adaptive Learning Algorithms (Adagrad, Adadelta, RMSprop, Adam)
Optimizers : Gradient Descent
Parameter(θ) update formula:

θ = θ − η⋅∇J(θ) ; η = learning rate, ∇J(θ) = gradient of loss function J(θ)

One of the most popular algorithms used in optimizing Neural Networks

Drawbacks:

● calculates gradient of the whole dataset and performs only one update
● very slow and hard to control for large datasets
● computes redundant updates for large data sets
Optimizers : Stochastic Gradient Descent (SGD)
Parameter(θ) update formula:

θ = θ − η⋅∇J(θ, x(i) ,y(i)) ; where {x(i) ,y(i)} are the training examples

performs a parameter update for each training example, usually much faster

Drawbacks:

● due to these frequent updates, parameters updates have high variance


● causes the Loss function to fluctuate to different intensities
● keeps overshooting due to the frequent fluctuations
Optimizers : Mini Batch Gradient Descent
● performs an update for every batch with ‘n’ training examples in each batch
● reduces the variance in the parameter updates
● batch sizes can vary according to problem

Drawbacks:

● choosing proper LR is difficult; low: slow convergence; high: oscillations


● same learning rate applies to all parameter updates
● gets trapped in local minima and specially at ‘saddle points’
Optimizers : Momentum
Parameter(θ) update formula:

V(t) = γV(t−1) + η∇J(θ) ; V(t) = update vector at time t


θ = θ − V(t) ; γ = momentum
term, usually set to 0.9

Faster and stable convergence


Reduced Oscillations in irrelevant directions
Drawbacks:

● might miss the minima and shoot up due to momentum


● same learning rate applies to all parameter updates
Optimizers : Nesterov Accelerated Gradient
Parameter(θ) update formula:

V(t) = γV(t−1) + η∇J(θ − γV(t−1)) ; V(t) = update vector at time t


θ = θ − V(t) ; γ = momentum term, η =
learning rate

first make a big jump based on previous momentum, then calculate the gradient and
make a correction which results in a parameter update

θ − γV(t−1) gives approximation of the next position of the parameters


Drawbacks:

● same learning rate applies to all parameter updates


Optimizers : Adagrad
Parameter(θ) update formula:

Uses a different Learning Rate for every parameter θ at


a time step based on the past gradients which were computed for that parameter

modifies the general learning rate η at each time step t for every parameter θ(i) based
on the past gradients that have been computed for θ(i).

Drawbacks:

● decaying learning rate(η) problem


Optimizers : AdaDelta / RMSProp
Parameter(θ) update formula:

E[g²](t) = γ.E[g²](t−1)+(1−γ).g²(t)

limits the window of accumulated past gradients to some fixed size w


running average E[g²](t) at time step t then depends only on the previous average and
the current gradient

Drawbacks:

● can not calculate individual momentum changes for each parameter


Optimizers : Adam(Adaptive Moment Estimation)
M(t) and V(t) are values of the first moment which is the Mean and
the second moment which is the uncentered variance of the gradients
respectively.

Then the final formula for the Parameter update is —

The values for β1 = 0.9 , β2 = 0.999, and ϵ = (10 x exp(-8)).

Adam optimizer is usually recommended for most learning problems right now.
Optimizers : A Comparison
“Insofar, RMSprop, Adadelta, and Adam are very similar
algorithms that do well in similar circumstances. […] its bias-
correction helps Adam slightly outperform RMSprop towards the
end of optimization as gradients become sparser. Insofar, Adam
might be the best overall choice.”
- “An overview of gradient descent optimization
algorithms”, 2016, Sebastian Rudger

“In practice Adam is currently recommended as the default


algorithm to use, and often works slightly better than RMSProp.
However, it is often also worth trying SGD+Nesterov Momentum
as an alternative.” Comparison of Adam to Other Optimization Algorithms Training a
Multilayer Perceptron
- “CS231n: Convolutional Neural Networks for Visual Taken from Adam: A Method for Stochastic Optimization, 2015.

Recognition”, Andrej Karpathy, et al.


Optimizers : Visualisation

Fig. optimization on loss surface contours Fig. optimization on saddle points


Thank You

You might also like