Activations, Loss Functions & Optimizers in ML
Activations, Loss Functions & Optimizers in ML
& Optimizers in ML
Aniket Dhar
RWS DataLab
A Neural Network Model in Keras : An Example
model = Sequential()
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dropout(0.7))
model.add(Dense(num_classes, activation='softmax'))
● Activation function of a node defines the output given an input or set of inputs
● Can be called as a Transfer function
Y = output, X = input
W = weights, B = bias
f = activation function
and feed it as an input to the next layer.
Activation function determines if a certain neuron would fire or not !
Activations
Linear or Identity Activation Function
A = cx
Drawbacks:
1. Sigmoid or Logistic
2. Tanh - Hyperbolic tangent
3. ReLu - Rectified linear units
Activations
Sigmoid Function
Drawbacks:
Drawbacks:
A(x) = max(0, x)
● Leaky ReLu
● Maxout
Activations
Softmax Function
3. Margin Classifier
H(y,p) = − ∑i yilog(pi) ,
y = label, p = prediction
Optimization Process
Propagate backwards in the
Network carrying Error terms and
updating Weights values using
Optimizer algorithms
Drawbacks:
● calculates gradient of the whole dataset and performs only one update
● very slow and hard to control for large datasets
● computes redundant updates for large data sets
Optimizers : Stochastic Gradient Descent (SGD)
Parameter(θ) update formula:
θ = θ − η⋅∇J(θ, x(i) ,y(i)) ; where {x(i) ,y(i)} are the training examples
performs a parameter update for each training example, usually much faster
Drawbacks:
Drawbacks:
first make a big jump based on previous momentum, then calculate the gradient and
make a correction which results in a parameter update
modifies the general learning rate η at each time step t for every parameter θ(i) based
on the past gradients that have been computed for θ(i).
Drawbacks:
E[g²](t) = γ.E[g²](t−1)+(1−γ).g²(t)
Drawbacks:
Adam optimizer is usually recommended for most learning problems right now.
Optimizers : A Comparison
“Insofar, RMSprop, Adadelta, and Adam are very similar
algorithms that do well in similar circumstances. […] its bias-
correction helps Adam slightly outperform RMSprop towards the
end of optimization as gradients become sparser. Insofar, Adam
might be the best overall choice.”
- “An overview of gradient descent optimization
algorithms”, 2016, Sebastian Rudger