0% found this document useful (0 votes)
25 views

Mod 2.3 - Activation Function, Loss Functions

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Mod 2.3 - Activation Function, Loss Functions

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

• Why Do We Need Activation Functions?

• An activation function Φ(v) in the output layer can


control the nature of the output (e.g., probability value
in [0, 1])

• In multilayer neural networks, activation functions


bring non-linearity into hidden layers, which increases
the complexity of the model.

A neural network with any number of layers but only linear


activations can be shown to be equivalent to a single-layer
network.

Binary Step Function


Binary step function depends on a threshold value that decides whether a
neuron should be activated or not. 

The input fed to the activation function is compared to a certain threshold; if


the input is greater than it, then the neuron is activated, else it is deactivated,
meaning that its output is not passed on to the next hidden layer.

Mathematically it can be represented as:

Binary Step Function


Here are some of the limitations of binary step function:
 It cannot provide multi-value outputs—for example, it cannot be used
for multi-class classification problems.
 The gradient of the step function is zero, which causes a hindrance in
the backpropagation process.

Linear Activation Function


The linear activation function, also known as "no activation," or "identity
function” is where the activation is proportional to the input.

The function doesn't do anything to the weighted sum of the input, it simply
spits out the value it was given.

Linear Activation Function

Mathematically it can be represented as: f(x)=x

However, a linear activation function has two major problems :


 It’s not possible to use backpropagation as the derivative of the
function is a constant and has no relation to the input x. 
 All layers of the neural network will collapse into one if a linear
activation function is used. No matter the number of layers in the
neural network, the last layer will still be a linear function of the first
layer. So, essentially, a linear activation function turns the neural
network into just one layer.

Non-Linear Activation Functions


The linear activation function is simply a linear regression model. 

Because of its limited power, this does not allow the model to create
complex mappings between the network’s inputs and outputs. 

Non-linear activation functions solve the following limitations of linear


activation functions:
 They allow backpropagation because now the derivative function
would be related to the input, and it’s possible to go back and
understand which weights in the input neurons can provide a better
prediction.
 They allow the stacking of multiple layers of neurons as the output
would now be a non-linear combination of input passed through
multiple layers. Any output can be represented as a functional
computation in a neural network.

Non-Linear Neural Networks Activation Functions


Sigmoid / Logistic Activation Function 

This function takes any real value as input and outputs values in the range
of 0 to 1. 

The larger the input (more positive), the closer the output value will be to
1.0, whereas the smaller the input (more negative), the closer the output will
be to 0.0, as shown below.
Sigmoid/Logistic Activation Function

Mathematically it can be represented as:

Sigmoid/logistic activation function is one of the most widely used functions:

Reasons
 It is commonly used for models where we have to predict the
probability as an output. Since probability of anything exists only
between the range of 0 and 1, sigmoid is the right choice because of
its range.
 The function is differentiable and provides a smooth gradient, i.e.,
preventing jumps in output values. This is represented by an S-shape
of the sigmoid activation function. 

Tanh Function (Hyperbolic Tangent)


Tanh function is very similar to the sigmoid/logistic activation function, and even
has the same S-shape with the difference in output range of -1 to 1. In Tanh,
the larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to -
1.0.

Tanh Function (Hyperbolic Tangent)

Mathematically it can be represented as:

Advantages of using this activation function are:


 The output of the tanh activation function is Zero centered; hence we can
easily map the output values as strongly negative, neutral, or strongly
positive.
 Usually used in hidden layers of a neural network as its values lie
between -1 to 1; therefore, the mean for the hidden layer comes out to
be 0 or very close to it. It helps in centering the data and makes learning
for the next layer much easier.

ReLU Function
ReLU stands for Rectified Linear Unit. 

Although it gives an impression of a linear function, ReLU has a derivative


function and allows for backpropagation while simultaneously making it
computationally efficient. 

The main catch here is that the ReLU function does not activate all the neurons
at the same time. 

The neurons will only be deactivated if the output of the linear transformation is
less than 0.
ReLU Activation Function

Mathematically it can be represented as : f(x)=max(0,x)


The advantages of using ReLU as an activation function are as follows:
 Since only a certain number of neurons are activated, the ReLU function
is far more computationally efficient when compared to the sigmoid and
tanh functions.
 ReLU accelerates the convergence of gradient descent towards the
global minimum of the loss function due to its linear, non-saturating
property.

The limitations faced by this function are:


 All the negative input values become zero immediately, which decreases
the model’s ability to fit or train from the data properly. 
 The Dying ReLU problem.(Solved by an improved version named Leaky
ReLU)
Neural networks are a set of algorithms that are designed to
recognize trends/relationships in a given set of training data. These
algorithms are based on the way human neurons process
information.

This equation represents how a neural network processes the input


data at each layer and eventually produces a predicted output value.

To train — the process by which the model maps the relationship


between the training data and the outputs — the neural network
updates its hyperparameters, the weights, wT, and biases, b, to
satisfy the equation above.

Each training input is loaded into the neural network in a process


called forward propagation. Once the model has produced an
output, this predicted output is compared against the given target
output in a process called backpropagation — the
hyperparameters of the model are then adjusted so that it now
outputs a result closer to the target output.

This is where loss functions come in. Loss functions are one of the
most important aspects of neural networks, as they (along with the
optimization functions) are directly responsible for fitting the model
to the given training data
Loss Functions Overview

A loss function is a function that compares the target and


predicted output values; measures how well the neural network
models the training data. When training, we aim to minimize this
loss between the predicted and target outputs.

The hyperparameters are adjusted to minimize the average loss


— we find the weights, wT, and biases, b, that minimize the value
of J (average loss).

Types of Loss Functions

In supervised learning, there are two main types of loss functions —


these correlate to the 2 major types of neural networks: regression
and classification loss functions

1. Regression Loss Functions — used in regression neural


networks; given an input value, the model predicts a
corresponding output value (rather than pre-selected labels);

Ex. Mean Squared Error, Mean Absolute Error

2. Classification Loss Functions — used in classification neural


networks; given an input, the neural network produces a vector
of probabilities of the input belonging to various pre-set
categories — can then select the category with the highest
probability of belonging;

Ex. Binary Cross-Entropy, Categorical Cross-Entropy

Mean Squared Error (MSE)

One of the most popular loss functions, MSE finds the average of the
squared differences between the target and the predicted outputs

This function has numerous properties that make it especially suited


for calculating loss. The difference is squared, which means it does
not matter whether the predicted value is above or below the target
value; however, values with a large error are penalized. MSE is also a
convex function (as shown in the diagram above) with a clearly
defined global minimum — this allows us to more easily
utilize gradient descent optimization to set the weight values.

Mean Absolute Error (MAE)

MAE finds the average of the absolute differences between the target
and the predicted outputs.

This loss function is used as an alternative to MSE in some cases. As


mentioned previously, MSE is highly sensitive to outliers, which can
dramatically affect the loss because the distance is squared. MAE is
used in cases when the training data has a large number of outliers
to mitigate this.

It also has some disadvantages; as the average distance approaches


0, gradient descent optimization will not work, as the function's
derivative at 0 is undefined (which will result in an error, as it is
impossible to divide by 0).

Because of this, a loss function called a Huber Loss was developed,


which has the advantages of both MSE and MAE.

If the absolute difference between the actual and predicted value is


less than or equal to a threshold value, 𝛿, then MSE is applied.
Otherwise — if the error is sufficiently large — MAE is applied.

Binary Cross-Entropy/Log Loss

This is the loss function used in binary classification models —


where the model takes in an input and has to classify it into one of
two pre-set categories.
Classification neural networks work by outputting a vector of
probabilities — the probability that the given input fits into each of
the pre-set categories; then selecting the category with the highest
probability as the final output.

In binary classification, there are only two possible actual values of y


— 0 or 1. Thus, to accurately determine loss between the actual and
predicted values, it needs to compare the actual value (0 or 1) with
the probability that the input aligns with that category (p(i) =
probability that the category is 1; 1 — p(i) = probability that the
category is 0)

Categorical Cross-Entropy Loss

In cases where the number of classes is greater than two, we utilize


categorical cross-entropy — this follows a very similar process to
binary cross-entropy.

Binary cross-entropy is a special case of categorical cross-entropy,


where M = 2 — the number of categories is 2.

You might also like