ML Unit 4
ML Unit 4
Neural Networks
A neural network is a machine learning model that processes data by mimicking the
human brain. It's made up of interconnected nodes, or neurons, that learn to recognize patterns
and relationships in data. Every neural network consists of layers of nodes, or artificial
neurons—an input layer, one or more hidden layers, and an output layer. Each node connects to
others, and has its own associated weight and threshold. If the output of any individual node is
above the specified threshold value, that node is activated, sending data to the next layer of the
network. Otherwise, no data is passed along to the next layer of the network.
An MLP is a type of feedforward artificial neural network with multiple layers, including
an input layer, one or more hidden layers, and an output layer. Each layer is fully connected to
the next. Each neuron in a layer is connected to every neuron in the next layer, and these
connections have weights and biases that control how signals are passed forward.
Fig: Multilayer Perceptron
Input Layer: The input layer takes the raw data (like numbers, images, or text features), and
passes it to the first hidden layer.
Hidden Layer: Each neuron in a hidden layer receives inputs, multiplies them by certain
weights, adds a bias, and then passes the result through an activation function like ReLU or
sigmoid, which helps the network learn complex patterns. This processed information is passed
from one layer to the next in a step called forward propagation, until it reaches the output layer,
which gives the final prediction.
Output Layer: The output layer is the final layer in a neural network. It take all the learned
information from the previous (hidden) layers and produce the final result (a prediction or
decision).
Advantages of MLP
Disadvantages of MLP
2. Activation Function
Activation functions are an integral building block of neural networks that enable them to
learn complex patterns in data. They transform the input signal of a node in a neural network into
an output signal that is then passed on to the next layer. Without activation functions, neural
networks would be restricted to modeling only linear relationships between inputs and outputs.
Choosing the right activation function is crucial for training neural networks that generalize well
and provide accurate predictions.
Without activation functions, neural networks would just consist of linear operations like
matrix multiplication. All layers would perform linear transformations of the input, and no non-
linearity would be introduced. Most real-world data is non-linear. For example, relationships
between house prices and size, income, and purchases, etc., are non-linear. If neural networks
had no activation functions, they would fail to learn the complex non-linear patterns that exist in
real-world data. Activation functions enable neural networks to learn these non-linear
relationships by introducing non-linear behaviors through activation functions. This greatly
increases the flexibility and power of neural networks to model complex data.
Neural networks leverage various types of activation functions to introduce non-linearities and
enable learning complex patterns. Each activation function has its own unique properties and is
suitable for certain use cases. Here are the most common types of activation functions:
Sigmoid Function: Sigmoid function takes any real valued input and maps it to a value
between 0 and 1. This makes it especially useful for models where we want to predict
probabilities, such as in binary classification problems. The sigmoid function is defined as:
x
1
1 e x
Tanh (Hyperbolic Tangent) Function: The tanh function is similar to the sigmoid function,
but it maps input values to a range between -1 and 1, instead of 0 to 1. Its formula is:
e x e x
tanh x
e x e x
ReLU (Rectified Linear Unit): ReLU outputs the input value if it's positive, otherwise, it
outputs zero. The formula is:
f x max 0, x
Leaky ReLU: Leaky ReLU is a slight variation of ReLU designed to fix the "dying neuron"
issue. Instead of outputting zero for negative inputs, it allows a small negative slope, so the
neuron can still learn. The formula is:
x if x 0
f x
x if x 0
Softmax Activation Function: The softmax function is usually used in the output layer of a
neural network for multi-class classification problems. It converts the raw scores from the
network into probabilities that sum to 1. Each output value represents the probability of the
input belonging to a particular class. It's especially useful when your model needs to choose
one label from many possible options. The formula is:
e xi
SoftMaxxi
e
xj
j
3. Network Training
Neural network training involves an iterative process where the network learns to make accurate
predictions by adjusting its internal parameters (weights and biases) using a training dataset and
a loss function, typically through algorithms like backpropagation and gradient descent.
Fig: Network Training
The first step is to collect and prepare your dataset. This includes cleaning the data, handling
missing values, converting categories to numbers if needed, and scaling features. The data is then
divided into three parts: training data (to train the model), validation data (to tune and monitor
the model), and test data (to evaluate final performance).
Next, design the structure of the neural network. This involves choosing the number of layers,
how many neurons each layer should have, and which activation functions to use. ReLU is
commonly used in hidden layers, while sigmoid or softmax is used in the output layer depending
on the type of task.
Before training begins, the model’s weights and biases are initialized. This is usually done with
small random values to start the learning process. Proper initialization helps the model converge
faster and more effectively.
The model’s prediction is compared to the true output using a loss function. The loss function
measures the difference or error between the predicted result and the actual label. Common loss
functions include cross-entropy for classification and mean squared error for regression.
Step 6: Backpropagation
Once the loss is calculated, the model uses backpropagation to understand how much each
weight contributed to the error. It does this by computing gradients — how much the loss
changes with respect to each weight.
Using the gradients from backpropagation, the model updates its weights to reduce the loss. This
is done using an optimizer like Stochastic Gradient Descent (SGD) or Adam. These optimizers
adjust the weights in small steps based on the learning rate.
Step 8: Repeat
Steps 4 to 7 are repeated many times over the full dataset. Each full pass through the training
data is called an epoch. With each epoch, the model improves its ability to make accurate
predictions by minimizing the loss.
During training, the model is evaluated on the validation data to monitor its performance. This
helps detect overfitting—when the model performs well on training data but poorly on new data.
Finally, after training is complete, the model is tested on the test dataset. This final evaluation
shows how well the model can generalize to new, unseen data.
4. Gradient Descent Optimization
There are several variants of Gradient Descent, each designed to improve performance, speed, or
stability in different scenarios. Here's a simple and professional explanation of the most common
ones:
Batch Gradient Descent (Vanilla Gradient Descent): In batch gradient descent, the model
uses the entire training dataset to compute the gradient and update the weights. i.e., we need
to calculate the gradients for the whole dataset to perform just one update, batch gradient
descent can be very slow and is intractable for datasets that don't fit in memory.
Stochastic Gradient Descent (SGD): SGD updates the weights using one training sample at
a time. This makes it much faster and suitable for large datasets, but the updates are noisy
and can jump around the optimal point. However, this noise can help the model escape local
minima.
Mini-Batch Gradient Descent: This is a hybrid of batch and stochastic gradient descent.
Mini-Batch Gradient Descent is a popular optimization technique used to train deep learning
models efficiently. Instead of using the entire dataset like in Batch Gradient Descent, or just
one sample at a time like in Stochastic Gradient Descent (SGD), it uses small groups of
samples, called mini-batches, to update the model.
Advantages of Gradient Descent Algorithm
Backpropagation (short for backward propagation of error) is the core algorithm used to
train neural networks. It helps the model learn by adjusting its weights to reduce the difference
between the predicted output and the actual result (i.e., the error). When a neural network makes
a wrong prediction, we need to adjust the weights and biases inside the network so that it
performs better next time. But a neural network has many layers and thousands (or millions) of
parameters, so here we don’t know which weight caused how much of error. That’s where
backpropagation comes in. It calculates the gradient (or slope) of the error with respect to each
weight using the chain rule of calculus. This tells the model exactly how to change each weight
to reduce the total error. Without backpropagation, the model wouldn’t know what to change or
how much to change during training.
In shallow networks, which typically consist of only one hidden layer, backpropagation is
relatively straightforward. The error is calculated at the output layer and then propagated
backward through the single hidden layer to adjust weights. Because of the limited depth, the
gradient computations are simpler, and the risk of issues like vanishing gradients is low.
However, as we move to deep neural networks, which contain many hidden layers,
backpropagation becomes more complex and computationally intensive. The error must be
passed through multiple layers, which involves computing partial derivatives for each weight and
activation function along the way.
5.1 Steps for Backpropagation Algorithm
The input data passes through the network, layer by layer, until the model gives a prediction.
The prediction is compared to the actual value using a loss function (like Mean Squared Error or
Cross-Entropy). This gives the total error (loss).
Now, the algorithm works backwards through the network to calculate how much each weight
contributed to the error. This is done using a mathematical tool called the chain rule from
calculus, which helps find the partial derivatives of the loss with respect to each weight.
Once the gradients (slopes) are known, the weights are updated using an optimizer (like SGD or
Adam) to reduce the error. We subtract a small fraction of the gradient from each weight—a step
controlled by the learning rate.
The vanishing gradient problem in deep learning is a big issue that comes up when
training neural networks in deep structures with lots of layers. It happens when the gradients
used to update the weights during backpropagation get small disappearing. This makes it hard for
the network to learn because the weights of the earlier layers change, which slows down or stops
training altogether. Fixing the vanishing gradient problem is key to train deep neural networks.
Activation functions have a direct impact on the occurence of vanishing gradient
problems in neural networks. Here are a couple of activation functions: Sigmoid & Tanh
activation function, ReLU activation function, Leaky ReLU. The sigmoid function is one of the
most popular activations functions used for developing deep neural networks. The use of sigmoid
function restricted the training of deep neural networks because it caused the vanishing gradient
problem. This caused the neural network to learn at a slower pace or in some cases no learning at
all.Sigmoid functions are used frequently in neural networks to activate neurons. It is a
logarithmic function with a characteristic S shape. The output value of the function is between 0
and 1. The sigmoid function is used for activating the output layers in binary classification
problems. It is calculated as follows:
x
1
1 e x
The ReLU (Rectified Linear Unit) activation function plays a key role in addressing the
vanishing gradient problem in deep neural networks. The vanishing gradient problem occurs
when gradients become extremely small during backpropagation, especially in deep networks
using activation functions like sigmoid or tanh. These functions tend to saturate for large input
values, causing gradients to shrink toward zero and preventing effective learning in earlier
layers. ReLU helps solve this by outputting the input directly if it’s positive and zero otherwise,
which means that for positive values, the gradient is consistently 1. This prevents gradients from
vanishing as they pass through multiple layers, allowing the network to learn more efficiently
and converge faster.
7. ReLU (Rectified Linear Unit)
The ReLU activation function is used to introduce nonlinearity in a neural network, helping
mitigate the vanishing gradient problem during machine learning model training and enabling
neural networks to learn more complex relationships in data. If a model input is positive, the
ReLU function outputs the same value. If a model input is negative, the ReLU function outputs
zero. The formula for ReLU is:
The ReLU (Rectified Linear Unit) activation function plays a key role in addressing the
vanishing gradient problem in deep neural networks. The vanishing gradient problem occurs
when gradients become extremely small during backpropagation, especially in deep networks
using activation functions like sigmoid or tanh. These functions tend to saturate for large input
values, causing gradients to shrink toward zero and preventing effective learning in earlier
layers. ReLU helps solve this by outputting the input directly if it’s positive and zero otherwise,
which means that for positive values, the gradient is consistently 1. This prevents gradients from
vanishing as they pass through multiple layers, allowing the network to learn more efficiently
and converge faster.
Advantages of ReLU
Computationally Efficient
Solves the Vanishing Gradient Problem
Faster Convergence
Disadvantages of ReLU
Dying ReLU Problem: If a neuron receives inputs that are always negative, it will always
output zero. This means it stops learning completely, a phenomenon called "dying ReLU."
Exploding gradients: It leads to extremely large weight updates, which can cause the model
to fail to learn or result in numerical instability.
8. Hyperparameter Tuning
Grid Search: It is a method where we define a set of hyperparameter values and try every
possible combination of them to find the best one. It performs exhaustive search, meaning it
checks all options systematically. For each combination, it uses cross-validation to evaluate
performance and chooses the one with the highest accuracy (or lowest error). The following
code illustrates how to use GridSearchCV:
#Necessary Imports
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Load data
X, y = load_iris(return_X_y=True)
# Define model and parameters
model = SVC()
params = {'C': [1, 10]}
# Grid Search
grid = GridSearchCV(model, params)
grid.fit(X, y)
# Best result
print(grid.best_params_)
Random Search: It also uses cross-validation but instead of checking every combination, it
randomly samples combinations from the hyperparameter space. This means it tries a fixed
number of random combinations, not all of them.
Batch normalization is a deep learning approach that has been shown to significantly
improve the efficiency and reliability of neural network models. It is particularly useful for
training very deep networks, as it can help to reduce the internal covariate shift that can occur
during training. The term “internal covariate shift” is used to describe the effect that updating the
parameters of the layers above it has on the distribution of inputs to the current layer during deep
learning training. This can make the optimization process more difficult and can slow down the
convergence of the model. Since normalization guarantees that no activation value is too high or
too low, and since it enables each layer to learn independently from the others, this strategy leads
to quicker learning rates. By standardizing inputs, the “dropout” rate (the amount of information
lost between processing stages) may be decreased. That ultimately leads to a vast increase in
precision across the board.
Let’s understand this through an example. We have a deep neural network, as shown in
the following image.
Initially, our inputs X1, X2, X3, and X4 are normalized as they come from the pre-
processing stage. When the input passes through the first layer, it transforms as a sigmoid
function applied over the dot product of input X and the weight matrix W. Similarly, this
transformation will take place for the second layer and continue until the last layer L, as shown
in the following image.
Although our input X was normalized with time, the output will no longer be on the same scale.
As the data pass through multiple layers of the neural network and L activation functions are
applied, it leads to an internal co-variate shift in the data.
Normalization is the process of transforming the data to have a mean zero and standard deviation
one. In this step we have our batch input from layer h, first, we need to calculate the mean of this
hidden activation.
1
m
hi
Here, m is the number of neurons at layer h. Once we have achieved our goal, the next step is to
calculate the standard deviation of the hidden activations.
1
1
hi 2
2
m
Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide the
whole value with the sum of standard deviation and the smoothing term . The smoothing term
assures numerical stability within the operation by stopping a division by a zero value.
hi
hi norm
Additional Complexity
Overhead in Implementation
Depends on Batch Size
10. Regularization
The most common regularization techniques are L1 regularization (Lasso), which adds the
absolute values of the model weights to the loss function, and L2 regularization (Ridge), which
adds the squared values of the weights. By incorporating these penalty terms, regularization
strikes a balance between fitting the training data and keeping the model simple, ultimately
leading to better performance on new data.
To train our machine learning model, we provide it with data to learn from. The process
of plotting a series of data points and drawing a line of best fit to understand the relationship
between variables is called Data Fitting. Overfitting in regularization happens when a model
learns the training data too well, including its noise and outliers, which hurts its performance on
new, unseen data. Regularization helps reduce overfitting by adding a penalty to the model's
complexity, discouraging it from relying too heavily on specific features or fitting the data too
closely. This makes the model more general and better at predicting on test data. In the figure
below, we can see that the model is fit for every point in our data. If new data is provided, the
model curves may not match the patterns in the new data, and the model may not predict very
well.
Underfitting occurs when a machine learning model fails to learn the relationship between
variables in the test data or to predict or classify a new data point. The image below shows an
underfitting model. We can see that it doesn’t fit the data given correctly. There is no way to find
patterns in the data and ignored much of the data set. It cannot work with both known and
unknown data.
Ridge Regression (L2 Regularization): This technique adds the sum of the squared
coefficients to the loss function. It shrinks coefficients but doesn’t make them zero, meaning
it keeps all features in the model. It’s useful when many features have small to moderate
effects.
Lasso Regression (L1 Regularization): Lasso adds the sum of the absolute values of the
coefficients as a penalty. Unlike Ridge, it can reduce some coefficients to zero, effectively
performing feature selection by removing less important variables.
ElasticNet: ElasticNet combines both L1 and L2 penalties. It balances the strengths of Ridge
and Lasso, making it effective when there are multiple correlated features or when we want
both shrinkage and feature selection.
The cost function for regularization technique is given below, where is penalty of errors, w is
weight function.
11. Dropout
Dropout is a technique that randomly disables (or “drops”) a fraction of neurons during each
training iteration. i.e., the term "dropout" refers to dropping out the nodes (input and hidden
layer) in a neural network. All the forward and backwards connections with a dropped node are
temporarily removed, thus creating new network architecture out of the parent network. The
nodes are dropped by a dropout probability of p. This prevents the network from becoming too
dependent on certain nodes and encourages it to learn more generalized features, which helps it
perform better on new data.
a. During Training: At each training iteration, dropout randomly disables a fraction (e.g.,
50%) of neurons in the network. This effectively creates a new, smaller neural network with
fewer neurons for that iteration. As a result, the model learns to work without depending on
any one specific neuron, which reduces overfitting.
b. During Testing: During testing or inference (when the model is predicting on new data), all
neurons are active. However, the weights of the neurons are scaled down by the dropout rate
to account for the different training structure. This scaling is typically by a factor of (1 —
dropout rate) to ensure the same output range as during training.
For example, if you set a dropout rate of 0.5, during training, half of the neurons are disabled in
each iteration. In testing, all neurons are active, but each neuron’s output is multiplied by 0.5 to
balance the network’s response.
The main motive of training the dropout is to decrease the loss function, given all the
units (neurons). So in overfitting, a unit may change in a way that fixes up the mistakes of the
other units. This leads to complex co-adaptations, which in turn leads to the overfitting problem
because this complex co-adaptation fails to generalise on the unseen dataset. Now, if we use
dropout, it prevents these units to fix up the mistake of other units, thus preventing co-adaptation,
as in every iteration the presence of a unit is highly unreliable. So by randomly dropping a few
units (nodes), it forces the layers to take more or less responsibility for the input by taking a
probabilistic approach. This ensures that the model is getting generalised and hence reducing the
overfitting problem.
Fig: (a) Hidden layer features without dropout; (b) Hidden layer features with dropout
From the above figure, we can easily make out that the hidden layer with dropout is
learning more of the generalised features than the co-adaptations in the layer without dropout. It
is quite apparent, that dropout breaks such inter-unit relations and focuses more on
generalisation.