0% found this document useful (0 votes)
20 views23 pages

ML Unit 4

This document provides an overview of neural networks, focusing on the Multilayer Perceptron (MLP) architecture, activation functions, and the training process using gradient descent and backpropagation. It discusses the advantages and disadvantages of MLPs, various activation functions, and the challenges of training deep networks, including the vanishing gradient problem. The document also highlights the importance of optimizing network training and the role of different gradient descent variants.

Uploaded by

nish1997t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views23 pages

ML Unit 4

This document provides an overview of neural networks, focusing on the Multilayer Perceptron (MLP) architecture, activation functions, and the training process using gradient descent and backpropagation. It discusses the advantages and disadvantages of MLPs, various activation functions, and the challenges of training deep networks, including the vanishing gradient problem. The document also highlights the importance of optimizing network training and the role of different gradient descent variants.

Uploaded by

nish1997t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT IV

Neural Networks

Multilayer Perceptron, activation functions, network training – gradient descent optimization –


stochastic gradient descent, error backpropagation, from shallow networks to deep networks –
Unit saturation (aka the vanishing gradient problem) – ReLU, hyperparameter tuning, batch
normalization, regularization, dropout.

1. Multilayer Perceptron (MLP)

A neural network is a machine learning model that processes data by mimicking the
human brain. It's made up of interconnected nodes, or neurons, that learn to recognize patterns
and relationships in data. Every neural network consists of layers of nodes, or artificial
neurons—an input layer, one or more hidden layers, and an output layer. Each node connects to
others, and has its own associated weight and threshold. If the output of any individual node is
above the specified threshold value, that node is activated, sending data to the next layer of the
network. Otherwise, no data is passed along to the next layer of the network.

An MLP is a type of feedforward artificial neural network with multiple layers, including
an input layer, one or more hidden layers, and an output layer. Each layer is fully connected to
the next. Each neuron in a layer is connected to every neuron in the next layer, and these
connections have weights and biases that control how signals are passed forward.
Fig: Multilayer Perceptron

1.1 Layers of MLP

Input Layer: The input layer takes the raw data (like numbers, images, or text features), and
passes it to the first hidden layer.

Hidden Layer: Each neuron in a hidden layer receives inputs, multiplies them by certain
weights, adds a bias, and then passes the result through an activation function like ReLU or
sigmoid, which helps the network learn complex patterns. This processed information is passed
from one layer to the next in a step called forward propagation, until it reaches the output layer,
which gives the final prediction.

Output Layer: The output layer is the final layer in a neural network. It take all the learned
information from the previous (hidden) layers and produce the final result (a prediction or
decision).

Advantages of MLP

 Can Learn Complex Patterns


 Works with Any Type of Data like tables, images, text
 Generalization

Disadvantages of MLP

 Requires a Lot of Data


 Computationally Expensive
 Risk of Overfitting

2. Activation Function

Activation functions are an integral building block of neural networks that enable them to
learn complex patterns in data. They transform the input signal of a node in a neural network into
an output signal that is then passed on to the next layer. Without activation functions, neural
networks would be restricted to modeling only linear relationships between inputs and outputs.
Choosing the right activation function is crucial for training neural networks that generalize well
and provide accurate predictions.
Without activation functions, neural networks would just consist of linear operations like
matrix multiplication. All layers would perform linear transformations of the input, and no non-
linearity would be introduced. Most real-world data is non-linear. For example, relationships
between house prices and size, income, and purchases, etc., are non-linear. If neural networks
had no activation functions, they would fail to learn the complex non-linear patterns that exist in
real-world data. Activation functions enable neural networks to learn these non-linear
relationships by introducing non-linear behaviors through activation functions. This greatly
increases the flexibility and power of neural networks to model complex data.

2.1 Types of activation function

Neural networks leverage various types of activation functions to introduce non-linearities and
enable learning complex patterns. Each activation function has its own unique properties and is
suitable for certain use cases. Here are the most common types of activation functions:

 Sigmoid Function: Sigmoid function takes any real valued input and maps it to a value
between 0 and 1. This makes it especially useful for models where we want to predict
probabilities, such as in binary classification problems. The sigmoid function is defined as:

 x  
1
1  e x

 Tanh (Hyperbolic Tangent) Function: The tanh function is similar to the sigmoid function,
but it maps input values to a range between -1 and 1, instead of 0 to 1. Its formula is:
e x  e x
tanh x  
e x  e x

 ReLU (Rectified Linear Unit): ReLU outputs the input value if it's positive, otherwise, it
outputs zero. The formula is:
f x  max 0, x

 Leaky ReLU: Leaky ReLU is a slight variation of ReLU designed to fix the "dying neuron"
issue. Instead of outputting zero for negative inputs, it allows a small negative slope, so the
neuron can still learn. The formula is:
 x if x  0
f x   
x if x  0
 Softmax Activation Function: The softmax function is usually used in the output layer of a
neural network for multi-class classification problems. It converts the raw scores from the
network into probabilities that sum to 1. Each output value represents the probability of the
input belonging to a particular class. It's especially useful when your model needs to choose
one label from many possible options. The formula is:
e xi
SoftMaxxi  
e
xj
j

3. Network Training

Neural network training involves an iterative process where the network learns to make accurate
predictions by adjusting its internal parameters (weights and biases) using a training dataset and
a loss function, typically through algorithms like backpropagation and gradient descent.
Fig: Network Training

The steps for training deep neural network are as follows:

Step 1: Data Preparation

The first step is to collect and prepare your dataset. This includes cleaning the data, handling
missing values, converting categories to numbers if needed, and scaling features. The data is then
divided into three parts: training data (to train the model), validation data (to tune and monitor
the model), and test data (to evaluate final performance).

Step 2: Build the Model

Next, design the structure of the neural network. This involves choosing the number of layers,
how many neurons each layer should have, and which activation functions to use. ReLU is
commonly used in hidden layers, while sigmoid or softmax is used in the output layer depending
on the type of task.

Step 3: Initialize Weights

Before training begins, the model’s weights and biases are initialized. This is usually done with
small random values to start the learning process. Proper initialization helps the model converge
faster and more effectively.

Step 4: Forward Propagation


In this step, input data is passed through the network from the input layer to the output layer.
Each neuron processes the input using its current weights and activation function to produce an
output. This gives the model’s prediction.

Step 5: Calculate Loss

The model’s prediction is compared to the true output using a loss function. The loss function
measures the difference or error between the predicted result and the actual label. Common loss
functions include cross-entropy for classification and mean squared error for regression.

Step 6: Backpropagation

Once the loss is calculated, the model uses backpropagation to understand how much each
weight contributed to the error. It does this by computing gradients — how much the loss
changes with respect to each weight.

Step 7: Update Weights

Using the gradients from backpropagation, the model updates its weights to reduce the loss. This
is done using an optimizer like Stochastic Gradient Descent (SGD) or Adam. These optimizers
adjust the weights in small steps based on the learning rate.

Step 8: Repeat

Steps 4 to 7 are repeated many times over the full dataset. Each full pass through the training
data is called an epoch. With each epoch, the model improves its ability to make accurate
predictions by minimizing the loss.

Step 9: Test the Model

During training, the model is evaluated on the validation data to monitor its performance. This
helps detect overfitting—when the model performs well on training data but poorly on new data.
Finally, after training is complete, the model is tested on the test dataset. This final evaluation
shows how well the model can generalize to new, unseen data.
4. Gradient Descent Optimization

Gradient descent is an optimization algorithm used in machine learning to minimize the


cost function by iteratively adjusting parameters in the direction of the negative gradient, aiming
to find the optimal set of parameters. Its goal is to find the best values for the model’s weights by
gradually adjusting them to reduce the error between predicted and actual outputs. It works by
calculating the gradient (or slope) of the loss function with respect to each weight, which tells us
the direction in which the loss increases. The size of each step is controlled by a parameter called
the learning rate. If the learning rate is too high, the model might overshoot the minimum; if it’s
too low, training will be very slow. Gradient Descent continues this process over many iterations
until the model converges to an optimal or near-optimal solution.

Fig: Gradient Descent Optimization

4.1 Gradient descent variants

There are several variants of Gradient Descent, each designed to improve performance, speed, or
stability in different scenarios. Here's a simple and professional explanation of the most common
ones:

 Batch Gradient Descent (Vanilla Gradient Descent): In batch gradient descent, the model
uses the entire training dataset to compute the gradient and update the weights. i.e., we need
to calculate the gradients for the whole dataset to perform just one update, batch gradient
descent can be very slow and is intractable for datasets that don't fit in memory.
 Stochastic Gradient Descent (SGD): SGD updates the weights using one training sample at
a time. This makes it much faster and suitable for large datasets, but the updates are noisy
and can jump around the optimal point. However, this noise can help the model escape local
minima.

 Mini-Batch Gradient Descent: This is a hybrid of batch and stochastic gradient descent.
Mini-Batch Gradient Descent is a popular optimization technique used to train deep learning
models efficiently. Instead of using the entire dataset like in Batch Gradient Descent, or just
one sample at a time like in Stochastic Gradient Descent (SGD), it uses small groups of
samples, called mini-batches, to update the model.
Advantages of Gradient Descent Algorithm

 Handles Large-Scale Problems


 Simple and Easy to Implement

Disadvantages of Gradient Descent Algorithm

 Choosing the Right Learning Rate is Critical


 Sensitive to Feature Scaling
 Slow Convergence for Large Datasets

5. Error Backpropagation, from Shallow Networks to Deep Networks

Backpropagation (short for backward propagation of error) is the core algorithm used to
train neural networks. It helps the model learn by adjusting its weights to reduce the difference
between the predicted output and the actual result (i.e., the error). When a neural network makes
a wrong prediction, we need to adjust the weights and biases inside the network so that it
performs better next time. But a neural network has many layers and thousands (or millions) of
parameters, so here we don’t know which weight caused how much of error. That’s where
backpropagation comes in. It calculates the gradient (or slope) of the error with respect to each
weight using the chain rule of calculus. This tells the model exactly how to change each weight
to reduce the total error. Without backpropagation, the model wouldn’t know what to change or
how much to change during training.

In shallow networks, which typically consist of only one hidden layer, backpropagation is
relatively straightforward. The error is calculated at the output layer and then propagated
backward through the single hidden layer to adjust weights. Because of the limited depth, the
gradient computations are simpler, and the risk of issues like vanishing gradients is low.
However, as we move to deep neural networks, which contain many hidden layers,
backpropagation becomes more complex and computationally intensive. The error must be
passed through multiple layers, which involves computing partial derivatives for each weight and
activation function along the way.
5.1 Steps for Backpropagation Algorithm

Step 1: Forward Pass

The input data passes through the network, layer by layer, until the model gives a prediction.

Step 2: Loss Calculation

The prediction is compared to the actual value using a loss function (like Mean Squared Error or
Cross-Entropy). This gives the total error (loss).

Step 3: Backward Pass (Backpropagation)

Now, the algorithm works backwards through the network to calculate how much each weight
contributed to the error. This is done using a mathematical tool called the chain rule from
calculus, which helps find the partial derivatives of the loss with respect to each weight.

Step 4: Weight Updates

Once the gradients (slopes) are known, the weights are updated using an optimizer (like SGD or
Adam) to reduce the error. We subtract a small fraction of the gradient from each weight—a step
controlled by the learning rate.

5.2 Types of Backpropagation


 Static backpropagation: It is the standard form of the backpropagation algorithm used in
feedforward neural networks, where data flows only in one direction from input to output
without any internal loops or memory. In this method, the error between the predicted and
actual output is calculated, and then gradients are propagated backward through the network
to update the weights. Static backpropagation is efficient and widely used for tasks with
fixed-size inputs and outputs, such as image classification, where there is no need to consider
time or sequence.
 Recurrent backpropagation: It is used in recurrent neural networks (RNNs), which are
designed to handle sequential data and time-dependent tasks. Unlike feedforward networks,
RNNs have loops that allow them to maintain a "memory" of previous inputs. During
training, the network is "unfolded" over time, and the error is propagated backward through
each time step to adjust the weights. This allows the model to learn patterns and
dependencies across time, making recurrent backpropagation.

6. Unit Saturation (Vanishing Gradient Problem)

The vanishing gradient problem in deep learning is a big issue that comes up when
training neural networks in deep structures with lots of layers. It happens when the gradients
used to update the weights during backpropagation get small disappearing. This makes it hard for
the network to learn because the weights of the earlier layers change, which slows down or stops
training altogether. Fixing the vanishing gradient problem is key to train deep neural networks.
Activation functions have a direct impact on the occurence of vanishing gradient
problems in neural networks. Here are a couple of activation functions: Sigmoid & Tanh
activation function, ReLU activation function, Leaky ReLU. The sigmoid function is one of the
most popular activations functions used for developing deep neural networks. The use of sigmoid
function restricted the training of deep neural networks because it caused the vanishing gradient
problem. This caused the neural network to learn at a slower pace or in some cases no learning at
all.Sigmoid functions are used frequently in neural networks to activate neurons. It is a
logarithmic function with a characteristic S shape. The output value of the function is between 0
and 1. The sigmoid function is used for activating the output layers in binary classification
problems. It is calculated as follows:

 x  
1
1  e x

The ReLU (Rectified Linear Unit) activation function plays a key role in addressing the
vanishing gradient problem in deep neural networks. The vanishing gradient problem occurs
when gradients become extremely small during backpropagation, especially in deep networks
using activation functions like sigmoid or tanh. These functions tend to saturate for large input
values, causing gradients to shrink toward zero and preventing effective learning in earlier
layers. ReLU helps solve this by outputting the input directly if it’s positive and zero otherwise,
which means that for positive values, the gradient is consistently 1. This prevents gradients from
vanishing as they pass through multiple layers, allowing the network to learn more efficiently
and converge faster.
7. ReLU (Rectified Linear Unit)

The ReLU activation function is used to introduce nonlinearity in a neural network, helping
mitigate the vanishing gradient problem during machine learning model training and enabling
neural networks to learn more complex relationships in data. If a model input is positive, the
ReLU function outputs the same value. If a model input is negative, the ReLU function outputs
zero. The formula for ReLU is:

f x  max 0, x

The ReLU (Rectified Linear Unit) activation function plays a key role in addressing the
vanishing gradient problem in deep neural networks. The vanishing gradient problem occurs
when gradients become extremely small during backpropagation, especially in deep networks
using activation functions like sigmoid or tanh. These functions tend to saturate for large input
values, causing gradients to shrink toward zero and preventing effective learning in earlier
layers. ReLU helps solve this by outputting the input directly if it’s positive and zero otherwise,
which means that for positive values, the gradient is consistently 1. This prevents gradients from
vanishing as they pass through multiple layers, allowing the network to learn more efficiently
and converge faster.
Advantages of ReLU

 Computationally Efficient
 Solves the Vanishing Gradient Problem
 Faster Convergence

Disadvantages of ReLU

 Dying ReLU Problem: If a neuron receives inputs that are always negative, it will always
output zero. This means it stops learning completely, a phenomenon called "dying ReLU."
 Exploding gradients: It leads to extremely large weight updates, which can cause the model
to fail to learn or result in numerical instability.

8. Hyperparameter Tuning

Hyperparameter tuning is the practice of identifying and selecting the optimal


hyperparameters for use in training a machine learning model. When performed correctly,
hyperparameter tuning minimizes the loss function of a machine learning model, which means
that the model performance is trained to be as accurate as possible. Tuning involves
experimenting with different combinations of these hyperparameters to find the configuration
that results in the best model performance, typically measured on a validation set. The goal is to
optimize the model’s ability to generalize to new, unseen data.

8.1 Common Hyperparameter Tuning Methods

 Grid Search: It is a method where we define a set of hyperparameter values and try every
possible combination of them to find the best one. It performs exhaustive search, meaning it
checks all options systematically. For each combination, it uses cross-validation to evaluate
performance and chooses the one with the highest accuracy (or lowest error). The following
code illustrates how to use GridSearchCV:
#Necessary Imports
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Load data
X, y = load_iris(return_X_y=True)
# Define model and parameters
model = SVC()
params = {'C': [1, 10]}
# Grid Search
grid = GridSearchCV(model, params)
grid.fit(X, y)
# Best result
print(grid.best_params_)

 Random Search: It also uses cross-validation but instead of checking every combination, it
randomly samples combinations from the hyperparameter space. This means it tries a fixed
number of random combinations, not all of them.

from sklearn.datasets import load_iris


from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
# Load data
X, y = load_iris(return_X_y=True)
# Define model and parameters
model = SVC()
params = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
# Random Search
random_search = RandomizedSearchCV(model, params, n_iter=2, random_state=42)
random_search.fit(X, y)
# Best result
print(random_search.best_params_)
9. Batch Normalization

Batch normalization is a deep learning approach that has been shown to significantly
improve the efficiency and reliability of neural network models. It is particularly useful for
training very deep networks, as it can help to reduce the internal covariate shift that can occur
during training. The term “internal covariate shift” is used to describe the effect that updating the
parameters of the layers above it has on the distribution of inputs to the current layer during deep
learning training. This can make the optimization process more difficult and can slow down the
convergence of the model. Since normalization guarantees that no activation value is too high or
too low, and since it enables each layer to learn independently from the others, this strategy leads
to quicker learning rates. By standardizing inputs, the “dropout” rate (the amount of information
lost between processing stages) may be decreased. That ultimately leads to a vast increase in
precision across the board.

Let’s understand this through an example. We have a deep neural network, as shown in
the following image.

Initially, our inputs X1, X2, X3, and X4 are normalized as they come from the pre-
processing stage. When the input passes through the first layer, it transforms as a sigmoid
function applied over the dot product of input X and the weight matrix W. Similarly, this
transformation will take place for the second layer and continue until the last layer L, as shown
in the following image.

Although our input X was normalized with time, the output will no longer be on the same scale.
As the data pass through multiple layers of the neural network and L activation functions are
applied, it leads to an internal co-variate shift in the data.

9.1 Normalization of the Input

Normalization is the process of transforming the data to have a mean zero and standard deviation
one. In this step we have our batch input from layer h, first, we need to calculate the mean of this
hidden activation.

1

m
 hi

Here, m is the number of neurons at layer h. Once we have achieved our goal, the next step is to
calculate the standard deviation of the hidden activations.

1
1 
    hi   2 
2

m 

Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide the
whole value with the sum of standard deviation and the smoothing term   . The smoothing term

  assures numerical stability within the operation by stopping a division by a zero value.

hi   
hi norm 
 

Advantages of Batch Normalization

 Speed Up the Training


 Handles internal covariate shift
 Smoothens the Loss Function

Disadvantages of Batch Normalization

 Additional Complexity
 Overhead in Implementation
 Depends on Batch Size

10. Regularization

Regularization is a technique used in machine learning to prevent overfitting and improve


the generalization performance of models. In essence, regularization adds a penalty term to the
loss function, discouraging the model from learning overly complex patterns that may not
generalize well to unseen data. The main benefits of regularization include:

a. Reducing overfitting: By constraining the model’s complexity, regularization helps prevent


the model from memorizing noise or irrelevant patterns in the training data.
b. Improving generalization: Regularized models tend to perform better on new, unseen data
because they focus on capturing the underlying patterns rather than fitting the training data
perfectly.
c. Enhancing model stability: Regularization makes models less sensitive to small fluctuations
in the training data, leading to more stable and reliable predictions.
d. Enabling feature selection: Some regularization techniques, such as L1 regularization, can
automatically identify and discard irrelevant features, resulting in more interpretable models.
Fig: Regularization

The most common regularization techniques are L1 regularization (Lasso), which adds the
absolute values of the model weights to the loss function, and L2 regularization (Ridge), which
adds the squared values of the weights. By incorporating these penalty terms, regularization
strikes a balance between fitting the training data and keeping the model simple, ultimately
leading to better performance on new data.

10.1 Understanding Overfitting and Underfitting

To train our machine learning model, we provide it with data to learn from. The process
of plotting a series of data points and drawing a line of best fit to understand the relationship
between variables is called Data Fitting. Overfitting in regularization happens when a model
learns the training data too well, including its noise and outliers, which hurts its performance on
new, unseen data. Regularization helps reduce overfitting by adding a penalty to the model's
complexity, discouraging it from relying too heavily on specific features or fitting the data too
closely. This makes the model more general and better at predicting on test data. In the figure
below, we can see that the model is fit for every point in our data. If new data is provided, the
model curves may not match the patterns in the new data, and the model may not predict very
well.
Underfitting occurs when a machine learning model fails to learn the relationship between
variables in the test data or to predict or classify a new data point. The image below shows an
underfitting model. We can see that it doesn’t fit the data given correctly. There is no way to find
patterns in the data and ignored much of the data set. It cannot work with both known and
unknown data.

10.2 Regularization Techniques in Machine Learning

Regularization techniques in machine learning are used to prevent overfitting by adding a


penalty to the loss function, which discourages overly complex models. The three main types
are:

 Ridge Regression (L2 Regularization): This technique adds the sum of the squared
coefficients to the loss function. It shrinks coefficients but doesn’t make them zero, meaning
it keeps all features in the model. It’s useful when many features have small to moderate
effects.
 Lasso Regression (L1 Regularization): Lasso adds the sum of the absolute values of the
coefficients as a penalty. Unlike Ridge, it can reduce some coefficients to zero, effectively
performing feature selection by removing less important variables.
 ElasticNet: ElasticNet combines both L1 and L2 penalties. It balances the strengths of Ridge
and Lasso, making it effective when there are multiple correlated features or when we want
both shrinkage and feature selection.
The cost function for regularization technique is given below, where  is penalty of errors, w is
weight function.

Cost function  Loss     w


2

11. Dropout

Dropout is a technique that randomly disables (or “drops”) a fraction of neurons during each
training iteration. i.e., the term "dropout" refers to dropping out the nodes (input and hidden
layer) in a neural network. All the forward and backwards connections with a dropped node are
temporarily removed, thus creating new network architecture out of the parent network. The
nodes are dropped by a dropout probability of p. This prevents the network from becoming too
dependent on certain nodes and encourages it to learn more generalized features, which helps it
perform better on new data.

11.1 Working Process of Dropout

a. During Training: At each training iteration, dropout randomly disables a fraction (e.g.,
50%) of neurons in the network. This effectively creates a new, smaller neural network with
fewer neurons for that iteration. As a result, the model learns to work without depending on
any one specific neuron, which reduces overfitting.
b. During Testing: During testing or inference (when the model is predicting on new data), all
neurons are active. However, the weights of the neurons are scaled down by the dropout rate
to account for the different training structure. This scaling is typically by a factor of (1 —
dropout rate) to ensure the same output range as during training.

For example, if you set a dropout rate of 0.5, during training, half of the neurons are disabled in
each iteration. In testing, all neurons are active, but each neuron’s output is multiplied by 0.5 to
balance the network’s response.

The main motive of training the dropout is to decrease the loss function, given all the
units (neurons). So in overfitting, a unit may change in a way that fixes up the mistakes of the
other units. This leads to complex co-adaptations, which in turn leads to the overfitting problem
because this complex co-adaptation fails to generalise on the unseen dataset. Now, if we use
dropout, it prevents these units to fix up the mistake of other units, thus preventing co-adaptation,
as in every iteration the presence of a unit is highly unreliable. So by randomly dropping a few
units (nodes), it forces the layers to take more or less responsibility for the input by taking a
probabilistic approach. This ensures that the model is getting generalised and hence reducing the
overfitting problem.

Fig: (a) Hidden layer features without dropout; (b) Hidden layer features with dropout

From the above figure, we can easily make out that the hidden layer with dropout is
learning more of the generalised features than the co-adaptations in the layer without dropout. It
is quite apparent, that dropout breaks such inter-unit relations and focuses more on
generalisation.

You might also like