0% found this document useful (0 votes)
23 views80 pages

Chapter 5 Final

Deep learning is a subfield of AI that utilizes neural networks to analyze data similarly to human cognition, requiring more data and computational power than traditional machine learning. It encompasses various types of neural networks, such as Convolutional and Recurrent Neural Networks, and is increasingly popular due to its applicability in advanced tasks like image recognition and natural language processing. Key concepts include activation functions, loss functions, backpropagation, and hyperparameter tuning, which are essential for optimizing model performance.

Uploaded by

koolavarghese6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views80 pages

Chapter 5 Final

Deep learning is a subfield of AI that utilizes neural networks to analyze data similarly to human cognition, requiring more data and computational power than traditional machine learning. It encompasses various types of neural networks, such as Convolutional and Recurrent Neural Networks, and is increasingly popular due to its applicability in advanced tasks like image recognition and natural language processing. Key concepts include activation functions, loss functions, backpropagation, and hyperparameter tuning, which are essential for optimizing model performance.

Uploaded by

koolavarghese6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

INTRODUCTION TO

DEEP LEARNING
What is Deep Learning?
Deep learning Vs Machine Learning
Why Deep Learning getting famous?
Types of Deep Learning
History
Deep Learning
Subfield of AI and machine learning that is inspired by
structure of a human brain
Deep learning attempts to draw similar conclusion as
humans would by continually analyzing data with given
logical structure called neural network
Deep learning is a part of a broader family of machine
learning methods based on Artificial Neural Network with
representation learning
E.g.Dog/Cat Classifier

Deep Learning algorithms use multiple layers to progressively


extract higher level features from raw input
Ex.In image processing,lower layers may identify
edges,higher layers identify the concepts relevant to human
such as digit or letters or faces
COMPARISON
Parameters Machine Learning Deep Learning

Data Dependency Less data required More data required to train.


More data, more performance
Hardware CPU to train ML model GPU to train DL model
Dependency
Training Time Comparatively Complex models, training time
less(Mins,Hours) high(days, weeks, may be months)
Feature Selection Manual extraction of Not required. DL algorithm
features automatically extracts relevant features
Interpretability High DL model is not interpretable.(works
like a black box)
Hyperparamter Limited tuning Can be tuned in various ways
tuning capabilities
Model complexity Comparatively simple Uses complex architectures with
models are multiple hidden layers(ANN)
used(SVM,Decision
trees)
Application Areas Used for basic tasks like Applied to advanced tasks like image
email filtering, fraud recognition, speech processing and
Types of Neural
Networks
Multilayer Perceptron
Convolutional Neural Networks
Recurrent Neural Networks
Autoencoders
Generative Adversarial
Networks
GAN-Generated Images
History –Deep Learning
Why its getting famous?
Applicability
Performance
Labelled data
Powerful Hardware and increased
computational power
Advanced algorithms, techniques and
frameworks
Application Areas
Self driving cars
Game playing agents
Virtual assistants
Image colourization/audio generator
Image caption generator
Text Translators
Pixel Restoration
Object Detection
Artificial Neural
Network
Perceptron
Perceptron Vs Neutron
Prediction in Perceptron
Training in perceptron
Problem with perceptron
Perceptron
Prediction in Perceptron
IQ CGPA Placed or Not
78 7.8 1
69 5.1 0
Perceptron Vs Neuron
Geometric Intuition
Geometric Intuition
Geometric
Intuition IQ CGPA 12TH MARKS
Perceptron Training
Problem with Perceptron
Problem with Perceptron
Problem with Perceptron
MULTILAYER PERCEPTRON-NOTATION
MULTILAYER PERCEPTRON
MULTILAYER PERCEPTRON
MULTILAYER PERCEPTRON
MULTILAYER PERCEPTRON
MULTILAYER PERCEPTRON
FEED FORWARD NETWORK
Perceptron Feed Forward Back
Propagation Network
Multi-layer network (input
Single-layer network (input
Structure layer, hidden layers, output
and output layer only)
layer)
Draw Diagram here for
both
Simple learning rule (error Backpropagation algorithm
Learning Process correction) (forward + backward
propagation)
Activation Function Step function (binary Nonlinear functions (sigmoid,
threshold) ReLU, tanh, etc.)
Can solve linearly separable Can solve both linear and
Capability problems only (e.g., AND, nonlinear problems (e.g.,
OR) XOR)
Training Time Faster, simpler model Slower, requires multiple
passes (forward and
backward)
Error Correction Simple weight update based Error is propagated backward
on misclassified data to adjust weights
Application Binary classification (linearly Complex tasks (image
Activation Function-Definition
In artificial neural network,each neuron forms a weighted
sum of its inputs and passes the resulting scalar value
through a function referred to as an activation function or
transfer function.
If a neuron has n inputs then output or activation function
of a neuron is
Activation Functions-Sigmoid
Sigmoid Activation Function is characterized by ‘S’
shape.
It is mathematically defined as ​.
•It allows neural networks to handle and model complex
patterns that linear equations cannot.
•The output ranges between 0 and 1, hence useful for
binary classification.
•The function exhibits a steep gradient when x values are
between -2 and 2. This sensitivity means that small
changes in input x can cause significant changes in
output y, which is critical during the training process.
Activation Functions-tanh
Tanh function (hyperbolic tangent function), is a shifted version
of the sigmoid, allowing it to stretch across the y-axis. It is defined as:

•Value Range: Outputs values from -1 to +1.

•Non-linear: Enables modeling of complex data patterns.

•Use in Hidden Layers: Commonly used in hidden layers due to its


zero-centered output, facilitating easier learning for subsequent
layers.
Activation Functions-Relu
ReLU activation is defined by
this means that if the input x is positive, ReLU returns x, if the input is
negative, it returns 0.
•Value Range: [0,∞)[0,∞), meaning the function only outputs non-
negative values.
•Nature: It is a non-linear activation function, allowing neural
networks to learn complex patterns and making backpropagation more
efficient.
•Advantage over other Activation: ReLU is less computationally
expensive than tanh and sigmoid because it involves simpler
mathematical operations. At a time only a few neurons are activated
making the network sparse making it efficient and easy for
computation.
Loss Functions
Loss function is a method of evaluating how well your algorithm is
modelling your dataset.
The goal of a loss function is to guide optimization algorithms in
adjusting model parameters to reduce this loss over time.
High value of loss function-Poor performance
Low value of loss function-Good performance
Why its important?
 “You cant improve what you cant measure”
 Loss function In DL:
Regression:MSE, MAE, Huber Loss
Classification:Binary cross entropy, categorical cross entropy, Hinge
Loss
Loss Function in DL
1. The Mean Squared Error (MSE) Loss is one of the most widely
used loss functions for regression tasks. It calculates the average of
the squared differences between the predicted values and the
actual values.

Disadvantages:
•Sensitive to outliers because the errors are squared, which can
disproportionately affect the loss

The Mean Absolute Error (MAE) Loss is another commonly used loss
function for regression. It calculates the average of the absolute
differences between the predicted values and the actual values.

Not differentiable at zero, which can pose issues for some


optimization algorithms.
Huber Loss

Huber Loss combines the advantages of MSE and MAE. It is less


sensitive to outliers than MSE and differentiable everywhere, unlike
MAE.
Binary Cross-Entropy Loss, also known as Log Loss, is used for
binary classification problems. It measures the performance of a
classification model whose output is a probability value between 0
and 1.

where n is the number of data points, yiyi​is the actual binary label
(0 or 1), and y^iy^​i​is the predicted probability.
Suitable for binary classification.
Categorical Cross-Entropy Loss is used for multiclass classification
problems. It measures the performance of a classification model
whose output is a probability distribution over multiple classes.

where n is the number of data points, k is the number of


classes, yijyij​​is the binary indicator (0 or 1) if class label j is the
correct classification for data point i, and y^ijy^​ij​​is the predicted
probability for class j.
Suitable for multiclass classification.
Hinge Loss is used for training classifiers, especially for support
vector machines (SVMs). It is suitable for binary classification tasks.

where yiyi​​is the actual label (-1 or 1), and y^iy^​i​is the predicted
value.
Chapter 5
Backpropagation algorithm with example
Hyperparameters
Backpropagation is a technique used in deep learning to train
artificial neural networks particularly feed-forward networks. It
works iteratively to adjust weights and bias to minimize the cost
function.

In each epoch the model adapts these parameters reducing loss by


following the error gradient. Backpropagation often uses
optimization algorithms like gradient descent or stochastic gradient
descent.
Backpropagation Algorithm
Example
The goal of backpropagation is to
optimize the weights so that the
neural network can learn how to
correctly map arbitrary inputs to
outputs.

For the rest of this tutorial we’re


going to work with a single training
set: given inputs 0.05 and 0.10,
we want the neural network to
output 0.01 and 0.99.
The Forward Pass
In forward pass the input data is fed into the input layer.
These inputs combined with their respective weights are passed to hidden layers.
For example in a network with two hidden layers (h1 and h2 as shown in Fig. (a))
the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.
Each hidden layer applies an activation function like ReLU
(Rectified Linear Unit) which returns the input if it’s positive and zero
otherwise. This adds non-linearity allowing the model to learn complex
relationships in the data.
Finally the outputs from the last hidden layer are passed to the output layer where
an activation function such as softmax converts the weighted outputs into
probabilities for classification.
In the backward pass the error (the difference between the predicted and actual
output) is propagated back through the network to adjust the weights and biases.
One common method for error calculation is the Mean Squared Error (MSE) given
by:
MSE=(Predicted Output−Actual Output)2
Once the error is calculated the network adjusts weights using gradients which are
computed with the chain rule.
These gradients indicate how much each weight and bias should be adjusted to
minimize the error in the next iteration.
The backward pass continues layer by layer ensuring that the network learns and
improves its performance. The activation function through its derivative plays a
crucial role in computing these gradients during backpropagation.
Hyperparameters
Definition:
Hyperparameters are settings that are not learned during the
training process of a machine learning model, but are instead set
before training begins.
Purpose:
They control the overall learning process and the structure of the
model, influencing how the model learns and performs.
Learning Rate: Determines how much the model's parameters
(weights) adjust during each iteration of training.
Batch Size: Specifies the number of training examples used in each
iteration (or "batch").
Number of Hidden Layers: Determines the complexity of the model.
Number of Neurons per Layer: Affects the capacity of each layer to
learn patterns.
Activation Function: Introduces non-linearity into the model,
allowing it to learn complex relationships.
Optimizer: The algorithm used to update the model's parameters
based on the loss function.
Epochs: The number of times the entire training dataset is passed
through the model during training.
Number of Neurons per Layer:
Currently, all layers share the same number of neurons, but customization is
possible. It’s crucial to adapt the number of neurons based on the complexity of
the solution. Tasks with higher complexity demand an increased number of
neurons. The specified range for the number of neurons spans from 10 to 100.
Activation Function:
Input data are fed to the input layer, followed by hidden layers, and the final
output layer. The output layer contains the output value. The input values moving
from a layer to another layer keep changing according to the activation function.
Activation function introduces nonlinearity into a model, allowing it to handle
more complex datasets. Nonlinear models can generalize and adapt to a greater
variety of data.
Optimizer:
The layers of a neural network are compiled and an optimizer is assigned. The
optimizer is responsible to change the learning rate and weights of neurons in the
neural network to reach the minimum loss function. Optimizer is very important to
achieve the possible highest accuracy or minimum loss.
Learning rate sets the speed at which a model adjusts its parameters in each
iteration. These adjustments are known as steps.
A high learning rate means that a model will adjust more quickly, but at the risk of
unstable performance and data drift.
Meanwhile, while a low learning rate is more time-consuming and requires more data
Gradient descent optimization is an example of a training metric requiring a set
learning rate.
Batch Size
Batch size sets the amount of samples the model will compute
before updating its parameters. It has a significant effect on both
compute efficiency and accuracy of the training process. On its
own, a higher batch size weakens overall performance, but
adjusting the learning rate along with batch size can mitigate this
loss.
Number of hidden layers
The number of hidden layers in a neural network determines its
depth, which affects its complexity and learning ability. Fewer layers
make for a simpler and faster model, but more layers—such as
with deep learning networks—lead to better classification of input
data. Identifying the optimal hyperparameter value here from all
the possible combinations is all about a tradeoff between speed
with accuracy.
Epochs
Epochs is a hyperparameter that sets the amount of times that a
model is exposed to its entire training dataset during the training
process. Greater exposure can lead to improved performance but
runs the risk of overfitting.
Methods to find
hyperparameters
Choosing the right hyperparameters is crucial for achieving good
performance and preventing overfitting or underfitting.

Grid Search
Random Access
Hyperparameter Tuning techniques
GridSearch
We fit the model using all possible combinations after creating a
grid of potential discrete hyperparameter values.

We log each set’s model performance and then choose the


combination that produces the best results.

This approach is called GridSearchCV, because it searches for the


best set of hyperparameters from a grid of hyperparameters
values.
Hyperparameter Tuning techniques
GridSearch
For example: if we want to set two hyperparameters
C and Alpha of the Logistic Regression Classifier model,
with different sets of values. The grid search technique
will construct many versions of the model with all
possible combinations of hyperparameters and will
return the best one.

As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and


Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination of C=0.3
and Alpha=0.2, the performance score comes out to
be 0.726(Highest), therefore it is selected.
Drawback: GridSearchCV will go through all
the intermediate combinations of
hyperparameters which makes grid search
computationally very expensive.
Hyperparameter Tuning techniques
RandomizedSearch / Random Access
The random search method selects values at random as opposed to
the grid search method’s use of a predetermined set of numbers.

Every iteration, random search attempts a different set of


hyperparameters and logs the model’s performance.

It returns the combination that provided the best outcome after


several iterations. This approach reduces unnecessary computation.
Hyperparameter Tuning techniques
RandomizedSearch / Random Access
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it
goes through only a fixed number of hyperparameter settings.

It moves within the grid in a random fashion to find the best set of
hyperparameters.

The advantage is that, in most cases, a random search will produce


a comparable result faster than a grid search.
Hyperparameter Tuning techniques
RandomizedSearch / Random Access
Drawback: It’s possible that the outcome could not be the ideal
hyperparameter combination is a disadvantage.
Selecting Number of Neurons
The number of hidden neurons should be between
the size of the input layer and the size of the output
layer.

The number of hidden neurons should be 2/3 the


size of the input layer, plus the size of the output
layer.

The number of hidden neurons should be less than


twice the size of the input layer.

number of neurons and number layers required for


the hidden layer also depends upon the complexity
Selecting Number of Neurons –
Using single hidden layers
Most of the problems can be solved by using a single hidden layer
with the number of neurons equal to the mean of the input and
output layer.

If less number of neurons is chosen it will lead to underfitting.

Whereas if we choose too many neurons it may lead to overfitting.


Selecting Number of Neurons –
Using Pruning
It trims the neurons during training, by identifying
those which have no impact on the performance
of the network.

It can also be identified by checking the weights


of the neurons, weights that are close to zero
have relatively less importance. In pruning such
nodes are removed.

You might also like