0% found this document useful (0 votes)
17 views205 pages

Unit 1 2

The document outlines a course on Deep Learning, detailing its syllabus, objectives, and evaluation scheme, led by Dr. Raju at the Noida Institute of Engineering and Technology. It covers topics such as the curse of dimensionality, regression, classification, artificial neural networks, and various learning techniques. The course aims to enhance students' understanding of unsupervised techniques and improve accuracy in data analysis.

Uploaded by

cosisew977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views205 pages

Unit 1 2

The document outlines a course on Deep Learning, detailing its syllabus, objectives, and evaluation scheme, led by Dr. Raju at the Noida Institute of Engineering and Technology. It covers topics such as the curse of dimensionality, regression, classification, artificial neural networks, and various learning techniques. The course aims to enhance students' understanding of unsupervised techniques and improve accuracy in data analysis.

Uploaded by

cosisew977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 205

UNIT-1 INTRODUCTION

Model Improvement
and Performance
MODULE -01
Noida Institute of Engineering and Technology

INTRODUCTION

Unit: 1

DEEP LEARNING(ACSML0602)

Dr. Raju
Assistant Professor
(B Tech VIth Sem)
(CSE - AIML)

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 2 1


Noida Institute of Engineering and Technology

Faculty Profile:

Name : Dr. Raju

Designation: Assistant Professor & HoD

Qualification: Ph.D

Experience: 11 Years in the field of Computer Science & Engineering

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 3 1


Content
• Introduction to Deep Learning Curse of Dimensionality
• Regression
• Classification
• Artificial Neural Network
• Neural network architecture
• Various learning techniques
• Multilayer perceptron
• Derivation of Backpropagation Algorithm.

03/20/2025 Deep Learning ACSML0602 4 2


Course Objective

To be able to learn unsupervised techniques and provide continuous


improvement in accuracy and outcomes of various datasets with more
reliable and concise analysis results.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 5


Evaluation Scheme

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 6


Syllabus
UNIT-I: Model Improvement and Performance

Curse of Dimensionality, Bias and Variance Trade off, Overfitting and underfitting,
Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-Value, Classification -
Precision, Recall, F1, Other topics, K-Fold Cross validation, RoC curve, Hyper-Parameter
Tuning Introduction – Grid search, random search, Introduction to Deep Learning.

Artificial Neural Network: Neuron, Nerve structure and synapse, Artificial Neuron and its
model, activation functions, Neural network architecture: Single layer and Multilayer feed
forward networks, recurrent networks. Various learning techniques; Perception and
Convergence rule, Hebb Learning. Perceptron’s, Multilayer perceptron, Gradient descent and
the Delta rule, Multilayer networks, Derivation of Backpropagation Algorithm.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 7


Syllabus

UNIT-II: CONVOLUTION NEURAL NETWORK

What is computer vision? Why Convolutions (CNN)?

Introduction to CNN, Train a simple convolutional neural net, Explore the design space for convolutional nets,
Pooling layer motivation in CNN, Design a convolutional layered application, Understanding and visualizing a
CNN, Transfer learning and fine-tuning CNN, Image classification, Text classification, Image classification and
hyper-parameter tuning, Emerging NN architectures

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 8


Syllabus

UNIT-III:DETECTION & RECOGNITION

Padding & Edge Detection, Strided Convolutions, Networks in Networks and


1x1Convolutions, Inception Network Motivation, Object Detection, YOLO Algorithm.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 9


Syllabus
UNIT-IV: RECURRENT NEURAL NETWORKS

Why use sequence models? Recurrent Neural Network Model, Notation, Back-propagation
through time (BTT), Different types of RNNs, Language model and sequence generation,
Sampling novel sequences, Vanishing gradients with RNNs, Gated Recurrent Unit (GRU),
Long Short-Term Memory (LSTM), Bidirectional RNN, Deep RNNs

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 10


Syllabus

UNIT-V: AUTO ENCODERS IN DEEP LEARNING

Auto-encoders and unsupervised learning, Stacked auto-encoders and semi-supervised


learning,

Regularization - Dropout and Batch normalization.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 11


Course Outcome (CO)

Course Bloom’s
Outcome At the end of course , the student will be able to: Knowledge
( CO) Level (KL)
CO1 Analyze ANN model and understand the ways of accuracy K4
measurement.
CO2 Develop a Convolutional neural network for multi-class K4
classification in images
CO3 Apply Deep Learning algorithm to detect and recognize an K3
object.
CO4 Apply RNNs to Time Series Forecasting, NLP, Text and K4
Image Classification
CO5 Apply Lower-dimensional representation over higher- K3
dimensional data for dimensionality reduction and capture
the important features of an object.
03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 12
Program Outcomes (POs)

Engineering Graduates will be able to:

PO1 : Engineering Knowledge

PO2 : Problem Analysis

PO3 : Design/Development of solutions

PO4 : Conduct Investigations of complex problems

PO5 : Modern tool usage

PO6 : The engineer and society


03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 13
Program Outcomes (POs)

Engineering Graduates will be able to:


PO7 : Environment and
sustainability
PO8 : Ethics

PO9 : Individual and teamwork

PO10 : Communication
PO11 : Project management and
finance
PO12 : Life-long learning
03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 14
CO-PO Mapping

CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1 3 3 3 3 2 2 1 - 1 - 2 2

CO2 3 3 3 3 2 2 1 - 1 1 2 2

CO3 3 3 3 3 3 2 2 - 2 1 2 3

CO4 3 3 3 3 3 2 2 1 2 1 2 3

CO5 3 3 3 3 3 2 2 1 2 1 2 2

AVG 3.0 3.0 3.0 3.0 2.6 2.0 1.6 0.4 1.6 0.8 2.0 2.4

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 15


Pattern of Online External Exam Question Paper (100 marks)

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 16


Pattern of Online External Exam Question Paper (100 marks)

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 17


Pattern of Online External Exam Question Paper (100 marks)

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 18


Pattern of Online External Exam Question Paper (100 marks)

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 19


Pattern of Online External Exam Question Paper (100 marks)

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 20


Pattern of Online External Exam Question Paper (100 marks)

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 21


Unit I Content

Model Improvement and Artificial Neural Network:


Performance: • Neuron, Nerve structure and synapse,
• Curse of Dimensionality, • Artificial Neuron and its model,
• Bias and Variance Trade off • activation functions,
• Overfitting and underfitting, • Neural network architecture: Single
• Regression - MAE, MSE, RMSE, layer and Multilayer feed forward
networks, recurrent networks.
• R Squared, Adjusted R Squared, p-Value,
• Various learning techniques; Perception
• Classification - Precision, Recall, F1, and Convergence rule, Hebb Learning.
• Other topics, K-Fold Cross validation, Perceptron’s, Multilayer perceptron,
Gradient descent and the Delta rule,
• RoC curve,
• Multilayer networks,
• Hyper-Parameter Tuning Introduction –
Grid search, random search, • Derivation of Backpropagation
Algorithm.
• Introduction to Deep Learning.
03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 22
Unit I Objective

Analyze ANN model and understand the ways of accuracy measurement.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 23


Topis Prerequisite

• Python, Basic Modeling Concepts

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 24


Topic Objective

To be able to learn unsupervised techniques and provide continuous improvement in accuracy


and outcomes of various datasets with more reliable and concise analysis results.

Analyze ANN model and understand the ways of accuracy measurement .

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 25


Lecture Plan,
Lecture Plan,
Lecture Plan,
Lecture Plan,
Lecture Plan,
UNIT-1 INTRODUCTION

• Curse of Dimensionality,
• Bias and Variance Trade off,
• Overfitting and underfitting,
• Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-
Value,
• Classification - Precision, Recall, F1, Other topics, K-Fold Cross
validation, RoC curve,
• Hyper-Parameter Tuning Introduction – Grid search, random search,
• Introduction to Deep Learning.
UNIT-1 Curse of Dimensionality,
• Curse of Dimensionality refers to a set of problems that arise when
working with high-dimensional data.
• The difficulties related to training machine learning models due to
high dimensional data is referred to as ‘Curse of Dimensionality’.
• The popular aspects of the curse of dimensionality; ‘data sparsity’
and ‘distance concentration’
UNIT-1 Curse of Dimensionality,

• Data Sparsity
• Data sparsity refers to a situation where the available data is insufficient or
incomplete for a particular analysis or task.
• In the context of data analysis and machine learning, data sparsity occurs
when many of the potential variables or combinations of variables have little
or no observed data points.
• This lack of data can lead to challenges when trying to build accurate models,
make predictions, or draw meaningful insights.
UNIT-1 Curse of Dimensionality,

• Data sparsity can arise for various reasons, including:


• Rare Events: In certain situations, events or occurrences might be rare, leading
to sparse data. For example, in fraud detection, actual fraudulent transactions
are much less common than legitimate ones.
• High-Dimensional Data: When dealing with data that has a high number of
features or dimensions, it's more likely that some combinations of features will
have few or no data points, leading to sparsity.
• Cold-Start Problem: In recommendation systems, when a new item or user is
introduced to the system, there might be very little data available for making
accurate recommendations.
• Long-Tail Distribution: In scenarios where the distribution of data follows a
long-tail distribution, a few instances dominate, while the rest have minimal
representation, leading to sparsity in the tail.
• Sparse Sampling: In experimental or observational studies, certain conditions
or groups might be underrepresented due to limitations in data collection
methods.
UNIT-1 Curse of Dimensionality,

• Techniques to address data sparsity


• Feature Engineering: Creating new features or combining existing ones can help alleviate
sparsity by providing additional information.
• Data Imputation: Filling in missing or sparse data points using statistical methods or
imputation algorithms can help create a more complete dataset.
• Dimensionality Reduction: Reducing the number of dimensions or features can help
mitigate sparsity by reducing the number of combinations with limited data.
• Regularization: In machine learning models, regularization techniques like L1
regularization (Lasso) can help in feature selection and mitigate the impact of sparse
features.
• Ensemble Methods: Combining predictions from multiple models or algorithms can help
reduce the impact of sparsity on the overall predictive power.
• Matrix Factorization: In recommendation systems, matrix factorization techniques can
help fill in missing values by learning latent factors from the available data.
• Synthetic Data Generation: Generating synthetic data points can help balance out the
representation of different categories, especially in cases where certain categories are
severely underrepresented.
UNIT-1 Curse of Dimensionality,

• Distance concentration
• In the context of machine learning, the term "distance concentration" could
potentially refer to a phenomenon where distances between data points in a
high-dimensional space become more concentrated or consistent.
• This could imply that the data points are exhibiting certain patterns or
clusters that are characterized by specific distances between them.
• Such a phenomenon could have implications for various aspects of machine
learning, such as clustering, anomaly detection, and feature selection.
• For example, if data points in a high-dimensional space tend to cluster
around specific distances from each other, it might suggest that certain
groups or classes of data have inherent structures that are closely related in
terms of their feature values.
• This could be useful for tasks like clustering, where identifying groups of
similar data points is important.
UNIT-1 Curse of Dimensionality,

• Domains of curse of dimensionality


• Anomaly Detection
• Anomaly detection is used for finding unforeseen items or events in the
dataset. In high-dimensional data anomalies often show a remarkable
number of attributes which are irrelevant in nature; certain objects occur
more frequently in neighbour lists than others.
• Combinatorics
• Whenever, there is an increase in the number of possible input
combinations it fuels the complexity to increase rapidly, and the curse of
dimensionality occurs.
• Machine Learning
• In Machine Learning, a marginal increase in dimensionality also requires
a large increase in the volume in the data in order to maintain the same
level of performance. The curse of dimensionality is the by-product of a
phenomenon which appears with high-dimensional data.
Regularization: Bias and Variance

• An error is a measure of how accurately an


algorithm can make predictions for the
previously unknown dataset.

• Reducible errors: These errors can be


reduced to improve the model accuracy. Such
errors can further be classified into bias and
Variance.
• Irreducible errors: These errors will always be
present in the model

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 38


NIT 03
Regularization: Bias and Variance

• Bias: a difference between prediction values made by the model and


actual values/expected values is known as bias errors or Errors due to
bias.
• Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
• High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias
model also cannot perform well on new data.
• Some examples of machine learning algorithms with low bias are
Decision Trees, k-Nearest Neighbours and Support Vector Machines.
• At the same time, an algorithm with high bias is Linear Regression,
Linear Discriminant Analysis and Logistic Regression.

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 39


NIT 03
Regularization: Bias and Variance

• Ways to reduce High Bias:


• Increase the input features as the model is underfitted.
• Decrease the regularization term.
• Use more complex models, such as including some polynomial
features.

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 40


NIT 03
Regularization: Bias and Variance

• Variance
• The variance would specify the amount of variation in the prediction if the
different training data was used.
• It tells that how much a random variable is different from its expected value.
• Ideally, a model should not vary too much from one training dataset to another,
which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables.
• Variance errors are either of low variance or high variance.
• Low variance means there is a small variation in the prediction of the
target function with changes in the training data set.
• At the same time, High variance shows a large variation in the prediction
of the target function with changes in the training dataset.

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 41


NIT 03
Regularization: Bias and Variance

• A model with high variance has the below problems:


• A high variance model leads to overfitting.
• Increase model complexities.
• Ways to Reduce High Variance:
• Reduce the input features or number of parameters as a model is overfitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 42


NIT 03
Regularization: Bias and Variance

• Different Combinations of Bias-Variance

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 43


NIT 03
Regularization: Bias and Variance
• Different Combinations of Bias-Variance
• Low-Bias, Low-Variance:
• The combination of low bias and low variance shows an ideal machine learning model.
However, it is not possible practically.
• Low-Bias, High-Variance:
• With low bias and high variance, model predictions are inconsistent and accurate on
average. This case occurs when the model learns with a large number of parameters and
hence leads to an overfitting
• High-Bias, Low-Variance:
• With High bias and low variance, predictions are consistent but inaccurate on average.
This case occurs when a model does not learn well with the training dataset or uses few
numbers of the parameter. It leads to underfitting problems in the model.
• High-Bias, High-Variance:
• With high bias and high variance, predictions are inconsistent and also inaccurate on
average.
03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 44
NIT 03
Regularization

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 45


NIT 03
Regularization: Overfitting

• Overfitting
• Overfitting is an undesirable machine learning behavior that occurs when the
machine learning model gives accurate predictions for training data but not
for new data.
• High variance and low bias.
• Reasons for Overfiting
• The training data size is too small and does not contain enough data samples
to accurately represent all possible input data values.
• The training data contains large amounts of irrelevant information, called
noisy data.
• The model trains for too long on a single sample set of data.
• The model complexity is high, so it learns the noise within the training data.
03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 46
NIT 03
Regularization: Overfitting

• Symptom of Overfitting:
• Low training error but high validation error.
• The model fits the training data very well but fails to generalize to new data.

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 47


NIT 03
Regularization: Underfitting

• Underfitting
• It occurs when a model is too simple to capture data complexities.
• It represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate, especially when applied to
new, unseen examples.
• It mainly happens when we uses very simple model with overly simplified
assumptions.
• To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.
• The underfitting model has High bias and low variance.

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 48


NIT 03
Regularization: Underfitting

• Reasons for Underfitting


• The model is too simple, So it may be not capable to represent the
complexities in the data.
• The input features which is used to train the model is not the
adequate representations of underlying factors influencing the
target variable.
• The size of the training dataset used is not enough.
• Excessive regularization are used to prevent the overfitting, which
constraint the model to capture the data well.
• Features are not scaled.

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 49


NIT 03
Regularization: Underfitting vs. Overfitting

03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 50


NIT 03
UNIT-1 Bias and Variance Trade off,

• Bias
• Bias is the difference between Actual and Predicted value of a model
• Bias = Actual_Value - Predicted_Value
• high bias leads to large error and low bias leads to low error.
• low bias helpful to avoid underfiting problem while high bias represents
predicted data in a straight line that represnts not fitting data accurately.
• Characteristics of a high-bias model include:
• Failure to capture proper data trends
• Potential towards underfitting
• More generalized/overly simplified
• High error rate
UNIT-1 Bias and Variance Trade off,

• Variance
• Variance represents spread of a given data in a preditive model.
• High Variance model is very complex to fit training data and doesn’t work
accurately over unseen data.
• High Variance generates Overfitting Problem in which model works properly over
training data but gives high error rate on Test data.
• A high-variance model typically has the following qualities:
• Noise in the data set
• Potential towards overfitting
• Complex models
• Trying to put all data points as close as possible
UNIT-1 Bias and Variance Trade off,
• Bias Trade off/ Trade off
• Simple Algorithm may leads to high bias and low variance and thus error-prone.
• Complex algorithm may leads to low bias and high variance.
• Bias-Trade off lies betweeen Bias and Variance.
• Bias-trade off finds the optimize value of total error.

Total Error = Bias²+ Variance + Irreducible Error


UNIT-1 Overfitting and
underfitting
UNIT-1 Overfitting and
underfitting
• Underfitting
UNIT-1 Overfitting and
underfitting
• Underfitting
• Underfitting occurs when a model is not able to make accurate
predictions based on training data and hence, doesn’t have the capacity
to generalize well on new data.
• Another case of underfitting is when a model is not able to learn enough
from training data (Figure 2), making it difficult to capture the dominating
trend (the model is unable to create a mapping between the input and
the target variable).
• Machine learning models with underfitting tend to have poor
performance both in training and testing sets.
• Underfitting models usually have high bias and low variance.
UNIT-1 Overfitting and
underfitting
• Way of detecting underfitting
• Training and test loss: If the model is underfitting, the loss for
both training and validation will be considerably high.
• Over simplistic prediction graph: If a graph with the data points
and the fitted curve is plotted, and the classifier curve is
oversimplistic, then, most probably, your model is underfitting.
UNIT-1 Overfitting and
underfitting
• Way of avoiding underfitting
• Train a more complex model – Lack of model complexity in terms of data
characteristics is the main reason behind underfitting models.
• Training a more complex model will help us solve the problem of
underfitting.
• More time for training - Early training termination may cause underfitting.
As a machine learning engineer, you can increase the number of epochs or
increase the duration of training to get better results.
• Eliminate noise from data – Another cause of underfitting is the existence
of outliers and incorrect values in the dataset.
• Data cleaning techniques can help deal with this problem.
• Adjust regularization parameters - the regularization coefficient can
cause both overfitting and underfitting models.
UNIT-1 Overfitting and
underfitting
• Overfitting
UNIT-1 Overfitting and
underfitting
• Overfitting
• A model is considered overfitting when it does extremely well on training
data but fails to perform on the same level on the validation data.
• An overfitting model fails to generalize well, as it learns the noise and
patterns of the training data to the point where it negatively impacts the
performance of the model on new data.
• If the model is overfitting, even a slight change in the output data will
cause the model to change significantly.
• Models that are overfitting usually have low bias and high variance
UNIT-1 Overfitting and
underfitting
• Way of detecting overfitting
• Use a resampling technique to estimate model accuracy. The most
popular resampling technique is k-fold cross-validation. It allows you to
train and test your model k-times on different subsets of training data
and build up an estimate of the performance of a machine learning
model on unseen data. The drawback here is that it is time-consuming
and cannot be applied to complex models, such as deep neural networks.
• Hold back a validation set. Once a model is trained on the training set,
you can evaluate it on the validation dataset, then compare the accuracy
of the model in the training dataset and the validation dataset. A
significant variance in these two results allows assuming that you have an
overfitted model.
UNIT-1 Overfitting and
underfitting
• Way of preventing overfitting
• Adding more data – Most of the time, adding more data can help
machine learning models detect the “true” pattern of the model,
generalize better, and prevent overfitting. However, this is not always the
case, as adding more data that is inaccurate or has many missing values
can lead to even worse results.
• Early stopping – In iterative algorithms, it is possible to measure how the
model iteration performance. Up until a certain number of iterations,
new iterations improve the model. After that point, however, the model’s
ability to generalize can deteriorate as it begins to overfit the training
data. Early stopping refers to stopping the training process before the
learner passes that point.
UNIT-1 Overfitting and
underfitting
• Way of preventing overfitting
• Data augmentation – In machine learning, data augmentation
techniques increase the amount of data by slightly changing previously
existing data and adding new data points or by producing synthetic data
from a previously existing dataset.
• Remove features – You can remove irrelevant aspects from data to
improve the model. Many characteristics in a dataset may not contribute
much to prediction. Removing non-essential characteristics can enhance
accuracy and decrease overfitting.
UNIT-1 Overfitting and
underfitting
• Way of preventing overfitting
• Regularization – Regularization refers to a variety of techniques to push
your model to be simpler. The approach you choose will be determined
by the model you are training. For example, you can add a penalty
parameter for a regression (L1 and L2 regularization), prune a decision
tree or use dropout on a neural network.
• Ensembling – Ensembling methods merge predictions from numerous
different models. These methods not only deal with overfitting but also
assist in solving complex machine learning problems (like combining
pictures taken from different angles into the overall view of the
surroundings). The most popular ensembling methods are boosting and
bagging.
UNIT-1 Overfitting and
underfitting
• Way of preventing overfitting
• Ensembling –
• Boosting – In boosting method, you train a large number of weak
learners (constrained models) in sequence, and each sequence learns
from the mistakes of the previous sequence. Then you combine all
weak learners into a single strong learner.
• Bagging is another technique to reduce overfitting. It trains a large
number of strong learners (unconstrained models) and then combines
them all in order to optimize their predictions.
UNIT-1 Overfitting and
underfitting
• Overfitting and underfitting using a dart analogy
UNIT-1 Regression
• Regression analysis is a set of statistical methods used for the
estimation of relationships between a dependent variable and
one or more independent variables.
• Regression analysis includes several variations, such as linear,
multiple linear, and nonlinear.
• The most common models are simple linear and multiple linear.
• Nonlinear regression analysis is commonly used for more
complicated data sets in which the dependent and independent
variables show a nonlinear relationship.
UNIT-1 Regression
• Regression Analysis
• Simple Linear Regression: A model that assesses the relationship
between a dependent variable and an independent variable
Y = mx + c + e
• Where:
• Y – Dependent variable
• x – Independent (explanatory) variable
• c – Intercept
• m – Slope
• e – Residual (error)
UNIT-1 Regression
• Multiple linear regression analysis is essentially similar to the
simple linear model, with the exception that multiple
independent variables are used in the model.
• The mathematical representation of multiple linear regression
is:
Y = a + bX1 + cX2 + dX3 + ϵ
• Where:
• Y – Dependent variable
• X1, X2, X3 – Independent (explanatory) variables
• a – Intercept
• b, c, d – Slopes
• ϵ – Residual (error)
UNIT-1 Regression
• Loss Function
• Loss function is a way to know the performance of a model.
• High Loss function leads to bad train model and low loss function
leads to good train model.
• loss function should be as minimum as possible.
• Loss function calculated over a single training data.
L = (Actual_Value - Predicted_Value)2
• Loss function Sometime also known as error function.
• Cost Function
• Cost function calculated for complete batch of data
C= 2
UNIT-1 Regression
• Example for Loss and Cost Function

Actual_Value Predicted_Value
Roll No. CGPA IQ Loss Function Cost Function
Package Predicted
1 5.2 100 6.3 6.4 0.01
2 4.3 91 4.5 5.3 0.64
3.475
3 8.2 83 6.5 5.2 1.69
4 8.9 102 5.5 8.9 11.56

NOTE: Loss function calculated for Individual Data while Cost


Function calculate for Entire Dataset
UNIT-1 Regression
• MAE (Mean Absolute Error): MAE is a metric that measures the
average absolute difference between the predicted values and
the actual values. It gives an idea of how far off the predictions
are from the true values, regardless of the direction of the error.

L = |Actual_Value - Predicted_Value|
C=
UNIT-1 Regression
• Advantages
• Easy to Understand
• Same unit as unit of Actual_Value
• It is Robust to Outlier: It means outlier will not affect error, so if there is
no outliers in dataset then it better to use MAE instead of MSE
• Disadvantages
• Grap is not differenciable due which Gradient Descent(GD) algorithm
not easy to implement.
• To implement GD we need to calculate Sub-Gradient.
UNIT-1 Regression
• MSE (Mean Squared Error): MSE is a metric that calculates the
average squared difference between the predicted values and
the actual values. Squaring the errors gives more weight to
larger errors, making it useful for penalizing significant
deviations from the true values.

L = (Actual_Value - Predicted_Value)2
C= 2
UNIT-1 Regression
• Advantages
• Easy to interpret
• Loss function is differenciable that allows to implement GD easily
• One Local Mininma: It means function has one minimum value that we
have to find.

• Disadvantage
• Unit of error is Square: That creates an confusion to understand it, so to
extract accurate error we have to find square root of MSE.
• It is not Robust to Outlier: If dataser conists outliers then MSE is not
useful
UNIT-1 Regression
• Huber loss

• Huber Loss is applicable when Outlier data is around 25% because 25% is a
significant amount of data and if we use MSE then it will ignore the 75%
data which is correct, because graph will deviate towards Outliers and if
we use MAE, it will ignore 25% outlier data that is also significant. In this
type of situation Huber Loss is useful.
UNIT-1 Regression
• RMSE

• It quantifies the differences between predicted values and


actual values, squaring the errors, taking the mean, and then
finding the square root.
• RMSE provides a clear understanding of the model’s performance, with
lower values indicating better predictive accuracy.
• RMSE is computed by taking the square root of MSE
• RMSE value with zero indicates that the model has a perfect fit
UNIT-1 Regression
• RMSE

• The lower the RMSE, the better the model and its predictions.
• A higher RMSE indicates that there is a large deviation from the
residual to the ground truth.
UNIT-1 Regression
• Pros of the RMSE Evaluation Metric:
• RMSE is easy to understand.
• It serves as a heuristic for training models.
• It is computationally simple and easily differentiable which many optimization
algorithms desire.
• RMSE does not penalize the errors as much as MSE does due to the square root.
• Cons of the RMSE metric:
• Like MSE, RMSE is dependent on the scale of the data. It increases in magnitude if
the scale of the error increases.
• One major drawback of RMSE is its sensitivity to outliers and the outliers have to
be removed for it to function properly.
• RMSE increases with an increase in the size of the test sample. This is an issue
when we calculate the results on different test samples.
UNIT-1 Regression
• R Squared
• R-squared (Coefficient of Determination) is a statistical measure that
quantifies the proportion of the variance in the dependent variable that is
explained by the independent variables in a regression model.

• Where:
• SSR (Sum of Squares Residual) represents the sum of squared differences between
the observed values and the predicted values by the model.
• SST (Total Sum of Squares) represents the sum of squared differences between the
observed values and the mean of the dependent variable.
UNIT-1 Regression
• R-squared ranges between 0 and 1, with the following
interpretations:
• =0: The model does not explain any of the variability in the dependent
variable. It's a poor fit.
• : The model explains a proportion of the variability. A higher R-squared
indicates a better fit, with 1 indicating a perfect fit where the model
explains all the variability.
• =1: The model perfectly predicts the dependent variable based on the
independent variables.
UNIT-1 Regression
• R-squared evaluates regression model fit but has limitations:
• High R-squared doesn't always mean good fit; high value may
imply overfitting, lacking generalization.
• Including more predictors can inflate R-squared, even if they're
weak; adjusted R-squared adjusts for this.
• "Good" R-squared varies by field; lower values acceptable in
data-rich areas.
• R-squared may miss fit quality with nonlinearity or outliers.
UNIT-1 Regression
• Adjusted R Squared

• Where −
• n = the number of points in your data sample.
• k = the number of independent regressors, i.e. the number of variables
in your model, excluding the constant.
UNIT-1 Regression
• Adjusted R Squared
• Adjusted R-squared adjusts the statistic based on the number
of independent variables in the model
• Adjusted R2 also indicates how well terms fit a curve or line,
but adjusts for the number of terms in a model.
• If you add more and more useless variables to a model,
adjusted r-squared will decrease.
• If you add more useful variables, adjusted r-squared will
increase.
• Adjusted R2 will always be less than or equal to R2
UNIT-1 Regression
• Adjusted R Squared
• Problem Statement −
• A fund has a sample R-squared value close to 0.5 and it is
doubtlessly offering higher risk adjusted returns with the
sample size of 50 for 5 predictors. Find Adjusted R square
value.
• Sample size = 50 Number of predictor = 5 Sample R - square
= 0.5.Substitute the qualities in the equation,
UNIT-1 Regression
• p-Value,
UNIT-1 Regression
• RMSE (Root Mean Squared Error): RMSE is the square root of
the MSE and is commonly used to express the average
magnitude of the prediction errors in the same units as the
dependent variable. It provides a measure of the model's
accuracy, and lower values indicate better performance.
• R Squared (Coefficient of Determination): R-squared is a
statistical measure that represents the proportion of the
variance in the dependent variable that is explained by the
independent variables in the regression model. It ranges from 0
to 1, where 1 indicates that the model explains all the variance,
and 0 indicates that the model doesn't explain any of the
variance.
UNIT-1 Regression
• Adjusted R Squared: Adjusted R-squared is a modified version
of R-squared that takes into account the number of
independent variables in the model. It penalizes the addition of
irrelevant variables that might artificially inflate the R-squared
value.
• p-Value: The p-value is a measure of the evidence against a null
hypothesis in a statistical hypothesis test. In the context of
regression analysis, p-values are used to determine whether the
coefficients of the independent variables are statistically
significant. A low p-value (typically below a significance level
like 0.05) suggests that the variable has a significant impact on
the dependent variable.
UNIT-1 Classification
• A Fraud Detection Classifier
• Objective: To detect fraud claim
• Assumption:
• The output of your fraud detection model is the probability [0.0–1.0]
that a transaction is fraudulent.
• If this probability is below 0.5, you classify the transaction as non-
fraudulent; otherwise, you classify the transaction as fraudulent.
• Methodology
• Collect 10,000 manually classified transactions, with 300 fraudulent
transaction and 9,700 non-fraudulent transactions.
• You run your classifier on every transaction, predict the class label
(fraudulent or non-fraudulent) and
• summarise the results in the following confusion matrix:
UNIT-1 Classification

UNIT-1 Classification
• A True Positive (TP=100) is an outcome where the model
correctly predicts the positive (fraudulent) class.
• A True Negative (TN=9,000) is an outcome where the model
correctly predicts the negative (non-fraudulent) class.
• A False Positive (FP=700) is an outcome where the model
incorrectly predicts the positive (fraudulent) class.
• A False Negative (FN=200) is an outcome where the model
incorrectly predicts the negative (non-fraudulent) class.
UNIT-1 Classification
• Accuracy: Correctly predicted values out of total given data.

• Accuracy = ?
UNIT-1 Classification
• Area Under Curve
• Area Under Curve(AUC) is one of the most widely used metrics for
evaluation.
• It is used for binary classification problem.
• AUC of a classifier is equal to the probability that the classifier will rank a
randomly chosen positive example higher than a randomly chosen
negative example.
• Two basic terms used in AUC:
• True Positive Rate (Sensitivity)
• True Negative Rate (Specificity)
UNIT-1 Classification
• Area Under Curve
• Few basic terms used in AUC:
• True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True
Positive Rate corresponds to the proportion of positive data points that are
correctly considered as positive, with respect to all positive data points.

• True Negative Rate (Specificity) : True Negative Rate is defined as TN / (FP+TN).


False Positive Rate corresponds to the proportion of negative data points that are
correctly considered as negative, with respect to all negative data points.
UNIT-1 Classification
• Area Under Curve
• False Positive Rate : False Positive Rate is defined as FP / (FP+TN). False Positive
Rate corresponds to the proportion of negative data points that are mistakenly
considered as positive, with respect to all negative data points.

• False Positive Rate and True Positive Rate both have values in the range
[0, 1].
• FPR and TPR both are computed at varying threshold values such as
(0.00, 0.02, 0.04, …., 1.00) and a graph is drawn.
• AUC is the area under the curve of plot False Positive Rate vs True
Positive Rate at different points in [0, 1].
UNIT-1 Classification
• Area Under Curve
• As evident, AUC has a range of [0, 1]. The greater the value, the better is
the performance of our model.
UNIT-1 Classification
• F1-Score:
• F1 Score is used to measure a test’s accuracy
• F1 Score is the Harmonic Mean between precision and recall.
• The range for F1 Score is [0, 1].
• It tells you how precise your classifier is (how many instances it classifies
correctly), as well as how robust it is (it does not miss a significant number of
instances).
• High precision but lower recall, gives you an extremely accurate, but it then
misses a large number of instances that are difficult to classify.
• The greater the F1 Score, the better is the performance of our model.
• Mathematically, it can be expressed as :

• F1 Score tries to find the balance between precision and recall.


UNIT-1 Classification
• Precision : It is the number of correct positive results divided by the number of
positive results predicted by the classifier.

• Recall : It is the number of correct positive results divided by the number


of all relevant samples (all samples that should have been identified as
positive).
UNIT-1 Classification
• Cross validation is a technique used in machine learning to evaluate the performance
of a model on unseen data.
• It involves dividing the available data into multiple folds or subsets, using one of
these folds as a validation set, and training the model on the remaining folds.
• This process is repeated multiple times, each time using a different fold as the
validation set.
• Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance.
• The main purpose of cross validation is to prevent overfitting, which occurs when a
model is trained too well on the training data and performs poorly on new, unseen
data
• Cross validation techniques: K-fold cross validation, leave-one-out cross validation,
and stratified cross validation.
UNIT-1 Classification
• Leave-One-Out Cross Validation (LOOCV)
• Perform training on the whole data-set but leaves only one data-point of the
available data-set and then iterates for each data-point.
• Advantage:
• Advantage of using this method is that we make use of all data points and hence it is low
bias.
• Drawback:
• It leads to higher variation in the testing model as we are testing against one data point. If the
data point is an outlier it can lead to higher variation.
• It takes a lot of execution time as it iterates over ‘the number of data points’ times.
UNIT-1 Classification
• K-Fold Cross Validation
• In this method, we split the data-set into k number of subsets(known as folds) then
we perform training on the all the subsets but leave one(k-1) subset for the
evaluation of the trained model.
• In this method, we iterate k times with a different subset reserved for testing purpose
each time.
• Note:
• It is always suggested that the value of k should be 10 as the lower value
• of k is takes towards validation and higher value of k leads to LOOCV method.
UNIT-1 Classification
UNIT-1 Classification
• Advantages of cross-validation:
• Overcoming Overfitting: Cross validation helps to prevent overfitting
by providing a more robust estimate of the model’s performance on
unseen data.
• Model Selection: Cross validation can be used to compare different
models and select the one that performs the best on average.
• Hyperparameter tuning: Cross validation can be used to optimize the
hyperparameters of a model, such as the regularization parameter, by
selecting the values that result in the best performance on the
validation set.
• Data Efficient: Cross validation allows the use of all the available data
for both training and validation, making it a more data-efficient
method compared to traditional validation techniques.
UNIT-1 Classification
• Stratified k-fold cross-validation
• This technique is similar to k-fold cross-validation with some little
changes. This approach works on stratification concept, it is a process of
rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and
variance, it is one of the best approaches.
UNIT-1 Classification
• Disadvantages of Cross Validation:
• Computationally Expensive: Cross validation can be computationally
expensive, especially when the number of folds is large or when the
model is complex and requires a long time to train.
• Time-Consuming: Cross validation can be time-consuming, especially
when there are many hyperparameters to tune or when multiple
models need to be compared.
• Bias-Variance Tradeoff: The choice of the number of folds in cross
validation can impact the bias-variance tradeoff, i.e., too few folds may
result in high variance, while too many folds may result in high bias.
UNIT-1 Classification
• RoC curve
UNIT-1 Classification
• Hyper-Parameter Tuning Introduction
• Hyperparameter tuning involves adjusting the parameters of a machine
learning algorithm that are not learned from data but set before the
learning process begins.

• It aims to find the best combination of hyperparameters that yields


optimal performance for a given problem.

• Effective hyperparameter tuning can significantly improve the


performance of machine learning models.
UNIT-1 Classification
• Techniques for hyperparameter tuning:
• Grid Search:
• Involves defining a grid of possible hyperparameter values.
• Systematically searches through all possible combinations.
• Useful for a smaller set of hyperparameters.
• Random Search:
• Randomly samples hyperparameters from predefined distributions.
• More efficient when searching over a large hyperparameter space.
UNIT-1 Classification
• Techniques for hyperparameter tuning:
• Grid Search:
• Grid Search helps identify the combination of hyperparameters that
results in the best performance for a given problem.
UNIT-1 Classification
• Introduction of Deep Learning
• Deep learning is a subset of machine learning that employs artificial
neural networks to model and solve complex problems.
• It is inspired by the structure and function of the human brain, using
interconnected nodes (neurons) to process and learn from data.
• Neural networks can have multiple layers, enabling them to learn
hierarchical representations and capture intricate patterns in data.
• Activation functions introduce non-linearity to neural networks,
allowing them to model complex relationships in the data.
• Backpropagation is a key algorithm in deep learning, adjusting
network weights and biases to minimize prediction errors.
Deep Learning Applications (CO1)

Here are just a few examples of deep learning at work:


• A self-driving vehicle slows down as it approaches a
pedestrian crosswalk.
• An ATM rejects a counterfeit bank note.
• A smartphone app gives an instant translation of a foreign
street sign.

• Deep learning is especially well-suited to identification


applications such as face recognition, text translation, voice
recognition, and advanced driver assistance systems,
including, lane classification and traffic sign recognition.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 112


Some other Applications (CO1)

Used for speed of machine Digital imaging

Fraud Detection Increasing phone efficiency

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 113


What Makes Deep Learning State-of-the-Art? (CO1)

In a word, accuracy. Advanced tools and techniques have dramatically improved deep learning algorithms
—to the point where they can outperform humans at classifying images, win against the world’s best GO
player, or enable a voice-controlled assistant like Amazon Echo® and Google Home to find and download
that new song you like.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 114


What Makes Deep Learning State-of-the-Art? (CO1)

Three technology enablers make this degree of accuracy possible:


Easy access to massive sets of labeled data Data sets such as
ImageNet and PASCAL VoC are freely available, and are useful for
training on many different types of objects.

Increased computing power High-performance GPUs accelerate


the training of the massive amounts of data needed for deep
learning, reducing training time from weeks to hours.

Pretrained models built by experts Models such as AlexNet can be retrained to perform new recognition
tasks using a technique called transfer learning. While AlexNet was trained on 1.3 million high-resolution
images to recognize 1000 different objects, accurate transfer learning can be achieved with much smaller
datasets.

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 115


Difference between AI, ML, DL (CO1)

AI ML DL
AI

ML

DL

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 116


Why it needed deep learning (CO1)

1. Huge amount of data


(Initially we started with ML, its major drawback is, its efficiency is degraded with higher data
or data sets)

(x-axis: number of data, Y-axis: efficiency )


And the solution given by deep learning, that can handled huge amount of data, which may
be structured or unstructured.
2. Complex problem
These are basically included the real time data analysis, medical diagnosis system etc., which
are handled by deep learning

03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 117


UNIT-1 Classification
• Gradient descent is an optimization technique that guides weight
updates toward minimizing the loss function.
• Convolutional Neural Networks (CNNs) are tailored for tasks involving
grid-like data such as images, using convolutional layers to detect
features.
• Recurrent Neural Networks (RNNs) are used for sequential data,
capturing temporal dependencies in applications like text and speech.
• Long Short-Term Memory (LSTM) networks address the vanishing
gradient problem in RNNs, handling long-range dependencies.
• Generative Adversarial Networks (GANs) consist of two networks
competing to generate realistic data, used for image generation and
more.
UNIT-1 Classification

• Transfer learning involves leveraging pre-trained models for new tasks,


benefiting from existing learned representations.
• Attention mechanisms focus on relevant parts of input data, enhancing
sequence-to-sequence tasks like language translation.
• Autoencoders are used for unsupervised learning and data compression,
involving encoding and decoding data representations.
• Deep learning finds applications in computer vision, natural language
processing, healthcare, finance, gaming, and AI research.
• It continues to advance with innovations in neural network architectures,
optimization methods, and hardware acceleration.
UNIT-1 Classification
Aspect Machine Learning Deep Learning
Definition Algorithms learn from data to make predictions or Subset of machine learning using neural networks to
decisions without explicit programming. model complex patterns from data.
Feature Requires manual feature engineering and selection. Can automatically learn features from raw data.
Engineering
Data Can perform well with smaller datasets. Often requires larger datasets for meaningful learning.
Requirements
Performance May plateau in performance with complex problems or Continues to improve with more data and complexity.
larger datasets.
Interpretability Models are often more interpretable, allowing better Complex models can be black boxes, challenging to
understanding of feature contributions. interpret.
Hardware Can be trained on standard hardware. Requires powerful GPUsor specialized hardware for
Resources training due to computational demands.
Applications Used in a wide range of applications, including simpler Primarily used for complex tasks like image recognition,
image analysis and recommendation systems. NLP, and speech recognition.
Training Training is generally less computationally intensive and Training can be computationally intensive and time-
Complexity faster. consuming, especially for complex architectures.
Evolution and Developed over decades, still widely used. Gained prominence more recently due to success in
Trends various domains.
Problem Types Effective for a variety of problems, especially with Suited for unstructured and complex data problems.
structured data.
Performance on Requires well-engineered features for optimal Can automatically learn hierarchical features and
Data performance. patterns.
UNIT-1 Artificial Neural Network
• Artificial Neural Network:
• Neuron, Nerve structure and synapse, Artificial Neuron and its model,
• activation functions,
• Neural network architecture:
• Single layer and Multilayer feed forward networks,
• recurrent networks.
• Various learning techniques;
• Perception and Convergence rule,
• Hebb Learning.
• Perceptron’s,
• Multilayer perceptron,
• Gradient descent and the Delta rule,
• Multilayer networks,
• Derivation of Backpropagation Algorithm.
UNIT-1 Artificial Neural Network
• Artificial Neural Network:
UNIT-1 Artificial Neural Network

• Artificial Neural Network:


• An Artificial Neural Network (ANN) is a computational model inspired by the
structure and functioning of biological neural networks in the human brain.
• It is a key component of the field of machine learning and artificial intelligence.
• ANNs are used to model complex relationships and patterns in data, and they excel
at tasks like classification, regression, and pattern recognition.
• An ANN is composed of interconnected nodes, also called neurons, organized into
layers.
• The neurons are designed to process and transmit information using a combination
of mathematical operations and activation functions.
• The basic concept of an ANN involves receiving input data, passing it through
multiple layers of interconnected neurons, and producing an output or prediction
based on the learned patterns.
UNIT-1 Artificial Neural Network
• Structure of ANN
• The structure of an ANN typically includes:
• Input Layer: The initial layer that receives the input data or features. Each neuron in this layer
corresponds to a specific feature in the input data.
• Hidden Layers: Intermediate layers between the input and output layers. These layers are
responsible for processing the input data through a series of weighted connections and
activation functions. Hidden layers allow ANNs to learn complex features and representations.
• Output Layer: The final layer that produces the prediction or output. The number of neurons in
this layer depends on the type of task the ANN is designed for. For example, in a binary
classification task, there might be one output neuron with a sigmoid activation function.
UNIT-1 Artificial Neural Network
• Structure of ANN
• The structure of an ANN typically includes:
• Neurons
• Neurons simulate biological neurons, processing inputs and producing outputs.
• Each neuron computes a weighted sum of its inputs, applies an activation function, and passes the
result to the next layer.
• Weights and Biases:
• Each connection between neurons has an associated weight, indicating its importance.
• Biases are added to the weighted sum to introduce flexibility.
UNIT-1 Artificial Neural Network
• Structure of ANN
• The structure of an ANN typically includes:
• Activation Functions
• Non-linear functions applied to the weighted sum to introduce non-linearity.
• Common activation functions include ReLU, sigmoid, and tanh.
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• Activation functions introduce non-linearity to the network.
• In absence of AF, ANNs would be limited to representing linear relationships, that restrict to learn
complex patterns and relationships in data.
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• ReLU (Rectified Linear Unit):
• A rectified linear unit (ReLU) is an activation function that introduces the property of non-
linearity to a deep learning model and solves the vanishing gradients issue.
• "It interprets the positive part of its argument.
• It is one of the most popular activation functions in deep learning.
• The variants of ReLU include
• leaky ReLU,
• exponential linear unit (ELU) and
• Sigmoid linear unit (SiLU)
• Mathematical Representatio

• Derivative
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• ReLU Advantages:
• Non-linearity: ReLU introduces non-linearity to the model, allowing it to learn complex patterns and
relationships in the data. This is crucial for neural networks to model a wide range of functions
effectively.
• Computationally Efficient: ReLU activation is computationally efficient to compute during both forward
and backward passes of training. The activation only involves simple thresholding of values, which leads
to faster training times compared to more complex activation functions like sigmoid or tanh.
• Sparse Activation: ReLU has a characteristic called "sparse activation." It means that only a subset of
neurons is activated for any given input. This sparsity can lead to more efficient learning and reduced
overfitting, as neurons are less likely to co-adapt.
• Mitigating Vanishing Gradient: Unlike sigmoid and tanh functions, ReLU does not saturate for positive
inputs, which helps mitigate the vanishing gradient problem. This makes training deeper networks more
feasible as gradients can flow more effectively through the network during backpropagation.
• Empirical Success: ReLU has shown remarkable empirical success in various deep learning applications,
including image and speech recognition, natural language processing, and more. It has contributed to
the rise of deep learning in recent years.
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• ReLU Disadvantages:
• Dying ReLU Problem: One significant issue with ReLU is the "dying ReLU" problem. During training,
some neurons can become inactive (output zero for all inputs) and stay that way. Once a large
gradient flows through a ReLU neuron and updates its weights such that it always produces
negative outputs, it will never activate again. This leads to dead neurons that do not contribute to
learning.
• Not Zero-Centered: ReLU is not zero-centered, meaning the output of ReLU is always positive or
zero. This can lead to issues in weight updates during training and can affect convergence.
• Unbounded Activation: ReLU does not have an upper bound, which means that if a large gradient
flows through a ReLU neuron, it can lead to "exploding gradients," causing training instability.
• Sensitivity to Initialization: ReLU neurons can be sensitive to weight initialization. If the initial
weights are too large, it's more likely for neurons to get stuck in the inactive state, contributing to
the dying ReLU problem.
• Leaky ReLU and Variants: To address some of the issues with standard ReLU, variations like Leaky
ReLU, Parametric ReLU, and Exponential Linear Units (ELUs) have been proposed. These variations
introduce controlled non-zero slopes for negative inputs or introduce other adaptive
characteristics to mitigate the disadvantages of ReLU.
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• Sigmoid:
• The sigmoid function squashes the input into a range between 0 and 1.
• It's often used in the output layer for binary classification tasks.
• Mathmatical Representation

• Derivative of the sigmoid function


UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• Tanh (Hyperbolic Tangent):
• The tanh function also squashes the input, but between -1 and 1.
• It's often used in hidden layers to introduce non-linearity.
• The Tanh Activation function is widely used in neural networks due to its centered, zero-mean
output, which helps mitigate the vanishing gradient problem that occurs with the sigmoid
activation function.
• The zero-centered property facilitates faster convergence during the training process, as it avoids
biasing the updates in one particular direction.
• Mathmatical Representation

To know more refer : https://www.nomidl.com/deep-learning/what-is-tanh-activation-function/


UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• Softmax:
• The softmax function is used in the output layer for multi-class classification tasks.
• It converts a vector of values into a probability distribution, making it suitable for selecting one
class from multiple options.
• The softmax function is sometimes called the softargmax function, or multi-class logistic
regression.
• The softmax function can be used in a classifier only when the classes are mutually exclusive.
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• Leaky ReLU:
• Similar to ReLU, but allows a small negative slope for inputs below zero.
• This mitigates the "dying ReLU" problem where neurons can get stuck during training.
• Leaky ReLU(x) = x if x > 0, and alpha * x if x <= 0 (where alpha is a small positive constant).
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• ELU (Exponential Linear Unit):
• ELU is similar to ReLU for positive inputs, but it has a non-zero gradient for negative inputs, which
can help with preventing dead neurons.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Feedforward Neural Networks (FNNs): These are the simplest type of
neural networks, consisting of an input layer, one or more hidden layers,
and an output layer. Neurons are connected only in one direction, from the
input layer to the output layer. FNNs are used for tasks like regression and
classification.
Multi-Layer Feedforward Neural
Network
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Single Layer Feed Forward Neural Network
• A Single Layer Feedforward Neural Network, also known as a
Perceptron,
• It is the simplest type of neural network architecture.
• It consists of only one layer of neurons, which directly connects
the input to the output.
• This type of network is primarily used for binary classification
• tasks where the input features are linearly separable.
• Structure:
• Input Layer:
• The input layer consists of input nodes, each representing a feature of the input data.
• These nodes do not perform any computation; they simply pass the input values to the output layer.
• Output Layer:
• The output layer consists of a single node or neuron. The output of this neuron is determined by a weighted sum of the input
values along with a bias term.
• The output is then passed through an activation function to produce the final prediction or output of the network.
UNIT-1 Artificial Neural Network
• Single Layer Feed Forward Neural Network (Example)
• Problem: Classify whether a fruit is an apple or a banana based on two features: fruit diameter (in
centimeters) and fruit weight (in grams). We have a dataset with labeled examples of apples and
bananas.
• Network Architecture:
• A single-layer neural network for this problem consists of two input neurons (one for diameter and one for weight) and one output neuron for
binary classification.
• Activation Function:
• Typically, a step function or a sigmoid function is used as the activation function in single-layer perceptrons. In this example, we'll use a step
function.
• Weights and Bias:
• The network assigns weights to each input feature and has a bias term. These weights and bias are learned during training.
• Training:
• During training, the network adjusts the weights and bias to minimize the classification error. The weights and bias are updated using a learning
algorithm like the perceptron learning rule.
UNIT-1 Artificial Neural Network
• Single Layer Feed Forward Neural Network (Example)
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Single Layer Feed Forward Neural Network
• Activation Function:
• The activation function is typically a step function or a sigmoid function.
• The step function returns a binary output based on whether the weighted sum is above a certain
threshold, while the sigmoid function produces a continuous output between 0 and 1.
• The choice of activation function depends on the problem at hand.
• Limitations:
• A single-layer feedforward neural network can only learn linear decision boundaries. It cannot capture complex
patterns in the data that are not linearly separable.
• It might struggle with problems that require more sophisticated feature extraction or involve non-linear
relationships.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Multi Layer Feed Forward Neural Network
• It is also referred as a Multi-Layer Perceptron (MLP), is a type of artificial neural network with a more
complex architecture compared to its single-layer counterpart.
• It is designed to handle a wide range of machine learning tasks, especially those involving complex and
nonlinear relationships within data.
• Multi-layer neural networks have found extensive applications in areas such as image recognition, natural
language processing, speech recognition, and many other domains.
UNIT-1 Artificial Neural Network
• Multi Layer Feed Forward Neural Network
• Architecture:
• It consists of three or more layers:
• Input Layer: This layer receives the raw input data, and each neuron in this layer represents a feature or attribute of the
input.
• Hidden Layers: These are one or more layers placed between the input and output layers. Each hidden layer contains
multiple neurons, and these layers introduce nonlinearity into the network. The presence of hidden layers enables the
network to model complex, hierarchical patterns and relationships in the data.
• Output Layer: The output layer produces the final result of the network's computation. The number of neurons in this layer
depends on the specific task, with binary classification tasks having one neuron, multi-class classification having one neuron
per class, and regression tasks having one or more neurons for the output.
• Functionality:
• Multi-layer neural networks excel at approximating complex functions and capturing intricate patterns within data. They
can represent both linear and nonlinear transformations of input data, making them highly versatile for a wide range of
machine learning tasks. This versatility is achieved through the following key components:
• Activation Functions: Nonlinear activation functions, such as ReLU (Rectified Linear Unit), sigmoid, and tanh, are applied to
the neurons in the hidden layers. These functions introduce nonlinearity into the network, allowing it to model complex
relationships.
• Weighted Connections: Each connection between neurons in adjacent layers has an associated weight, which determines
the strength of the connection. These weights are learned during training to optimize the network's performance.
• Backpropagation: Multi-layer neural networks are typically trained using the backpropagation algorithm, combined with
gradient descent optimization. This involves iteratively adjusting the weights to minimize the error between the predicted
and actual output.
UNIT-1 Artificial Neural Network
• Multi Layer Feed Forward Neural Network
• Applications:
• Multi-layer neural networks have demonstrated remarkable success in various machine learning and artificial intelligence
tasks, including:

• Image Recognition: Convolutional neural networks (CNNs), a specialized type of multi-layer network, have achieved state-of-
the-art performance in image classification, object detection, and segmentation tasks.

• Natural Language Processing: Multi-layer neural networks, such as recurrent neural networks (RNNs) and long short-term
memory networks (LSTMs), are used for tasks like text generation, sentiment analysis, and machine translation.

• Speech Recognition: Multi-layer networks have been applied to automatic speech recognition (ASR) systems, converting
spoken language into text.

• Recommendation Systems: They are used in collaborative filtering and content-based recommendation systems to
personalize content recommendations for users.
UNIT-1 Artificial Neural Network
• Multi Layer Feed Forward Neural Network
• Problem: Classify whether a bank customer will churn (leave the bank) or stay based on features such as
their credit score, age, tenure, and balance.
• Data:
• We'll use a synthetic dataset with the following features:
• Credit Score
• Age
• Tenure
• Balance
• Network Architecture:
• For this example, we'll create an MLP with three hidden layers, each containing 64 neurons. The input layer will have four
neurons (one for each feature), and the output layer will have one neuron for binary classification.
• Activation Function:
• We'll use the Rectified Linear Unit (ReLU) activation function for hidden layers and a sigmoid activation function for the
output layer.

• https://colab.research.google.com/drive/1Vc50HdjexdN3B5TDGXsRxPJHyGin0ef6?usp=s
haring
UNIT-1 Artificial Neural Network
• Differences between Single layer and Multi-Layer Neural Networks
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Differences between Single layer and Multi-Layer Neural Networks
• Architecture:
• Single-layer neural network: It consists of only one layer of neurons, which is often called the input layer. There are
no hidden layers or multiple layers of neurons between the input and output.
• Multi-layer neural network: It has more than one layer of neurons, typically including an input layer, one or more
hidden layers, and an output layer. The layers between the input and output are called hidden layers.
• Capabilities:
• Single-layer neural network: It can only model linearly separable functions. In other words, it can solve simple
problems where the data can be separated by a straight line or hyperplane.
• Multi-layer neural network: It can approximate complex, nonlinear functions. By adding hidden layers and using
nonlinear activation functions, MLPs can capture intricate patterns and relationships in data, making them capable
of handling a wide range of tasks, including classification, regression, and more.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Differences between Single layer and Multi-Layer Neural Networks
• Representation:
• Single-layer neural network: These networks can represent linear transformations of the input data, making them
limited in their ability to capture complex relationships in the data.
• Multi-layer neural network: The presence of hidden layers allows MLPs to represent both linear and nonlinear
transformations of the input data. This enables them to model complex, hierarchical features and relationships
within the data.
• Activation functions:
• Single-layer neural network: Typically uses linear activation functions, which result in linear transformations of the
input.
• Multi-layer neural network: Uses nonlinear activation functions (e.g., sigmoid, ReLU, tanh) in the hidden layers to
introduce nonlinearity into the network, enabling it to learn and represent nonlinear relationships.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Differences between Single layer and Multi-Layer Neural Networks
• Learning and Training:
• Single-layer neural network: Training is relatively straightforward and can often be done using simple algorithms
like the perceptron learning rule.
• Multi-layer neural network: Training is more complex and usually requires advanced optimization techniques like
backpropagation and gradient descent. These networks benefit from the use of gradient-based optimization
methods to adjust the weights and biases during training.
• Use Cases:
• Single-layer neural network: Suitable for simple tasks like binary classification or linear regression when the data
is linearly separable.
• Multi-layer neural network: Suited for a wide range of tasks, including image recognition, natural language
processing, speech recognition, and more, where the data has complex and nonlinear relationships.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
Aspect Single-Layer Neural Network Multi-Layer Neural Network
Architecture Consists of onlyone layer (input layer). Consists of multiple layers, including input,hidden, and
output layers.
Capabilities Can model linearly separable functions. Can approximate complex, nonlinear functions.
Representation Represents linear transformations of Represents both linear and nonlinear transformations of
input data. input data.
Activation Typicallyuses linear activation functions. Uses nonlinear activation functions in hidden layers.
Functions
Learning and Training is relatively simple, often using Training is more complex,usually involving
Training the perceptron learning rule. backpropagation and gradient descent.
Use Cases Suitable for simple tasks like binary Suited for a wide range of tasks, including image
classification or linear regression with recognition,natural language processing, and more,
linearly separable data. involving complex, nonlinear data relationships.
Hidden Layers Does not have hidden layers. Includes one or more hidden layers.
Nonlinearity Lacks inherent nonlinearity, limiting its Introduces nonlinearity through activation functions in
representation capabilities. hidden layers.
Complex Data Struggles to capture complex data Can model and represent complex data relationships
Relationships relationships. effectively.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Recurrent Neural Networks (RNNs):
• RNNs are designed to handle sequences of data, making them suitable for tasks such as natural
language processing and time series analysis.
• Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous
step is fed as input to the current step.
• Hidden state that is also known as Memory State is the main and most important features of RNN
which remembers some information about a sequence.
UNIT-1 Artificial Neural Network

Types Of Learning Rules in ANN


UNIT-1 Artificial Neural Network
• Hebb Learning.
• Proposed by Donald O Hebb.
• It is used for pattern classification.
• It is a single layer neural network, i.e. it has one input layer and one output layer.
• The input layer can have many units, say n but The output layer only has one unit.
• Hebbian rule works by updating the weights between neurons in the neural network for each training sample.
• Hebbian Learning Rule Algorithm :
1. Set all weights to zero, wi = 0 for i=1 to n, and bias to zero.
2. For each input vector, S(input vector) : t(target output pair), repeat steps 3-5.
3. Set activations for input units with the input vector X i = Si for i = 1 to n.
4. Set the corresponding output value to the output neuron, i.e. y = t.
5. Update weight and bias by applying Hebb rule for all i = 1 to n:
UNIT-1 Hebbian Learning Rule
• Principle: This rule is based on the biological concept that “neurons that fire
together, wire together.”
• Hebbian Learning Rule is an unsupervised learning algorithm used in neural
networks to adjust the weights between nodes.
• It is based on the principle that the connection strength between two neurons
should change depending on their activity patterns.
• The rule can be summarized as follows:
• When two neighboring neurons operate in the same phase at the same time, the weight between them
increases.

• If the neurons operate in opposite phases, the weight between them decreases.

• When there is no signal correlation between the neurons, the weight remains unchanged.

• The sign of the weight between two nodes is determined by the sign of their input
signals:
• If both nodes receive inputs that are either positive or negative, the resulting weight is strongly positive.


UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Artificial Neural Network
• Implementation of AND gate using Hebb Learning.
• Step 1 : Set weight and bias to zero, w = [ 0 0 0 ]T and b = 0.
• Step 2 : Set input vector Xi = Si for i = 1 to 4.
• X1 = [ -1 -1 1 ]T
• X2 = [ -1 1 1 ]T
• X3 = [ 1 -1 1 ]T
• X4 = [ 1 1 1 ]T
• Step 3 : Output value is set to y = t.
• Step 4 : Modifying weights using Hebbian Rule:
• First iteration –
• w(new) = w(old) + x1y1 = [ 0 0 0 ]T + [ -1 -1 1 ]T . [ -1 ] = [ 1 1 -1 ]T
• For the second iteration, the final weight of the first one will be used and so on.
• Second iteration –
• w(new) = [ 1 1 -1 ]T + [ -1 1 1 ]T . [ -1 ] = [ 2 0 -2 ]T
• Third iteration –
• w(new) = [ 2 0 -2]T + [ 1 -1 1 ]T . [ -1 ] = [ 1 1 -3 ]T
• Fourth iteration –
• w(new) = [ 1 1 -3]T + [ 1 1 1 ]T . [ 1 ] = [ 2 2 -2 ]T
• So, the final weight matrix is [ 2 2 -2 ]T
UNIT-1 Artificial Neural Network
• Testing the network :
• For x1 = -1, x2 = -1, b = 1, Y = (-1)(2) + (-1)(2) + (1)(-2) = -6
• For x1 = -1, x2 = 1, b = 1, Y = (-1)(2) + (1)(2) + (1)(-2) = -2
• For x1 = 1, x2 = -1, b = 1, Y = (1)(2) + (-1)(2) + (1)(-2) = -2
• For x1 = 1, x2 = 1, b = 1, Y = (1)(2) + (1)(2) + (1)(-2) = 2
• The results are all compatible with the original table.
• Decision Boundary :
• 2x1 + 2x2 – 2b = y
• Replacing y with 0, 2x1 + 2x2 – 2b = 0
• Since bias, b = 1, so 2x1 + 2x2 – 2(1) = 0
• 2( x1 + x2 ) = 2
• The final equation, x2 = -x1 + 1
UNIT-1 Perceptron Rules
• Perceptron Rules
• Supervised Learning Algorithm: The Perceptron Rule is a supervised learning
algorithm used for binary classification tasks.
• Developed by Frank Rosenblatt: It was developed by Frank Rosenblatt in the late
1950s.
• Objective: The main goal of the Perceptron Rule is to learn a linear decision
boundary that can separate two classes of data points in a feature space.
• Linear Separability: It is suitable for problems where the data is linearly
separable, meaning it can be separated by a straight line (or hyperplane in higher
dimensions).
• Components: The algorithm works with input features, weights, an activation
function (typically a step function), and a bias term (threshold).
• Training: It iteratively updates the weights based on misclassified data points until
a stopping criterion is met. Correctly classified points do not trigger weight
updates.
UNIT-1 Perceptron Rules
• Perceptron Rules
• Update Rule: When a data point is misclassified as class 0 when it should be class 1, the weights
are increased for the associated features. When misclassified as class 1 when it should be class 0,
the weights are decreased for the associated features.
• Limitations: The Perceptron Rule can only solve linearly separable problems and cannot handle
tasks with nonlinear decision boundaries.
• Historical Significance: It played a pivotal role in the history of machine learning and served as a
foundation for more complex neural network models like multi-layer perceptrons (MLPs).
• Activation Function: Typically uses a step function for making binary decisions based on the
weighted sum of inputs.
• Bias: A bias term (threshold) is used to shift the decision boundary.
• Supervised Learning: Requires labeled training data for learning and updating weights.
• Early Neural Network: Represents one of the earliest forms of artificial neural networks and
contributed to the development of the field.
UNIT-1 Perceptron Learning Rule
• Principle: This rule adjusts weights to minimize classification errors.

• Perceptron Learning Rule is an error-correcting algorithm designed for single-layer feedforward


networks.

• It is a supervised learning approach that adjusts weights based on the error calculated between the
desired and actual outputs.

• Weight adjustments are made only when an error is present.

• The process is computed as follows:


UNIT-1 Perceptron Learning Rule
UNIT-1 Perceptron Learning Rule
UNIT-1 Perceptron Learning Rule
UNIT-1 Perceptron Learning Rule

https://colab.research.google.com/drive/1xHwz0NZhxJLg751sESGSjSlQMkR9m9vf?authuser=1#scrollTo=VPAQVhFVqpvD
UNIT-1 Delta Learning
• Delta Rule
• Delta Rule [also known as the Widrow & Hoff Learning rule or the Least Mean Square (LMS) rule] was invented by Widrow and
Hoff.
• The Delta Rule is primarily used for binary classification problems, where the goal is to separate two classes of data points by
finding an appropriate decision boundary.
• Working
1. Initialization: Initialize the weights of the neuron or node to small random values.
2. Forward Pass:
• For each input data point, compute the weighted sum of the inputs using the current weights:
net_input = ∑(weight_i * input_i)
• Apply an activation function (often a step function or sign function) to the net input to obtain the predicted output of the neuron:
predicted_output = activation_function(net_input)
3. Error Calculation:
• Compare the predicted output to the actual target value (ground truth) for the input data point to calculate the error:
error = target_output - predicted_output
4. Weight Update (Delta Rule):
• Adjust the weights of the neuron based on the error calculated in the previous step. The weight update formula is as follows:
weight_i_new = weight_i_old + learning_rate * error * input_i
UNIT-1 Delta Learning
5. Repeat:
• Repeat steps 2 to 4 for all training data points.
• Continue iterating through the entire dataset for multiple epochs (iterations) or until the error converges to a satisfactory
level.
6. Convergence:
• The Delta Rule iteration continues until the error decreases to an acceptable level or until a predefined stopping criterion is
met.
UNIT-1 Derivation of Delta Learning Rulee

• The Delta Learning Rule is derived from the gradient


descent method applied to a supervised learning scenario
where we want to minimize the error between the predicted
output and the actual output of a neuron. This error is
typically quantified using a mean squared error (MSE) cost
function.
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Derivation of Delta Learning Rulee
UNIT-1 Delta Learning
UNIT-1 Delta Learning
UNIT-1 Delta Learning
UNIT-1 Delta Learning

https://colab.research.google.com/drive/1xHwz0NZhxJLg751sE
SGSjSlQMkR9m9vf?authuser=1#scrollTo=5CY8fC2a1X5m
UNIT-1 Artificial Neural Network
• Convergence Rules
• Convergence in Machine Learning: Convergence refers to the point where an
iterative algorithm reaches a stable solution. It's a critical concept in machine
learning and optimization.

• Objective: The goal of convergence is to find an optimal solution, often defined


by minimizing an objective function or meeting specific criteria.

• Criteria for Convergence: The specific criteria for convergence can vary
depending on the algorithm and problem. Common criteria include:

• A small change in the objective function or loss function between iterations.


• Reaching a predefined maximum number of iterations.
• Achieving a specific level of accuracy or error tolerance.
• Satisfying certain mathematical conditions, like gradient descent reaching a stationary point.
UNIT-1 Artificial Neural Network
• Convergence Rules
• Iterative Algorithms: Many machine learning and optimization algorithms are
iterative in nature, meaning they repeatedly update their parameters or variables
to approach an optimal solution.
• Examples: Convergence rules apply to various algorithms, including gradient
descent, stochastic gradient descent, Newton's method, expectation-
maximization (EM), and more.
• Convergence in Neural Networks: In neural network training, convergence is
often measured by observing the decrease in the loss function (e.g., mean
squared error) over iterations or epochs. Training stops when the loss converges
to a stable value.
• Stopping Criteria: Determining appropriate stopping criteria is essential for
avoiding overfitting (continuing training beyond the optimal point) or underfitting
(stopping prematurely).
UNIT-1 Artificial Neural Network
• Convergence Rules
• Importance: Convergence is crucial for ensuring that machine learning models and optimization
algorithms find meaningful solutions efficiently and reliably.

• Convergence Rule in Gradient Descent: In gradient descent optimization, a common convergence


rule is to stop when the change in the loss function between iterations falls below a certain
threshold.

• Trade-Off: Balancing convergence with computational resources is essential. Too many iterations
may lead to slow training, while too few may result in suboptimal solutions.

• Monitoring Convergence: Practitioners often monitor convergence during training by plotting the
loss or objective function's values over time and observing when it stabilizes.

• Convergence in Unsupervised Learning: In unsupervised learning, such as clustering or


dimensionality reduction, convergence may refer to the stability of cluster assignments or the
convergence of the algorithm's objectives (e.g., variance reduction in PCA).
UNIT-1 Artificial Neural Network
• Convergence Rules
• Adaptive Methods: Some algorithms, like adaptive learning rate methods in optimization (e.g.,
Adam), dynamically adjust their behavior during training, which can affect the criteria for
convergence.

• Convergence and Hyperparameter Tuning: Selecting appropriate hyperparameters (e.g., learning


rate, batch size) can impact convergence, and fine-tuning these hyperparameters is often part of
model development.

• Early Stopping: A common practice in machine learning is early stopping, where training is halted
when no significant improvement in the objective function is observed for a predefined number
of iterations.

• Convergence in Reinforcement Learning: In reinforcement learning, convergence may refer to the


stability of the policy or value function during training.
UNIT-1 Artificial Neural Network
• Gradient Descent
• Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid
of 18th century.
• Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It
helps in finding the local minimum of a function.
• Gradient Descent is used to optimize the weight and biases based on the cost function.
• cost function evaluates the difference between the actual and predicted outputs.
• A gradient is nothing but a derivative that defines the effects on outputs of the function with a
little bit of variation in inputs.
• Gradient Descent (GD) is a widely used optimization algorithm in deep learning that is used to
minimize the cost function of a neural network model during training.
• It works by iteratively adjusting the weights or parameters of the model in the direction of the
negative gradient of the cost function until the minimum of the cost function is reached.
• The main objective of using a gradient descent algorithm is to minimize the cost function using
iteration.
UNIT-1 Artificial Neural Network

• Gradient Descent
• The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
• If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.

• This entire procedure is known as Gradient Ascent, which is also known as steepest descent.
UNIT-1 Artificial Neural Network
• Gradient Descent
• The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs two steps iteratively:
• Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
• Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.
UNIT-1 Artificial Neural Network
• Gradient Descent
UNIT-1 Artificial Neural Network
• Derivation of Gradient Descent
1. derivation of the gradient descent update rule for minimizing a cost function J(θ), where θ represents the parameters of the
model:
2. Define the cost function:
• J(θ) is a function that measures how well the model's predictions match the actual values in the training data. The goal is to minimize this cost
function.
3. Calculate the gradient:
• Compute the gradient (or derivative) of the cost function with respect to the parameters θ. This gradient represents the direction and magnitude
of the steepest increase in the cost function:
• ∇J(θ) = [∂J/∂θ₁, ∂J/∂θ₂, ..., ∂J/∂θₙ]
4. Each component of ∇J(θ) tells you how much the cost function will change if you make a small change to the corresponding
parameter θᵢ.
5. Initialize parameters:
• Start with an initial guess for the parameters θ, denoted as θ₀.
6. Update parameters iteratively:
7. The gradient descent update rule is as follows:
8. θk₊₁ = θₖ - α∇J(θₖ)
• Where:
• θₖ is the current estimate of the parameters at iteration k.
• α (alpha) is the learning rate, a hyperparameter that determines the step size or the size of each update.
• ∇J(θₖ) is the gradient of the cost function at the current parameter values θ ₖ.
UNIT-1 Artificial Neural Network
• Derivation of Gradient Descent
• The update rule effectively adjusts the parameters in the direction opposite to the gradient,
scaled by the learning rate α.
• It repeats this process until a stopping criterion is met, typically when the change in the cost
function becomes very small or a fixed number of iterations is reached.
8. Convergence:
• With each iteration, the parameters θ move closer to the values that minimize the cost function.
Gradient descent will converge to a local minimum of the cost function, which represents the best
parameter values for the model with respect to the training data.

9. Repeat until convergence:


• Continue the update process until the stopping criterion is met or until the algorithm converges to
a minimum.
UNIT-1 Artificial Neural Network
• Types of Gradient Descent

• Batch Gradient Descent (BGD) or Vanilla Descent:


• BGD computes the gradient of the cost function with respect to the entire training dataset in each iteration.
• It can be computationally expensive for large datasets since it requires processing the entire dataset for each update.
• BGD often provides stable and deterministic convergence to the minimum.
• Stochastic Gradient Descent (SGD):
• In SGD, each iteration randomly selects a single data point (or a small random subset, called a mini-batch) to compute
the gradient and update the parameters.
• SGD is faster and can handle large datasets more efficiently than BGD.
• It introduces more noise in the parameter updates but can escape local minima and explore the parameter space
more effectively.
• Mini-Batch Gradient Descent:
• Mini-batch gradient descent is a compromise between BGD and SGD. It uses a small random subset (mini-batch) of
the training data in each iteration.
• It combines the benefits of both BGD (stable convergence) and SGD (efficiency and noise reduction).
• The mini-batch size is a hyperparameter that can be tuned.
UNIT-1 Assignment-01

• Describe methods like grid search and random search for hyperparameter tuning.
• Describe the backpropagation algorithm used to train MLPs. How does it update weights and biases
to minimize the loss function?
• Discuss common dimensionality reduction techniques, such as Principal Component Analysis (PCA)
and t-Distributed Stochastic Neighbor Embedding (t-SNE).
• Discuss techniques for detecting and mitigating overfitting and underfitting in machine learning
models.
• Explain how the ROC curve and AUC can be used to evaluate model performance.
UNIT-1 Assignment-01
What is the primary objective of deep learning?
a) Feature extraction
b) Dimensionality reduction
c) Automated feature learning
d) Clustering

Which of the following is NOT an activation function commonly used in deep learning?
a) Sigmoid
b) ReLU
c) Tanh
d) K-Means

Which type of neural network is well-suited for image classification tasks?


a) Convolutional Neural Network (CNN)
b) Recurrent Neural Network (RNN)
c) Multilayer Perceptron (MLP)
d) Radial Basis Function Network (RBFN)
UNIT-1 Assignment-01
What is the primary purpose of dropout in deep learning?
a) Enhancing model generalization
b) Reducing the number of layers
c) Increasing the learning rate
d) Eliminating outliers

In gradient descent, what is the learning rate responsible for?


a) The rate of convergence
b) The size of the neural network
c) The number of epochs
d) The choice of activation function

What problem in deep learning can occur when gradients become extremely small, causing the network to stop learning?
a) Vanishing gradients
b) Exploding gradients
c) Overfitting
d) Dropout
UNIT-1 Assignment-01
Which deep learning architecture is well-suited for sequential data, such as natural language processing?
a) CNN
b) RNN
c) GAN
d) MLP

What is transfer learning in deep learning?


a) Training a model from scratch
b) Using pre-trained models and fine-tuning them for a specific task
c) Transforming data before feeding it into a neural network
d) Transferring data between different devices

Which deep learning framework is developed by Google and widely used in research and industry?
a) PyTorch
b) Caffe
c) Keras
d) TensorFlow
UNIT-1 Assignment-01
What is the primary purpose of a loss function in deep learning?
a) To initialize network weights
b) To measure the accuracy of predictions
c) To visualize data
d) To preprocess input data
UNIT-1 References
• https://www.mygreatlearning.com/blog/understanding-curse-of-dimensionality/
• https://www.analyticsvidhya.com/blog/2021/10/evaluation-metric-for-
regression-models/
• https://medium.com/swlh/recall-precision-f1-roc-auc-and-everything-
542aedf322b9
• https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-
algorithm-f10ba6e38234
• https://www.javatpoint.com/cross-validation-in-machine-learning
UNIT-1 Content

Model Improvement
and Performance
MODULE -01

You might also like