Unit 1 2
Unit 1 2
Model Improvement
and Performance
MODULE -01
Noida Institute of Engineering and Technology
INTRODUCTION
Unit: 1
DEEP LEARNING(ACSML0602)
Dr. Raju
Assistant Professor
(B Tech VIth Sem)
(CSE - AIML)
Faculty Profile:
Qualification: Ph.D
Curse of Dimensionality, Bias and Variance Trade off, Overfitting and underfitting,
Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-Value, Classification -
Precision, Recall, F1, Other topics, K-Fold Cross validation, RoC curve, Hyper-Parameter
Tuning Introduction – Grid search, random search, Introduction to Deep Learning.
Artificial Neural Network: Neuron, Nerve structure and synapse, Artificial Neuron and its
model, activation functions, Neural network architecture: Single layer and Multilayer feed
forward networks, recurrent networks. Various learning techniques; Perception and
Convergence rule, Hebb Learning. Perceptron’s, Multilayer perceptron, Gradient descent and
the Delta rule, Multilayer networks, Derivation of Backpropagation Algorithm.
Introduction to CNN, Train a simple convolutional neural net, Explore the design space for convolutional nets,
Pooling layer motivation in CNN, Design a convolutional layered application, Understanding and visualizing a
CNN, Transfer learning and fine-tuning CNN, Image classification, Text classification, Image classification and
hyper-parameter tuning, Emerging NN architectures
Why use sequence models? Recurrent Neural Network Model, Notation, Back-propagation
through time (BTT), Different types of RNNs, Language model and sequence generation,
Sampling novel sequences, Vanishing gradients with RNNs, Gated Recurrent Unit (GRU),
Long Short-Term Memory (LSTM), Bidirectional RNN, Deep RNNs
Course Bloom’s
Outcome At the end of course , the student will be able to: Knowledge
( CO) Level (KL)
CO1 Analyze ANN model and understand the ways of accuracy K4
measurement.
CO2 Develop a Convolutional neural network for multi-class K4
classification in images
CO3 Apply Deep Learning algorithm to detect and recognize an K3
object.
CO4 Apply RNNs to Time Series Forecasting, NLP, Text and K4
Image Classification
CO5 Apply Lower-dimensional representation over higher- K3
dimensional data for dimensionality reduction and capture
the important features of an object.
03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 12
Program Outcomes (POs)
PO10 : Communication
PO11 : Project management and
finance
PO12 : Life-long learning
03/20/2025 Saurabh Namdev ACSML0602 Deep Learning 14
CO-PO Mapping
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 3 3 3 3 2 2 1 - 1 - 2 2
CO2 3 3 3 3 2 2 1 - 1 1 2 2
CO3 3 3 3 3 3 2 2 - 2 1 2 3
CO4 3 3 3 3 3 2 2 1 2 1 2 3
CO5 3 3 3 3 3 2 2 1 2 1 2 2
AVG 3.0 3.0 3.0 3.0 2.6 2.0 1.6 0.4 1.6 0.8 2.0 2.4
• Curse of Dimensionality,
• Bias and Variance Trade off,
• Overfitting and underfitting,
• Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-
Value,
• Classification - Precision, Recall, F1, Other topics, K-Fold Cross
validation, RoC curve,
• Hyper-Parameter Tuning Introduction – Grid search, random search,
• Introduction to Deep Learning.
UNIT-1 Curse of Dimensionality,
• Curse of Dimensionality refers to a set of problems that arise when
working with high-dimensional data.
• The difficulties related to training machine learning models due to
high dimensional data is referred to as ‘Curse of Dimensionality’.
• The popular aspects of the curse of dimensionality; ‘data sparsity’
and ‘distance concentration’
UNIT-1 Curse of Dimensionality,
• Data Sparsity
• Data sparsity refers to a situation where the available data is insufficient or
incomplete for a particular analysis or task.
• In the context of data analysis and machine learning, data sparsity occurs
when many of the potential variables or combinations of variables have little
or no observed data points.
• This lack of data can lead to challenges when trying to build accurate models,
make predictions, or draw meaningful insights.
UNIT-1 Curse of Dimensionality,
• Distance concentration
• In the context of machine learning, the term "distance concentration" could
potentially refer to a phenomenon where distances between data points in a
high-dimensional space become more concentrated or consistent.
• This could imply that the data points are exhibiting certain patterns or
clusters that are characterized by specific distances between them.
• Such a phenomenon could have implications for various aspects of machine
learning, such as clustering, anomaly detection, and feature selection.
• For example, if data points in a high-dimensional space tend to cluster
around specific distances from each other, it might suggest that certain
groups or classes of data have inherent structures that are closely related in
terms of their feature values.
• This could be useful for tasks like clustering, where identifying groups of
similar data points is important.
UNIT-1 Curse of Dimensionality,
• Variance
• The variance would specify the amount of variation in the prediction if the
different training data was used.
• It tells that how much a random variable is different from its expected value.
• Ideally, a model should not vary too much from one training dataset to another,
which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables.
• Variance errors are either of low variance or high variance.
• Low variance means there is a small variation in the prediction of the
target function with changes in the training data set.
• At the same time, High variance shows a large variation in the prediction
of the target function with changes in the training dataset.
• Overfitting
• Overfitting is an undesirable machine learning behavior that occurs when the
machine learning model gives accurate predictions for training data but not
for new data.
• High variance and low bias.
• Reasons for Overfiting
• The training data size is too small and does not contain enough data samples
to accurately represent all possible input data values.
• The training data contains large amounts of irrelevant information, called
noisy data.
• The model trains for too long on a single sample set of data.
• The model complexity is high, so it learns the noise within the training data.
03/20/2025 Dr. Raju, Assistant Prof. (CSE (AIML)) U 46
NIT 03
Regularization: Overfitting
• Symptom of Overfitting:
• Low training error but high validation error.
• The model fits the training data very well but fails to generalize to new data.
• Underfitting
• It occurs when a model is too simple to capture data complexities.
• It represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate, especially when applied to
new, unseen examples.
• It mainly happens when we uses very simple model with overly simplified
assumptions.
• To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.
• The underfitting model has High bias and low variance.
• Bias
• Bias is the difference between Actual and Predicted value of a model
• Bias = Actual_Value - Predicted_Value
• high bias leads to large error and low bias leads to low error.
• low bias helpful to avoid underfiting problem while high bias represents
predicted data in a straight line that represnts not fitting data accurately.
• Characteristics of a high-bias model include:
• Failure to capture proper data trends
• Potential towards underfitting
• More generalized/overly simplified
• High error rate
UNIT-1 Bias and Variance Trade off,
• Variance
• Variance represents spread of a given data in a preditive model.
• High Variance model is very complex to fit training data and doesn’t work
accurately over unseen data.
• High Variance generates Overfitting Problem in which model works properly over
training data but gives high error rate on Test data.
• A high-variance model typically has the following qualities:
• Noise in the data set
• Potential towards overfitting
• Complex models
• Trying to put all data points as close as possible
UNIT-1 Bias and Variance Trade off,
• Bias Trade off/ Trade off
• Simple Algorithm may leads to high bias and low variance and thus error-prone.
• Complex algorithm may leads to low bias and high variance.
• Bias-Trade off lies betweeen Bias and Variance.
• Bias-trade off finds the optimize value of total error.
Actual_Value Predicted_Value
Roll No. CGPA IQ Loss Function Cost Function
Package Predicted
1 5.2 100 6.3 6.4 0.01
2 4.3 91 4.5 5.3 0.64
3.475
3 8.2 83 6.5 5.2 1.69
4 8.9 102 5.5 8.9 11.56
L = |Actual_Value - Predicted_Value|
C=
UNIT-1 Regression
• Advantages
• Easy to Understand
• Same unit as unit of Actual_Value
• It is Robust to Outlier: It means outlier will not affect error, so if there is
no outliers in dataset then it better to use MAE instead of MSE
• Disadvantages
• Grap is not differenciable due which Gradient Descent(GD) algorithm
not easy to implement.
• To implement GD we need to calculate Sub-Gradient.
UNIT-1 Regression
• MSE (Mean Squared Error): MSE is a metric that calculates the
average squared difference between the predicted values and
the actual values. Squaring the errors gives more weight to
larger errors, making it useful for penalizing significant
deviations from the true values.
L = (Actual_Value - Predicted_Value)2
C= 2
UNIT-1 Regression
• Advantages
• Easy to interpret
• Loss function is differenciable that allows to implement GD easily
• One Local Mininma: It means function has one minimum value that we
have to find.
• Disadvantage
• Unit of error is Square: That creates an confusion to understand it, so to
extract accurate error we have to find square root of MSE.
• It is not Robust to Outlier: If dataser conists outliers then MSE is not
useful
UNIT-1 Regression
• Huber loss
• Huber Loss is applicable when Outlier data is around 25% because 25% is a
significant amount of data and if we use MSE then it will ignore the 75%
data which is correct, because graph will deviate towards Outliers and if
we use MAE, it will ignore 25% outlier data that is also significant. In this
type of situation Huber Loss is useful.
UNIT-1 Regression
• RMSE
• The lower the RMSE, the better the model and its predictions.
• A higher RMSE indicates that there is a large deviation from the
residual to the ground truth.
UNIT-1 Regression
• Pros of the RMSE Evaluation Metric:
• RMSE is easy to understand.
• It serves as a heuristic for training models.
• It is computationally simple and easily differentiable which many optimization
algorithms desire.
• RMSE does not penalize the errors as much as MSE does due to the square root.
• Cons of the RMSE metric:
• Like MSE, RMSE is dependent on the scale of the data. It increases in magnitude if
the scale of the error increases.
• One major drawback of RMSE is its sensitivity to outliers and the outliers have to
be removed for it to function properly.
• RMSE increases with an increase in the size of the test sample. This is an issue
when we calculate the results on different test samples.
UNIT-1 Regression
• R Squared
• R-squared (Coefficient of Determination) is a statistical measure that
quantifies the proportion of the variance in the dependent variable that is
explained by the independent variables in a regression model.
• Where:
• SSR (Sum of Squares Residual) represents the sum of squared differences between
the observed values and the predicted values by the model.
• SST (Total Sum of Squares) represents the sum of squared differences between the
observed values and the mean of the dependent variable.
UNIT-1 Regression
• R-squared ranges between 0 and 1, with the following
interpretations:
• =0: The model does not explain any of the variability in the dependent
variable. It's a poor fit.
• : The model explains a proportion of the variability. A higher R-squared
indicates a better fit, with 1 indicating a perfect fit where the model
explains all the variability.
• =1: The model perfectly predicts the dependent variable based on the
independent variables.
UNIT-1 Regression
• R-squared evaluates regression model fit but has limitations:
• High R-squared doesn't always mean good fit; high value may
imply overfitting, lacking generalization.
• Including more predictors can inflate R-squared, even if they're
weak; adjusted R-squared adjusts for this.
• "Good" R-squared varies by field; lower values acceptable in
data-rich areas.
• R-squared may miss fit quality with nonlinearity or outliers.
UNIT-1 Regression
• Adjusted R Squared
• Where −
• n = the number of points in your data sample.
• k = the number of independent regressors, i.e. the number of variables
in your model, excluding the constant.
UNIT-1 Regression
• Adjusted R Squared
• Adjusted R-squared adjusts the statistic based on the number
of independent variables in the model
• Adjusted R2 also indicates how well terms fit a curve or line,
but adjusts for the number of terms in a model.
• If you add more and more useless variables to a model,
adjusted r-squared will decrease.
• If you add more useful variables, adjusted r-squared will
increase.
• Adjusted R2 will always be less than or equal to R2
UNIT-1 Regression
• Adjusted R Squared
• Problem Statement −
• A fund has a sample R-squared value close to 0.5 and it is
doubtlessly offering higher risk adjusted returns with the
sample size of 50 for 5 predictors. Find Adjusted R square
value.
• Sample size = 50 Number of predictor = 5 Sample R - square
= 0.5.Substitute the qualities in the equation,
UNIT-1 Regression
• p-Value,
UNIT-1 Regression
• RMSE (Root Mean Squared Error): RMSE is the square root of
the MSE and is commonly used to express the average
magnitude of the prediction errors in the same units as the
dependent variable. It provides a measure of the model's
accuracy, and lower values indicate better performance.
• R Squared (Coefficient of Determination): R-squared is a
statistical measure that represents the proportion of the
variance in the dependent variable that is explained by the
independent variables in the regression model. It ranges from 0
to 1, where 1 indicates that the model explains all the variance,
and 0 indicates that the model doesn't explain any of the
variance.
UNIT-1 Regression
• Adjusted R Squared: Adjusted R-squared is a modified version
of R-squared that takes into account the number of
independent variables in the model. It penalizes the addition of
irrelevant variables that might artificially inflate the R-squared
value.
• p-Value: The p-value is a measure of the evidence against a null
hypothesis in a statistical hypothesis test. In the context of
regression analysis, p-values are used to determine whether the
coefficients of the independent variables are statistically
significant. A low p-value (typically below a significance level
like 0.05) suggests that the variable has a significant impact on
the dependent variable.
UNIT-1 Classification
• A Fraud Detection Classifier
• Objective: To detect fraud claim
• Assumption:
• The output of your fraud detection model is the probability [0.0–1.0]
that a transaction is fraudulent.
• If this probability is below 0.5, you classify the transaction as non-
fraudulent; otherwise, you classify the transaction as fraudulent.
• Methodology
• Collect 10,000 manually classified transactions, with 300 fraudulent
transaction and 9,700 non-fraudulent transactions.
• You run your classifier on every transaction, predict the class label
(fraudulent or non-fraudulent) and
• summarise the results in the following confusion matrix:
UNIT-1 Classification
•
UNIT-1 Classification
• A True Positive (TP=100) is an outcome where the model
correctly predicts the positive (fraudulent) class.
• A True Negative (TN=9,000) is an outcome where the model
correctly predicts the negative (non-fraudulent) class.
• A False Positive (FP=700) is an outcome where the model
incorrectly predicts the positive (fraudulent) class.
• A False Negative (FN=200) is an outcome where the model
incorrectly predicts the negative (non-fraudulent) class.
UNIT-1 Classification
• Accuracy: Correctly predicted values out of total given data.
• Accuracy = ?
UNIT-1 Classification
• Area Under Curve
• Area Under Curve(AUC) is one of the most widely used metrics for
evaluation.
• It is used for binary classification problem.
• AUC of a classifier is equal to the probability that the classifier will rank a
randomly chosen positive example higher than a randomly chosen
negative example.
• Two basic terms used in AUC:
• True Positive Rate (Sensitivity)
• True Negative Rate (Specificity)
UNIT-1 Classification
• Area Under Curve
• Few basic terms used in AUC:
• True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True
Positive Rate corresponds to the proportion of positive data points that are
correctly considered as positive, with respect to all positive data points.
• False Positive Rate and True Positive Rate both have values in the range
[0, 1].
• FPR and TPR both are computed at varying threshold values such as
(0.00, 0.02, 0.04, …., 1.00) and a graph is drawn.
• AUC is the area under the curve of plot False Positive Rate vs True
Positive Rate at different points in [0, 1].
UNIT-1 Classification
• Area Under Curve
• As evident, AUC has a range of [0, 1]. The greater the value, the better is
the performance of our model.
UNIT-1 Classification
• F1-Score:
• F1 Score is used to measure a test’s accuracy
• F1 Score is the Harmonic Mean between precision and recall.
• The range for F1 Score is [0, 1].
• It tells you how precise your classifier is (how many instances it classifies
correctly), as well as how robust it is (it does not miss a significant number of
instances).
• High precision but lower recall, gives you an extremely accurate, but it then
misses a large number of instances that are difficult to classify.
• The greater the F1 Score, the better is the performance of our model.
• Mathematically, it can be expressed as :
In a word, accuracy. Advanced tools and techniques have dramatically improved deep learning algorithms
—to the point where they can outperform humans at classifying images, win against the world’s best GO
player, or enable a voice-controlled assistant like Amazon Echo® and Google Home to find and download
that new song you like.
Pretrained models built by experts Models such as AlexNet can be retrained to perform new recognition
tasks using a technique called transfer learning. While AlexNet was trained on 1.3 million high-resolution
images to recognize 1000 different objects, accurate transfer learning can be achieved with much smaller
datasets.
AI ML DL
AI
ML
DL
• Derivative
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• ReLU Advantages:
• Non-linearity: ReLU introduces non-linearity to the model, allowing it to learn complex patterns and
relationships in the data. This is crucial for neural networks to model a wide range of functions
effectively.
• Computationally Efficient: ReLU activation is computationally efficient to compute during both forward
and backward passes of training. The activation only involves simple thresholding of values, which leads
to faster training times compared to more complex activation functions like sigmoid or tanh.
• Sparse Activation: ReLU has a characteristic called "sparse activation." It means that only a subset of
neurons is activated for any given input. This sparsity can lead to more efficient learning and reduced
overfitting, as neurons are less likely to co-adapt.
• Mitigating Vanishing Gradient: Unlike sigmoid and tanh functions, ReLU does not saturate for positive
inputs, which helps mitigate the vanishing gradient problem. This makes training deeper networks more
feasible as gradients can flow more effectively through the network during backpropagation.
• Empirical Success: ReLU has shown remarkable empirical success in various deep learning applications,
including image and speech recognition, natural language processing, and more. It has contributed to
the rise of deep learning in recent years.
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• ReLU Disadvantages:
• Dying ReLU Problem: One significant issue with ReLU is the "dying ReLU" problem. During training,
some neurons can become inactive (output zero for all inputs) and stay that way. Once a large
gradient flows through a ReLU neuron and updates its weights such that it always produces
negative outputs, it will never activate again. This leads to dead neurons that do not contribute to
learning.
• Not Zero-Centered: ReLU is not zero-centered, meaning the output of ReLU is always positive or
zero. This can lead to issues in weight updates during training and can affect convergence.
• Unbounded Activation: ReLU does not have an upper bound, which means that if a large gradient
flows through a ReLU neuron, it can lead to "exploding gradients," causing training instability.
• Sensitivity to Initialization: ReLU neurons can be sensitive to weight initialization. If the initial
weights are too large, it's more likely for neurons to get stuck in the inactive state, contributing to
the dying ReLU problem.
• Leaky ReLU and Variants: To address some of the issues with standard ReLU, variations like Leaky
ReLU, Parametric ReLU, and Exponential Linear Units (ELUs) have been proposed. These variations
introduce controlled non-zero slopes for negative inputs or introduce other adaptive
characteristics to mitigate the disadvantages of ReLU.
UNIT-1 Artificial Neural Network
• Activation Functions(AF)
• Sigmoid:
• The sigmoid function squashes the input into a range between 0 and 1.
• It's often used in the output layer for binary classification tasks.
• Mathmatical Representation
• Image Recognition: Convolutional neural networks (CNNs), a specialized type of multi-layer network, have achieved state-of-
the-art performance in image classification, object detection, and segmentation tasks.
• Natural Language Processing: Multi-layer neural networks, such as recurrent neural networks (RNNs) and long short-term
memory networks (LSTMs), are used for tasks like text generation, sentiment analysis, and machine translation.
• Speech Recognition: Multi-layer networks have been applied to automatic speech recognition (ASR) systems, converting
spoken language into text.
• Recommendation Systems: They are used in collaborative filtering and content-based recommendation systems to
personalize content recommendations for users.
UNIT-1 Artificial Neural Network
• Multi Layer Feed Forward Neural Network
• Problem: Classify whether a bank customer will churn (leave the bank) or stay based on features such as
their credit score, age, tenure, and balance.
• Data:
• We'll use a synthetic dataset with the following features:
• Credit Score
• Age
• Tenure
• Balance
• Network Architecture:
• For this example, we'll create an MLP with three hidden layers, each containing 64 neurons. The input layer will have four
neurons (one for each feature), and the output layer will have one neuron for binary classification.
• Activation Function:
• We'll use the Rectified Linear Unit (ReLU) activation function for hidden layers and a sigmoid activation function for the
output layer.
• https://colab.research.google.com/drive/1Vc50HdjexdN3B5TDGXsRxPJHyGin0ef6?usp=s
haring
UNIT-1 Artificial Neural Network
• Differences between Single layer and Multi-Layer Neural Networks
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Differences between Single layer and Multi-Layer Neural Networks
• Architecture:
• Single-layer neural network: It consists of only one layer of neurons, which is often called the input layer. There are
no hidden layers or multiple layers of neurons between the input and output.
• Multi-layer neural network: It has more than one layer of neurons, typically including an input layer, one or more
hidden layers, and an output layer. The layers between the input and output are called hidden layers.
• Capabilities:
• Single-layer neural network: It can only model linearly separable functions. In other words, it can solve simple
problems where the data can be separated by a straight line or hyperplane.
• Multi-layer neural network: It can approximate complex, nonlinear functions. By adding hidden layers and using
nonlinear activation functions, MLPs can capture intricate patterns and relationships in data, making them capable
of handling a wide range of tasks, including classification, regression, and more.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Differences between Single layer and Multi-Layer Neural Networks
• Representation:
• Single-layer neural network: These networks can represent linear transformations of the input data, making them
limited in their ability to capture complex relationships in the data.
• Multi-layer neural network: The presence of hidden layers allows MLPs to represent both linear and nonlinear
transformations of the input data. This enables them to model complex, hierarchical features and relationships
within the data.
• Activation functions:
• Single-layer neural network: Typically uses linear activation functions, which result in linear transformations of the
input.
• Multi-layer neural network: Uses nonlinear activation functions (e.g., sigmoid, ReLU, tanh) in the hidden layers to
introduce nonlinearity into the network, enabling it to learn and represent nonlinear relationships.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Differences between Single layer and Multi-Layer Neural Networks
• Learning and Training:
• Single-layer neural network: Training is relatively straightforward and can often be done using simple algorithms
like the perceptron learning rule.
• Multi-layer neural network: Training is more complex and usually requires advanced optimization techniques like
backpropagation and gradient descent. These networks benefit from the use of gradient-based optimization
methods to adjust the weights and biases during training.
• Use Cases:
• Single-layer neural network: Suitable for simple tasks like binary classification or linear regression when the data
is linearly separable.
• Multi-layer neural network: Suited for a wide range of tasks, including image recognition, natural language
processing, speech recognition, and more, where the data has complex and nonlinear relationships.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
Aspect Single-Layer Neural Network Multi-Layer Neural Network
Architecture Consists of onlyone layer (input layer). Consists of multiple layers, including input,hidden, and
output layers.
Capabilities Can model linearly separable functions. Can approximate complex, nonlinear functions.
Representation Represents linear transformations of Represents both linear and nonlinear transformations of
input data. input data.
Activation Typicallyuses linear activation functions. Uses nonlinear activation functions in hidden layers.
Functions
Learning and Training is relatively simple, often using Training is more complex,usually involving
Training the perceptron learning rule. backpropagation and gradient descent.
Use Cases Suitable for simple tasks like binary Suited for a wide range of tasks, including image
classification or linear regression with recognition,natural language processing, and more,
linearly separable data. involving complex, nonlinear data relationships.
Hidden Layers Does not have hidden layers. Includes one or more hidden layers.
Nonlinearity Lacks inherent nonlinearity, limiting its Introduces nonlinearity through activation functions in
representation capabilities. hidden layers.
Complex Data Struggles to capture complex data Can model and represent complex data relationships
Relationships relationships. effectively.
UNIT-1 Artificial Neural Network
• Neural Network Architecture
• Recurrent Neural Networks (RNNs):
• RNNs are designed to handle sequences of data, making them suitable for tasks such as natural
language processing and time series analysis.
• Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous
step is fed as input to the current step.
• Hidden state that is also known as Memory State is the main and most important features of RNN
which remembers some information about a sequence.
UNIT-1 Artificial Neural Network
• If the neurons operate in opposite phases, the weight between them decreases.
• When there is no signal correlation between the neurons, the weight remains unchanged.
• The sign of the weight between two nodes is determined by the sign of their input
signals:
• If both nodes receive inputs that are either positive or negative, the resulting weight is strongly positive.
•
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Hebbian Learning Rule
UNIT-1 Artificial Neural Network
• Implementation of AND gate using Hebb Learning.
• Step 1 : Set weight and bias to zero, w = [ 0 0 0 ]T and b = 0.
• Step 2 : Set input vector Xi = Si for i = 1 to 4.
• X1 = [ -1 -1 1 ]T
• X2 = [ -1 1 1 ]T
• X3 = [ 1 -1 1 ]T
• X4 = [ 1 1 1 ]T
• Step 3 : Output value is set to y = t.
• Step 4 : Modifying weights using Hebbian Rule:
• First iteration –
• w(new) = w(old) + x1y1 = [ 0 0 0 ]T + [ -1 -1 1 ]T . [ -1 ] = [ 1 1 -1 ]T
• For the second iteration, the final weight of the first one will be used and so on.
• Second iteration –
• w(new) = [ 1 1 -1 ]T + [ -1 1 1 ]T . [ -1 ] = [ 2 0 -2 ]T
• Third iteration –
• w(new) = [ 2 0 -2]T + [ 1 -1 1 ]T . [ -1 ] = [ 1 1 -3 ]T
• Fourth iteration –
• w(new) = [ 1 1 -3]T + [ 1 1 1 ]T . [ 1 ] = [ 2 2 -2 ]T
• So, the final weight matrix is [ 2 2 -2 ]T
UNIT-1 Artificial Neural Network
• Testing the network :
• For x1 = -1, x2 = -1, b = 1, Y = (-1)(2) + (-1)(2) + (1)(-2) = -6
• For x1 = -1, x2 = 1, b = 1, Y = (-1)(2) + (1)(2) + (1)(-2) = -2
• For x1 = 1, x2 = -1, b = 1, Y = (1)(2) + (-1)(2) + (1)(-2) = -2
• For x1 = 1, x2 = 1, b = 1, Y = (1)(2) + (1)(2) + (1)(-2) = 2
• The results are all compatible with the original table.
• Decision Boundary :
• 2x1 + 2x2 – 2b = y
• Replacing y with 0, 2x1 + 2x2 – 2b = 0
• Since bias, b = 1, so 2x1 + 2x2 – 2(1) = 0
• 2( x1 + x2 ) = 2
• The final equation, x2 = -x1 + 1
UNIT-1 Perceptron Rules
• Perceptron Rules
• Supervised Learning Algorithm: The Perceptron Rule is a supervised learning
algorithm used for binary classification tasks.
• Developed by Frank Rosenblatt: It was developed by Frank Rosenblatt in the late
1950s.
• Objective: The main goal of the Perceptron Rule is to learn a linear decision
boundary that can separate two classes of data points in a feature space.
• Linear Separability: It is suitable for problems where the data is linearly
separable, meaning it can be separated by a straight line (or hyperplane in higher
dimensions).
• Components: The algorithm works with input features, weights, an activation
function (typically a step function), and a bias term (threshold).
• Training: It iteratively updates the weights based on misclassified data points until
a stopping criterion is met. Correctly classified points do not trigger weight
updates.
UNIT-1 Perceptron Rules
• Perceptron Rules
• Update Rule: When a data point is misclassified as class 0 when it should be class 1, the weights
are increased for the associated features. When misclassified as class 1 when it should be class 0,
the weights are decreased for the associated features.
• Limitations: The Perceptron Rule can only solve linearly separable problems and cannot handle
tasks with nonlinear decision boundaries.
• Historical Significance: It played a pivotal role in the history of machine learning and served as a
foundation for more complex neural network models like multi-layer perceptrons (MLPs).
• Activation Function: Typically uses a step function for making binary decisions based on the
weighted sum of inputs.
• Bias: A bias term (threshold) is used to shift the decision boundary.
• Supervised Learning: Requires labeled training data for learning and updating weights.
• Early Neural Network: Represents one of the earliest forms of artificial neural networks and
contributed to the development of the field.
UNIT-1 Perceptron Learning Rule
• Principle: This rule adjusts weights to minimize classification errors.
• It is a supervised learning approach that adjusts weights based on the error calculated between the
desired and actual outputs.
https://colab.research.google.com/drive/1xHwz0NZhxJLg751sESGSjSlQMkR9m9vf?authuser=1#scrollTo=VPAQVhFVqpvD
UNIT-1 Delta Learning
• Delta Rule
• Delta Rule [also known as the Widrow & Hoff Learning rule or the Least Mean Square (LMS) rule] was invented by Widrow and
Hoff.
• The Delta Rule is primarily used for binary classification problems, where the goal is to separate two classes of data points by
finding an appropriate decision boundary.
• Working
1. Initialization: Initialize the weights of the neuron or node to small random values.
2. Forward Pass:
• For each input data point, compute the weighted sum of the inputs using the current weights:
net_input = ∑(weight_i * input_i)
• Apply an activation function (often a step function or sign function) to the net input to obtain the predicted output of the neuron:
predicted_output = activation_function(net_input)
3. Error Calculation:
• Compare the predicted output to the actual target value (ground truth) for the input data point to calculate the error:
error = target_output - predicted_output
4. Weight Update (Delta Rule):
• Adjust the weights of the neuron based on the error calculated in the previous step. The weight update formula is as follows:
weight_i_new = weight_i_old + learning_rate * error * input_i
UNIT-1 Delta Learning
5. Repeat:
• Repeat steps 2 to 4 for all training data points.
• Continue iterating through the entire dataset for multiple epochs (iterations) or until the error converges to a satisfactory
level.
6. Convergence:
• The Delta Rule iteration continues until the error decreases to an acceptable level or until a predefined stopping criterion is
met.
UNIT-1 Derivation of Delta Learning Rulee
https://colab.research.google.com/drive/1xHwz0NZhxJLg751sE
SGSjSlQMkR9m9vf?authuser=1#scrollTo=5CY8fC2a1X5m
UNIT-1 Artificial Neural Network
• Convergence Rules
• Convergence in Machine Learning: Convergence refers to the point where an
iterative algorithm reaches a stable solution. It's a critical concept in machine
learning and optimization.
• Criteria for Convergence: The specific criteria for convergence can vary
depending on the algorithm and problem. Common criteria include:
• Trade-Off: Balancing convergence with computational resources is essential. Too many iterations
may lead to slow training, while too few may result in suboptimal solutions.
• Monitoring Convergence: Practitioners often monitor convergence during training by plotting the
loss or objective function's values over time and observing when it stabilizes.
• Early Stopping: A common practice in machine learning is early stopping, where training is halted
when no significant improvement in the objective function is observed for a predefined number
of iterations.
• Gradient Descent
• The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
• If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
• Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.
• This entire procedure is known as Gradient Ascent, which is also known as steepest descent.
UNIT-1 Artificial Neural Network
• Gradient Descent
• The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs two steps iteratively:
• Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
• Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.
UNIT-1 Artificial Neural Network
• Gradient Descent
UNIT-1 Artificial Neural Network
• Derivation of Gradient Descent
1. derivation of the gradient descent update rule for minimizing a cost function J(θ), where θ represents the parameters of the
model:
2. Define the cost function:
• J(θ) is a function that measures how well the model's predictions match the actual values in the training data. The goal is to minimize this cost
function.
3. Calculate the gradient:
• Compute the gradient (or derivative) of the cost function with respect to the parameters θ. This gradient represents the direction and magnitude
of the steepest increase in the cost function:
• ∇J(θ) = [∂J/∂θ₁, ∂J/∂θ₂, ..., ∂J/∂θₙ]
4. Each component of ∇J(θ) tells you how much the cost function will change if you make a small change to the corresponding
parameter θᵢ.
5. Initialize parameters:
• Start with an initial guess for the parameters θ, denoted as θ₀.
6. Update parameters iteratively:
7. The gradient descent update rule is as follows:
8. θk₊₁ = θₖ - α∇J(θₖ)
• Where:
• θₖ is the current estimate of the parameters at iteration k.
• α (alpha) is the learning rate, a hyperparameter that determines the step size or the size of each update.
• ∇J(θₖ) is the gradient of the cost function at the current parameter values θ ₖ.
UNIT-1 Artificial Neural Network
• Derivation of Gradient Descent
• The update rule effectively adjusts the parameters in the direction opposite to the gradient,
scaled by the learning rate α.
• It repeats this process until a stopping criterion is met, typically when the change in the cost
function becomes very small or a fixed number of iterations is reached.
8. Convergence:
• With each iteration, the parameters θ move closer to the values that minimize the cost function.
Gradient descent will converge to a local minimum of the cost function, which represents the best
parameter values for the model with respect to the training data.
• Describe methods like grid search and random search for hyperparameter tuning.
• Describe the backpropagation algorithm used to train MLPs. How does it update weights and biases
to minimize the loss function?
• Discuss common dimensionality reduction techniques, such as Principal Component Analysis (PCA)
and t-Distributed Stochastic Neighbor Embedding (t-SNE).
• Discuss techniques for detecting and mitigating overfitting and underfitting in machine learning
models.
• Explain how the ROC curve and AUC can be used to evaluate model performance.
UNIT-1 Assignment-01
What is the primary objective of deep learning?
a) Feature extraction
b) Dimensionality reduction
c) Automated feature learning
d) Clustering
Which of the following is NOT an activation function commonly used in deep learning?
a) Sigmoid
b) ReLU
c) Tanh
d) K-Means
What problem in deep learning can occur when gradients become extremely small, causing the network to stop learning?
a) Vanishing gradients
b) Exploding gradients
c) Overfitting
d) Dropout
UNIT-1 Assignment-01
Which deep learning architecture is well-suited for sequential data, such as natural language processing?
a) CNN
b) RNN
c) GAN
d) MLP
Which deep learning framework is developed by Google and widely used in research and industry?
a) PyTorch
b) Caffe
c) Keras
d) TensorFlow
UNIT-1 Assignment-01
What is the primary purpose of a loss function in deep learning?
a) To initialize network weights
b) To measure the accuracy of predictions
c) To visualize data
d) To preprocess input data
UNIT-1 References
• https://www.mygreatlearning.com/blog/understanding-curse-of-dimensionality/
• https://www.analyticsvidhya.com/blog/2021/10/evaluation-metric-for-
regression-models/
• https://medium.com/swlh/recall-precision-f1-roc-auc-and-everything-
542aedf322b9
• https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-
algorithm-f10ba6e38234
• https://www.javatpoint.com/cross-validation-in-machine-learning
UNIT-1 Content
Model Improvement
and Performance
MODULE -01