0% found this document useful (0 votes)
22 views159 pages

AAM Book

The document outlines a comprehensive syllabus for a course on machine learning, covering topics such as model selection, feature engineering, supervised and unsupervised learning algorithms, and deep learning techniques. It includes detailed sections on algorithms like Naive Bayes, Decision Trees, Support Vector Machines, K Nearest Neighbours, K-Means Clustering, and various deep learning architectures. Additionally, the syllabus emphasizes practical implementation and evaluation methods, including cross-validation and hyperparameter tuning.

Uploaded by

Pragati Dagale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views159 pages

AAM Book

The document outlines a comprehensive syllabus for a course on machine learning, covering topics such as model selection, feature engineering, supervised and unsupervised learning algorithms, and deep learning techniques. It includes detailed sections on algorithms like Naive Bayes, Decision Trees, Support Vector Machines, K Nearest Neighbours, K-Means Clustering, and various deep learning architectures. Additionally, the syllabus emphasizes practical implementation and evaluation methods, including cross-validation and hyperparameter tuning.

Uploaded by

Pragati Dagale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Syllabus ...

Unit I: Model Selection and Feature Engineering


1.1 Infroduction: Selecting a model.
1.2 Training a model for supervised learning, Features - understand your data better, Feature extraction
and engineering.
1.3 Feature engineering on - numerical data and categorical data and fext data.
1.4 Feature scaling, Feature selection.
Unit II : Naive Bayes, Decision Tree
21 Naive Bayes
* Bayes Theorem, Working of Naive Bayes.
o Bayes classifier, Applying Bayes Theorem.
e Advantages and Disadvantages of Naive Bayes classifier.
o Application of Naive Bayes.
® Implementation of Naive Bayes classifier.
2.2 Decision Tree
® Decision tree Diagram, Why used decision tree?
® Working of decision tree algorithm, Attributes Selection Measure (ASM).
® Advantages and Disadvantages of decision tree.
® Implementation of decision tree.
Unit Il : Supervised Learning: Support Vector Machines, K Nearest Neighbours
3.1 Support Vector Machines:
o Types of SVM.
® How does SVM work?
* Advantages and Disadvantages of decision free.
o Implementation of SVM.
3.2 K Nearest Neighbours
® Need of KNN algorithm.
® Working of KNN algorithm.
e Advantages and Disadvantages of KNN algorithm.
o Implementation of KNN algorithm.
UnitlV: Unsupervised Learning: Clustering Algorithms
4.1 K-Means Clustering
® What is K-means clustering?
® Working of K-means Algorithm.
® Failure of K-means algorithm.
o Implementation of M-means algorithm.
4.2 Dimensionality Reduction
® Introduction to Dimensionality Reduction, Subset Selection.
o Introduction to Principal Component Analysis.
Unit V : Introduction to Deep Learning
5.1 Introduction
® Artificial Neural Network.
® Pereceptron EX-OR problem.
o Feed Forward and Back Propagation, Losses.
® Activation Function, GPU Training.
5.2 Basics Hyper Parameter
o Selecting number of neurons.
® Activation Functions.
® Layers using Greedy Search and Random Access
Unit VI: Deep Learning for Sequential and Image Data
6.1 Sequential Data
® RNN, LSTM, LSTM-GRU.
® Introduction to Transformers, GPT
6.2 Image Data
e CNN, (Resnet, VGG) Pre-trained.
® Neural Networks, Transfer Learning.
e Fine Tuning.
Contents ...
1. Model Selection and Feature Engineering 1.1-1.11
1.1 Introduction: Selecting a Model 11

12 Training a Model for Supervised Leaming Features - Understand Your Data Better,
Feature Extraction and Engineering 15
13 Feature Engineering on - Numerical Data and Categorical Data and Text data 16
14 Feature Scaling, Features Selection 18
Practice Questions 1.1
2. Naive Bayes, Decision Tree and Random Forest 21-241
21 Introduction 21

22 Working of Classification 21
23 Naive Bayes Classifier 22
24 Types of Naive Bayes Model 22
25 Naive Bayes 28
251 What is Naive Bayes Classifier Algorithm? 28
252 What is it called Naive Bayes? 29

253 What is Bayes’ Theorem? 29


254 Explain Working of Naive Bayes’ Classifier 29
255 Write Implementation of the Naive Bayes Algorithm 21
26 Decision Tree 215
26.1 Introduction: Decision Tree 215
262 The Decision Tree Algorithm 215
263 Advantages of Decision Tree 224

264 Disadvantages of Decision Tree 224


265 Python Implementation of Decision Tree 224

27 Random Forest Classification 229


271 An Overview of Random Forests 229
272 How Random Forest Classification Works? 229
273 Random Forest Algorithm 235
274 Working of Random Forest Algorithm 236
275 An Overview of Random Forests 236

276 Advantages of Random Forest 237


277 Disadvantages of Random Forest 237
278 Implementation of Random Forest Algorithm 237
Practice Questions 241
3. Supervised Learning: Support Vector Machine, K Nearest Neighbour 3.1 - 3.29
3.1 Support Vector Machines 31
3.1.1 Support Vectors 32
3.1.2 Dealing with Non-linear and Inseparable Planes 33
3.1.3 Support Vector Machine Kernels 33
3.1.4 Classifier Building in Scikit-Learn 33
3.2 Support Vector Machines (SVM)? 37

3.3 Typesof SYM 38


3.4 Working of SVM Algorithm 39
35 Implementation of Support Vector Machine 3.12
36 Advantages and Disadvantages of SVM 3.16
3.7 K Nearest Neighbours 3.16
3.7.1 The Dataset 3.17
3.72 Visualize the Data 3.19
3.7.3 Normalizing and Splitting the Data 3.19
3.8 Need of K-NN Algorithm 3.22
39 Working of K-NN Algorithm 322
3.10 Advantages and Disadvantages of K-NN Algorithm 3.24
3.10.1 Advantages of K-NN Algorithm 3.24
3.10.2 Disadvantages of K-NN Algorithm 3.24
3.11 Implementation of the K-NN Algorithm 324
. Practice Questions 3.29
4. Unsupervised Learning: Clustering Algorithm 4.1-4.30
4.1 K-Means Clustering 41
42 What is K-Means Clustering? 48
43 Working of K-Means Algorithm 49
4.4 Failure of K-Means 4.12
45 Implementation of K-Means Algorithm 4.13
46 Dimensionality Reduction 417
4.6.1 Dimension Reduction Techniques 4.18
46.2 Advantages of Applying Dimensionality Reduction 4.19
463 Disadvantages of Dimensionality Reduction 419
4.7 Principal Component ANalysis (PCA) 421
4.7.1 Orthogonality 422
4.7.2 PCA Implementation 423
4.8 Singular Value Decomposition (SVD) 425

49 Subset Selection 4.28


. Practice Questions 4.30
5. Introduction to Deep Learning 5.1 - 5.30
51 Introduction to Artificial Neural Network 51
5.1.1 Biological Neuron Structure and Functions 52
512 Structure and Functions of Artificial Neuron 53
513 State the Major Differences Between Biological and Artificianeural Networks 54
514 Basic Building Blocks of Artificial Neural Networks 55
515 A Neural Network Activation Function 58
516 Applications of ANN 59
517 Advantages of ANN 5.10
518 Limitations of ANN 510
519 Multilayer Perceptron 511

5.1.10 Working of Back Propagation 512


51.11 Linear Separability of Points 5.15
5.1.12 Types of Artificial Neural Networks 5.18
51.13 Loss Function in Deep Learning 5.18
5.1.14 Activation Function 522
5.1.15 GPU Training 525
52 Basics hyper Parameter 525
521 Hyoperparameters 525
522 Different Ways of Hyperparameters Tuning 526
523 Hyperparameter Tuning Techniques 527
524 Advantages of Hyperparameter Tuning 530
524 Disadvantages of Hyperparameter Tuning 5.30
. Practice Questions 530
6. Deep Learning for Sequential and Image Data 6.1-6.42
6.1 Sequential Data: RNN, LSTM, LSTM-GRU, Introduction to Transformers, GPT 6.1
6.1.1 Recurrent Neural Network (RNN) 6.1
6.1.2 Pros and cons of Recurrent Neural Networks (RNN) 6.4
6.1.3 Long Short-Term Memory in Machine Learning 6.6
6.1.4 Long-Short Term Memory Networks 6.8
615 Gated Recurrent Unit (GRU) 6.9
6.2 Introduction to Transformers 6.1

6.2.1 Transformer's Model Architecture 6.11

622 Limitations of the Transformer 6.13


6.3 GPT - Generative Pre-trained Transformer 6.13
6.3.1 What is GPT? 6.14

6.3.2 Working of GPT 6.14

6.3.3 Training GPT 6.15


634 Applications of GPT 6.15
6.35 Limitations of GPT 6.16
6.4 Image Data: (Resnet, VGG) Pre-Trained Neural Networks, Transfer Learning, Fine Tuning 6.16

6.4.1 Image Classification Using CNN (Convolutional Neural Networks) 6.16

6.4.2 Convolutional Neural Networks (CNNs) 6.20


6.4.3 Residual Network ResNet 6.21
6.44 VGG Neural Network 6.24
6.45 VGG16 Architecture 6.24
6.46 VGG Neural Network Architecture 6.25
6.4.7 Transfer Learning for Deep Learning 6.27
6.48 Models That Have Been Pre-Trained 6.30
6.49 Fine-Tuning Large Language Models 6.34
6.4.10 Freezing Weights 6.41
. Practice Questions 6.42
Appendix A.1- A.118
After reading this chapter, students will be able to understand :
Introduction of selection of a mode.
Training a model for supervised learning features.
Feature extraction and engineering on numerical data, categorical and text data.
The concept of feature scaling, feature selection.

Learning Objectives...
Select a suitable model for the given data with justification.
&

Describe the process of using supervised learning on the given data.


&

Describe the process of feature extraction


and engineering
on the given data.
W [w

Compare feature engineering for the given type of data.


Select feature scaling, feature selection, dimensionality reduction in the given situation with justification.
o

JEEW INTRODUCTION:
SELECTING A MODEL
‘What Is Model Selection?
“The process of selecting the machine learning model most appropriate for a given Issue Is known as model
selection.”
Model selection is a procedure that may be used to compare models of the same type that have been set-up with various
model hyperparameters
and models of other types.
Why Model Selection?
Model selection is a procedure used by statisticians to examine the relative merits of different predictive methods and
identify which one best fits the observed data. Model evaluation with the data used for training is not accepted in data science
becauseit easily generates overoptimistic
and overfitted models.
‘You may have to check things like :
* Overfitting
and underfitting
* Generalization
error
* Validation for model selection
For certain algorithms, the best way to reveal the problem’s structure to the learning algorithm is through specific data
preparation. The next logical step is to define model selection as the process of choosing amongst model development
workflows.
So, depending
on your use case, you choosean ML model.
n1
Advanced Algorithms in Al and ML 12 Model Selection and Feature Engineering

How to Choose the Best Model In Machine Learning?


The choice of model is influenced by many variables, including dataset, task, model type etc.
Generally, you need to consider two factors:
* Reason for choosing a model
© The model’s performance
So let's explore the reason behind selecting a model. You can choose models based on their data and task:
Type of Data:
* Images and videos: If your application mainly focuses on images and videos, for example, image recognition. The
Convolutional Neural Network model works better with images and videos when compared to other models.
* Text data or speech data: Similarly, recurrent neural networks (RNN) are employed if your problem includes
speech or text data.
« Numerical data: You may use Support Vector Machine (SVM), logistic regression, and decision trees if your data
is numerical.
How to select a model based on the task?
Classification Tasks: SVM, logistic regression, and decision trees.
Regresslon tasks: Linear regression, Random Forest, Polynomial regression etc.
Clustering tasks: K means clustering, hierarchical clustering.
Therefore, depending on the type of data you have and the task you do, you may use a variety of models.
Model Selection Techniques:

Model

! v ¥
Probatilistic eneon, Resampling

A |[eic][ woL] [ Cross Validation || Bootstrap|

Fig. 1.1
Resampling Methods:
As the name implies, resampling methods are straightforward methods of rearranging data samples to see how well the
model performs on samples of data it has not been trained. Resampling, in other words, enables us to determine the model's
generalizability.
There are two main types of re-sampling techniques:
Cross-valldation:
It is a resampling procedure to evaluate models by splitting the data. Consider a situation where you have two models
and want to determine which one is the most appropriate for a certain issue. In this case, we can use a cross-validation
process.
So, let’s say you are working on an SVM model and have a dataset that iterates multiple times. We will now divide the
datasets into a few groups. One group out of the five will be used as test data. Machine learning models are evaluated on test
data after being trained on training data.
Let's say you calculated the accuracy of each iteration; the figure below illustrates the iteration and accuracy of that
iteration.
Advanced Algorithms in Al and ML 13 ‘Model Selection and Feature Engineering

Test [E0 Train []

Iteration 1 c 88%

Iteration 2 c 83%

Iteration 3 D 86%

Iteration 4 c 82%

Iteration 5 D 84%

Iteration 6 c 85%

Flg. 1.2: Cross-Validation Example


Now, let’s calculate the mean accuracy of all the iterations, which comes to around 84.4%. You now use the same
procedure once again for the logistic regression model.
You can now compare the mean accuracy of the logistic regression model with the SVM. So, according to accuracy, you
might claim that a certain model Is better for a given use case.
To implement cross-validation you can use sklearn.model_selection.cross_val_score, like this:
>>> from sklearn Import datasets, linear_model
>>> from sklearn.model_selection Import cross_val_score
>>> diabetes = datasets.load_diabetes()
>>> X = diabetes.data[:150]
>>> y = diabetes.target[:150]
>>> lasso = linear_model.Lasso()
>>> print(cross_val_score(lasso, X, y, cv=3))
[0.3315057 0.08022103 0.03531816]
Bootstrap:
Another sampling technique is called Bootstrap, and it involves replacing the data with random samples. It is used to
sample a dataset using replacement to estimate statistics on a population.
® Used with smaller datasets.
© The number of samples must be chosen.
* Size of all samples and test data should be the same.
® The sample with the most scores is therefore taken into account.
In simple terms, you start by:
* Randomly selecting an observation.
* You note that value.
® You put that value back.
Now, you repeat the steps N times, where N is the number of observations in the initial dataset. So the final result is the
one bootstrap sample with N observations.
Probabllistic Measures:
Information Criterion is a kind of probabilistic measure that can be used to evaluate the effectiveness of statistical
procedures. Its methods include a scoring system that selects the most effective candidate models using a log-likelihood
framework of Maximum Likelihood Estimation (MLE).
Advanced Algorithms in Al and ML 14 Model Selection and Feature Engineering

Resampling only focuses on model performance, whereas probabllistic modelling concentrates on both model
performance and complexity.
* ICis a statistical metric that yields a score. The model with the lowest score is the most effective.
® Performance is calculated using in-sample data; therefore a test set is unnecessary. Instead, the score is calculated
using all the training data.
® Less complexity entails a straightforward model with fewer parameters that is simple to learn and maintain but
unable to detect fluctuations that affect a model's performance.
There are three statistical methods for calculating the degree of complexity and how well a particular model fits a
dataset:
Akalke Information Criterion (AIC)
AIC is a single numerical score that may be used to distinguish across many models the one that is most likely to be the
best fit for a given dataset. AIC ratings are only helpful when compared to other scores for the same dataset.
Lower AIC ratings are preferable: AIC calculates the model’s accuracy in fitting the training data set and includes a
penalty term for model accuracy.

AIC _= (2K-2log
N (L)
K = The number of distinct variablesor predictors.
L The model's greatest likelihood.
N is the number of data points in the practice set (especially helpful in the case of small datasets).
The drawback of AIC is that it struggles with generalizing models since it favors intricate models that retain more
training data. This implies that all tested models might still have a poor fit.
Minimum Description Length (MDL):
According to the MDL concept, the explanation that allows for the most data compression is the best given a small
collection of observed data. Simply put, it is a technique that forms the comerstone of statistical modelling, pattern
recognition, and machine leamning.

MDL Lo +1.(B)
d = Model, D =The model's predictions
L(h) is the number of bits needed to express the model.

L G) = number of bits needed to describe the model’s predictions

Bayeslan Information Criterion (BIC):


BIC was derived using the Bayesian probability idea and is appropriate for models that use maximum likelihood
estimation during training.
BIC = In(nk)
- 2In()
where, f is the maximized value of the likelihood function of the model.
n is the number of data points.
K is the number of free parameters to be estimated.
BIC is more commonly employed in time series and linear regression models. However, it may be applied broadly for
any models based on maximum probability.
Structural Risk Minimization (SRM):
There are instances of overfitting when the model becomes biased toward the training data, which is its primary source
of learning.
A generalized model must frequently be chosen from a limited data set in machine learning, which leads to the issue of
overfitting when the model becomes too fitted to the specifics of the training set and performs poorly on new data. By
weighing the model's complexity against how well it fits the training data, the SRM principle solves this issue.
N

Rem (D = # T Ly, f(x)) + (D)


i=1
Here, J(f) is the complexity of the model.
Advanced Algorithms in Al and ML 15 Model Selection and Feature Engineering

Metrics for Evaluating Regression Models:


Model evaluation is crucial in machine learning. It simplifies presenting your model to others and helps you understand
how well it performs. Several evaluation metrics are available, but only a few can be employed with regression.
* Mean Absolute Error(MAE): The MAE adds up each error’s absolute value. It is an important metric to evaluate a
model. You can simply calculate MAE by importing:
from sklearn.metrics import mean_absolute_error
* Mean Square Error (MSE): While MAE handles all errors equally, MSE is computed by adding the squares of the
real output and the expected output, then dividing the result by the total number of data points. It provides an exact
number indicating how much your findings differ from what you projected.
from sklearn.metrics import mean_squared_error
s Adjusted R Square: R Square quantifies how much of the variation in the dependent variable the model can
account for. Its name, R Square, refers to the fact that it is the square of the correlation coefficient (R).
‘When comparing machine learning models, you must choose a tool or platform that can support your team’s needs and
your business goal.
With Censius, you can monitor each model’s health in one place, use the user-friendly interface to comprehend models
and analyze them for particular problems.
* Evaluate performance without ground truth.
* Compare the past performance of a model.
® Create personalized dashboards.
© Compare performance between model iterations.
| Wl TRAINING A MODEL FOR SUPERVISED LEARNING FEATURES -
UNDERSTAND YOUR DATA BETTER, FEATURE EXTRACTION AND
ENGINEERING
Training a model for supervised learning involves several steps, and understanding your data is a crucial part of this
process. Feature extraction and engineering are techniques that help you represent your data in a way that is conductive to
learning for your machine learning model. Here's a step-by-step guide:
1. Understand Your Data:
(a) Exploratory Data Analysis (EDA):
* Examine the structure of your dataset.
* Check for missing values, outliers, and anomalies.
* Understand the distribution of your target variable.
(b) Statistical Summary:
® Use descriptive statistics to summarize key aspects of your data.
* Identify pattems, trends, and relationships.
(c) Visualization:
* Create visualizations (histograms, scatter plots etc.) to gain insights.
* Identify potential correlations between features and the target variable.
2. Feature Extraction:
(a) Select Relevant Features:
* Identify features that are likely to have a significant impact on the target variable.
* Remove irrelevant or redundant features that may not contribute to the model’s performance.
(b) Handling Categorical Data:
* Encode categorical variables using techniques like one-hot encoding or label encoding.
(c) Feature Scaling:
* Standardize or normalize numerical features to ensure they are on a similar scale.
* This s crucial for algorithms sensitive to feature scales, such as gradient-based optimization methods.
Advanced Algorithms in Al and ML 16 Model Selection and Feature Engineering

3. Feature Engineering:
(a) Create New Features:
® Derive new features that might capture important patterns or relationships.
* For example, extract date features from a timestamp, create interaction terms, or combine existing features.
(b) Polynomial Features:
* Introduce polynomial features to capture non-linear relationships.
* Forinstance, square or cube certain features to account for quadratic or cubic patterns.
(c) Dimensionality Reduction:
® Use techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the dataset while
retaining essential information.
Data Splitting:
(a) Training and Testing Sets:
o Split your dataset into training and testing sets to evaluate your model's performance on unseen data.
Model Training:
(a) Choose a Model:
® Select a suitable algorithm based on your problem (For example, regression, classification) and data
characteristics.
(b) Train the Model:
® Feed the training data into the chosen model.
o Adjust model parameters using techniques like cross-validation.
Model Evaluation:
(a) Evaluate on Test Set:
® Assess the model’s performance on the testing set to estimate its generalization capability.
(b) Fine-tuning:
* Ifneeded, fine-tune hyperparameters to improve performance.
Iterative Process:
(a) Reflnement:
* Based on model performance, go back to feature engineering or adjust the model architecture.
(b) Cross-Validation:
® Perform cross-validation to ensure robustness of your model.
Deployment:
® Once satisfied with the model, deploy it to make predictions on new, unseen data.
\Jl FEATURE ENGINEERING ON - NUMERICAL DATA AND CATEGORICAL
DATA AND TEXT DATA
What is Feature Engineering ?

‘ "o
[ Extracting [ fl — o

fig)? Raw data Features Insights


B J

Fig. 1.3
‘Advanced Algorithms in Al and ML 1.7 Model Selection and Feature Engineering

Feature Engineering:
* Feature engineering refers to the process of using domain knowledge to select and transform the most relevant
variables from raw data when creating a predictive model using machine learning or statistical modelling.
® The goal of feature engineering and selection is to improve the performance of machine learning (ML) algorithms.
Data Pre-Processing:
* Data preprocessing is an important step in the data mining process.
® Itrefers to the cleaning, transforming, and integrating of data to make it ready for analysis.
* The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
Feature engineering techniques for numerical data, categorical data, and text data separately:
1. Numerical Data:
(a) Scaling:
* Standardize or normalize numerical features to ensure they are on a similar scale. This is important for
algorithms sensitive to feature scales.
(b) Binning:
* Convert numerical features into categorical features by binning or bucketing. This can help capture non-linear
relationships.
() Polynomial Features:
* Introduce polynomial features to capture non-linear relationships in the data.
(d) Log Transform:
o Apply a log transformation to numerical features to handle skewed distributions.
(e) Interactions:
® Create interaction terms between two or more numerical features.
(O Outlier Handling:
* Identify and handle outliers using techniques such as truncation, transformation, or imputation.
2. Categorical Data:
(a) One-Hot Encoding:
* Convert categorical variables into binary vectors using one-hot encoding.
(b) Label Encoding:
* Transform categorical labels into numerical values if the ordinal relationship is essential.
(c) Target Encoding:
* Encode categorical features based on the mean or median of the target variable for each category.
(d) Frequency Encoding:
* Encode categorical variables based on their frequency in the dataset.
(e) Embeddings:
® Use embeddings for categorical variables, especially useful in deep leaming models.
(f) Dummy Variables:
o Create dummy variables for categorical features with multiple levels.
3. Text Data:
(a) Tokenization:
* Break text into individual words or subwords (tokenization).
(b) TF-IDF (Term Frequency-Inverse Document Frequency):
* Convert text data into numerical vectors using TF-IDF to capture the importance of words in a document.
(c) Word Embeddings:
® Use pre-trained word embeddings like Word2Vec, GloVe, or FastText to represent words in a continuous vector
space.
(d) Bag-of-Words:
® Represent text as a bag of words, counting the frequency of each word.
Advanced Algorithms in Al and ML 18 Model Selection and Feature Engineering

(e) Text Length:


* Create a feature representing the length of the text.
() Topic Modelling:
e Use techniques like Latent Dirichlet Allocation (LDA) for topic modelling and represent documents by their
topic distribution.
(g) Sentiment Analysis:
o Analyze sentiment and use sentiment scores as features.
(h) N-grams:
* Consider including n-grams (sequences of n words) as features.
FEATURE SCALING, FEATURES SELECTION
What Is Feature Scaling?
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing,
it is also known as data normalization and is generally performed during the data preprocessing step.
For example — if you have multiple independent variables like age, salary, and height; With their range as (18-100
Years), (25,000-75,000 Euros), and (1-2 Meters) respectively, feature scaling would help them all to be in the same range,
for example, centered around 0 or in the range (0,1) depending on the scaling technique.
In order to visualize the above, let us take an example of the independent variables of alcohol and Malic Acid content in
the wine dataset from the “Wine Dataset” that is deposited on the UCI machine learning repository. Below you can see the
impact of the two most common scaling techniques (Normalization and Standardization) on the dataset.
Alcohol and Malic Acid content of the wine dataset

©99 |nput scale


6f|ooc Standardized [N(u=0, o = 1)]
®ee Min-max scaled [min=0, max=1]
5|

3
g 3
2
s 2
=

0f

-1

-2
-5 0 10 15 2l
Alcohol
Fig. 1.4
The impact of Standardization and Normalisation on the Wine dataset
Methods for Scaling:
Now, since you have an idea of what is feature scaling. Let us explore what methods are available for doing feature
scaling. Of all the methods available, the most common ones are:
Normalization:
Also known as min-max scaling or min-max normalization, it is the simplest method and consists of rescaling the range
of features to scale the range in [0, 1]. The general formula for normalization is given as:
5 X — min (x)
X = ‘max (x) — min (x)
Here, max(x) and min(x) are the maximum and the minimum values of the feature respectively.
‘Advanced Algorithms in Al and ML 19 Model Selection and Feature Engineering

We can also do a normalization over different intervals, for example, choosing to have the variable laying in any [a, b]
interval, a and b being real numbers. To rescale a range between an arbitrary set of values [a, b], the formula becomes:
X — min (x; E)
x'
= a+ ‘max (x) — min (x)
Standardization:
Feature standardization makes the values of cach feature in the data have zero mean and unit variance. The general
method of calculation is to determine the distribution mean and standard deviation for each feature and calculate the new data
point by the following formula:

Here, © is the standard deviation of the feature vector, and % is the average of the feature vector.
Scaling to unit length: The aim of this method is to scale the components of a feature vector such that the complete
vector has length one. This usually means dividing each component by the Euclidean length of the vector:
x
X' = qu Ixllis the Euclidean length of the feature vector.
In addition to the above 3 widely used methods, there are some other methods to scale the features viz. Power
Transformer, Quantile Transformer, Robust Scaler etc. For the scope of this discussion, we are deliberately not diving into
the details of these techniques.
The million-dollar question: Normalization or Standardization
If you have ever built a machine leaming pipeline, you must have always faced this question of whether to Normalize or
to Standardize. While there is no obvious answer to this question, it really depends on the application, there are still a
few generalizations that can be drawn.
Normalization is good to use when the distribution of data does not follow a Gaussian distribution. It can be useful in
algorithms that do not assume any distribution of the data like K-Nearest Neighbors.
In Neural Networks algorithm that require data on a 0-1 scale, normalization is an essential pre-processing step. Another
popular example of data normalization is image processing, where pixel intensities have to be normalized to fit within a
certain range (i.e., 0 to 255 for the RGB color range).
Standardization can be helpful in cases where the data follows a Gaussian distribution. Though this does not have to be
necessarily true. Since standardization does not have a bounding range, so, even if there are outliers in the data, they will not
be affected by standardization.
In clustering analyses, standardization comes in handy to compare similarities between features based on certain distance
measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over
Min-Max scaling since we are interested in the components that maximize the variance.
There are some points which can be considered while deciding whether we need Standardization or Normalization
* Standardization may be used when data represent Gaussian Distribution, while Normalization is great with Non-
Gaussian Distribution.
o Impact of Outliers is very high in Normalization.
To conclude, you can always start by fitting your model to raw, normalized, and standardized data and compare the
performance for the best results.
The link between Data Scaling and Data Leakage
To apply Normalization or Standardization, we can use the prebuilt functions in scikit-learn or can create our own
custom function.
Data leakage mainly occurs when some information from the training data is revealed to the validation data. In order to
prevent the same, the point to pay attention to is to fit the scaler on the train data and then use it to transform the test data.
Define Feature Selection:
Feature Selection is defined as, "It is a process of automatically or manually selecting the subset of most appropriate and
relevant features to be used in model building.”
‘What Is Feature Selection?
Feature is an attribute that has an impact on a problem or is useful for the problem, and choosing the important features
for the model is known as feature selection. Each machine learning process depends on feature engineering, which mainly
contains two processes, which are Feature Selection and Feature Extraction. Although feature selection and extraction
Advanced Algorithms in Al and ML 1.10 Model Selection and Feature Engineering

processes may have the same objective, both are completely different from each other. The main difference between them is
that feature selection is about selecting the subset of the original feature set, whereas feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only relevant data in order to reduce
overfitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually selecting the subset of most
appropriate and relevant features to be used in model building.” Feature selection is performed by either including the
important features or excluding the irrelevant features in the dataset without changing them.
Need for Feature Selection:
Before implementing any technique, it is important to understand, need for the technique and so for the Feature
Selection. As we know, in machine leaming, it is necessary to provide a pre-processed and good input dataset to get better
outcomes. We collect a huge amount of data to train our model and help it to learn better. Generally, the dataset consists of
noisy data, irrelevant data, and some part of useful data. Moreover, the huge amount of data also slows down the training
process of the model, and with noise and irrelevant data, the model may not predict and perform well. So, it is very necessary
to remove such noises and less-important data from the dataset and to do this, and Feature selection techniques are used.
Selecting the best features helps the model to perform well. For example, suppose we want to create a model that
automatically decides which car should be crushed for a spare part, and to do this, we have a dataset. This dataset contains a
Model of the car, Year, Owner's name, Miles. So, in this dataset, the name of the owner does not contribute to the model
performance as it does not decide if the car should be crushed or not, so we can remove this column and select the rest of the
features (column) for the model building.
Below are some benefits of using feature selection in machine learning:
* It helps in avoiding the curse of dimensionality.
o Ithelps in the simplification of the model so that it can be easily interpreted by the researchers.
* It reduces the training time.
* It reduces overfitting hence enhance the generalization.
Feature Selection Techniques:
There are mainly two types of Feature Selection techniques, which are:
* Supervised Feature Selection technique: Supervised Feature selection techniques consider the target variable and
can be used for the labelled dataset.
* Unsupervised Feature Selection technique: Unsupervised Feature selection techniques ignore the target variable
and can be used for the unlabelled dataset.

Feature Selection Techniques

|
L1 ¥
Supervised Unsupervised
Feature Selection Feature Selection

+ A A *
Filters Embedded Wrappers
Method Method Method

Missing value fefu'anz"""" Forward Feature Selection


12
" gain"
Information Random
Jmportanceforest Backward Feature Selection

Chi-suqare Test Exhaustive Feature Selection

Fisher's Score Recursive Feature Elimination

Fig. 1.5
‘Advanced Algorithms in Al and ML 1.1 Model Selection and Feature Engineering

[Practice Questi S|

What is Model Selection?


-

‘Why Model Selection?


AE LN

How to Choose the Best Model in Machine Learning?


PN

How to select a model based on the task?


Describe Model Selection Techniques.
Describe Feature Engineering.
Explain Feature engineering techniques for numerical data, categorical data, and text data.
Describe Feature Scaling.
Explain Feature Selection.
©
§2
Supervised Learning:
Naive Bayes, Decision Tree
and Random Forest /
Chapter Outcomes...
After reading this chapter, students will be able to understand:
The concept of Bayes Theorem and working of Bayes theorem.
o) (w0 ) [ () (w] [w (W] (w)

Applications, Advantages and Disadvantage of Bayes theorem.


Implementation of Naive Bayes classifier
Concept and working of decision tree.
Advantages and Disadvantages of decision tree.
Implementation of decision tree.
Concept and working of Random Forest.
Advantages, Disadvantages and Applications of Random Forest.
Implementation of random Forest Algorithm.

Learning Objectives...
o] Classify the given data using Bayesian Method with stepwise Justification.
O] Describe working of Decision Tree Algorithm.
m Enlist application of Random Forest Algorithm.

- INTRODUCTION
Suppose you hold the position of a product manager, your objective is to categorize customer feedback into good
and negative groups. Naive
Or as a loan manager, your objective is to determine the creditworthiness of loan applicants, distinguishing between
those who pose a low risk and those who pose a high risk.
Or as a healthcare analyst, your objective is to forecast which people are susceptible to developing diabetes.
All of the cases exhibit a common issue in categorizing reviews, loan applications, and patients.
Naive Bayes is a very efficient and rapid classification technique that is well-suited for processing massive volumes
of data.
The Naive Bayes classifier is effectively used in many applications, including spam filtering, text classification,
sentiment analysis, and recommender systems.
The prediction of unknown class is accomplished by the use of Bayes' theory of probability.
- WORKING OF CLASSIFICATION
When engaging in categorization, the first phase is analysing the issue and identifying probable characteristics and
labels.
Features are particular characteristics or properties that have an impact on the outcomes of the label.
For instance, before distributing a loan, bank management assess the customer's employment, income, age. location,
past loan history, transaction history, and credit score.
21
Advanced Algorithms in Al and ML 22 ‘Supervised Learning:
Naive Bayes, Decision Tree ....

® These attributes are referred to as features that aid the model in categorizing clients.
* The categorization process consists of two distinct phases: a learning phase and an assessment phase.
® During the learning phase, the classifier acquires knowledge by training its model using a specific dataset.
* In the evaluation phase, the classifier assesses its performance. Evaluation of performance is based on many metrics
including accuracy, error, precision, and recall.

Model
Training Development

) Se.
W Performance
Model measures :

Test Set [——| Evaluation [—*| J-AcCUY


3. Recall

Fig. 2.1
JEXN NAIVE BAYES CLASSIFIER
Nalve Bayes Classifler:
® The Naive Bayes Classifier is a machine learning algorithm that is based on Bayes' theorem. It is used for
classification tasks, where it predicts the probability of an input belonging to a certain class based on its features.
* Naive Bayes is a statistical classification method that relies on Bayes' Theorem.
® This technique is considered to be one of the most straightforward methods in supervised learning.
® The Naive Bayes classifier is an algorithm that is known for its speed, accuracy, and reliability.
o Naive Bayes classifiers have excellent accuracy and efficiency when used to extensive datasets.
* Inexperienced or lacking in worldly wisdom.
® The Bayes classifier operates on the assumption that the impact of a specific feature on a class is not influenced by
other characteristics.
* For instance, the desirability of a loan applicant is contingent upon factors such as their income, previous loan and
transaction history, age, and geography.
* Although these traits are interrelated, they are nonetheless regarded as separate entities.
This assumption seems naive since it simplifies calculation.
* This assumption is referred to as class conditional independence.
P(Dlh) P(h)

P(h): The probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h.
P(D): The probability ofthe data (regardless of the hypothesis). This is known as the prior probability.
P(hID): The probability of hypothesis h given the data D. This is known as posterior probability.
P(Dih): The probability of data d given that the hypothesis h was true. This is known as posterior probability.
TYPES OF NAIVE BAYES MODEL
Types of Nalve Bayes Model:
There are five types of NB models under the scikit-learn library:
Gausslan Nalve Bayes: Often known as gaussiannb, is a classification algorithm that operates on the assumption that
the values of the features in the dataset are distributed according to a Gaussian distribution.
Multinomial Nalve Bayes: It is a classification algorithm specifically deslgned for discrete count data. Suppose we
encounter a text categorization issue as an example. In this context, we are ussing Bernoulli trials, which involve
counting the frequency of a word appearing in a document rather than just dctcmumng whether it happens. This may be seen
as the number of times a certain result, denoted as x_i, is observed throughout a series of n trials.
Bernoulll Nalve Bayes: The binomial model is applicable when the feature vectors consist of binary values (i.e., zeros
and ones). An example use case may include text categorization using a 'bag of words' model, where the binary values of
1 and 0 represent whether a word appears or does not appear in a given document, respectively.
Advanced Algorithms in Al and ML 23 Supervised Learning: Naive Bayes, Decision Tree ....
Complement Nalve Bayes: It is a variant of Multinomial NB that use the complement of each class to determine the
model weights. This method is well-suited for unbalanced data sets and often achieves better results than the Multinomial
Naive Bayes (MNB) algorithm in text classification tasks.
Categorical Nalve Bayes: It is a valuable method when the characteristics are distributed in a categorical manner. In
order to use this procedure, it is necessary to convert the category variable into a numerical format by using the ordinal
encoder.
What Is the working mechanism of the Nalve Bayes Classifler?
Let us comprehend the workings of Naive Bayes with an illustrative case. Provide a case study of weather conditions
while participating in sports. You must compute the likelihood of engaging in athletic activities. It is necessary to categorize
whether players will participate or not, depending on the weather conditions.
First Approach (For a single characteristic)
The Naive Bayes classifier computes the probability of an occurrence by the following sequential steps:
Step 1: Compute the previous probability for the supplied class labels.
Step 2: Calculate the probability of each characteristic for each class.
Step 3: Substitute these values into Bayes' Formula and compute the posterior probability.
Step 4: Determine the class with the greater probability, assuming that the input belongs to that class.
For the convenience of the computation of prior and posterior probabilities, you may use two tables: the frequency table
and the likelihood table. Both of these tables will help with the computation of the prior and posterior probability. The
Frequency table records the frequency of labels for all characteristics. There are two tables that represent the probability of
events occurring. Likelihood Table | displays the initial probabilities of labels, whereas Likelihood Table 2 presents the
updated values known as posterior probabilities.
Frequency Table
Whether Play Whether No e

Sunny No Overcast 4
Sunny No Sunny 2 3

Overcast Yes _ Rainy 3 2

Rainy You Total 5 s


Rainy Yes
Rainy No

Svereesl | Yes esabicttaleaiodl Likellhood Table 2


Sy b R Posterior | Posterlor
Sunny Yes Overcast 4 [=a/1a [0.29 | |Whether | No | Yes p opapility | Probability]
Rainy Yes sunny 2 3 |=5114 [0.36 for No '°_'"s
ainy 3 2 |=5/114 [0.36 | |Overcast 4 |os=0 |49=044
Sunny Yos ol 5 [ 9 Sunny |2 | 3 |25=04 |39=033
Overcast Yes =5/14 =9/14 Rainy 3 2 |3/5=06 [29=0.22

Overcast | Yes 0.36 | 064 Total 519


Rainy No
Fig.2.2
Now suppose you want to calculate the probability of playing when the weather is overcast.
Probabllity of Playlng:
P(Yes Y
| Overcast) = P(OVcmasllYcS)#)_) @
1. Calculate Prior Probabilities:

P(Overcast) = %= .29

9
P(Yes) = fi=0.64
‘Advanced Algorithms in Al and ML 24 Supervised Learning: Naive Bayes, Decision Tree ....
2. Calculate Posterior Probabilities:
P(Overcast IYes) = %— 0.44

3. Put Prior and Posterior probabilities in equation (2.1)


044 x0.64
P (Yes|Overcast) = =0.98 (Higher)
0.29
Similarly, you can calculate the probability of not playing:
Probabllity of not playing:
| Overcast) =_ P(Overcast
P(No |No) __P(No)5o_ .. (22)

1. Caleulate Prior Probabil


P(Overcast) = 14—4=0.29
P(No)
s
1= 0.36

2. Calculate Posterior Probabilities:


[
P(Overcast INo) = 9= 0

3. Put Prior and Posterior probabilities in equation (2.2)


0x036
P (No | Overcast) = 029 =0
The probability of a 'Yes' class is higher. So you can determine here if the weather is overcast than players will play the
sport.
Second Approach (In of multiple features):

How Naive Bayers Classifier Works ?


Whether |Temperature | Play
Calculate prior probability for
Sunny Hot No 0 1 given class labels
Sunny Hot No
Overcast Hot Yes
Ralny Mild Yes 0 2 Calculate conditional
probabilty with each attribute
Rainy Cool Yes for each class

Rainy Cool No o 3
Multiply same class conditional
Overcast Cool Yes probability

Sunny Mild No

Sunn, Cool Yes Multiply prior probability with


Rainyy Mild Yes 04 step 3 probability
Sunny Mild Yes
Overcast Mild Yes Seo which class has higher probability,
higher probability class belongs to
Overcast| Hot Yes 0 5 given input set step
Rainy Mild No

Fig. 23
Now suppose you want to calculate the probability of playing when the weather is overcast, and the temperature is mild.
Probability of Playing:
P(Play = Yes | Weather = Overcast, Temp = Mild) = P(Weather = Overcast, Temp=Mild | Play = Yes) P(Play = Yes)
- (23)
Advanced Algorithms in Al and ML 25 Supervised Learning: Naive Bayes, Decision Tree ....
P(Weather = Overcast, Temp = Mild | Play = Yes) = P(Overcast |Yes) P(Mild [Yes) - (24)
1. Calculate Prior Probabi es:
9
P(Yes) = 77=0.64
2. Caleulate Posterior Probabilities:
P(Overcast Yes) = 3 = 0.44 P(Mild IYes) = 3 = 0.44
3. Put Posterior probabilities in equation (2.4)
P(Weather = Overcast, Temp = Mild | Play = Yes) = 0.44 x 0.44 = 0.1936 (Higher)
4. Put Prior and Posterior probabilities in equation (2.1)
P(Play = Yes | Weather = Overcast, Temp = Mild) = 0.1936 x 0.64 = 0.124
Similarly, you can calculate the probability of not playing:
Probability of not playing:
P(Play = No | Weather = Overcast, Temp = Mild) = P(Weather = Overcast, Temp = Mild | Play = No) P(Play = No)
s (25)
P(Weather = Overcast, Temp = Mild | Play = No) = P(Weather = Overcast |Play = No) P(Temp = Mild | Play = No)
.. (2.6)
1. Calculate Prior Probabilities:
5
P(No) = 4= 0.36

2. Calculate Posterior Probabilities:


0 2
P(Weather= Overcast IPlay = No) =9= 0 P(Temp = Mild | Play = No) =5= 04.

3. Put posterior probabilities in equation (4)


P(Weather = Overcast, Temp = Mild | Play = No) =0 x 04 =0
4. Put prior and posterior probabilities in equation (3)
P(Play = No | Weather = Overcast, Temp = Mild) =0x 036 =0
The probability of a Yes' class is higher. So you can say here that if the weather is overcast than players will play the
sport.
Classifler Building In Scikit-learn
Naive Bayes Classifler with Synthetic Dataset
In the first example, we will generate synthetic data using scikit-learn and train and evaluate the Gaussian Naive Bayes
algorithm.
Generating the Dataset:
Scikit-learn provides us with a machine learning ecosystem so that you can generate the dataset and evaluate various
‘machine learning algorithms.
In our case, we are creating a dataset with six features, three classes, and 800 samples using the “make_classification”
function.
from sklearn.datasets import make_classification

X, y = make_classification(
n_features=6,
n_classes=3,
n_samples=800,
n_informative=2,
random_state=1,
n_clusters_per_class=1,
)
Advanced Algorithms in Al and ML 26 ‘Supervised Learning: Naive Bayes, Decision Tree ....

We will use matplotlib.pyplot’s “scatter’ function to visualize the dataset.


import matplotlib.pyplot as pit

plt.scatter(X[:, 0], X[:, 1], c=y, marker="+");

Fig. 2.4
As we can observe, there are three types of target labels, and we will be training a multiclass classification model.
Traln Test Split:
Before we start the training process, we need to split the dataset into training and testing for model evaluation.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(


X, y. test_size=0.33, random_state=125
)
Model Bullding and Training:
Build a generic Gaussian Naive Bayes and train it on a training dataset. After that, feed a random test sample to the
model to get a predicted value.
from sklearn.naive_bayes import GaussianNB

# Build a Gaussian Classifier


model = GaussianNB()

# Model fraining
model.fit(X_train, y_train)

# Predict Output
predicted = model.predict([X_test[6]])

print(“Actual Value:", y_test[6])


print("Predicted Value:", predicted[0])
Both actual and predicted values are the same.
Actual Value: 0
Predicted Value: 0
‘Advanced Algorithms in Al and ML 27 Supervised Learning: Naive Bayes, Decision Tree ....

Model Evaluation:
‘We will not evolve the model on an unseen test dataset. First, we will predict the values for the test dataset and use them
to calculate accuracy and F1 score.
from sklearn.metrics import (
accuracy_score,
confusion_matrix,
ConfusionMatrixDisplay,
f1_score,
)
y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1=f1_score(y_pred, y_test, average="weighted")

print("Accuracy:", accuray)
print("F1 Score:", f1)
Our model has performed fairly well with default hyperparameters.
Accuracy: 0.8484848484848485
F1 Score: 0.8491119695890328
To visualize the Confusion matrix, we will use “confusion_matrix" to calculate the true positives and true negatives and
*ConfusionMatrixDisplay" to display the confusion matrix with the labels.
labels = [0,1,2]
cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot();
True label

Predlu.ld label
Flg. 2.5
Code Available at: https://colab.research. google.com/drive/ | 0yewmAEAkWusiQQcSJ
TdIZ35Gj90q8P | 2usp=sharing
Advantages:
* Itisnot only a simple approach but also a fast and accurate method for prediction.
* Naive Bayes has a very low computation cost.
. It can efficiently work on a large dataset.
. It performs well in case of discrete response variable compared to the continuous variable.
* It can be used with multiple class prediction problems.
Advanced Algorithms in Al and ML 28 Supervised Learning: Naive Bayes, Decision Tree ....
* Italso performs well in the case of text analytics problems.
* When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models
like logistic regression.
Disadvantages:
® The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors
which are entirely independent.
® If there is no training tuple of a particular class, this causes zero posterior probability. In this case, the model is
unable to make predictions. This problem is known as Zero Probability/Frequency Problem.

JEXH NAIVE BAYES


Bayes Theorem:
* Bayes' Theorem, named after 18" century British mathematician Thomas Bayes, is a mathematical formula for
determining conditional probability.
« Conditional probability is the likelihood of an outcome occurring, based on a previous outcome having occurred in
similar circumstances.
* Bayes' theorem provides a way to revise existing predictions or theories (update probabilities) given new or
additional evidence.

Where, P(AIB) is Posterior probability: Probability of hypothesis A on the observed event B.


P(BIA) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Example 2.1: A bag I contains 4 white and 6 black balls while another Bag II contains 4 white and 3 black balls. One
ball is drawn at random from one of the bags, and it is found to be black. Find the probability that it was drawn from Bag 1.
Solution: Let E, be the event of choosing bag I, E; the event of choosing bag II, and A be the event of drawing a black
ball.
1
Then, P(E) = P(E) =73

Also, P(AIE,) P(drawing a blackball from Bag 1) =] = é


w S

P(AIE;) = P(drawing a black ball from bag II) =5


By using Bayes’ theorem, the probability of drawing a black ball from bag I out of two bags,
Ay = PEI
— PAE)
PEIA) = SETPAIE,) + P) PAIE)
1 3
2 x5
1.3 1 3
2%5+2%7
*5*2
-L
T2
What is Naive Bayes Classifier Algorithm?
* Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems.
o Itis mainly used in text classification that includes a high-dimensional training dataset.
* Naive Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the
fast machine learning models that can make quick predictions.
* Itisa probabilistic classifier, which means It predicts on the basis of the probability of an object.
Advanced Algorithms in Al and ML 29 Supervised Learning: Naive Bayes, Decision Tree ....

* Some popular examples of Naive Bayes Algorithm are spam flltration, Sentimental analysls, and classifying
articles.
What is it called Naive Bayes?
The Naive Bayes algorithm is comprised of two words Naive and Bayes, which can be described as:
* Nalve: It is called Naive because it assumes that the occurrence of a certain feature is independent of the occurrence
of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and
sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
What is Bayes’ Theorem?
* Bayes' theorem is also known as Bayes' Rule
or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
* The formula for Bayes' theorem is given as:

PAIB) _= P(BIA)
——p P(A)
Where, P(AIB) Is Posterior probability: Probability of hypothesis A on the observed event B.
P(BIA) Is Likellhood probability: Probability of the evidence given that the probability of a hypothesis is true.
P(A) Is Prior Probabllity: Probability of hypothesis before observing the evidence.
P(B) Is Marglnal Probabllity: Probability of Evidence.
Explain Working of Naive Bayes’ Classifier
‘Working of Naive Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions
and corresponding target variable "Play”. So, using this dataset we
need to decide that whether we should play or not on a particular day according to the weather conditions. So, to solve this
problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Genenate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Example 2.2: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Advanced Algorithms in Al and ML 210 Supervised Learning: Naive Bayes, Decision Tree ...
Frequency table for the Weather Conditions:
‘Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 S _03s
14
Rainy 2 2 414=029
_

Sunny 2 3 2
13=035

Al 4|4—0A29
_ i
14—0.71

DESCRIBE APPLYING BAYES ‘THEOREM:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes) = 3/10= 0.3
P(Sunny) = 0.35
P(Yes) = 0.71
So P(Yes|Sunny) = 0.3%0.71/0.35 = 0.60

P(No|Sunny) = P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO) = 2/4 =05
P(No) = 0.29
P(Sunny) = 0.35
50 P(No|Sunny) = 0.5%0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny) > P(No|Sunny)
Hence on a Sunny day, Player can play the game.
‘Write Advantages of Nalve Bayes Classifler:
* Naive Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
* It can be used for Binary as well as Multi-class Classifications.
* It performs well in Multi-class predictions as compared to the other Algorithms.
* Itis the most popular choice for text classification problems.
‘Write Disadvantages of Nalve Bayes Classifler:
* Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between
features.
‘What are the Applications of Naive Bayes Classifler:
* Itisused for Credit Scoring.
* Itisused in medical data classification.
* It can be used in real-time predictions because Naive Bayes Classifier is an eager learner.
® Itisused in Text classification such as Spam flltering and Sentiment analysis.
Advanced Algorithms in Al and ML 2.1 ‘Supervised Learning: Naive Bayes, Decision Tree ...

What are Types of Nalve Bayes Model?


There are three types of Naive Bayes Model, which are given below:
* Gausslan: The Gaussian model assumes that features follow a normal distribution. This means if predictors take
continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian
distribution.
* Multinomial: The Multinomial Naive Bayes classifier is used when the data is multinomial distributed. It is
primarily used for document classification problems, it means a particular document belongs to which category such
as Sports, Politics, education etc. The classifier uses the frequency of words for the predictors.
* Bernoulll: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the
independent Booleans variables. Such as if a particular word is present or not in a document. This model is also
famous for document classification tasks.
Dataset for Implementation: save below data as 'user_data.csv’ in MS Excel.

15624510 Male T 000"


15810944 Male 35 20000 o
15668575 Female 26 43000 o
15603246 Female 27 57000 o
15804002 Male 19 76000 o
15728773 Male 27 58000 [
15598044 Female 27 84000 [
156594829 Female 32 150000 1
15600575 Male 25 33000 [J
15727311 Female 35 65000 o
15570769 Female 26 80000 o
15606274 Female 26 52000 [J
15746139 Male 20 86000 0
15704987 Male 2 18000 o
15628972 Male 18 82000 o
15697686 Male 29 80000 [J
15733883 Male 47 25000 1
15617482 Male as 26000 1
15704583 Male 45 28000 1
15621083 Female 48 29000 1
15649487 Male 45 22000 1
15736760 Female a7 49000 3

Fig. 2.6
Write Implementation of the Naive Bayes Algorithm
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the "user_data” dataset, which
we have used in our other classification model. Therefore, we can easily compare the Naive Bayes model with the other
models.
Steps to Implement:
* Data Pre-processing step.
* Fitting Naive Bayes to the Training set.
* Predicting the test result.
® Test accuracy of the result(Creation of Confusion matrix).
* Visualizing the test set result.
1. Data Pre-processing step:
In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is similar as we did
in data-pre-processing. The code for this is given below:
1. Importing the libraries.
2. import numpy as nm.
3. import matplotlib.pyplot as mtp.
Advanced Algorithms in Al and ML 212 Supervised Learing: Naive Bayes, Decision Tree ....
4. import pandas as pd.
5. # Importing the dataset.
6. dataset = pd.read_csv('user_data.csv').
7 . x = dataset.iloc[:, [2, 3]].values
8. .y = dataset.iloc[:, 4].values
9. . # Splitting the dataset into the Training set and Test set
10. from sklearn.model_selection import train_test_split
11. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
12. # Feature Scaling
13. from sklearn.preprocessing import StandardScaler
14. sc = StandardScaler()
15. x_train = sc.fit_transform(x_train)
16. x_test = sc.transform(x_test)
In the above code, we have loaded the dataset
into our program using "dataset = pd.read_csv('user_data.csv').
‘The loaded dataset is divided into training and test set, and then we have scaled the feature variable.
The output for the dataset is given as:

Fig. 27
2. Fitting Nalve Bayes to the Tralning Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is the code for it:
1. # Fitting Ndive Bayes to the Training set
2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)
Advanced Algorithms in Al and ML 243 Supervised Learning: Naive Bayes, Decision Tree ....
In the above code, we have used the GaussianNB classifierto fit it to the training dataset. We can also use other
classifiers as per our requirement.
Output:
Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)
3. Prediction of the test set result:
Now we will predict
the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict
function to make the predictions.
1. #Predicting the Test set results
2. y_pred = classifier.predict(x_test)
Output:

[ e [ e conr
o [ com|
Fig.2.8
The above output shows the result for prediction vectory_pred
and real vector y_test. We can see that some
predications are different from the real values, which are the incorrect predictions.
4. Creating Confusion Matrix:
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is the code for it:
1. # Making the Confusion Matrix
2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)
Output:
B8 cm - NumPy array - o x

0 1

1
o_

IR —
o [oom ]
Fig. 2.9
Advanced Algorithms in Al and ML 214 ‘Supervised Learning: Naive Bayes, Decision Tree ....

As we can see in the above confusion matrix output, there are 7 + 3 = 10 incorrect predictions, and 65 + 25 = 90 correct
predictions.
5. Visualizing the training set result:
Next we will visualize the training set result using Naive Bayes Classifier. Below is the code for it:
# Visualising the Training set results
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0lmax() + 1, step = 0.01),
nm.arange(start = x_set[:, 1}.min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('purple’, ‘green')))
mtp.xlim(XLmin(), X1.max())
mtp.ylim(X2.min(), X2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
¢ = ListedColormap((‘purple’, "green*))(i), label = j)
mtp.title(*Naive Bayes (Training set)’)
mtp.xlabel('Age')
mtp.ylabel(‘Estimated Salary')
mtp.legend()
mtp.show()
Output:
Naive Bayes (Training Set)
Estimated Salary

-2

-2 -1 0 1 2 3
Age
Fig. 2.10
In the above output we can see that the Naive Bayes classifier has segregated the data points with the fine boundary. It is
Gaussian curve as we have used GausslanNB classifier in our code.
6. Visualizing the Test set result:
# Visualising the Test set results
from matplotlib.colors import ListedColormap
x_set,y_set = x_test,y_test
X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0lmax() + 1, step = 0.01),
nm.arange(start = x_set[:, 1}.min() - 1, stop = x_set[:, 1l.max() + 1, step = 0.01))
mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
Advanced Algorithms in Al and ML 215 ‘Supervised Leaming: Naive Bayes, Decision Tree ....

alpha = 0.75, cmap = ListedColormap(('purple’, ‘green')))


mtp.xlim(XLmin(), X1.max())
mtp.ylim(X2.min(), X2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
¢ = ListedColormap(('purple’, ‘green"))(i), label = j)
mtp.title('Nadive Bayes (test set)')
mtp.xlabel('Age')
mtp.ylabel(‘Estimated Salary")
mtp.legend()
mtp.show()
Output:
Naive Bayes (Test set)

4
s
a
2]
E
b -

-2

0
Age
Fig. 2.11
DECISION TREE
® Decision Tree Diagram,
® Why Used decision tree?
* Working Of Decision tree algorithm,
* Auributes selection Measures (ASM),
* Advantages and Disadvantages of Decision tree,
* Implementation of Decision Tree
Introduction: Decision Tree
As a marketing manager, your objective is to identify a certain group of clients who have the highest probability of
buying your goods. Discovering your target demographic is a strategic approach to economize your marketing expenditure.
To attain a reduced loan default rate, it is necessary for loan managers to identify loan applications that carry a high level of
risk. The act of categorizing clients into groups based on their potential or lack thereof, or distinguishing between safe and
dangerous loan applications, is referred to as a classification challenge.
Classification is a two-step procedure, consisting of a learning phase and a prediction phase. During the learning phase,
the model is constructed using the provided training data. During the prediction stage, the model utilizes the available data to
forecast the expected response. A Decision tree is a straightforward and widely used categorization technique that facilitates
the comprehension and interpretation of data. It may be used for both categorization and estimation tasks.
m The Decision Tree Algorithm
A decision tree is a hierarchical structure resembling a flowchart, in which each internal node represents a specific
characteristic or attribute, each branch represents a decision rule, and each leaf node represents a particular conclusion.
Advanced Algorithms in Al and ML 216 Supervised Learning: Naive Bayes, Decision Tree ....
The highest-level node in a decision tree is referred to as the root node. It acquires the ability to divide data based on the
attribute value. The tree is divided into smaller parts using a recursive method known as recursive partitioning. This
diagrammatic framework facilitates the process of decision-making. It is a kind of visualization that resembles a flowchart
diagram and effectively replicates human-level reasoning. Hence, decision trees possess a high level of comprehensibility and
interpretability.
Root Node

§§§:

-
.Zg

B -
X

f - Understanding the Risks to


Leaf Nodes Prevent a heart attack
Fig. 2.12
A decision tree is a transparent machine learning technique. The algorithm discloses its core decision-making logic,
unlike black box algorithms such as neural networks. The training time of this approach is much quicker than that of the
neural network technique.
The temporal complexity of decision trees is determined by the total number of records and characteristics included in
the provided data. The decision tree is a distribution-free or non-parametric technique that does not rely on assumptions about
probability distributions. Decision trees are capable of accurately processing data that has a great number of dimensions.
‘What Is the functioning mechanism
of the Decision Tree Algorithm?
The fundamental concept behind any decision tree algorithm is as follows:
Utilize Attribute Selection Measures (ASM) to determine the optimal attribute for splitting the data.
Transform the attribute into a decision node and partition the dataset into smaller subgroups.
Initiate the process of constructing the tree by iteratively repeating this procedure for each offspring until one of the
prerequisites is met:
All the tuples pertain to the same attribute value.
There are no more qualities left.
There are currently no other occurrences.
Decision Tree Generation

ASM such
Using ASM sucn as
i) Breaks
real the Datasot
| information Gain or Gini into smaller subsets
Index or Gain Ratio
L Recursively repeat the J
process for each child

Performance
Evaluation measures|
1. Accuracy
2. Precision
3. Recall
Flg. 2.13
‘Advanced Algorithms in Al and ML 217 ‘Supervised Leaming: Naive Bayes, Decision Tree ...

Attribute Selection Measures:


An attribute selection measure is a heuristic used to choose the splitting criteria that divides data in the most optimal
way. It is referred to as splitting rules due to its function in determining breakpoints for tuples on a certain node. ASM
assigns a numerical value to each feature (or attribute) based on the information provided in the dataset. The attribute with the
highest score will be chosen as the splitting attribute (Source). For a continuous-valued characteristic, it is necessary to
provide split points for branches as well. Information Gain, Gain Ratio, and Gini Index are the most often used selection
metrics.
Information Gain:
Information Gain is a concept introduced by Claude Shannon, who developed the notion of entropy as a measure of the
impurity of the input set. In the fields of physics and mathematics, entropy is defined as the measure of the degree of
unpredictability or impurity in a given system. In the field of information theory. the term "impurity” refers to the presence of
non-homogeneous or mixed elements within a set of samples. Information gain refers to the reduction in entropy. Information
‘gain quantifies the disparity between the entropy before to the division and the average entropy subsequent to the division of
the dataset, taking into account the attribute values provided. The ID3 (Iterative Dichotomiser) decision tree method employs
the concept of information gain.

Info(D) = — X pilog,pi

‘Where, p; is the probability that an arbitrary tuple in D belongs to class C;.


v
Djl x Info(D;)
InfonD) = X =173
j=1
Gain(A) = Info(D) — Infos(D)
Where,
Info(D) is the average amount of information needed to identify the class label of a tuple in D.
DJIDj acts as the weight of the j* partition.
InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by A.
The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N().
Gain Ratlo:
Gain Ratio refers to a measure used in data mining to evaluate the quality of a split in a decision tree. It is calculated by
dividing the information gain by the intrinsic information.
The measure of information acquisition is influenced by the characteristic that has a larger number of possible outcomes.
This indicates a preference for attributes that have a high degree of variability in their values. For example, let’s imagine an
attribute called customer_ID, which contains no information due to a complete partition. This optimizes the amount of
information obtained and generates unnecessary divisions.
C4.5, a refinement of ID3, employs an enhancement to information gain referred to as the gain ratio. obtained ratio
addresses the problem of bias by standardizing the information obtained via the use of Split Info. The J48 algorithm in Java is
a direct version of the C4.5 method and may be found in the WEKA data mining tool.
v
- Dyl D!
Splitinfox(D) = - X {57 x loga | 157

Where, ]D] ; ith


cts as the weight of the jth partition. parti
v is the number of discrete values in attribute A.
The gain ratio can be defined as

Gain Ratio (A) __Gain(A)_ =


Split Infory(A)
The attribute with the highest gain ratio is chosen as the splitting attribute (Source).
Advanced Algorithms in Al and ML 218 Supervised Learning: Naive Bayes, Decision Tree ....

Ginl Index:
Another decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to create split points.
m
Gini (D) = 1- X P
=1

Where, pi is the probability that a tuple in D belongs to class C;.


The Gini Index considers a binary split for each attribute. You can compute a weighted sum of the impurity of each
patition. If a binary split on attribute A partitions data D into Dy and D, the Gini index of D is:

Giniu(D) = 22 Gini (D)) + 12" Gini (D)


In the case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a
splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a
possible split point, and a point with a smaller gini index is chosen as the splitting point.
AGini(A) = Gini (D) - Ginix(D).
The attribute with the minimum Gini index is chosen as the splitting attribute.
Decision Tree Classifier Building In Scikit-learn
Importing Required Libraries: Let's first load the required libraries.
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
Loading Data
Let's first load the required Pima Indian Diabetes dataset using pandas’ read CSV function. You can download
the Kaggle data set to follow along.
col_names = ['pregnant’, 'glucose’, 'bp’, 'skin', ‘insulin’, 'bmi', 'pedigree’, 'age’, 'label']
# load dataset
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)
pima.head()
pregnant glucose | bp | skin | insulin = bmi pedigree | age | label

0 6 148 72 35 0 336 0.627 50 1


1 1 85 66 26.6 0.351 31 [
2 8 183 64 o 0 233 0.672 32 1
3 1 89 66 94 28.1 0.167 21 [
4 0 137 40 35 168 43.1 2.288 33 1
Feature Selection:
Here, you need to divide given columns into two types of variables dependent(or target variable) and independent
variable(or feature variables).
#split dataset in features and target variable
feature_cols = ['pregnant’, "insulin’, 'bmi’, "age’,'glucose’,'bp’,'pedigree’]
X = pima[feature_cols] # Features
y = pima.label # Target variable
Advanced Algorithms in Al and ML 219 ‘Supervised Leamning: Naive Bayes, Decision
Tree ....

Splitting Data:
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.
Let's split the dataset by using the function train_test_split(). You need to pass three parameters features: target, and
test_set size.
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training
and 30% test
Bullding Decision Tree Model
Let's create a decision tree model using Scikit-learn.
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer


clf = cIf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf predict(X_test)
Evaluating the Model:
Let's estimate how accurately the classifier or model can predict the type of cultivars.
Accuracy can be computed by comparing actual test set values and predicted values.
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6753246753246753
We got a classification rate of 67.53%, which is considered as good accuracy. You can improve this accuracy by tuning
the parameters in the decision tree algorithm.
Visualizing Decison Trees
You can use Scikit-learn's export_graphviz function for display the tree within a Jupyter notebook. For plotting the tree,
you also need to install graphviz and pydotplus.
pip install graphviz
pip install pydotplus
The export_graphviz function converts the decision tree classifier into a dot file, and pydotplus converts this dot file to
png or displayable form on Jupyter.
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True feature_names = feature_cols,class_names=['0",'1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png’)
Image(graph.create_png())
‘Supervised Leaming: Naive Bayes, Decision Tree ...

Fig. 2.14
In the decision tree chart, each internal node has a decision rule that splits the data. Gini, referred to as Gini ratio,
measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes
known as the leaf node.
Here, the resultant tree is unpruned. This unpruned tree is unexplainable and not easy to understand. In the next section,
let's optimize it by pruning.
Optimizing Decislon Tree Performance:
* criterion : optional (default="gini") or Choose attribute selection measure. This parameter allows us to use
the different-different attribute selection measure. Supported criteria are “gini” for the Gini index and “entropy™ for
the information gain.
* splitter : string, optional (default="best") or Split Strategy. This parameter allows us to choose the split
strategy. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
* max_depth : Int or None, optional (default=None) or Maximum Depth of a Tree. The maximum depth of the
tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The
higher value of maximum depth causes overfitting, and a lower value causes underfitting (Source).
In Scikit-learn, optimization of decision tree classifier performed by only pre-pruning. Maximum depth of the tree can be
used as a control variable for pre-pruning. In the following the example, you can plot a decision tree on the same data with
max_depth=3. Other than pre-pruning parameters, You can also try other attribute selection measure such as entropy.
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer


clf = cIf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.7705627705627706
Well, the classification rate increased to 77.05%, which is better accuracy than the previous model.
Visualizing Decision Trees:
Let's make our decision tree a little easier to understand using the following code:
from six import StringIO from IPython.display import Image
Advanced Algorithms in Al and ML 221 ‘Supervised Learning: Naive Bayes, Decision Tree ....

from sklearn.tree import export_graphviz


import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names = feature_cols,class_names=['0",'1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(‘diabetes.png')
Image(graph.create_png())
Here, we have completed the following steps:
o Imported the required libraries.
® Created a StringIO object called dot_data to hold the text representation of the decision tree.
* Exported the decision tree to the dot format using the export_graphviz function and write the output to
the dot_data buffer.
e Created apydotplus graph object from the dot format representation of the decision tree stored in
the dot_data buffer.
© Written the generated graph to a PNG file named "diabetes.png”.
* Displayed the generated PNG image of the decision tree using the Image object from the IPython.display module.
ql'lnlcou < 1275

‘glucose
< 158.!
‘entropy
= 0.9
samples = 152
value
= [48, 104]|
msls=l

=0.918 .9t .mm 0402 =10 0.985


?uhm“p) 6 |[“samp!= 5| Beles23 mus Carmplos + 26°
s 0!
value= [4, 2] =
‘b‘ (s val [9b” E LR | r
[83, 51] valt 41 55

Fig. 2.15
As you can see, this pruned model is less complex, more explainable, and easier to understand than the previous decision
tree model plot.
Decislon Tree Pros:
® Decision trees are easy to interpret and visualize.
® It can easily capture Non-linear patterns.
* It requires fewer data preprocessing from the user, for example, there is no need to normalize columns.
* It can be used for feature engineering such as predicting missing values, suitable for variable selection.
® The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm.
(Source)
Declslon Tree Cons:
® Sensitive to noisy data. It can overfit noisy data.
® The small variation (or variance) in data can result in the different decision tree. This can be reduced by bagging and
boosting algorithms.
Advanced Algorithms in Al and ML 222 ‘Supervised Learning: Naive Bayes, Decision Tree ...

® Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating
the decision tree.
What Is Decislon Tree?
e The Decision Tree serves as a supervised learning technique applicable to both classification and regression
problems, though it is predominantly favoured for addressing classification issues.
® This classifier adopts a tree structure, with internal nodes representing dataset features, branches embodying
decision rules, and each leaf node signifying an outcome.
* Within the Decision Tree, two types of nodes exist:
o Decision Nodes, facilitating decisions with multiple branches, and
o Leaf Nodes, which present the outcomes without additional branches.
® Decisions or tests are executed based on the dataset’s features.
© This method provides a graphical representation offering potential solutions to problems or decisions under specific
conditions.
* The construction of the tree employs the CART algorithm, denoting the Classification and Regression Tree
algorithm.
* The decision tree operates by posing questions and, contingent on the responses (Yes/No), progressively subdivides
the tree into subtrees.
Below diagram explains the general structure of a decision tree:

Decision Node |— Root Node

S S| \
|
1
1 Decision Node
1
1
1
1

Dacision Node

Leaf Node Leaf Node

Fig. 2.16
‘Why use Decislon Trees?
Reason for using the Decision tree:
® Decision Trees usually mimic human thinking ability while deciding, so it is easy to understand.
® The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologles:
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets divided
into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.
Advanced Algorithms in Al and ML 223 Supervised Learning: Naive Bayes, Decision Tree ...

Explain Working of Decision Tree Algorithm:


In a decision tree, the process of predicting the class for a given dataset initiates from the root node of the tree. The
algorithm commences by comparing the values of the root attribute with the corresponding attribute in the dataset. Based on
this comparison, it navigates through the branches, moving to the next node in the tree.
At each subsequent node, the algorithm repeats the process, comparing the attribute value with the conditions specified
in the sub-nodes, and progresses further accordingly. This iterative comparison and traversal persist until the algorithm
arrives at a leaf node within the tree.
The entire sequence can be elucidated through the algorithm outlined below, providing a step-by-step guide to the
decision-making process:
1. Start at the Root Node: Begin with the root node of the decision tree.
2. Compare Attribute Values: Compare the attribute value of the current node with the corresponding attribute value
in the dataset.
3. Follow Branches: Based on the result of the comparison, follow the appropriate branch to the next node.
4. Repeat Process: Repeat steps 2 and 3 for each subsequent node encountered.
5. Reach Leaf Node: Continue this process until a leaf node is reached, signifying the final prediction or classification.
Example:
Candidate with a job offer faces a decision: whether to accept or decline.
Decision tree begins with the root node, focusing on the Salary attribute determined by ASM.
Root node branches into the next decision node, considering the distance from the office, leading to a leaf node with
corresponding labels.
The subsequent decision node splits into two branches: one decision node (Cab facility) and one leaf node.
The final decision node further splits into two leaf nodes: Accepted offers and Declined offer.
Overall decision-making process is illustrated in the provided diagram.

Salary is between
$50,000 - $80000

Provides Cab Declined

Flg. 2.17
‘What is Attribute Selection Measures (ASM):
Decision tree implementation involves the challenge of selecting the best attributes for both the root node and
sub-nodes.
A technique called Attribute Selection Measure (ASM) is employed to address this issue effectively.
ASM helps in the identification of the most suitable attributes for different nodes within the tree.
Advanced Algorithms in Al and ML 224 Supervised Learning:
Naive Bayes, Decision Tree ....

‘Two widely used techniques for ASM are:


* Information Gain
* Gini Index
1. Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute.
* It calculates how much information a feature provides us about a class.
* According to the value of information gain, we split the node and build the decision tree.
® A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the
highest information gain is split first. It can be calculated using the below formula:
Calculation Formula:
[ Information Gain = Entropy (S) — [(Weighted Avg) *Entropy (each feature) |
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can
be calculated as:
Entropy(s) = — P(yes) log 2 P(yes) — P(no) log 2 P(no)
‘Where, S = Total number of samples
P(yes) = Probability of yes
P(no) = Probability of no
2. Ginl Index:
e Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and
Regression Tree) algorithm.
* Anattribute with the low Gini index should be preferred as compared to the high Gini index.
* It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
* Gini index can be calculated using the below formula:

Gini Index = | - P
Advantages of Decision Tree
© Simple to understand, mirroring human decision-making processes.
e Effective for decision-related problem-solving.
o Facilitates consideration of all possible outcomes for a given problem.
Requires less data cleaning compared to alternative algorithms.
Disadvantages of Decision Tree
* Complex structure with numerous layers.
* Potential overfitting issues, which can be addressed with the Random Forest algorithm.
* Computational complexity may increase with more class labels.
Python Implementation of Decision Tree
Now we will implement the Decision tree using Python. For this, we will use the dataset "user_data.csv,” which we
have used in previous classification models. By using the same dataset, we can compare the Decision tree classifier with
other classification models such as KNN SVM, Logistic Regression, etc.
Steps will also remain the same, which are given below:
* Data Pre-processing step.
« Fitting a Decislon-Tree algorithm to the Training set.
o Predicting the test result.
* Test accuracy of the result(Creation of Confusion matrix).
* Visuallzing the test set result.
Advanced Algorithms in Al and ML ‘Supervised Leaming: Naive Bayes, Decision Tree ....

1 Data Pre-Processing Step:


Below is the code for the pre-processing step:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv(‘user_data.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values

# Splitting the dataset info training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:

W dua_set
- OstaFrame

Fig. 2.18
Advanced Algorithms in Al and ML 226 Supervised Learning: Naive Bayes, Decision Tree ....
2. Fitting a Decision-Tree Algorithm to the Training Set:
Now we will fit the model to the training set. For this, we will import the DecisionTreeClassifier class from
sklearn.tree library. Below is the code for it:
1. #Fitting Decision Tree classifier to the training set
2. From sklearn.tree import DecisionTreeClassifier
3. classifier= DecisionTreeClassifier(criterion="entropy', random_state=0)
4. classifier.fit(x_train, y_train)
In the above code, we have created a classifier object, in whichwe have passed two main parameters.
« “criterion="entropy': Criterion is used to measure the quality of split, which is calculated by information gain
given by entropy.
* random_state=0": For generating the random states.
Below is the output for this:
Out[8]:
DecisionTreeClassifier(class_weight=None, criterion="entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_spli None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter="best")
3. Predicting the Test Result:
Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred-= classifier.predict(x_test)
Output:
In the below output image, the predicted output and real test output are given. We can clearly see that there are some
values in the prediction vector, which are different from the real vector values. These are prediction errors.

W pred NumPy sy =] |y vt - NPyaerny - o x

0 - 0 A

Fig. 2.19
Advanced Algorithms in Al and ML 227 Supervised Learning: Naive Bayes, Decision Tree ....
4. Test Accuracy of the Result (Creation of Confusion Matrix):
In the above output, we have seen that there were some incorrect predictions, so if we want to know the number of
correct and incorrect predictions, we need to use the confusion matrix. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
B8 cm - NumPy amay - o x

|Fomet | Resze | [ Badround


color

Savend
o] e ]
Fig. 2.20
In the above output image, we can see the confusion matrix, which has 6 + 3 = 9 Incorrect predictions and 62 + 29 = 91
correct predictions. Therefore, we can say that compared to other classification models, the Decislon Tree classifler
made a good prediction.
5. Visualizing the Tralning Set Result:
Here we will visualize the training set result. To visualize the training set result we will plot a graph for the decision tree
classifier. The classifier will predict yes or No for the users who have either Purchased or Not purchased the SUV car as we
did in Logistic Regression. Below is the code for it:
1. #YVisulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set,y_set=x_train, y_frain
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0J.max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1]min() - 1, stop = x_set[:, 1}max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple’,'green' )))
8. mtp.xlim(xL.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max()) # test accuracy of the result
10. fori, j in enumerate(nm.unique(y_set)): cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap((‘purple’, ‘green*))(i), label = j) #Visualizing the Decision Tree
13. mtp.title('Decision Tree Algorithm (Training set)') plt.figure(figsize=(12, 8))
plot_tree(classifier, feature_names=
['Feature_1', 'Feature_2'], class_names=
['Class_0', 'Class_1'], filled=True)
plt.title("Decision Tree Visualization")
plt.show()
Advanced Algorithms in Al and ML 228 Supervised Learning: Naive Bayes, Decision Tree ....

14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:

Decision Tree Algorithm (Training Set)


Estimated Salary
|
b

-2 -1 0 1 2 3
Age

Fig. 2.21
The above output is completely different from the rest classification models. It has both vertical and horizontal lines that
are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.

6. Visualizing the Test Set Result:


Visualization of test set result will be similar to the visualization of the training set except that the training set will be
replaced with the test set.
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set,y_set = x_test,y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0l.max() + 1, step =0.01)
5. nm.arange(start = x_set[:, Ilmin() - 1, stop = x_set[:, Ilmax() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap((*purple’,'green’ )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. ¢ = ListedColormap((‘purple’, ‘green’))(i), label = j)
13. mtp.title('Decision Tree Algorithm(Test set)')
14. mtp.xlabel('Age')
Advanced Algorithms in Al and ML 2.29 Supervised Learning: Naive Bayes, Decision Tree ....

15. mtp.ylabel('Estimated Salary')


16. mtp.legend()
17. mtp.show()
Output:

Decision Tree Algorithm (Test Set)


Estimated Salary

-2 -1 0 1 2 3
Age

Fig. 2.22

JEXH RANDOM FOREST CLASSIFICATION


An Overview of Random Forests
An Overview of Random Forests:

« Random forests are a popular supervised machine learning algorithm.


* Random forests are for supervised machine learning, where there is a labeled target variable.
© Random forests can be used for solving regression (numeric target variable) and classification (categorical target
variable) problems.
* Random forests are an ensemble method, meaning they combine predictions from other models.
* Each of the smaller models in the random forest ensemble is a decision tree.
How Random Forest Classification Works?
Imagine you have a complex problem to solve, and you gather a group of experts from different fields to provide their
input. Each expert provides their opinion based on their expertise and experience. Then, the experts would vote to arrive at a
final decision.
In a random forest classification, multiple decision trees are created using different random subsets of the data and
features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made by
calculating the prediction for each decision tree, then taking the most popular result. (For regression, predictions use an
averaging technique instead.)
In the diagram below, we have a random forest with n decision trees, and we’ve shown the first 5, along with their
predictions (either “Dog”™ or “Cat”). Each tree is exposed to a different number of features and a different sample of the
original dataset, and as such, every tree can be different. Each tree makes a prediction. Looking at the first 5 trees, we can see
that 4/5 predicted the sample was a Cat. The green circles indicate a hypothetical path the tree took to reach its decision. The
random forest would count the number of predictions from decision trees for Cat and for Dog, and choose the most popular
prediction.
Advanced Algorithms in Al and ML 230 Supervised Learning: Naive Bayes, Decision Trea ...

Tree 1: Cat Tree 2: Dog Tree 3 : Cat

Tree 4: Cat Tree 5: Cat Tree n


Fig. 2.23
The Dataset:
This dataset consists of direct marketing campaigns by a Portuguese banking institution using phone calls. The
campaigns aimed to sell subscriptions to a bank term deposit. We are going to store this dataset in a variable
called bank_data.
The columns we will use are:
age: The age of the person who received the phone call
default: Whether the person has credit in default
cons.price.ldx: Consumer price index score at the time of the call
cons.conf.idx: Consumer confidence index score at the time of the call
¥: Whether the person subscribed (this is what we are trying to predict)
Importing Packages:
The following packages and functions are used in this tutorial:
# Data Processing
import pandas as pd
import numpy as np

# Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score,
ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz
Advanced Algorithms
in Al and ML 231 Supervised Learning: Naive Bayes, Decision Tree ....
Random Forests Workflow:
To fit and train this model, we’ll be following The Machine Learning Workflow infographic; however, as our data is
pretty clean, we won’t be carrying out every step. We will do the following:
Feature engineering
Split the data
Train the model
Hyperparameter tuning
Assess model performance
Preprocessing Data for Random Forests:
Tree-based models are much more robust to outliers than linear models, and they do not need variables to be normalized
to work. As such, we need to do very little preprocessing on our data.
* We will map our ‘default’” column, which containsnoand yes, t0o0’s and I's, respectively. We will
treat unknown values as no for this example.
* We will also map our target, y, to Is and 0s.
bank_data['default'] = bank_data['default' lmap({'no':0,'yes":1,'unknown":0})
bank_data['y'] = bank_data['y'].map({'no":0,'yes":1})
Splitting the Data:
‘When training any supervised leaming model, it is important to split the data into training and test data. The training data
is used to fit the model. The algorithm uses the training data to learn the relationship between the features and the target. The
test data is used to evaluate the performance of the model.
The code below splits the data into separate variables for the features and target, then splits into training and test data.
# Split the data into features (X) and target (y)
X = bank_data.drop('y*, axis=1)
y = bank_data['y']

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Fitting and Evaluating the Model:
‘We first create an instance of the Random Forest model, with the default parameters. We then fit this to our training data.
We pass both the features and the target variable, so the model can learn.
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
At this point, we have a trained Random Forest model, but we need to find out whether it is making accurate predictions.
y_pred = rf.predict(X_test)
The simplest way to evaluate this model is using accuracy; we check the predictions against the actual values in the test
set and count up how many the model got right.
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 0.888
This is a pretty good score! However, we may be able to do better by optimizing our hyperparameters.
Visualizing the Results:
‘We can use the following code to visualize our first 3 trees.
# Export the first three decision trees from the forest
‘Advanced Algorithms in Al and ML 2.32 ‘Supervised Learning: Naive Bayes, Decision Tree ...

for i in range(3):
tree = rf.estimators_[i]
dot_data = export_graphviz(tree,
feature_names=X_train.columns,
filled=True,
max_depth=2,
impurity=False,
proportion=True)
graph = graphviz.Source(dot_data)
display(graph)
cons.conf.idx< =-3545
samples = 100.0%
value = [0.885, 0.115]

Try Wu

age <= 60.5 ccons.price.idx <= 92.559


samples = 93.3% samples = 6.7%
value = [0.807, 0.093] value = [0.571, 0.429]

cons.price.idx < =—-48.3


| cons.conf.ldx< = - 40.56
L
cons.price.idx<= 92.405 nonscom‘ldx<-—338
samples= 92.2% samples = 1.1% samples = 4.3% samples
value= [0.911, 0.089] value = [0.521, 0.479] value = [0.606, 0.394] value-[uses 04941

AN [\ [\ VAR
() () () () () ()

Fig.2.24

cons.conf.idx <= - 35.45


samples = 100.0%
value = [0.884, 0.116]
\Fa‘lse
>
age <= 62.5 age <=71.5
samples = 93.5% samples = 6.5%
value = [0.908, 0.094] value = [0.568, 0.432]

| cons.price.idx <= 93.019


NN
cons.conf.idx <=— 34.1 cons.conf.idx
<= —29.95
p 2. samples = 0.9% samples = 6.1% samples = 0.4%
=[0.911, 0.088] value = [0.43, 0.57] value = [0.579, 0.421] value= [0.417, 0.583]

A AN [\ VAN
() () () () () ()
Advanced Algorithms in Al and ML ‘Supervised Learning:
Naive Bayes, Decision Tree ....

cons.conf.idx: —35.45
samples = 100.0%
value = [0.886, 0.114]
Yin

cons.conf.idx <= —48.3 cons.price.idx <= 92.681


samples = 93.6% samples = 6.4%
value = [0.906, 0.094] value = [0.583, 0.417]

cons.conf.idx <= - 49.75


el age <= 62.5
| .
cons.conf.idx <= — 30.76 age <=25.5
samples = 1.6% samples = 92.0% samples = 5.0% samples = 1.4%
value = [0.591, 0.409] value = [0.912, 0.088] value = [0.622, 0.378] value = [0.452, 0.548]

()
[\ () ()
[\ et ()
[\ . ()
[\ (-

Flg. 2.26
Each tree image is limited to only showing the first few nodes. These trees can get very large and difficult to visualize.
The colors represent the majority class of each node (box, with red indicating majority 0 (no subscription) and blue indicating
majority 1 (subscription). The colors get darker the closer the node gets to being fully 0 or 1. Each node also contains the
following information:
The variable name and value used for splitting.
The % of total samples in each split.
The % split between classes in each split.
Hyperparameter Tuning:
The code below uses Scikit-Learn’s Randomized Search CV, which will randomly search parameters within a range per
hyperparameter. We define the hyperparameters to use and their ranges in the param_dist dictionary. In our case, we are
using:
n_estimators: the number of decision trees in the forest. Increasing this hyperparameter generally improves the
performance of the model but also increases the computational cost of training and predicting.
max_depth: the maximum depth of each decision tree in the forest. Setting a higher value for max_depth can lead to
overfitting while setting it too low can lead to underfitting.
param_dist = {'n_estimators': randint(50,500),
'max_depth': randint(1,20)}

# Create a random forest classifier


rf = RandomForestClassifier()

# Use random search to find the best hyperparameters


rand_search = RandomizedSearchCV(rf,
param_distributions = param_dist,
n_iter=5,
cv=b)

# Fit the random search object to the data


rand_search.fit(X_train, y_train)
Advanced Algorithms in Al andML 234 ‘Supervised Learning:
Naive Bayes, Decision Tree ...

RandomizedSearchCV will train many models (defined by n_iter_and save each one as variables, the code below creates
a variable for the best model and prints the hyperparameters. In this case, we haven't passed a scoring system to the function,
50 it defaults to accuracy. This function also uses cross validation, which means it splits the data into five equal-sized groups
and uses 4 to train and 1 (o test the result. It will loop through cach group and give an accuracy score, which is averaged to
find the best model.
# Create a variable for the best model
best_rf = rand_search.best_estimator_

# Print the best hyperparameters


print('Best hyperparameters:', rand_search.best_params_)
Output:
Best hyperparameters: {'max_depth': 5, 'n_estimators': 260}
More Evaluation Metrics:
Let's look at the confusion matrix. This plots what the model predicted against what the correct prediction was. We can
use this to understand the tradeoff between false positives (top right) and false negatives (bottom left) We can plot the
confusion matrix using this code:
# Generate predictions with the best model
y_pred = best_rf.predict(X_test)

# Create the confusion matrix


cm = confusion_matrix(y_test, y_pred)

ConfusionMatrixDisplay(confusion_matrix=cm).plot();
7000

6000

5000

4000
True label

3000

2000

1000

[
Predicted label
Fig. 2.27
We should also evaluate the best model with accuracy, precision, and recall (note your results may differ due to
randomization)
y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(“Accuracy:", accuracy)
print("Precision:", precision)
print(“Recall:", recall)
‘Advanced Algorithms in Al and ML 235 ‘Supervised Leamning: Naive Bayes, Decision Tree ...

Output:
Accuracy: 0.885
Precision: 0.578
Recall: 0.0873
The below code plots the importance of each feature, using the model’s internal score to find the best way to split the
data within each decision tree.
# Create a series containing feature importances from the model and feature names from the training data
feature_importances = pd.Series(best_rf.feature_importances_. index=X_train.columns).sort_values(ascending=False)

# Plot asimple bar chart


feature_importances.plot.bar();

This tells us that the consumer confidence index, at the time of the call, was the biggest predictor in whether the person
subscribed.

0.5

04

03

02

0.1

0.0 T T T T
cons.confidx cons.price.idx age default
Flg. 2.28
Random Forest Algorithm
* "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of that dataset.”
* Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the
majority votes of predictions, and it predicts the final output.
* The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
Advanced Algorithms in Al and ML 2.36 ‘Supervised Learning: Naive Bayes, Decision Tree ...

‘Why use Random Forest?


© It takes less training time as compared to other algorithms.
* It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
- Working of Random Forest Algorithm
‘Working of Random Forest Algorithm:
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step | and 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category
that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random forest
classifier. The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree
produces a prediction result, and when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:

Instance

Tree-1 Tree-2 g Tree-n o

Class-A
] ]¥ Class-B

Flg. 2.30
An Overview of Random Forests
There are mainly four sectors where Random Forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advanced Algorithms in Al and ML 237 Supervised Leaming: Naive Bayes, Decision Tree ....

Advantages of Random Forest


* Random Forest is capable of performing both Classification and Regression tasks.
® Itis capable of handling large datasets with high dimensionality.
® It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
* Although random forest can be used for both classification and regression tasks, it is not more suitable for
Regression tasks.
Implementation of Random Forest Algorithm
* Data Pre-processing step.
* Fitting the Random forest algorithm to the Training set.
® Predicting the test result.
* Test accuracy of the result (Creation of Confusion matrix).

* Visualizing the test set result.

1. Data Pre-Processing Step:


Below is the code for the pre-processing step:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv(‘user_data.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:
Advanced Algorithms in Al and ML 2.38 Supervised Learning: Naive Bayes, Decision Tree ....

Fomat | Resze | [ Badground


clor [] Cobam mivmax Sovemnd
Close| oo ]
Fig. 231
2. Fitting the Random Forest Algorithm to the Training Set:
Now
we will fit the Random forest algorithm to the training set. To fit it, we will import the RandomForestClassifler
class from the sklearn.ensemble
library. The code is given below:
#Fitting Decision Tree classifier to the training set
oo

from sklearn.ensemble import RandomForestClassifier


classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy")
sw

classifier.fit(x_train, y_train)
In the above code, the classifier object takes below parameters:
* n_estimators: The required number of trees in the Random Forest. The default value is 10. We can choose
any
number but need to take care of the overfitting issue.
e criterion: It is a function to analyze the accuracy of the split. Here we have taken "entropy” for the information
gain.
Output:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion="entropy",
max_depth=None, max_features="auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
3. Predicting the Test Set Result:
Since our model is fitted to the training set, so now we can predict the test result. For prediction,
we will create a new
prediction
vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Advanced Algorithms in Al and ML 2.39 Supervised Learning: Naive Bayes, Decision Tree ....
Output:
The prediction vector is given as:

R ETe—
- =]
Flg. 2.32
By checking the above prediction
vector and test set real vector, we can determine
the incorrect predictions
done by the
classifier.
4. Creating the Confusion Matrix:
Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
B om - NamPy amay - o x

[T | e | 2 sk ctr
|
Flg.2.33
As we can see in the above matrix, there are 4 + 4 = 8 Incorrect predictions and 64 + 28 = 92 correct predictions.
5. Visualizing the Training Set Result:
Here we will visualize the training set result. To visualize the training set result we will plot a graph for the Random
forest classifier. The classifier will predict yes or No for the users who have either Purchased or Not purchased the SUV car
as we did in Logistic Regression. Below is the code for it:
from matplotlib.colors Import ListedColormap
x_set,y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0l.max() + 1, step = 0.01),
Advanced Algorithms in Al and ML 240 Supervised Learning: Naive Bayes, Decision Tree ....
nm.arange(start = x_set[:, 1lmin() - 1, stop = x_set[:, Il.max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
# Visualizing the Random Forest Decision Boundary (Training Set)
alpha = 0.75, cmap = ListedColormap((*purple’,'green’ ))) def plot_decision_boundary(X_set, y_set, title):
X1, X2 = np.meshgrid(
mtp.xlim(x1.min(), x1.max()) np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)
mtp.ylim(x2.min(), x2.max()) )
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
for i, j in enumerate(nm.unique(y_set)): X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=plt.cm.Paired)
mtp.scatter(x_set[y_set == j, 0], x_set[y_set = plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
¢ = ListedColormap((‘purple’, ‘green*))(i), label = j) for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j][:, 0], X_set[y_set == j][:, 1],
mtp.title('Random Forest Algorithm (Training set)') c=('red', 'blue')[i], label=j)
plt.title(title)
mtp.xlabel('Age') plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
mtp.ylabel(*Estimated Salary") plt.legend()
plt.show()
mtp.legend() # Visualizing Training Set Results
mtp.show() plot_decision_boundary(x_train, y_train, "Random Forest (Training Set)")

Output: # Visualizing Test Set Results


plot_decision_boundary(x_test, y_test, "Random Forest (Test Set)")
Random ForestAl
Estimated Salary

Age
Fig. 2.34
The above image is the visualization result for the Random Forest classifier working with the training set result. It is very
much similar to the Decision tree classifier. Each data point corresponds to each user of the user_data, and the purple and
green regions are the prediction regions. The purple region is classified for the users who did not purchase the SUV car, and
the green region is for the users who purchased the SUV.
So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for the Purchased variable.
The classifier took the majority of the predictions and provided the result.
6. Visualizing the Test Set Result:
Now we will visualize the test set result. Below is the code for it:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set,y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0]max() + 1, step =0.01),
nm.arange(start = x_set[:, 11min() - 1, stop = x_set[:, 1lmax() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((*purple’,'green’ )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
Advanced Algorithms in Al and ML 241 Supervised Learning: Naive Bayes, Decision Tree ...

mtp.scatter(x_sety_set == j, 0], x_set[y_set == j, 1],


¢ = ListedColormap((‘purple’, "green*))(i), label = j)
mtp.title('Random Forest Algorithm(Test set)')
mtp.xlabel(*Age')
mtp.ylabel(*Estimated Salary")
mtp.legend()
mtp.show()
Output:
Random Forest Algorithm (Test set)
Estimated Salary

I. What is Naive Bayes Classifier Algorithm?


2. Whyis it called Naive Bayes?
3. Describe Working of Naive Bayes' Classifier.
4. Describe Applying Bayes ‘theorem.
5. Write Advantages and Disadvantage of Naive Bayes Classifier.
6. What are the Applications of Naive Bayes Classifier?
7. What are Types of Naive Bayes Model?
8. Write Python Implementation of the Naive Bayes algorithm.
9. What is Decision Tree?
10. Why use Decision Trees?
I'1. Write any four Decision Tree Terminologies.
12. Explain Working Of Decision Tree algorithm.
13. What is Attribute Selection Measures (ASM)?
14. Write Advantages and Disadvantage of Decision Tree.
15. Write Python Implementation of Decision Tree.
16. What is Random Forest Algorithm?
17. Why use Random Forest? Or Explain the need of Random Forest.
18. Explain working of Random Forest Algorithm.
19. Write Application of Random Forest Algorithm.
20. Write Advantages and Disadvantages of Random Forest algorithm.
21. Write python Implementation of Random Forest algorithm.
E-Y-¥
Supervised Learning:
Support Vector Machines,
K Nearest Neighbour
Chapter Outcomes...
After reading this chapter, students will be able to understand :
The concept of Support Vector Machines.
DRCRCRORORONCRO

Types and work of SVM.


Advantages and disadvantages of SVM.
Implementation of SVM.
Concept of K nearest neighbours.
Need and working of KNN algorithm.
Advantages and disadvantages of KNN algorithm.
Implementation of KNN algorithm.

Learning Objectives...
® Describe Support Vector Machines.
® Enlist advantages and disadvantages of KNN algorithm

SUPPORT VECTOR MACHINES


SVM has more accurate results in comparison to other classifiers such as logistic regression and decision trees.
The kernel technique is known for its ability to handle non-linear input spaces.
It is used in many applications like facial identification, intrusion detection, categorization of emails, news articles, and
web pages, classification of genes, and handwriting recognition.
The Support Vector Machine (SVM) method is a captivating computational model with notions that are reasonably
straightforward.
The classifier partitions data points by using a hyperplane that maximizes the margin.
An SVM classifier is often referred to as a discriminative classifier due to its distinguishing characteristics.
The Support Vector Machine (SVM) algorithm identifies an ideal hyperplane that aids in the classification of fresh data
points.
Support Vector Machines (SVM) is primarily recognized as a classification method, although it may also be used for
both classification and regression tasks.
It has the capability to effectively process and analyze both continuous and categorical information. Support Vector
Machines (SVM) use a hyperplane in a multidimensional space to effectively distinguish between distinct classes.
The Support Vector Machine (SVM) algorithm iteratively builds an ideal hyperplane to reduce errors.
The fundamental concept of Support Vector Machines (SVM) is to identify a hyperplane with greatest margin, known as
the greatest Marginal Hyperplane (MMH), which effectively separates the dataset into distinct groups.
B.1]
Advanced Algorithms in Al and ML 32 ‘Supervised Learning: Support Vector Machines, ....

* * Margin

Y-Axis

*
b
*
A
A Al A
A A A Class B

Support Vectors
X-Axis
Fig. 3.1
Support Vectors
* Support vectors refer to the specific data points that are located closest to the hyperplane.
* These points will provide a more precise definition of the dividing line by computing the margins.
® These considerations are particularly relevant to the building of the classifier.
Hyperplane:
A hyperplane is a plane that is used to make decisions by separating a group of objects with various class memberships.
Margin:
* A margin refers to the space between two adjacent lines on the nearest class points.
® The calculation involves determining the perpendicular distance between the line and either the support vectors or
the nearest points.
* A wider gap between the classes indicates a favorable margin, whereas a narrower margin is deemed unfavorable.
‘What Is the mechanism behind Support Vector Machines (SVM)?
The primary goal is to categorize the provided dataset in the most optimal manner.
The gap between the closest spots is referred to as the margin.
The goal is to choose a hyperplane that has the largest feasible distance between the support vectors in the provided
dataset.
‘The SVM algorithm seeks to identify the hyperplane with the largest margin using the following steps:
® Create hyperplanes that effectively separate the classes. The left-hand side figure displays three hyperplanes: one in
black, one in blue, and one in orange. The blue and orange colors exhibit a significant classification error, but the
black color accurately separates the two groups.
* Choose the optimal hyperplane that achieves the highest level of separation from the closest data points, as seen in
the picture on the right-hand side.

Class A + Class A .
. * * * largin

é *
* *% *
A A Class B SE

A A A
ANA A
Support Vectors
hois pox-AXIS >
Fig.32
Advanced Algorithms in Al and ML 33 ‘Supervised Leamning: Support Vector Machines, ...

Dealing with Non-linear and Inseparable Planes


Some problems cannot be solved using linear hyperplane, as shown in the figure below (left-hand side).
In such situation, SVM uses a kernel trick to transform the input space to a higher dimensional space as shown on the
right. The data points are plotted on the x-axis and z-axis (Z is the squared sum of both x and y: z = x* = y%). Now you can
casily segregate these points using linear separation.
Iy 4

$> A
Ajaa
Chssa* A
£> Class A
* AChassB *x *
AkA *
xx% A
O *k Kok
A
k iassB

AdsaA AAp,AAA,
> >
X-Axis X-Axis
Fig. 3.3
Support Vector Machine Kernels
The SVM method is practically implemented by using a kernel. A kernel maps an input data space to the desired format.
The Support Vector Machine (SVM) algorithm employs a method known as the kernel trick. In this context, the kernel
function maps a lower-dimensional input space to a higher-dimensional space.
Put simply, it transforms an issue that cannot be separated into separate problems by introducing more dimensions.
It is particularly beneficial in situations when there is a need to separate non-linear data.
The kemnel technique enhances the accuracy of the classifier.
* Linear Kernel A linear kernel can be used as normal dot product any two given observations. The product between
two vectors is the sum of the multiplication of each pairof input values.
K(x, xi) = sum(x * xi)
s Polynomial Kemel A polynomial kemel is a more generalized form of the linear kernel. The polynomial kernel can
distinguish between curved or nonlinear input space.
K(x,xi) = 1+ sum(x * xi)"d
Where, d is the degree of the polynomial. d = 1 is similar to the linear transformation. The degree needs to be manually
specified in the learning algorithm.
Radlal Basis Function Kernel: The Radial basis function kernel is a popular kernel function commonly used in support
vector machine classification. RBF can map an input space in infinite dimensional space.
K(x.xi) = exp(-gamma * sum((x - xi"2))
Here gamma is a parameter, which ranges from 0 to 1. A higher value of gamma will perfectly fit the training dataset,
which causes over-fitting. Gamma=0.1 is a good default value. The value of gamma needs to be manually specified in the
learning algorithm.
Classifier Building in Scikit-Learn
So far, you have acquired knowledge on the theoretical underpinnings of Support Vector Machines (SVM). Now you
will be taught about how to build it in Python using scikit-learn.
Within the framework of the building model, the cancer dataset may be used, which is a well recognized multiclass
classification challenge. This information is derived from a digitized picture of a fine needle aspirate (FNA) of a breast mass.
They provide a description of the attributes of the cell nuclei seen in the photograph.
The dataset consists of 30 features, including mean radius, mean texture, mean perimeter, mean area, mean smoothness,
mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture
error, perimeter error, area error, smoothness error, compactness error, Concavity error, concave points error, symmetry error,
Advanced Algorithms in Al and ML 34 Supervised Learning: Support Vector Machines, ....
fractal dimension error, worst radius, worst texture, worst perimeter, worst area, worst smoothness, worst compactness, worst
concavity, worst concave points, worst symmetry, and worst fractal dimension. Additionally, there is a target variable
indicating the type of cancer.
The dataset consists of two categories of cancer: malignant (having the potential to cause damage) and benign (not
having the potential to cause harm). At this location, you have the ability to construct a model that can accurately categorize
the specific form of cancer. The dataset may be accessed using the scikit-learn library or downloaded from the UCI Machine
Learning Library.
Loading Data:
Let's first load the required dataset you will use.
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()
Exploring Data:
After you have loaded the dataset, you might want to know a little bit more about it. You can check feature and target
names.

# print the names of the 13 features


print(“Features: *, cancer.feature_names)

# print the label type of cancer('malignant’ 'benign")


print(“Labels: *, cancer.target_names)

Features: ['mean radius' 'mean texture' ‘'mean perimeter' ‘mean area’


‘mean smoothness' 'mean compactness' ‘mean concavity'
‘mean concave points' 'mean symmetry' ‘mean fractal dimension*
‘radius error’ 'texture error' 'perimeter error' ‘area error'
‘smoothness error' ‘compactness error' ‘concavity error'
‘concave points error' ‘symmetry error' 'fractal dimension error'
‘worst radius’ 'worst texture' 'worst perimeter’ 'worst area’
‘worst smoothness' ‘worst compactness' ‘worst concavity'
‘worst concave points' ‘worst symmetry’ 'worst fractal dimension']
Labels: ['malignant" 'benign']
Let's explore it for a bit more. You can also check the shape of the dataset using shape.
# print data(feature)shape
cancer.data.shape

(569, 30)
Let's check top 5 records of the feature set.
# print the cancer data features (top 5 records)
print(cancer.data[0:5])

[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01


1.471e-01 2.419¢-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
6.399¢-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
Advanced Algorithms in Al and ML 35 Supervised Learning: Support Vector Machines, ....

1.733e+01 1.846¢+02 2.019¢+03 1.622¢-01 6.656¢-01 7.119¢-01 2.654e-01


4.601e-01 1.189¢-01]
[2.057¢+01 1.777e+01 1.329¢+02 1.326€+03 8.474¢-02 7.864¢-02 8.690e-02
7.017¢-02 1.812¢-01 5.667e-02 5.435¢-017.339¢-01 3.398¢+00 7.408e+01
5.225¢-03 1.308e-02 1.860e-02 1.340e-02 1.389¢-02 3.532¢-03 2.499¢+01
2.341+01 1.588¢+02 1.956¢+03 1.238e-01 1.866-01 2.416e-01 1.860e-01
2.750e-01 8.902¢-02]
[1.969¢+01 2.125€+01 1.300e+02 1.203€+03 1.096-01 1599¢-01 1.974¢-01
1.279¢-01 2.069¢-01 5.999¢-02 7.456e-01 7.869¢-01 4.585¢+00 9.403e+01
6.150e-03 4.006¢-02 3.832¢-02 2.058e-02 2.250e-02 4.571e-03 2.357¢+01
2.553e+01 1525¢+02 1.709e+03 1.444¢-01 4.245¢-01 4.504e-01 2.430e-01
3.613¢-01 8.758e-02]
[1.142¢+01 2.038e+01 7.758¢+01 3.861¢+02 1.425¢-01 2.83%-01 2.414¢-01
1.052e-01 2.507¢-01 9.744e-02 4.956e-01 1.156€+00 3.445€+00 2.723¢+01
9.110e-03 7.458¢-02 5.661e-02 1.867¢-02 5.963¢-02 9.208¢-03 1.491e+01
2.650e+01 0.887¢+01 5.677€+02 2.098¢-01 8.663¢-01 6.869¢-01 2.575¢-01
6.638¢-011.730e-01]
[2.029¢+01 1.434e+01 1.351e+02 1.297¢+03 1.003e-01 1.328¢-01 1.980e-01
1.043e-01 1.809¢-01 5.883e-02 7.572¢-01 7.813e-01 5.438¢+00 9.444¢+01
1.140¢-02 2.461e-02 5.688e-02 1.885¢-02 1.756¢-02 5.115¢-03 2.254e+01
1.667e+01 1.522¢+02 1575€+03 1.374¢-01 2.050e-01 4.000e-01 1.625¢-01
2.364¢-017.678¢-02]]

Let's take a look at the target set.


# print
the cancer labels (0:malignant, 1:benign)
print(cancer.target)
[0000000000000000000111000000000000000
1000000001011111001001111010011110100
1010011100100011101100111001111011011
1111110001001110010100100110110111101
1111111101111001011001100111101100010
1011101100100001000101011010000110011
1011111001101100101111011111010000000
0000000111111010110110100111111111111
1011010111111111111110111010111100011
1101010111011111110001111111111100100
0100111110111110111011001111110111111
1011111011011111111111101001011111011
0101101011111111001111110111111111101
1111110101101111100101011111011010100
1110111111111110100111111111111111111
11111110000001]
Advanced Algorithms in Al and ML 36 Supervised Leaming:
Support Vector Machines, ...

Splitting Data:
Dividing the dataset into a training set and a test set is a prudent method for comprehending model performance.
Partition the dataset by use the train_test_split() method. Please ensure that you include three parameters: features, goal,
and test_set size. In addition, you may use the random_state parameterto randomly choose records.
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
test_size=0.3 random_state=109) # 70% training and 30% test

Generating Model:
Let's build support vector machine model. First, import the SVM module and create support vector classifier object by
passing argument kemnel as the linear kernel in SVC() function.
Then, it your model on train set using fit() and perform prediction on the test set using predict().
#Import svm model
from sklearn import svm

#Create a svm Classifier


clf = svm.SVC(kernel="linear') # Linear Kernel

#Train the model using the training sets


clf fit(X_train, y_train)

#Predict the response for test dataset


y_pred = clf predict(X_test)
Evaluating the Model:
Let's estimate how accurately the classifier or model can predict the breast cancer of patients.
Accuracy can be computed by comparing actual test set values and predicted values.
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.9649122807017544
Well, you got a classification rate of 96.49%, considered as very good accuracy.
For further evaluation, you can also check precision and recall of model.
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precisio metrics.precision_score(y_test, y_pred))
# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:" metrics.recall_score(y_test, y_pred))
Advanced Algorithms in Al and ML 3.7 ‘Supervised Learning:
Support Vector Machines, ....

Precision: 0.9811320754716981
Recall: 0.9629629629629629
Well, you got a precision of 98% and recall of 96%, which are considered as very good values.
Tuning Hyperparameters:
Kernel:
® The kemnel's primary role is to convert the input data of a particular dataset into the necessary format.
® There are several categories of functions, including linear, polynomial, and radial basis function (RBF).
* Polynomial and Radial Basis Function (RBF) are effective for modelling non-linear hyperplanes.
* The polynomial and RBF kemels calculate the boundary line in a higher-dimensional space.
* For some applications, it is recommended to use a more intricate kernel in order to effectively distinguish between
classes that exhibit curvature or non-linearity.
* This conversion has the potential to result in classifiers that are more precise.
Regularization:
o Regularization refers to a technique used in machine leaming and statistics to prevent overfitting by adding a penalty
term to the loss function.
® The regularization parameter in Scikit-learn's Python library is denoted by the C parameter and is used to control the
degree of regularization.
® The penalty parameter, denoted as C, reflects the misclassification or error term.
® The misclassification or error term informs the SVM optimization algorithm about the acceptable level of
inaccuracy.
® Here is a method to manipulate the balance between the decision boundary and the misclassification term.
© A lower value of C results in a hyperplane with a smaller margin, whereas a higher value of C leads to a hyperplane
with a bigger margin.
Gamma:
A smaller gamma number will result in a less precise fit to the training dataset, whereas a larger gamma value will lead
to an exact match to the training dataset, resulting in overfitting.
Put simply, a low gamma value takes into account just the neighboring points when determining the separation line,
while a high gamma value takes into account all the data points in this computation.
Benefits:
® SVM classifiers provide superior accuracy and have speedier prediction capabilities in comparison to the Naive
Bayes method.
« Additionally, they use less memory since they employ a subset of training points during the decision phase.
* Support Vector Machines (SVM) perform optimally when there is a distinct distinction between data points and
when working with spaces that have a large number of dimensions.
Drawbacks:
* Support Vector Machines (SVM) is not well-suited for big datasets because to its lengthy training period, which is
also longer than that of Naive Bayes.
© It exhibits suboptimal performance when dealing with overlapping classes and is also highly dependent on the
choice of kernel.
SUPPORT VECTOR MACHINES (SVM)?
Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification and regression tasks.
SVM works by finding the optimal hyperplane that best separates the data into different classes. The basic idea is to identify
the hyperplane that maximizes the margin between classes.
Advanced Algorithms in Al and ML 38 Supervised Learning: Support Vector Machines, ....

Maximum
Margin Positive

Maximum
Margin
Hyperplane

Negative Hyperplanc

Flg. 3.4
Example of SVM:
* [llustration using the scenario from the KNN classifier example.
* Imagine encountering a peculiar creature with features of both cats and dogs.
* Goal: Develop a model using SVM to accurately determine whether it is a cat or dog.
* Training the model involves using a dataset with numerous cat and dog images to leam their distinct features.
® Testing the model with the strange creature, SVM creates a decision boundary between cat and dog data.
* SVM identifies extreme cases (support vectors) representing the distinctive features of cats and dogs.
* Based on support vectors, the model classifies the creature as a cat.
* Refer to the provided diagram for visualization.

New Data
|

)4 ;8
B Model __
Training Prediction
e
Past Labelled
Data

Flg.3.5
ation, text categorization etc.
SVM algorithm can be used for Face detection, Image classific
JEEN TYPES
OF SVM
1. Linear SVM:
® Used for lincarly separable data.
© Assumes that the data can be separated by a straight line.
2. Non-linear SVM:
o Used when the data is not linearly separable.
« Employs kemel functions to map the input data into a higher-dimensional space where a hyperplane can be used to
separate the classes.
Advanced Algorithms in Al and ML 39 Supervised Learning: Support Vector Machines, ....

3. Support Vector Regression (SVR):


® Used for regression tasks instead of classification.
* Predicts a continuous output rather than a discrete class label.
4. Nu-Support Vector Classification (Nu-SVC):

* Like C-SVC but uses a parameter to control the number of support vectors.

JEZll WORKING OF SVM ALGORITHM


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags
(green and blue), and the dataset has two features x, and x;. We want a classifier that can classify the pair(x;, x;) of
co-ordinates in either green or blue. Consider the below image:
Ya
¢ *

* °
*
* [+
* (<]
(]
° (<]
(<}
>
X
Flg. 3.6
So as it is 2-d space so by just using a straight line, we can casily separate these two classes. But there can be multiple
lines that can separate these classes. Consider the below image:
Y4

*

/
/ /

/
* / °
o) e
o / °
7 o [}
I/ .

K o
>
X

Fig. 3.7
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as
a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Advanced Algorithms in Al and ML 310 Supervised Learning: Support Vector Machines, ...

Support vector / Optimal Hyperplane

Fig. 3.8
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a
single straight line. Consider the below image:
Iy
A A

v
X
Fig. 3.9
So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x
and y, so for non-lincar data, we will add a third dimension z. It can be calculated as:
z = xX+y
By adding the third dimension, the sample space will become as below image:

A A
A A

z 4A A
A A

X
Fig. 3.10
Advanced Algorithms in Al and ML 31 Supervised Learning: Support Vector Machines, ...

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

A B A A A

A A
z
A A
A

° Best Hyperplane

- >
0

X
Fig. 3.11
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space withz =1,
then it will become as:
4

A A
A A

<]
Y - .J-' a
A (5] Best Hyperplane

A A

A A
Fig. 3.12
Hence we get a circumference of radius | in case of non-linear data.
Dataset for Implementation: Save below data as 'user_data.csv’ in MS-Excel
User 1D Gender
15624510
15810944 Ma) £ 20000 0
15668575 Female 2 43000 0
15603246 Female z 57000 [
15804002 Male 1 76000 [
15728773 Male 7 58000 [
15598044 Female 2 84000 °
15694829 Female n 150000 1
15600575 Male > 33000 [
15727311 Female 35 65000 °
15570769 Female % 80000 [
15606274 Female % 52000 [
15746139 Male 2 86000 °
15704987 Male 2 18000 3
15628972 Male 18 82000 [
15697686 Male » 80000 °
15733883 Male ] 25000 1
15617482 Male a5 26000 1
15704583 Male % 28000 1
15621083 Female a 29000 1
15649487 Male a 22000 1
15736760 Female a7 49000 1
Fig. 3.13
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv(‘user_data.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:
2 data_set - DataFrame - o x
Index Gender

°
1

2
3

s
6

7
.
.
10

1
12
13
14

Fomat Resze | (] Bedground


coor [ Coumn miimax (S andcese [oome ]

Fig. 3.14
Advanced Algorithms in Al and ML 343 Supervised Learning: Support Vector Machines, ....
The scaled output for the test set will be:

B x test - NumPy armay - a

[ Fomat | | Resze | £ Badayound color

Sovemitese [om ]
Fig. 3.15
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will import SVC class
from Sklearn.svm library. Below is the code for it:
1. from sklearn.svm import SVC # “Support vector classifier"
2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)
In the above code, we have used kernel="linear",
as here we are creating
SVM for linearly separable data. However, we
can change it for non-linear data. And then we fitted the classifier to the training dataset(x_train, y_train).
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape="ovr', degree=3, gamma="auto_deprecated',
kernel="linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, t0l=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor), gamma, and kernel.
* Predicting the test set result: Now, we will predict the output for test set. For this, we will create a new vector
y_pred. Below is the code for it
1. #Predicting the test set result
2. y_pred-= classifier.predict(x_test)
After getting the y_pred vector, we can compare
the result of y_pred and y_test to check the difference between the
actual value and predicted
value.
Advanced Algorithms in Al and ML 3.14 Supervised Learning: Support Vector Machines, ....

Output: Below is the output for the prediction of the test set:
[E8 pred
- NumPy srray - o x

Format Resze [7] Background color

e o
Flg. 3.16
* Creating the confuslon matrix: Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create the confusion matrix, we need to import
the confuslon_matrix function of the sklearn library. After importing the function, we will call it using a new variable cm.
The function takes two parameters, mainly y_true(the actual values) and y_pred (the targeted value return by the classifier).
Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
B8 cm - NumPy amay - o Xx

0 1

= L] 24

Format Resze [/] Badkground


color

se
Fig. 3.17
As we can see in the above output image, there are 66 + 24 = 90 correct predictions and 8 + 2 = 10 correct predictions.
Therefore
we can say that our SVM model improved
as compared to the Logistic regression model.
Advanced Algorithms in Al and ML 3.15 Supervised Learning: Support Vector Machines, ....

* Visualizing the training set result: Now we will visualize the training set result, below is the code for it:
from matplotlib.colors import ListedColormap
x_set,y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0]lmax() + 1, step =0.01),
nm.arange(start = x_set[:, 1l.min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(xl, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red’, 'green')))
mtp.xlim(xL.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == |, 0], x_set[y_set == j, 1],
¢ = ListedColormap(('red’, "green))(i), label = j)
mtp.title('SVM classifier (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary")
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
SVM classifier
Estimated Salary

Age

Fig. 3.18
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got the straight
line as hyperplane because we have used a linear kernel in the classifler. And we have also discussed above that for the 2d
space, the hyperplane in SVM is a straight line.
* Visualizing the test set result:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set,y_set = x_test,y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0Jmax() + 1, step =0.01),
nm.arange(start = x_set[:, 1}.min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((‘red','green’ )))
Advanced Algorithms in Al and ML 3.16 Supervised Learning: Support Vector Machines, ....

mtp.xlim(xL.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap((‘red', 'green"))(i), label = j)
mtp.title('SVM classifier (Test set)')
mtp.xlabel(' Age')
mtp.ylabel(*Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
SVM classifier (Test set)

2
k-
3
gm
£
2
W

-2

-2 -1 0 1 2 3
Age
Fig. 3.19
JEXAl ADVANTAGES AND DISADVANTAGES OF SVM
Advantages of SVM:
1. SVM classifiers offer good accuracy and perform faster predictions compared to the Naive Bayes algorithm.
2. They also use less memory because they use a subset of training points in the decision phase.
3. SVM works well with a clear margin of separation and with high-dimensional space.
Disadvantages of SVM:
I. SVM is not suitable for large datasets because of its high training time, and it also takes more time in training
compared to Naive Bayes.
2. It works poorly with overlapping classes and is also sensitive to the type of kernel used.
JEXH K NEAREST NEIGHBOURS
k-nearest neighbors (K-NN) algorithm. It is a popular supervised model used for both classification and regression and is
a useful way to understand distance functions, voting systems, and hyperparameter optimization.
While K-NN can be used for classification and regression, here we will focus on building a classification model.
Classification in machine learning is a supervised learning task that involves predicting a categorical label for a given input
data point. The algorithm is trained on a labelled dataset and uses the input features to learn the mapping between the inputs
and the corresponding class labels. We can use the trained model to predict new, unseen data.
A Brief Introduction to K-Nearest Nelghbours:
The K-NN algorithm functions as a voting system, where the class label of a new data point is determined by the
‘majority class label among its k closest neighbors in the feature space. Envision a little hamlet including a limited populace
Advanced Algorithms in Al and ML 347 Supervised Learning: Support Vector Machines, ....
of a few hundred inhabitants, wherein you are faced with the imperative task of determining the political party that merits
your vote. In order to do this, you may approach your closest neighbors and inquire about their political party affiliation. If
the majority of your k closest neighbors endorse party A, it is quite probable that you would likewise cast your vote in favor
of party A. The process is analogous to the functioning of the K-NN algorithm, in which the class label of a new data point is
determined based on the majority class label among its k closest neighbors.
Now, let’s examine another case in further detail. Consider a dataset containing information on two types of fruit: grapes
and pears. You possess a numerical value representing the fruit's roundness and its diameter. You choose to graphically
represent these data points. If presented with an unfamiliar fruit, one might also include it in the graph and then determine its
identity by measuring the distance to the k (a numerical value) closest spots. In the above example, by selecting three
locations for measurement, we may confidently conclude that the three closest points correspond to pears. Hence, | am quite
certain that this object is a pear. By considering the four closest locations, it is seen that three of them correspond to pears,
while one corresponds to a grape. Consequently, we may confidently state that there is a 75% probability that the object in
question is a pear. In this post, we will discuss the methods for determining the optimal value of k and several techniques for
measuring distance.
1

09 o0
08 . °
0.7 O o °
Roundness S core
S e Il3

]
® Grape
o

°
© Pear
o

© New Fruit
e
w o o
N

0 1 2 3 4 5 6 7 8
Diameter (cm)
Fig. 3.20
The Dataset
To further illustrate the K-NN algorithm, let's work on a case study you may find while working as a data scientist. Let's
assume you are a data scientist at an online retailer, and you have been tasked with detecting fraudulent transactions. The
only features you have at this stage are:
e dist_from_home: The distance between the user’s home location and where the transaction was made.
* purchase_price_ratio: The ratio between the price of the item purchased in this transaction to the median purchase
price of that user.
The data has 39 observations which are individual transactions.
dist_from_home purchase_price_ratio fraud

0 21 6.4 1
1 38 22 1
2 157 44 1
3 26.7 4.6 1
4 10.7 49 1
K-Nearest Neighbors Workflow
To fit and train this model, we’ll be following The Machine Learning Workflow infographic.
‘Advanced Algorithms in Al and ML 3.18 Supervised Learning: Support Vector Machines, ....

Project set-up Data preparation

1. Understand the business goals 1. Data Collection


Speak with your stakeholders and deeply understand Collect all the data you need for your models,
the business goal behind the model being proposed. iitisther from your ovn organization,
Adeep understanding of your business goals wil help e pakl soumes.
'you scope the necessary technical solution, data
sources to be collected, how to evaluate modal
performance and more. e kd
Tumn the messy
row data into clean,
tidy data ready
for analysis. Check out this data cleaning checklist for
2. Choose the solution to your problem primeron data cleaning.
Once you have a deep understanding of your
problem focus on which category of models drives
the highest impact. See this Machine Learning 3. Feature Engineering
Cheat Sheet for more information. Manipulate the datasets to create variables (features)
that improve your model’s prediction accuracy. Create
the same features in both the training set and the
Ltesting set,
4. Split the data
Randomly divide the records in the data set into a
training set and a testing set. For a more reliable
assessment of mode; performance, generate multiple
training and testing sets using cross-validation.

3 4
Modelling Deployment

1. Hyperparameter 1. Deploy the model


tuning Embed the model you chose in dashboards,
For eachmeter
model,
iluse applications or wherever you need it.
techniques to improve
model performance. \ 2. Monitor model performance
Regularty test the performance of your model
4. Assess models as your data changes to avoid model drift.
2. Train your performance
models For each model,
B sach model calculate performance| 3. Improve your model
b; L metrics on the testing Continuously literate and improve your model
trainl set such as accuracy post-deployment. Replace your model with an
ning set. recall and precision updated version to improve performance.

3. Make predictions j
Make predictions on
the testing set

Fig. 321
However, as our data is pretty clean, we won’t carry out every step. We will do the following:
* Feature engineering.
Spliting the data.
Train the model.
Hyperparameter tuning.
Assess model performance.
Advanced Algorithms in Al and ML 3.19 Supervised Learning: Support Vector Machines, ....

Visualize the Data


Let’s start by visualizing
our data using Matplotlib; we can plot our two features in a scatterplot.
sns.scatterplot(x=df['dist_from_home'],y=df['purchase_price_ratio'], hue=df['fraud'])
As you can see, there is a clear difference between these transactions, with fraudulent transactions being of much higher
value, compared to the customers’ median order. The trends around distance from home are somewhat hard to interpret, with
non-fraudulent transactions typically being closer to home but with several outliers.

8 . Fraud
o 0
o 1

.
6 o o
£
g . .
£8 oo °
2@ | 4] @ e. .
©
g .
2 o .
24 Ny °
o ° o
%5 00 0° e
) ° )
o %
0 20 40 60 80 100 120 140
dist_from_home
Fig.3.22
Normalizing and Splitting the Data
When building a machine leaming model, it is crucial to partition the data into separate sets for training and testing
purposes. The training data is used to optimize the model. The algorithm utilizes the training data to acquire knowledge about
the correlation between the characteristics and the goal. Its objective is to identify a consistent structure within the training
data, which can then be used to create accurate forecasts on unfamiliar data. The test data is used to assess the efficacy of the
model. The model undergoes testing on the test data by using it to generate predictions and then comparing these predictions
to the actual target values.
Normalizing the features is crucial for training a K-NN classifier. The reason for this is because K-NN calculates the
distance between data points. The default method is to use the Euclidean Distance, which is calculated as the square root of
the total of the squared differences between two places. In our scenario, the purchase_price_ratio falls within the range of
0 to 8, however the dist_from_home is much more. Without normalization, our estimate would be significantly influenced
by the variable "dist_from_home" due to its larger values.
It is advisable to standardize the data after it has been divided into training and test sets. To avoid 'data leaking’, it is
necessary to normalize the data separately for the test set. This prevents the model from gaining extra knowledge about the
test set during normalization if all the data is normalized together.
The provided code segment divides the data into separate train and test sets, and then applies normalization using scikit-
learn's standard scaler. Initially, we use the .fit_transform() method on the training data, which adjusts our scaler to the
average and standard deviation of the training data. Subsequently, we may use this approach on the test data by invoking the
.transform() function, which employs the previously acquired values.
# Split the data into features (X) and target (y)
X = df.drop(‘fraud’, axis=1)
y = df['fraud']
Advanced Algorithms in Al and ML 320 Supervised Learning: Support Vector Machines, ....
# Split the data into training and test sets
X_train, X_test, y_tfrain, y_test = train_test_split(X, y, test_size=0.2)

# Scale the features using StandardScaler


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Fitting and Evaluating the Model:
We are preparedto begin training the model. For this purpose, we will use a constant value of 3 fork, but we will need to
refine this in the future. Initially, we instantiate a K-NN model and then train it using our training data. We provide both the
features and the target variable to facilitate the learning process of the model.
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
‘The model is now trained! We can make predictions on the test dataset, which we can use later to score the model.
y_pred = knn.predict(X_test)
The simplest way to evaluate this model is by using accuracy. We check the predictions against the actual values in the
test set and count up how many the model got right.
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.875
This is a pretty good score! However, we may be able to do better by optimizing our value of k.
Using Cross Validation to Get the Best Value of k:
Regrettably, there is no miraculous method to determine the optimal value for k. We need to iterate over many distinct
values and thereafter use our most astute discernment.
In the code shown below, we provide a range of values for the variable k and initialize an empty list to record the
outcomes. Cross-validation is used to determine the accuracy scores, obviating the necessity for creating separate training and
test sets. However, data scaling is still required. Subsequently, we iterate over the data and append the scores to our list.
In order to carry out cross-validation, we use the cross_val_score function provided by scikit-learn. We provide the
K-NN model instance, our data, and the desired number of splits as input. In the code below, we use five splits, indicating
that the model will divide the data into five groups of equal size. Four
of these groups will be used for training, while one will
be used for testing the results. The program will go over each group and calculate an accuracy score for each. These scores
will be averaged to determine the optimal model.
k_values = [i for i in range (1,31)]
scores =[]

scaler = StandardScaler()
X = scaler.fit_transform(X)

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
score = cross_val_score(knn, X, y, cv=b)
scores.append(np.mean(score))
‘We can plot the results with the following code
sns.lineplot(x = k_values, y = scores, marker = '0')
plt.xlabel("K Values")
plt.ylabel(“Accuracy Score")
Advanced Algorithms in Al and ML 3.21 ‘Supervised Learning: Support Vector Machines, ....

‘We can see from our chart that K=9, 10, 11, 12, and 13 all have an accuracy score of just under 95%. As these are tied
for the best score, it is advisable to use a smaller value for K. This is because when using higher values of K, the model will
use more data points that are further away from the original. Another option would be to explore other evaluation metrics.

0.95

0.90
Accuracy score
°
5&

0.70

0.65

0.60

0.55

[ 5 10 15 20 25 30
K Values

Fig.3.23
More Evaluation Metrics:
We can now train our model using the best k value using the code below.
best_index = np.argmax(scores)
best_k = k_values[best_index]

knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)
then evaluate with accuracy, precision, and recall (note your results may differ due to randomization)
y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print(“Precision:", precision)
print(“Recall:", recall)
Accuracy: 0.875
Precision: 0.75
Recall: 1.0
* K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
* K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the
category that is most similar to the available categories.
o K-NNis a non-parametric algorithm, which means it does not make any assumption on underlying data.
Advanced Algorithms in Al and ML 322 Supervised Learning: Support Vector Machines, ....
* Itisalso called a lazy learner algorithm because it does not learn from the training set immediately instead it stores
the dataset and at the time of classification, it performs an action on the dataset.
* K-NNalgorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into
a category that is much like the new data.
Example: Suppose we have an image of a creature that looks like cat and dog, but we want to know either it is a cat or
dog. So, for this identification, we can use the K-NN algorithm, as it works on a similarity measure. Our K-NN model will
find the similar features of the new data set to the cats and dogs’ images and based on the most similar features it will put it in
either cat or dog category.

K-NN Classifier

Input value Predicted Output

Fig. 3.24
NEED OF K-NN ALGORITHM
‘Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point
will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we
can easily identify the category or class of a particular dataset. Consider the below diagram:

%
P 0% °
0% o o ° %0
) o ° ° b
'\ Category B Category B
° ° ° \
° ° °
g New data point b4 New data point
°° ° ° °: © assigned to
° Category 1
Category A

X, Xy
Flg. 3.25
JEEM WORKING OF K-NN ALGORITHM
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
* Step-1: Select the number
K of the neighbors
* Step-2: Calculate the Euclidean distance of K number of neighbors
® Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
* Step-4: Among these k neighbors, count the number of the data points in each category.
® Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
* Step-6: Our model is ready.
Advanced Algorithms in Al and ML 323 Supervised Learning: Support Vector Machines, ....
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
Xop

AR
0%

oo\
* oC ategory

New Data
’ oB

o0 M <O point
<
Category A
>
Xy
Flg. 3.26
* Firstly, we will choose the number of neighbors, so we will choose the k=5.
* Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance
between two points, which we have already studied in geometry. It can be calculated as:
Y 4

1] et
L e L et B (X,.Y)

Xy
k2

Flg. 3.27
Euclidean Distance Between A, and B, =[(X; - X,)" + (Y- Y,)".
* By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:

X2 A Category A:3 neighbours


Category B:2 neighbours

AR3
®°,°
+ o °
\ Category B

¢ New Data
o 9 ¢ point
<
Category A
>
Xy
Fig. 3.28
® As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
Advanced Algorithms in Al and ML 3.24 Supervised Leaming: Support Vector Machines, ...
How to select the value of K In the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
* There is no way to determine the best value for "K", so we need to try some values to find the best out of them. The
most preferred value for K is 5.
* A very low value for K such as K = 1 or K = 2, can be noisy and lead to the effects of outliers in the model.
* Large values for K are good, but it may find some difficulties.
U] ADVANTAGES AND DISADVANTAGES OF K-NN ALGORITHM
[EXTEY Advantages of K-NN Algorithm
* Itissimple to implement.
® Itis robust to the noisy training data.
It can be more effective if the training data is large.
Disadvantages of K-NN Algorithm
* Always needs to determine the value of K which may be complex some time.
* The computation cost is high because of calculating the distance between the data points for all the training samples.
AV IMPLEMENTATION OF THE K-NN ALGORITHM
Implementation of the KNN Algorithm:
To do the Python implementation of the K-NN algorithm, we will use the same problem and dataset which we have used
in Logistic Regression. But here we will improve the performance of the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV car. The
company wants to give the ads to the users who are interested in buying that SUV. So for this problem, we have a dataset that
contains multiple user’s information through the social network. The dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent variable and the Purchased variable is for the dependent variable. Below
is the dataset:
User 1D Gender Age EstimatedSalary Purchased
15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 o
15804002 Male 19 76000 0
15728773 Male 27 58000 0
15598044 Female 27 84000 0
15694829 Female 7] 150000 1
15600575 Male b0 33000 0
15727311 Female 35 65000 0
15570769 Female 26 80000 o
15606274 Female 26 52000 o
15746139 Male 20 86000 0
15704987 Male 32 18000 0
15628972 Male 18 82000 0
15697686 Male 29 80000 0
15733883 Male 47 25000 1
15617482 Male as 26000 1
15704583 Male 46 28000 1
15621083 Female a8 29000 1
15649487 Male as 22000 1
15736760 Female 47 49000 1
Fig. 3.29
Steps to implement the K-NN algorithm:
* Data Pre-processing step.
* Fitting the K-NN algorithm to the Training set.
Advanced Algorithms in Al and ML 3.25 ‘Supervised Leamning: Support Vector Machines, ....

* Predicting the test result.


® Test accuracy of the result (Creation of Confusion matrix).
® Visualizing the test set result.
Data Pre-Processing Step:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code for it:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv(‘user_data.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed. After feature scaling our
test dataset will look like:
e ] B 1. test - NumPy smay - o0 x

Fig. 3.30
From the above output image, we can see that our data is successfully scaled.
‘Advanced Algorithms in Al and ML 3.26 ‘Supervised Learning: Support Vector Machines, ....

Fitting K-NN classlfier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import the KNelghborsClassifler class
of Sklearn Nelghbors library. After importing the class, we will create the Classifler object of the class. The Parameter of
this class will be
* n_nelghbors: To define the required neighbors of the algorithm. Usually, it takes 5.
* metric="minkowskl': This is the default parameter and it decides the distance between the points.
e p=2:Itis equivalent to the standard Fuclidean metric.
And then we will fit the classifier to the training data. Below is the code for it:
1. #Fitting K-NN classifier to the training set
2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric="minkowski', p=2 )
4. classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm="auto’, leaf_size=30, metric="minkowski’,
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights="uniform')
* Predicting the Test Result:To predict the test set result, we will create ay_pred vector as we did in Logistic
Regression. Below is the code for it:
1. #Predicting the fest set result
2. y_pred-= classifier.predict(x_test)
Output:
The output for the above code will be:
2B | pred - Num®y ariey - =] X

Fig. 3.31
Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the classifier. Below is the code for
it:
#Creating the Confusion matrix
Advanced Algorithms in Al and ML 3.27 Supervised Learning: Support Vector Machines, ....

from sklearn.metrics import confusion_matrix


cm= confusion_matrix(y_test, y_pred)
In above code, we have imported the confusion_matrix function and called it using the variable cm.
Output: By executing the above code, we will get the matrix as below:

@ cm - NumPy armay - o x

0 1

e
1 3 29

Format Resze [ Background


color

Fig. 3.32
In the above image, we can see there are 64 + 29 = 93 correct predictions and 3 + 4 = 7 incorrect predictions, whereas, in
Logistic Regression, there were 11 incorrect predictions. So we can say that the performance of the model is improved by
using the K-NN algorithm.
Visualizing the Tralning Set Result:
Now, we will visualize the training set result for K-NN model. The code will remain same as we did in Logistic
Regression, except the name of the graph. Below is the code for it:
#Visulaizing the trianing set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0l.max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1lmax() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((‘red','green’ )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap((‘red', 'green*))(i), label = j)
mtp.title("K-NN Algorithm (Training set)')
mtp.xlabel('Age')
mtp.ylabel(*Estimated Salary')
mtp.legend()
mtp.show()
Advanced Algorithms in Al and ML 3.28 Supervised Learning: Support Vector Machines, ....

Output:
By exccuting the above code, we will get the below graph:
K-NN Algorithm (Training Set)

Estimated Salary
[}
~1

) -1 0 1 2 3
Age
Fig. 3.33
The output graph is different from the graph which we have occurred in Logistic Regression. It can be understood in the
below points:
® As we can see the graph is showing the red point and green points. The green points are for Purchased(1) and Red
Points for not Purchased(0) variable.
© The graph is showing an iregular boundary instead of showing any straight line or any curve because it is a K-NN
algorithm, i.c., finding the nearest neighbor.
* The graph has classified users in the correct categories as most of the users who didn't buy the SUV are in the red
region and users who bought the SUV are in the green region.
* The graph is showing good result but still, there are some green points in the red region and red points in the green
region. But this is no big issue as by doing this model is prevented from overfitting issues.
* Hence our model is well trained.
Visualizing the Test Set Result:
After the training of the model, we will now test the result by putting a new dataset, i.c., Test dataset. Code remains the
same except some minor changes: such as X_train and y_train will be replaced by x_test and y_test.
Below is the code for it:
#Visualizing the test set result
from matplotlib.colors import ListedColormap
x_set,y_set = x_test,y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0lmax() + 1, step =0.01),
nm.arange(start = x_set[:, 1}min() - 1, stop = x_set[:, 1]max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((‘red’,'green’ )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
¢ = ListedColormap(('red', 'green'))(i), label = j)
mtp.title("K-NN algorithm(Test set)’)
mtp.xlabel('Age')
mtp.ylabel(‘Estimated Salary")
Advanced Algorithms in Al and ML 3.29 Supervised Learing: Support Vector Machines, ....
mtp.legend()
mtp.show()
Output:
K-NN Algorithm (Test set)
Estimated Salary
o
n|

Age
Fig. 3.34
The above graph is showing the output for the test data set. As we can see in the graph, the predicted output is well good
as most of the red points are in the red region and most of the green points are in the green region.
However, there are few green points in the red region and a few red points in the green region. So these are the incorrect
observations that we have observed in the confusion matrix (7 Incorrect output).

1. What is Support Vector Machines and its types?


2. How does SVM work?
3. Write Advantages and Disadvantages Of Decision Tree.
4. Write python Implementation Of SVM.
5. What is K Nearest Neighbors?
6. Explain Need of KNN algorithm.
7. Describe Working of KNN Algorithm.
8. Write Advantages and Disadvantages of KNN algorithm.
9. Write python Implementation of KNN Algorithm.

Y ¥
Unsupervised Learning:
Clustering Algorithm )
Chapter Outcomes...
After reading this chapter, students will be able to understand :
[ The concept of K-means clustering.
[ The working of K-means algorithm.
(8] The failure of K-means.

[ The implementation of algorithm.


[® The concept of Dimensionality, Reduction and Subselection.

W] The concept of Principal Component Analy:

Learning Objectives...
® Describe the performance analysis of clustering for the given situation.
® Describe Dimensionality Reduction.

K-MEANS CLUSTERING
A popular clustering approach for effectively dividing spherical data into separate categories is k-means clustering. This
is particularly useful as a feature-engineering step to improve supervised learning models, as well as an analytical tool when
the groupings of data rows are not evident.

For this lesson, we assume that you have a basic understanding of Python and the ability to work with pandas
Dataframes.

An Overview of K-Means Clustering:

Clustering models attempt to categorise data into discrete "clusters” or groups. This might serve as a compelling
perspective in an analysis or as a characteristic in a supervised learning system.

Imagine a social environment where individuals are engaged in conversations inside separate clusters around a room.
Upon initial observation of the room, one's gaze falls upon a gathering of individuals. One approach would be to conceptually
assign a distinct identification to each group of people by thinking placing points at the centre of each group. Subsequently,
you would have the capacity to designate a distinct appellation to each group in order to characterise them. K-means
clustering does data segmentation in a fundamental manner.

811
Advanced Algorithms in Al and ML 42 Unsupervised
Leaming: Clustering Algorithm

Before k-means Before k-means


Y

Fig. 4.1
On the left side of the picture, there are two separate sets of dots that are not labelled and are coloured to indicate
similarity. Applying a k-means algorithm to this dataset (on the right-hand side) can uncover two unique clusters (shown by
separate circles and colours).
When dealing with two dimensions, humans can easily divide these clusters. However, when working with higher
dimensions, a model is required to accomplish the same task.
The Dataset:
We are going to wuse the California housing data obtained via Kaggle (available here
hitps://www kaggle.com/datasets/camnugent/california-housing-prices?resource=download). Our analysis will incorporate
geographic data, specifically latitude and longitude, together with the median house value. Our objective is to group the
houses based on their geographical proximity and analyse the variations in property values throughout California. The dataset
is stored in a CSV file named 'housing.csv' in our current working directory and is accessed using the pandas library.
import pandas as pd

home_data = pd.read_csv('housing.csv', usecols = ['longitude’, 'latitude’, ‘median_house_value'])


home_data.head()

longitudinal latitude median_house_value

0 -12223 37.88 452600.0

1 -12222 37.86 358500.0

2 -12224 37.85 352100.0

3 -12225 37.85 341300.0

4 -12225 37.85 342200.0

The data include 3 variables that we have selected using the usecols parameter:
* longltude: A value representing how far west a house is. Higher values represent houses that are further West.
e latitude: A value representing how far north a house is. Higher values represent houses that are further north.
* median_house_value: The median house price within a block measured in USD.
K-Means Clustering Workflow:
Similar to other Machine Learning algorithms, K-Means Clustering follows a specific process.
Advanced Algorithms in Al and ML 43 Unsupervised
Learning: Clustering Algorithm

Project set-up Data preparation

1. Understand the business goals 1. Data Collection


Speak with your staksholders and deeply understand Collect all the data you need for your models,
the business goal behind the model being proposed. whether from your own organization,
Adeep understanding of your business goals will hel public or paid sources.
you scope the necessary technical solution, data
sources to be:‘\:‘llected. how to evaluate modal 2. Data Cleaning
P s Tum the messy row data into clean, tidy data ready
for analysis. Check out this data cleaning checkist for
2. Choose the solution to your problem primer on data cleaning.
Once you have a deep understanding of your
problem focus on which category of models drives
the highest impact. See this Machine Leaming 3. Feature Engineering
Cheat Sheet for more information. Manipulate the datasets to create variables (features)
that improve your model's prediction accuracy. Create
-— — the same features in both the training set and the
Ltesfing set.
4. Split the data
Randomly divide the records in the data setinto a
training set and a testing set. For a more reliable
assessment of mode; performance, generate multiple
training and testing sets using cross-validation.

3 4
Modelling Deployment

1. Hyperparameter 1. Deploy the model


For esgnflf:lg‘1 ol oo Emb_od the model you chose in gym,
S mete: mv;ing applications or wherever you need it.

techniques to improve
model performance. '\ 2. Monitor model performance
Regularly test the performance of your model
4. Assess models as your data changes to avoid model drift.
2. Train your performance
models For each model,
B a calculate performance]| 3. Improve your model
iy t?" il m‘m the testing Continuously literate and improve your model
te such as accuracy post-de| nt. Replace your model
with an
hon mn recall and precision e varson o e coforrce |

3. Make predictions j
Make predictions on
the testing set

Flg. 4.2: Machine Learning Workflow


This infographic presents a simplified view of the machine learning workflow.
This lesson will primarily cover the processes of data collection and data splitting in data preparation, as well as
hyperparameter tweaking, model training, and model performance assessment in the modelling phase. A significant portion
Advanced Algorithms in Al and ML 44 Unsupervised
Learning: Clustering Algorithm
of the effort in unsupervised learning algorithms is dedicated to hyperparameter tuning and evaluating performance to
optimise the model’s outcomes.
Visualize the Data:
Initially, we engage in the process of representing our housing data visually. We analyse the location data by generating
a heatmap that is determined by the median price inside each block. In this lesson, we will utilise Seaborn to efficiently
generate plots (see to my 3™ semester book Data Storytelling chapter Introduction to Data Visualization).
import seaborn as sns
sns.scatterplot(data = home_data, x = 'longitude’, y = 'latitude’, hue = ‘median_house_value')

42 Median_house_value
“ 100000
® 200000
4 * 300000
e 400000
38 ® 500000

124 122 120 -118 -116 -114


Longitude
Fig.4.3
It is evident that the majority of high-cost residences are located around the western coastline of California, whereas
various regions exhibit concentrations of medium priced homes. This is predictable, as waterfront properties generally hold a
higher value compared to houses that are not located along the ocean.
Clusters are readily identifiable when utilising a limited number of 2 or 3 features. It gets progressively challenging or
unattainable when the
Normalizing the Data:
Normalisation of data is necessary when utilising distance-based methods such as k-Means Clustering. Failure to
normalise the data will result in variables with varying scales being given unequal importance in the distance calculation that
is being optimised during training. For instance, if we were to incorporate price into the cluster alongside latitude and
longitude, price would exert a disproportionate influence on the optimisations due to its substantially larger and broader range
compared to the constrained location variables.
We first set-up training and test splits using train_test_split from sklearn.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, _test = train_test_split(home_data[['latitude’, ‘longitude']],
home_data[['median_house_value']], test_size=0.33, random_state=0)
Next, we normalize the training and test data using the preprocessing.normalize( ) method from sklearn.
from sklearn import preprocessing
X_train_norm = preprocessing.normalize(X_train)
X_test_norm = preprocessing.normalize(X_test)
Fitting and Evaluating the Model:
During the initial iteration, we will select a predetermined number of clusters, denoted as k, which will be set to 3.
Constructing and fitting models with sklearn is really straightforward. To instantiate KMeans, we will specify the number of
clusters using the n_clusters attribute. We will set n_init to "auto” to determine the number of iterations the algorithm will
perform with different centroid seeds. Additionally, we will set the random_state to 0 to ensure consistent results each time
the code is executed. Next, we can train the model using the normalised training data by utilising the fit() function.
from sklearn import KMeans
Advanced Algorithms in Al and ML 45 Unsupervised
Learning: Clustering Algorithm
kmeans = KMeans(n_clusters = 3, random_state = O, n_init="auto")
kmeans.fit(X_train_norm)
Once the data are fit, we can access labels from the labels_ attribute. Below, we visualize the data we just fit.
sns.scatterplot(data = X_train, x = 'longitude’, y = 'latitude’, hue = kmeans.labels_)
42 5
o1
40 LI

238
2
3 36
-
MY
34

124 122 120 -118 -116 -114


Longitude
Fig.4.4
The data is now clearly segregated into three separate clusters, namely Northern California, Central California, and
Southern California. An alternative approach is to examine the dispersion of median house values in these three categories
using a boxplot.
sns.boxplot(x = kmeans.labels_, y = y_train['median_house_value'])
500000 I -1
|
$400000 '
g
300000
3
cl
§200000
35o
=
100000

0
0 1 2
Longitude
Flg. 4.5
We clearly see that the Northern and Southern clusters have similar distributions of median house values (clusters 0 and
2) that are higher than the prices in the central cluster (cluster 1).
We can evaluate performance of the clustering algorithm using a Silhouette score which is a part
of sklearn.metrics where a lower score represents a better fit.
from sklearn.metrics import silhouette_score

silhouette_score(X_train_norm, kmeans.labels_, metric="euclidean")


Due to our lack of analysis on the robustness of various cluster sizes, the adequacy of the k = 3 model remains uncertain.
In the following part, we will examine various clusters and evaluate their performance in order to determine the optimal
hyperparameter values for our model.
Choosing the best number of clusters:
An inherent limitation of k-means clustering is the inability to determine the optimal number of clusters solely based on
model execution. We must conduct a comprehensive evaluation of several value ranges and determine the optimal value of k.
Typically, we employ the Elbow technique to ascertain the ideal number of clusters, ensuring that we neither overfit the data
by using excessive clusters nor underfit it by using too few.
Advanced Algorithms in Al and ML 46 Unsupervised
Learning: Clustering Algorithm
We implement the following loop to evaluate and store various model outcomes in order to determine the optimal
number of clusters.
K = range(2, 8)
fits =[]
score =[]

for kinK:
# train the model for current value of k on training data
model = KMeans(n_clusters = k, random_state = 0, n_init="auto").fit(X_train_norm)

# append the model to fits


fits.append(model)

# Append the silhouette score to scores


score.append(silhouette_score(X_train_norm, model.labels_, metric="euclidean'))
We can then first visually look at a few different values of k.
First we look atk =2.
sns.scatterplot(data = X_train, x = 'longitude’, y = 'latitude’, hue = fits[0]labels_)

42F * 0
o1
a0k

g 38

3 a6 .
34 o
K
b
-124 -122 -120 -118 -116 114
Longitude
Fig. 4.6
The model does an ok job of splitting the state into two halves, but probably does not capture enough nuance in the
California housing market.
Next, we look atk =4.
sns.scatterplot(data = X_train, x = 'longitude’, y = 'latitude’, hue = fits[2].labels_)
42

40

o 38
-]
£
Sas

34

-124 -122 =120 -118 -116 -114


Longitude
Fig. 4.7
Advanced Algorithms in Al and ML 47 Unsupervised
Learning: Clustering Algorithm

This figure categorises California into more coherent clusters based on the geographical location of residences,
specifically their proximity to the northern or southern regions of the state. This model is highly likely to capture a greater
level of subtlety in the housing market as we traverse the state.
Lastly, we examine the case where k is equal to 7.
sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude’, hue = fits[2].labels_)

oabwN=0O
" W,
" W en
-124 -122 -120 -18 ~116 —114
Longitude

Fig. 4.8
The graph depicted above exhibits an excessive number of clusters. We have prioritised sacrificing the simplicity of
interpreting the clusters in order to achieve a geo-clustering result that is deemed "more accurate”.
Generally, as the value of K increases, we observe enhancements in the clusters and their respective representations until
reaching a specific threshold. Subsequently, we observe a decline in productivity or, in some cases, even poorer performance.
To aid in determining the value of k, we can employ a visual representation known as an elbow plot. In this plot, the y-axis
represents the measure of goodness of fit, while the x-axis corresponds to the value of k.
sns.lineplot(x = K, y = score)

0.78

0.76

0.74

0.72

0.70

0.68

0.66

0.64
IS

o
©
[

Fig. 4.9
Typically, we select the point at which the performance improvements level off or deteriorate. It appears that k = 5 is the
optimal choice without risking overfitting. Furthermore, the clusters effectively divide California into distinct groups, which
correspond reasonably well to different price ranges, as demonstrated below.
sns.scatterplot(data = X_train, x = 'longitude’, y = 'latitude’, hue = fits[3]labels_)
‘Advanced Algorithms in Al and ML 48 Unsupervised
Learning: Clustering Algorithm

124 122 -120 -118 -116 -114


Longitude
Fig. 4.10
sns.boxplot(x = fits[3].labels_, y = y_train['median_house_value'])
s00000f ¢ 1 T

© 400000
[
®

§ 300000
2 4
3£ 200000
= 100000 l::l
o = - L
0 1 2 3 4
Fig. 4.11
Under what circumstances does k-means cluster analysis fall?
K-means clustering is most effective when applied to data that exhibit a spherical shape. Spherical data refers to data
points that cluster closely together in space. It is easier to visualise this in 2 or 3 dimensional space. Data that deviate from a
spherical shape or are not ideally spherical are not suitable for effective use in k-means clustering. For instance, the k-means
clustering algorithm would not perform effectively on the given data because it would fail to identify separate centroids to
cluster the two circles or arcs differently, even if they are visually distinct and should be labelled accordingly.

SN .00s .00s
Fig. 4.12
Is It advisable to partition your data Into separate training and testing sets?
The choice to partition your data is contingent upon the objectives you have set for the clustering process. If the
objective is to group your data at the conclusion of your research, then it is not obligatory. If you intend to utilize the clusters
as a feature in a supervised learning model or for prediction, as demonstrated in the Scikit-Learn Tutorial: Baseball Analytics
Pt | tutorial, it is necessary to partition your data prior to clustering in order to adhere to the recommended procedures for the
supervised learning workflow.
JEEH WHAT IS K-MEANS CLUSTERING?
® K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters.
* Here K definesthe number of pre-defined clusters that need to be created in the process, as if K = 2, there will be
two clusters, and for K = 3, there will be three clusters, and so on.
Advanced Algorithms in Al and ML 49 Unsupervised
Learning: Clustering Algorithm
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the
unlabelled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each clusteris associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs
two tasks
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a
cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Before K-Means After K-Means
Y Y

Flg. 4.13
XN WORKING OF K-MEANS ALGORITHM
How Does the k-means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let us understand the above steps by considering the visual plots:
Suppose we have two variables M, and M,. The x-y axis scatter plot of these two variables is given below:
Y
Advanced Algorithms in Al and ML 410 Unsupervised
Learning: Clustering Algorithm
® Let us take number K of clusters, i.e., K = 2, to identify the dataset and to put them into different clusters. It means
here we will try to group these datasets into two different clusters.
* We need to choose some random k points or centroid to form the cluster. These points can be cither the points from
the dataset or any other point. So, here we are selecting the below two points as K points, which are not the part of
our dataset. Consider the below image:
Y

Fig. 4.15
e Now we will assign cach data point of the scatter plot to its closest K-point or centroid. We will compute it by
applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:

Flg. 4.16
* From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the
right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.

Fig. 4.17
* As we need to find the closest cluster, so we will repeat the process by choosing a new centrold. To choose the new
centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
Advanced Algorithms in Al and ML 411 Unsupervised Learning: Clustering Algorithm

X
Flg. 4.18
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a
median line. The median will be like below image:
Y.
o
o o -
o, *>~
T, .
_-o”7 ®
o . L
o o o

X
Flg. 4.19
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to
the line. So, these three points will be assigned to new centroids.
Y

X
Flg. 4.20
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
‘We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the
below image:

Flg. 4.21
Advanced Algorithms in Al and ML 412 Unsupervised Learning: Clustering Algorithm
© As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:
Y4
. o o
\\ °
N \ oDOdo
\\?
] . ©
. N
s 5
LR
. \N
>
X
Fig. 4.22
* We can see in the above image; there are no dissimilar data points on either side of the line, which means our model
is formed. Consider the below image:
Y4

Fig.4.23
As our model is ready. so we can now remove the assumed centroids, and the two final clusters will be as shown in the
below image:

Flg. 4.24
LYW FAILURE OF K-MEANS
Fallures or challenges assoclated with K-Means:
1. Sensitive to Initial Centrold Positlons: K-Means is sensitive to the initial placement of centroids. Different
initializations can lead to different final cluster assignments, and the algorithm may converge to a local minimum
rather than the global minimum.
2. Assumes Spherical Clusters: K-Means assumes that clusters are spherical and equally sized. In situations where
clusters have different shapes, densities, or sizes, K-Means may fail to accurately capture the underlying structure of
the data.
Advanced Algorithms in Al and ML 413 Unsupervised
Learning: Clustering Algorithm
3. Sensitive to Outllers: Outliers can significantly impact the performance of K-Means. Since the algorithm relies on
the mean (centroid) of the data points in each cluster, outliers can disproportionately influence the centroid, leading
to suboptimal cluster assignments.
4. Requires Pre-specification of the Number of Clusters (K): One of the major limitations of K-Means is that it
requires the user to specify the number of clusters (K) in advance. Choosing an inappropriate value for K can result
in poor clustering results.
5. Limited to Euclidean Distance: K-Means uses Euclidean distance to measure the dissimilarity between data points
and centroids. This can be a limitation when dealing with data that does not adhere to Euclidean geometry or when
the features have different scales.
6. May Produce Unbalanced Clusters: K-Means can produce clusters of significantly different sizes. In cases where
the data naturally forms clusters of unequal sizes, K-Means may not be the most suitable algorithm.
7. Not Robust to Non-Convex Shapes: K-Means assumes that clusters are convex, which means it struggles with
non-convex shapes. If the true clusters have complex, non-convex boundaries, K-Means may fail to accurately
represent them.
8. Does Not Handle Categorical Data Well: K-Means is designed for numerical data, and it may not perform well
with categorical or binary features. Preprocessing techniques, such as one-hot encoding, are often required.
9. Nolsy Data Impact: Noise in the data can lead to incorrect cluster assignments. K-Means is not robust to noisy data,
and outliers or irrelevant features can affect the clustering results.
10. Convergence to Local Optima: K-Means uses an iterative optimization process, and it may converge to a local
minimum rather than the global minimum. Multiple runs with different initializations are often performed to
mitigate this issue.
IMPLEMENTATION OF K-MEANS ALGORITHM
Before implementation, let's understand what type of problem we will solve here. So, we have a dataset of
Mall_Customers, which is the data of customers who visit the mall and spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score (which is the
calculated value of how much a customer has spent in the mall, the more the value, the more he has spent). From this dataset,
we need to calculate some patterns, as it is an unsupervised method, so we don't know what to calculate exactly.
The steps to be followed for the implementation are given below:
* Data Pre-processing.
« Finding the optimal number of clusters using the elbow method.
« Training the K-means algorithm on the training dataset.
* Visualizing the clusters.
Step-1: Data Pre-processing Step
The first step will be the data pre-processing, as we did in our earlier topics of Regression and Classification. But for the
clustering problem, it will be different from other models. Let’s discuss it:
« Importing Librarles: As we did in previous topics, firstly, we will import the libraries for our model, which is part
of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the DUMPY we have imported for the performing mathematics calculation, matplotlib is for plotting
the graph, and pandas are for managing the dataset.
* Importing the Dataset: Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:
1. # Importing the dataset
2. dataset = pd.read_csv('Mall_Customers_data.csv')
Advanced Algorithms in Al and ML 414 Unsupervised
Learning: Clustering Algorithm
By exccuting the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks like the below image:
W dataset Oatatrame - B x
inser CustomentD. Gewe Age Aveusl ncome (t5) Spending
Scove (1-100) ~

>
Fermat = L o e (o ]
Fig. 4.25
From the above dataset, we need to find some patterns in it.
* Extracting Independent Variables: Here we don't need any dependent variable for data pre-processing step as it is
a clustering problem, and we have no idea about what to determine. So we will just add a line of code for the matrix
of features.
1. x = dataset.iloc[:, [3, 4]].values
As we can see, we are extracting only 3™ and 4 feature. It is because we need a 2d plot to visualize the model, and some
features are not required, such as customer_id.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering problem. So, as discussed above,
here we are going to use the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis and the
number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values ranging from | to
10. Below is the code for it:
1. #finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
AON

wess_list= [] #Initializing the list for the values of WCSS


#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
NSO

kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)


kmeans.fit(x)
wess_list.append(kmeans.inertia_)
©®

mtp.plot(range(l, 11), wess_list)


. mtp.title('The Elobw Method 6raph')
o
—_

. mtp.xlabel(‘Number of clusters(k)')
=
Advanced Algorithms in Al and ML 4.15 Unsupervised Learning: Clustering Algorithm

12. mtp.ylabel('wcss_list")
13. mtp.show()
As we can see in the above code, we have used the KMeans class of sklearn. cluster library to form the clusters.
Next, we have created the wess_list variable to initialize an empty list, which is used to contain the value of wess
computed for different values of k ranging from | to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to 10; since for loop
in Python, exclude the outbound limit, so it is taken as 11 to include 10 value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix of features and
then plotted the graph between the numberof clusters and WCSS.
Output: After executing the above code, we will get the below output:
The Elbow Method Graph

250000

Z 500000
g= 150000

100000

50000

2 4 6 8 10
Number of clusters (k)
Fig. 4.26
From the above plot, we can see the elbow point is at 5. So, the number of clusters here will be 5.
@ s st - st (10 ebemments: - o x

index Type Sae Value

® flostés 1 20850.36526258563)
. flowttd 1 19672.07204901432

e |
Flig. 4.27
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number
of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the above section, but here instead of
using i, we will use 5, as we know there are 5 clusters that need to be formed. The code is given below:
1. #training the K-means model on a dataset
2. kmeans = KMeans(n_clusters=5, init="k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)
‘Advanced Algorithms in Al and ML 4.16 Unsupervised
Learning: Clustering Algorithm

The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train the model.
By executing the above lines of code, we will get the y predict variable. We can check it under the variable
explorer option in the Spyder IDE. We can now compare the values of y_predict with our original dataset. Consider the
below image:
r LT
u CusomeD = A —
IA e -
® B i —cus
L tomerld]
T T o9 R:3 ™ . 3 belongs to
y Female = * 2 3 cluster
b A [ » u ’
a . n b4 . 2
s . B ” [
s ’ -~ » " 2
. o - » " 3
. . - » 2
0 » Pute - .
- ——
- - A
=
Fig. 4.28
From the above image, we can now relate that the CustomerlD 1 belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.
Step-4: Visualizing the Clusters
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize each cluster
one-by-one.
To visualize the clusters will use scatter plot using mip.scatter( ) function of matplotlib.
1. #visulaizing the clusters
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue’, label = 'Cluster 1') #for
first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green’, label = 'Cluster 2') #for
second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, ¢ = 'red’, label = *Cluster 3') #for
third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan’, label = ‘Cluster 4') #for
fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, ¢ = 'magenta’, label = 'Cluster 5') #
for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, ¢ = 'yellow',
label = 'Centroid")
8. mtp.title(‘Clusters of customers')
9. mtp.xlabel(' Annual Income (k$)")
10. mtp.ylabel(* Spending Score (1-100)")
11. mtp.legend()
12. mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first co-ordinate of the
mip.scatter, i.c., x[y_predict == 0, 0] containing the x value for the showing the matrix of features values, and the y_predict is
ranging from 0 to 1.
Advanced Algorithms in Al and ML 417 Unsupervised
Leaming: Clustering Algorithm

Output:
Clusters of customers
100

= 80
g
w Cluster 1
if; 60 Cluster 2
2 ® Ciuster3
$3, ] @ Cluster 4
=] ) @ Cluster5
s ‘Q Centroids|

&E 20
o o8 g °
o] Weee
20 40 60 80 100 120 140
Annual Income (k$)
Fig. 4.29
The output image is clearly showing the five different clusters with different colors. The clusters are formed between two
parameters of the dataset; Annual income of customer and Spending. We can change the colors and labels as per the
requirement or choice. We can also observe some points from the above patterns, which are given below:
o Cluster 1 shows the customers with average salary and average spending so we can categorize these customers as
* Cluster 2 shows the customer has a high income but low spending, so we can categorize them as careful.
© Cluster 3 shows the low income and also low spending so they can be categorized as sensible.
* Cluster 4 shows the customers with low income with very high spending so they can be categorized as careless.
* Cluster 5 shows the customers with high income and high spending so they can be categorized as target, and these
customers can be the most profitable customers for the mall owner.
LXJ DIMENSIONALITY REDUCTION
What Is the significance of Dimension Reduction in machine learning and predictive modelling?
The issue of undesired increase in dimension is intricately linked to the fixation of measuring or recording data at a far
more detailed level than in previous times. This is by no means implying that this is a recent issue. Recently, it has become
increasingly important due to a significant increase of data.
Recently, there has been a significant surge in the utilisation of sensors in the business. The sensors consistently capture
and store data for subsequent analysis. There might be a significant amount of redundancy in the process of capturing data.
Consider, for instance, the scenario of a motorbike racer participating in racing contests. Currently, his location and motion
are determined by the utilisation of a GPS sensor on his bicycle, gyro metres, various video feeds, and his smart watch. Due
to individual discrepancies in the recording process, the data will not be identical. However, the inclusion of these other
sources provides minimal extra information regarding position gained. Suppose an analyst has access to all this data and is
tasked with analysing the racing strategy of the biker. They will encounter numerous factors or dimensions that are
comparable and provide little to no additional information. This issue pertains to the presence of excessive dimensions that
are not desired and requires a method for reducing the number of dimensions.
Now, let’s examine further instances of innovative methods for gathering data:
* Casinos are collecting data by means of surveillance cameras and monitoring the activities of their consumers.
® Political parties are collecting data by broadening their presence in the field.
* Smartphone applications gather extensive personal information about users.
® The set top box gathers data regarding programme preferences and viewing schedules.
© Organisations are assessing the worth of their brand by analysing social media interactions such as comments, likes,
number of followers, and the overall emotion expressed, both good and negative.
® Increasing the number of variables leads to an increase in difficulties. To mitigate this issue, dimension reduction
approaches offer a solution.
Advanced Algorithms in Al and ML 418 Unsupervised
Learning: Clustering Algorithm

Dimension Reduction Techniques


Dimension reduction strategies refer to a set of methods used to reduce the number of variables or features in a dataset
while preserving its essential information.
Dimension reduction is the procedure of transforming a high-dimensional dataset into a lower-dimensional
representation while preserving the essential information. These strategies are commonly employed in machine learning
problem-solving to get enhanced features for classification or regression tasks.
Let us examine the image displayed below. The diagram displays two dimensions, x, and x,. representing measurements
of various objects in centimetres (x,) and inches (x;). When utilising both of these dimensions in machine learning, they will
provide comparable information and introduce significant interference in the system. Therefore, it is more advantageous to
utilise only one dimension. We have transformed the data from a two-dimensional format, including variables x, and x,, into
a one-dimensional one represented by z,. This conversion has simplified the data and made it more easily understandable.
X, (inches)

Flg. 4.30
We can reduce the number of dimensions in a data collection from n to k, where k is less than n, using comparable
methods. The k dimensions can be identified directly or can be a combination of dimensions (weighted averages of
dimensions) or new dimension(s) that effectively represent several existing dimensions.
Image processing is a widely used application of this method. You may have encountered the Facebook application titled
"Which Celebrity Do You Resemble?”. However, have you ever contemplated the underlying algorithm employed in this?
Here is the solution: In order to determine the corresponding celebrity image, we employ pixel data, with each pixel
representing a single dimension. Each image contains a large number of pixels, which corresponds to a high number of
dimensions. Each dimension holds significance in this context. Arbitrarily excluding dimensions is not permissible in order to
enhance the comprehensibility of your entire dataset. Dimension reduction approaches are employed in such scenarios to
identify the important dimension(s) through the use of various methods. We will address these strategies briefly.
‘What benefits does Dimension Reduction offer?
Now, let's examine the advantages of implementing the Dimension Reduction process:
It facilitates data compression and decreases the necessary storage space.
It reduces the amount of time needed to conduct identical calculations. Reducing the number of dimensions results in
decreased computational requirements. Additionally, a lower number of dimensions enables the use of algorithms that are not
suitable for high-dimensional data.
It addresses the issue of multicollinearity, which enhances the performance of the model. It eliminates superfluous
characteristics. For instance, it is unnecessary to store a value in two distinct units, such as metres and inches.
By reducing the dimensions of data to either two or three, we can accurately plot and visualise it. Subsequently, one can
discern patterns with more clarity. Below, you can observe the process of converting 3D data into 2D. Initially, it has
established the 2D plane and subsequently depicted the points on these two newly defined axes, z, and z,.
® The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the
process to reduce these features is called dimensionality reduction.
® A dataset contains a huge number of input features in various cases, which makes the predictive modelling task
more complicated. Because it is very difficult to visualize or make predictions for the training dataset with a high
number of features, for such cases, dimensionality reduction techniques are required to use.
Advanced Algorithms in Al and ML 4.19 Unsupervised Learning: Clustering Algorithm

* Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions dataset into
lesser dimensions dataset ensuring that it provides similar information.” These techniques are widely used
in machine learning for obtaining a better fit predictive model while solving the classification and regression
problems.
e It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal
processing, bioinformatics etc. It can also be used for data visualization, noise reduction, cluster analysis etc.

Dimensionality reduction
Techniques

Dimensionality
e Sl Reduction

ie Missing Value Ration : Compoz:zFacws Projection Based


: @ Low Variance Filter 1
* @ High Correlation Filter i l
+ ® Random Forest
Factor Analysis
@ Backward Feature Extraction: Fe ISOMAP |
! ® Forward Feature Selection RiNIcipal Componant I ®tSNE
. Analysis ! e UMAP 1
"""""""""""" Independent Component, : :
Analysis 9 b B

Fig. 4.31
Advantages of Applying Dimensionality Reduction
* By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
* Reduced dimensions of features of the dataset help in visualizing the data quickly.
* It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of Dimensionality Reduction
There are also some disadvantages of applying the dimensionality reduction, which are given below:
* Some data may be lost due to dimensionality reduction.
® In the PCA dimensionality reduction technique, sometimes the principal components required to consider are
unknown.
‘What are the common methods to perform Dimension Reduction?
There are many methods to perform Dimension reduction. I have listed the most common methods below:
1. Missing Values: When we come across missing values when analysing data, how should we proceed? To begin, we
should first determine the cause and then address missing data or eliminate variables using suitable approaches. However,
what if we encounter an excessive number of missing values? Should we replace missing values with imputed values or
remove the variables entirely?
1 would choose the later option as it contains fewer specifics about the data set. Furthermore, it would not contribute to
enhancing the efficacy of the model. Next, is there a specific threshold for the number of missing values that would warrant
deleting a variable? The outcome differs depending on the specific circumstances. If the variable has a relatively little amount
of information, it can be discarded if it has more than around 40-50% missing values.
2. Low Varlance: Consider a situation where all observations in our dataset have the same value, 5, indicating a
constant variable. Do you believe it has the potential to enhance the efficacy of the model? No, it does not have any variance.
Advanced Algorithms in Al and ML 420 Unsupervised
Learning: Clustering Algorithm
If there are a large number of dimensions, it is advisable to exclude variables with low variance in comparison to others, as
these variables will not effectively account for the variation in the target variables.
3. Decislon Trees: It is a strategy that I particularly favour. It serves as a comprehensive approach to address various
issues like as handling missing results, outliers, and finding relevant variables. It performed effectively during our Data
Hackathon as well. Multiple data scientists employed decision tree algorithms and achieved successful outcomes.
4. Random Forest: Random Forest is a method that is similar to a decision tree. | would suggest utilising the inherent
feature importance offered by random forests to choose a reduced set of input characteristics. It is important to note that
random forests tend to show a bias towards variables with a higher number of different values, meaning they favour numeric
variables over binary or category values.
5. Strong Correlation: Dimensions that have a strong correlation can negatively impact the model's performance.
Furthermore, it is undesirable to have several variables that contain comparable information or exhibit variance, a
phenomenon commonly referred to as "multicollinearity”. To locate variables with high correlation, you can utilise either the
Pearson correlation matrix for continuous data or the Polychoric correlation matrix for discrete variables. Once identified,
you can then select one of these highly correlated variables using the Variance Inflation Factor (VIF). Variables with a VIF
(Variance Inflation Factor) greater than 5 can be eliminated.
6. Backward Feature Elimination: This method begins with all n dimensions. Calculate the sum of squared errors
(SSR) by removing each variable individually, repeating this process n times. Next, we find the variables that, when
removed, result in the smallest increase in the sum of squared residuals (SSR). Finally, we remove these variables, resulting
in a dataset with n-1 input features.
Continue this procedure until there are no remaining variables that can be eliminated. In the recent Online Hackathon
conducted by Analytics Vidhya on 11-12 Jun'l5, the data scientist who secured the second rank utilised Backward Feature
Elimination in linear regression to train their model.
Conversely, we can employ the "Forward Feature Selection” technique. This strategy involves selecting a single variable
and evaluating the model’s performance by introducing an additional variable. Variable selection is determined by the extent
to which it improves model performance.
7. Factor Analysls: Suppose there exists a strong correlation among certain variables. These variables can be
categorised based on their correlations, meaning that all variables within a specific group may exhibit strong connections
with each other, but weak correlations with variables in other group(s). Each group signifies an individual underlying
component or factor. These parameters are relatively few in comparison to the vast number of dimensions. Nevertheless,
these elements are challenging to perceive. There are essentially two approaches to conducting factor analysis:
* Exploratory Factor Analysls (EFA)
« Confirmatory Factor Analysls (CFA)
8. Principal Component Analysis (PCA) is a method that involves transforming variables into a new set of variables
that are linear combinations of the original variables. The new set of variables is referred to as principal components. The
components are obtained by ensuring that the first principal component captures the most of the potential variation in the
original data, followed by each subsequent component having the largest variance possible.
The second principle component must be perpendicular to the first principal component. Put simply, it strives to capture
the remaining variability in the data that is not accounted for by the initial principal component. A two-dimensional dataset
can have a maximum of two main components. Displayed below is a summary of the data along with its primary and
secondary major components. It is evident that the second principal component is perpendicular to the first principal
component.

X2 * principal component

© 5™ principal component
Ll

Fig. 4.32
Advanced Algorithms in Al and ML 421 Unsupervised
Learning: Clustering Algorithm
The main components are influenced by the scale of measurement. To address this problem, it is necessary to standardise
variables prior to performing PCA. The use of Principal Component Analysis (PCA) to your data collection becomes
meaningless. If the importance of result interpretability is a priority for your analysis, Principal Component Analysis (PCA)
is not the appropriate technique for your project.
PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while
retaining as much information as possible. It is commonly used in data analysis and machine learning to identify the most
important features or variables that contribute the most to the overall variance in the data.
Principal Component Analysis (PCA) is a widely used unsupervised learning method that is employed to decrease the
dimensionality of extensive datasets. It enhances interpretability while simultancously reducing information loss. It facilitates
the identification of the most prominent characteristics in a dataset and simplifies the process of visualising the data in both
two and three dimensions. Principal Component Analysis (PCA) facilitates the identification of a series of linear
combinations of variables.

Fig. 4.33
In the above figure, we have several points plotted on a 2-D plane. There are two principal components. PC1 is the
primary principal component that explains the maximum variance in the data. PC2 is another principal component that is
orthogonal to PC1.
A principal component is a linear combination of the original variables in a dataset that captures the maximum amount of
variance in the data.
The Principal Components refer to a linear representation that effectively represents the majority of the variability
included in the data. They possess both a specific orientation and a measurable size. Principal components refer to the
orthogonal projections, or perpendicular projections, of data onto a lower-dimensional space.
Having grasped the fundamental concepts of PCA, we will now go into the subsequent aspect of PCA in the field of
Machine Learning.
Dimensionality here:
Dimensionality refers to the number of dimensions or variables in a given dataset or problem.
Dimensionality refers to the number of aspects or variables utilised in the investigation. Visualising and interpreting the
relationships between variables can provide challenges when working with high-dimensional data, such as datasets
containing a large number of variables. Dimensionality reduction approaches such as PCA are employed to retain the most
essential data while decreasing the number of variables in the dataset. PCA converts the original variables into a new set of
variables called principal components. These principal components are linear combinations of the original variables. The
dimensionality of the dataset is determined by the number of main components utilised in the investigation. The goal of PCA
is to identify a reduced set of main components that capture the most significant variation in the data. Principal Component
processing (PCA) can optimise data processing, improve visualisation, and facilitate the identification of patterns and
correlations among variables by lowering the dataset’s dimensionality.
The mathematical formulation of dimensionality reduction in the context of Principal Component Analysis (PCA) can be
expressed as:
The objective of Principal Component Analysis (PCA) is to convert the initial variables in a dataset, represented by the
n X p data matrix X, into a new collection of k variables known as principal components. These components aim to capture
the most substantial variation contained in the data. The primary components are determined by calculating linear
combinations of the original variables according to the following formula:
Advanced Algorithms in Al and ML 422 Unsupervised
Learning: Clustering Algorithm
The value of PC_1 is determined by the sum of the products of a_I 1, x_I,a_12,x 2, .., a_lp, and x_p.
The value of PC_2 is determined by the sum of the products of a_21 multiplied by x_I, a_22 multiplied by x_2, and so
on, up to a_2p multiplied by x_p.
The user’s text is enclosed in tags.
The equation PC_k is defined as the sum of the products of the coefficients a_k1, a_k2, ..., a_kp and the variables x_I,
x_2, .., %X_p.
The term "a_ij" represents the loading or weight of variable "x_j" on principal component "PC_i". Here, "x_j" refers to
the j* variable in the data matrix "X". The principle components are arranged in a specific order, with PC_| capturing the
highest amount of variation in the data, PC_2 capturing the second highest amount of variation, and so forth. The value of k,
which represents the number of principal components utilised in the analysis, directly influences the decreased
dimensionality of the dataset.
Correlation:
Correlation refers to the statistical relationship between two or more variables.
Correlation is a statistical term that quantifies the direction and magnitude of the linear relationship between two
variables. In the context of Principal Component Analysis (PCA), the covariance matrix is computed to represent the
pairwise correlations between all variables in the dataset. This matrix is a square matrix. The diagonal members of the
covariance matrix represent the variance of each variable, whereas the off-diagonal elements reflect the covariances between
distinct pairs of variables. The correlation coefficient, which ranges from —1 to 1, is a standardised statistic used to determine
the degree and direction of the linear relationship between two variables.
A correlation value of 0 indicates the absence of a linear relationship between the two variables, whereas correlation
coefficients of | and —1 indicate perfect positive and negative correlations, respectively. The principal components in PCA
are linear combinations of the original variables that optimise the amount of variation accounted for by the data. The
calculation of principal components involves the utilisation of the correlation matrix.
Within the context of Principal Component Analysis (PCA), correlation is mathematically expressed in the following
manner:
The correlation matrix C is a symmetric matrix of size n X n, where n represents the number of variables (x, Xa...., X,) in
the dataset. It contains the correlation coefficients between these variables.

The formula for calculating the correlation coefficient between two variables x; and x; is given by C; =
(sd(x)) * sd(x;)
cov(x;, x) °
The standard deviation of variable x; is denoted as sd(x;), while the standard deviation of variable x; is denoted as sd(x;).
The correlation between variables x; and x; is represented as cov(x;, x;).
The correlation matrix C can be expressed in matrix notation as:
The equation can be expressed as C = X T X /(n - 1) (n — 1), where C is a matrix, X is the transpose of another matrix,
and n is a constant.
Orthogonality
The term "orthogonality” refers to the property of the principle components in the PCA technique, where they are
constructed to be perpendicular to one other. This suggests that there is no superfluous information among the primary
components and that they are not interrelated.
The concept of orthogonality in Principal Component Analysis (PCA) can be mathematically defined as follows: each
principal component is constructed to maximise the amount of variance it explains, while also satisfying the condition that it
is perpendicular to all other principal components. The primary components are calculated by combining the original
variables in a linear manner. Therefore, every primary component is ensured to reflect a distinct and non-duplicative portion
of the variability in the data.
The orthogonality constraint is defined as:
The sum of the products of a_il *a_jl,a_i2*a_j2, .., a_ip * a_jp is equal to zero.
For all values of i andj, where i is not equal to j. Consequently, the dot product of loading vectors for distinct principal
components is zero, signifying their orthogonality.
Advanced Algorithms in Al and ML 423 Unsupervised Learning: Clustering Algorithm

PCA Implementation
Example: Let's take a look at how PCA can be implemented in Scikit-Learn. We will be using the Mushroom
classification dataset for this.
Download Datasets: https://www.kaggle.com/datasets/uciml/mushroom-classification
First, we need to import all the modules we need, which includes PCA, train_test_split, and labeling and scaling tools:
import pandas as pd
import matplotlib.pyplot as pit
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(“ignore")
After we load in the data, we will check for any null values. We will also encode the data with the LabelEncoder. The
class feature is the first column in the dataset, so we split-up the features and labels accordingly:
m_data = pd.read_csv('mushrooms.csv')

# Machine learning systems work with integers, we need to encode these


# string characters into ints

encoder = LabelEncoder()

# Now apply the transformation to all the columns:


for col in m_data.columns:
m_data[col] = encoder.fit_transform(m_data[col])

X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
We will now scale the features with the standard scaler. This is optional as we aren't actually running the classifier, but it
may impact how our data is analyzed by PCA:
# Scale the features
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
We will now use PCA to get the list of features and plot which features have the most explanatory power, or have the
most variance. These are the principle components. It looks like around 17 or 18 of the features explain the majority, almost
95% of our data:
# Visualize
pea=PCA()
pea.fit_transform(X_features)
pca_variance = pca.explained_variance_

plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align="center’, label="individual variance")
plt.legend()
plt.ylabel(*Variance ratio')
plt.xlabel('Principal components")
plt.show()
‘Advanced Algorithms in Al and ML 424 Unsupervised Learning: Clustering Algorithm

[ Individual variance

[nd
Variance Ratio
o
N
° Ed

0.5

0.0 HUHHHHDDHD:;
Principal Components
Flg. 4.34
Let's convert the features into the 17 top features. We will then plot a scatter plot of the data point classification based on
these 17 features:
pca2 = PCA(n_components=17)
pea2.fit(X_features)
x_3d = pca2.transform(X_features)

plt-figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:5], c=m_data['class'])
plt.show()
@

Flg. 4.35
Let's also do this for the top 2 features and see how the classification changes:
pca3 = PCA(n_components=2)
pea3.fit(X_features)
x_3d = pca3.transform(X_features)

plt-figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,1], c=m_data['class'])
plt.show()
Advanced Algorithms in Al and ML 425 Unsupervised
Learning: Clustering Algorithm

-4 -3 2 -1 0 1 2 3 4
Fig. 4.36
SINGULAR VALUE DECOMPOSITION (SVD)
The primary objective of Singular Value Decomposition (SVD) is to streamline a matrix and facilitate computations
involving the matrix. The matrix is decomposed into its individual components, akin to the objective of Principal Component
Analysis (PCA). While it is not essential to comprehend all the intricacies of Singular Value Decomposition (SVD) in order
to apply it in your machine learning models, possessing a basic understanding of its functioning will enhance your ability to
determine its appropriate usage.
Singular Value Decomposition (SVD) can be performed on matrices that are either complex or real-valued. However, for
the purpose of clarity, we will focus on explaining the process of decomposing a real-valued matrix.
During Singular Value Decomposition (SVD), we are presented with a matrix containing data and our objective is to
decrease the number of columns in the matrix. This process decreases the number of dimensions in the matrix while retaining
the maximum amount of variability in the data.
Matrix A is equivalent to the transpose of matrix V.
The equation A = U * D * Vt represents a mathematical relationship between the variables A, U, D, and Vt.

Given a matrix A, it is possible to express this matrix as three separate matrices denoted as U, V, and D. Matrix A
contains x*y elements, Matrix U is an orthogonal matrix with x*x elements, and Matrix V is a separate orthogonal matrix
with y+y components. Ultimately, D is a matrix with diagonal elements that are the product of x and y.

Decomposition of a matrix entails transforming the singular values of the original matrix into the diagonal values of the
resulting matrix. Orthogonal matrices retain their properties even when multiplied by other integers, allowing us to exploit
this characteristic to obtain an approximation of matrix A. By multiplying the orthogonal matrix with the transpose of matrix
V, we obtain a matrix that is equal to the original matrix A.
By decomposing matrix A into U, D, and V, we have three distinct matrices that encapsulate the information of Matrix
Al
Interestingly, the primary data of our matrices is concentrated in the left-most columns. By selecting only these specific
columns, we may obtain a reliable approximation of Matrix A. The new matrix is more streamlined and user-friendly, as it
possesses significantly less dimensions.
Example of Singular Value Decomposition (SVD) Implementation
Singular Value Decomposition (SVD) is frequently employed for picture compression. Ultimately, by diminishing the
pixel values comprising the red, green, and blue channels within the image, a less intricate image can be obtained while
retaining the same visual information. Let us attempt to utilise Singular Value Decomposition (SVD) to compress an image
and subsequently display it.
We will utilise multiple functions to manage the compression of the image. To execute this task, we simply require the
Numpy library and the Image function from the PIL library. Numpy provides a mechanism to perform the SVD computation.
import numpy
from PIL import Image
Advanced Algorithms in Al and ML 4.26 Unsupervised
Learning: Clustering Algorithm
First, we will just write a function to load in the image and tun it into a Numpy array. We then want to select the red,
green, and blue color channels from the image:
def load_image(image):
image = Image.open(image)
im_array = numpy.array(image)

red = im_array[:, :, 0]
green = im_array[:, :, 1]
blue = im_array[:, :, 2]

return red, green, blue


Having obtained the colours, our next step is to condense the colour channels. To begin, we can initiate the SVD
function from the Numpy library on the desired colour channel. Subsequently, we will generate an array consisting entirely of
zeros, which we will subsequently populate once the matrix multiplication has been executed. Next, we determine the desired
threshold for the singular value to be used during the calculations:
def channel_compress(color_channel, singular_value_limit):
u, 5, v = numpy.linalg.svd(color_channel)
compressed = numpy.zeros((color_channel.shape[0], color_channel.shape[1]))
n = singular_value_limit

left_matrix = numpy.matmul(u[:, 0:n], numpy.diag(s)[O:n, 0:n])


inner_compressed = numpy.matmul(left_matrix, v[O:n, :])
compressed = inner_compressed.astype('uint8')
return compressed

red, green, blue = load_image("dog3.jpg")


singular_val_lim = 350
Subsequently, we do matrix multiplication on the diagonal and the value constraints in the U matrix, as previously
explained. We obtain the left matrix and subsequently do matrix multiplication with the V matrix. To obtain the compressed
values, we convert them to the data type ‘uint8'".
def compress_image(red, green, blue, singular_val_lim):
compressed_red = channel_compress(red, singular_val_lim)
compressed_green = channel_compress(green, singular_val_lim)
compressed_blue = channel_compress(blue, singular_val_lim)

im_red = Image.fromarray(compressed_red)
im_blue = Image.fromarray(compressed_blue)
im_green = Image.fromarray(compressed_green)

new_image = Image.merge(“RGB", (im_red, im_green, im_blue))


new_image.show()
new_image.save("dog3-edited.jpg")

compress_image(red, green, blue, singular_val_lim)


Advanced Algorithms in Al and ML 427 Unsupervised
Learning: Clustering Algorithm
‘We will be using this image of a dog to test our SVD compression on:

L f‘ g ot

Fig. 4.37
‘We also need to set the singular value limit we'll use, let's start with 600 for now:
red, green, blue = load_image("dog.jpg")
singular_val_lim = 350
Ultimately, we can obtain the condensed values for the three colour channels and convert them from Numpy arrays into
image components using PIL. Next, we simply need to combine the three channels and display the image. The desired image
should possess reduced dimensions and exhibit a more basic and uncomplicated design compared to the original image.

Fig. 4.38
Upon examining the image sizes, it becomes evident that the compressed version is smaller, but with some lossy
compression applied. There is also some visual distortion present in the photograph.
You have the option to manipulate and modify the solitary value limit. As the selected limit decreases, the compression
will increase. However, there is a threshold where image artifacting becomes visible and the image quality deteriorates.
def compress_image(red, green, blue, singular_val_lim):
compressed_red = channel_compress(red, singular_val_lim)
compressed_green = channel_compress(green, singular_val_lim)
compressed_blue = channel_compress(blue, singular_val_lim)
im_red = Image.fromarray(compressed_red)
im_blue = Image.fromarray(compressed_blue)
im_green = Image.fromarray(compressed_green)

new_image = Image.merge("R68B", (im_red, im_green, im_blue))


new_image.show()

compress_image(red, green, blue, singular_val_lim)


Advanced Algorithms in Al and ML 4.28 Unsupervised Learning: Clustering Algorithm

X SUBSET SELECTION
Approaches of Dimension Reduction:
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection:
Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant features
present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal features from the
input dataset.
Three methods are used for the feature selection:
1. Fliters Methods:
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common
techniques of filters method are:
* Correlation
* Chi-Square Test
e ANOVA
* Information Gain etc.
2. Wrappers Methods:
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its evaluation. In
this method, some features are fed to the ML model, and evaluate the performance. The performance decides whether to add
those features or remove to increase the accuracy of the model. This method is more accurate than the filtering method but
complex to work. Some common techniques of wrapper methods are:
* Forward Selection.
* Backward Selection.
* Bi-directional Elimination.
3. Embedded Methods:
Embedded methods check the different training iterations of the machine learning model and evaluate the importance of
each feature. Some common techniques of Embedded methods are:
* LASSO
* Elastic Net
* Ridge Regression efc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole information but use fewer resources while processing
the information.
Some common feature extraction techniques are:
(a) Principal Component Analysis
(b) Linear Discriminant Analysis
(c) Kemel PCA
(d) Quadratic Discriminant Analysis
Common Techniques of Dimensionality Reduction:
(a) Principal Component Analysis
(b) Backward Elimination
(¢) Forward Selection
(d) Score comparison
(e) Missing Value Ratlo
() Low Varlance Fllter
(g) High Correlation Filter
Advanced Algorithms in Al and ML 4.29 Unsupervised
Learning: Clustering Algorithm

(h) Random Forest


(1) Factor Analysis
()) Auto-Encoder
INTRODUCTION TO PRINCIPAL COMPONENT ANALYSIS:
Principal Component Analysis is a statistical process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called
the Principal Components. It is one of the popular tools that is used for exploratory data analysis and predictive modelling.
PCA works by considering the variance of each attribute because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various communication channels.
Backward Feature Elimination:
The backward feature elimination technique is mainly used while developing Linear Regression or Logistic Regression
model. Below steps are performed in this technique to reduce the dimensionality or in feature selection:
* In this technique, firstly, all the n variables of the given dataset are taken to train the model.
* The performance of the model is checked.
* Now we will remove one feature each time and train the model on n — | features for n times and will compute the
performance of the model.
* We will check the variable that has made the smallest or no change in the performance of the model, and then we
will drop that variable or features; after that, we will be left with n — | features.
* Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum tolerable error rate, we can define
the optimal number of features require for the machine learning algorithms.
Forward Feature Selection:
Forward feature selection follows the inverse process of the backward elimination process. It means, in this technique,
we do not eliminate the feature; instead, we will find the best features that can produce the highest increase in the
performance of the model. Below steps are performed in this technique:
o We start with a single feature only, and progressively we will add each feature at a time.
* Here we will train the model on each feature separately.
® The feature with the best performance is selected.
* The process will be repeated until we get a significant increase in the performance of the model.
Missing Value Ratlo:
If a dataset has too many missing values, then we drop those variables as they do not carry much useful information. To
perform this, we can set a threshold level, and if a variable has missing values more than that threshold, we will drop that
variable. The higher the threshold value, the more efficient the reduction.
Low Varlance Fllter:
As same as missing value ratio technique, data columns with some changes in the data have less information. Therefore,
we need to calculate the variance of each variable, and all data columns with variance lower than a given threshold are
dropped because low variance features will not affect the target variable.
High Correlation Fliter:
High Correlation refers o the case when two variables carry approximately similar information. Due to this factor, the
performance of the model can be degraded. This correlation between the independent numerical variable gives the calculated
value of the correlation coefficient. If this value is higher than the threshold value, we can remove one of the variables from
the dataset. We can consider those variables or features that show a high correlation with the target variable.
Random Forest:
Random Forest is a popular and very useful feature selection algorithm in machine learning. This algorithm contains an
in-built feature importance package, so we do not need to program it separately. In this technique, we need to generate a large
set of trees against the target variable, and with the help of usage statistics of each attribute, we need to find the subset of
features.
Advanced Algorithms in Al and ML 430 Unsupervised Learning: Clustering Algorithm
Random forest algorithm takes only numerical variables, so we need to convert the input data into numeric data
using hot encoding.
Factor Analysis:
Factor analysis is a technique in which each variable is kept within a group according to the correlation with other
variables, it means variables within a group can have a high correlation between themselves, but they have a low correlation
with variables of other groups.
We can understand it by an example, such as if we have two variables Income and spend. These two variables have a
high correlation, which means people with high income spends more, and vice versa. So, such variables are put into a group,
and that group is known as the factor. The number of these factors will be reduced as compared to the original dimension of
the dataset.
Auto-encoders:
One of the popular methods of dimensionality reduction is auto-encoder, which is a type of ANN or artificial neural
network, and its main aim is to copy the inputs to their outputs. In this, the input is compressed into latent-space
representation, and output is occurred using this representation. It has mainly two parts:
* Encoder: The function of the encoderis to compress the input to form the latent-space representation.
* Decoder: The function of the decoder is to recreate the output from the latent-space representation.

1 What is K-means Clustering?


2 Describe Working of K-means Algorithm.
3 Explain Failure of K-means.
4. Write Implementation of Algorithm.
5. What is Dimensionality Reduction?
6. Describe Subset Selection.
7. Describe (PCA) Principal Component Analysis.

Y7
6...
Deep Learning for Sequential
and Image Data )
Chapter Outcomes...
After reading this chapter, students will be able to understand:
The concept of sequential data.
EENEE

Meaning of RNN, LSTM, LSTM-GRU.


The concept of transformers and GPT.
The concept of Image data, CNN.
The concept of Neural Networks, Transfer Learning and Fine tuning.

Learning Objectives...
® Implement Deep Learning for sequential data.
® Implement Deep Learning for Image data.

SEQUENTIAL DATA: RNN, LSTM, LSTM-GRU, INTRODUCTION TO


TRANSFORMERS, GPT
Recurrent Neural Network (RNN)
A Recurrent Neural Network (RNN) is a type of artificial neural network that is designed to process sequential data by
ng feedback connections.
Neural networks replicate the cognitive abilities of the human brain in the domains of artificial intelligence, machine
learning, and deep learning, enabling computer algorithms to identify patterns and address typical problems.
Recurrent Neural Networks (RNNs) are a specific category of artificial neural networks that are capable of effectively
representing and analysing sequential data. Recurrent Neural Networks (RNNs), derived from feedforward networks, exhibit
behaviour that is analogous to that of human brains. In essence, recurrent neural networks possess the ability to predict
sequential data in a manner that is beyond the capabilities of other algorithms.

Rotate anti-clockwise and A


compress the Iaiers c

Input Layer Hidden Layers Output Layer

Fig. 6.1
In normal neural networks, the inputs and outputs are considered to be independent. However, in certain situations, such
as predicting the next word in a phrase, the preceding words become crucial and need to be remembered. Consequently, the
Recurrent Neural Network (RNN) was developed, employing a Hidden Layer to address the issue. The fundamental element
of RNN is the Hidden state, which retains precise details regarding a sequence.
6.1]
Advanced Algorithms in Al and ML 62 Deep Leamning for Sequential and Image Data
Recurrent Neural Networks (RNNs) possess a Memory component that retains comprehensive information pertaining to
the computations. It utilises identical configurations for every input, as it generates the same result by executing the same
operation on all inputs or hidden layers.
The Architecture of a Traditlonal RNN:
RNNs are a type of neural network that has hidden states and allows past outputs to be used as inputs. They usually go
like this:

<t-1> <t <te1>


a

- <tot
X

Flg. 6.2
For each time step t, the activation a® and the output y* are expressed as follows:
[aT = Waa™
" + W x® +b) | and [yF = g(Wyaa®
+b,) ]
where, Wy, W,,, Way, b,, b, are coefficients that are shared temporarily and g,, g, activation functions.

<t-1>

Fig. 6.3
RNN architecture can vary depending on the problem you are trying to solve. From those with a single input and output
to those with many (with variations between).
Below are some examples of RNN architectures that can help you better understand this.
* One To One: There is only one pair here. A one-to-one architecture is used in traditional neural networks.
* One To Many: A single input in a one-to-many network might result in numerous outputs. One too many networks
are used in the production of music.
For example,
* Many To One: In this scenario, a single output is produced by combining many inputs from distinct time steps.
Sentiment analysis and emotion identification use such networks, in which the class label is determined by a
sequence of words.
e Many To Many: For many to many, there are numerous options. Two inputs yield three outputs. Machine
translation systems, such as English to French or vice versa translation systems, use many to many networks.
* Recurrent Neural Networks or RNNs consist of some directed connections that form a cycle that allow the input
provided from the .STMs to be used as input in the current phase of RNNs.
* These inputs are deeply embedded as inputs and enforce the memorization ability of LSTMs lets these inputs get
absorbed fora period in the internal memory.
Advanced Algorithms in Al and ML 6.3 Deep Leamning for Sequential and Image Data

* RNNs are therefore dependent on the inputs that are preserved by LSTMs and work under the synchronization
phenomenon of LSTMs.
© RNNs are mostly used in captioning images, time series analysis, recognizing handwritten data, and translating data
to machines.
* RNNs follow the work approach by putting output feeds (t — 1) time if the time is defined as t. Next, the output
determined by t is feed at input time t + 1.
* Similarly, these processes are repeated for all the input consisting of any length.
® There's also a fact about RNNs is that they store historical information and there's no increase in the input size even
if the model size is increased.
* RNNs look something like this when unfolded.

Hidden
bl state Output at timet|

Input at timet

Flg. 6.4
‘Workingof Recurrent Neural Networks work:
The information in recurrent neural networks cycles through a loop to the middle hidden layer.

Flg. 6.5
The input layer, denoted as x, receives and processes the input data of the neural network before transmitting it to the
middle layer.
The middle layer h has multiple hidden layers, each having its own activation functions, weights, and biases. If there is
no memory in the neural network and the parameters of distinct hidden layers are not affected by the preceding layer, you can
employ a recurrent neural network.
The Recurrent Neural Network will standardise the activation functions, weights, and biases, so ensuring that each
hidden layer possesses identical features. Instead of generating many hidden layers, it will produce a single layer and iterate
overit as many times as needed.
Common Activation Functions:
A neuron’s activation function dictates whether it should be turned on or off. Nonlinear functions usually transform a
neuron’s output to a number between O and | or— I and 1.
Deep Learning for Sequential and Image Data

Rule

9 (z) =max (0, z)

Flg. 6.6
The following are some of the most commouly utilized functions:
* Sigmold: The formula g(z) —%flis used to express this.
-
« Tanh: The formula g(z) = {%} is used to express this.
* Relu: The formula g(z) = max(0, z) is used to express this.
Pros and cons of Recurrent Neural Networks (RNN)
Beneflts of Recurrent Neural Networks (RNNs):
o Efficiently process sequential data, such as text, voice, and time series.
* Unlike feedforward neural networks, this model is capable of processing inputs of any length.
* By distributing weights over multiple time steps, the efficiency of training is improved.
Drawbacks of Recurrent Neural Networks (RNNs):
* Susceptible to the issues of vanishing and exploding gradients, which impede the learning process.
* Training can be arduous, particularly for lengthy stretches.
o Characterised by a slower computational speed compared to alternative neural network topologies.
Recurrent Neural Network Vs Feedforward Neural Network:
A feed-forward neural network is characterised by a unidirectional flow of information, specifically from the input layer
to the output layer, while traversing the hidden layers. The data traverses the network on a direct path, without passing
through any node more than once.
The transfer of information between a recurrent neural network (RNN) and a feed-forward neural network is illustrated
in the two figures provided.

Recurrent Neural Network Feed-Forward Neural Network


Flg. 6.7
Feed-forward neural networks exhibit limited predictive capabilities due to their lack of information retention. Due to its
limited capacity to assess only the present input, a feed-forward network lacks the ability to perceive temporal sequencing.
With the exception of its training, it lacks any recollection of previous events.
Advanced Algorithms in Al and ML 65 Deep Leamning for Sequential and Image Data

The information is processed in a recurrent neural network (RNN) by a looping mechanism. Prior to forming a
conclusion, it assesses the present input and incorporates knowledge gained from previous inputs. In contrast, a recurrent
neural network has the ability to remember information through its internal memory. The system generates output, duplicates
it, and subsequently transmits it back to the network.
Backpropagation Through Time (BPTT)
Backpropagation via time refers to the application of the Backpropagation algorithm to a Recurrent Neural Network that
takes time series data as input.
In a typical Recurrent Neural Network (RNN), just one input is processed at a time, resulting in a single output.
Contrarily, backpropagation utilises both the present and previous inputs as input. A timestep is the term used to describe the
occurrence when numerous time series data points are simultaneously fed into the RNN.
Yo Y1 N Yn

e!
v v

w W
c—

u l Gey u
%

X0 X % X
Flg. 6.8
The output of the neural network is used to calculate and collect the errors once it has trained on a time set and given you
an output. The network is then rolled back-up, and weights are recalculated and adjusted to account for the faults.
Two Issues of Standard RNNs:
There are two key challenges that RNNs have had to overcome, but in order to comprehend them, one must first grasp
what a gradient is.
Exploding Gradient Vanishing Gradlent

|Aw] 00 |aw|
Derivative

Derivative
size

size

-t
Fig. 6.9
A gradient is a partial derivative with relation to its inputs. If you are uncertain about the implications, take into account
the following: A gradient measures the extent to which the output of a function changes when the inputs are altered slightly.
The slope of a function is synonymous with its gradient. A model’s learning speed increases as the slope becomes
steeper, resulting in a higher gradient. Conversely, the model will cease learning if the slope is zero. A gradient is employed
to quantify the variation in all weights with respect to the variation in error.
Exploding Gradlents: Exploding gradients refer to a situation in which the algorithm assigns excessively high
importance to the weights without any clear justification. Fortunately, the problem can be easily resolved by truncating or
squashing the gradients.
Vanishing Gradlents: Vanishing gradients refer to the situation where the values of the gradients become extremely
small, resulting in the model either ceasing to learn or taking an excessively long time to leam. This problem was of
significant concern during the 1990’s and posed a greater challenge to resolve compared to the issue of exploding gradients.
Fortunately, the difficulty was resolved by Sepp Hochreiter and Juergen Schmidhuber's LSTM idea.
Advanced Algorithms in Al and ML 66 Deep Learning for Sequential and Image Data

RNN Applications:
Applications of Recurrent Neural Networks (RNN).
Recurrent Neural Networks are employed to address a range of issues related to sequential data. Several forms of
sequence data exist, with the following being the most prevalent: Audio, text, video, and biological sequences.
By utilising recurrent neural network (RNN) models and sequence datasets, you can address a wide range of issues, such
as:
* Automatic Speech Recognition.
* Music Generation.
* Machine Translations.
* Video Action Analysis.
® Genomic and DNA Sequencing Analysis.
Baslc Python Implementation (RNN with Keras)
Import the required librarles
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
Here is a simple Sequential model that processes integer sequences, embeds each integer into a 64-dimensional vector,
and then uses an LSTM layer to handle the sequence of vectors.
model = keras.Sequential()
model.add(layers.Embedding(input_dim=1000, output_dim=64))
model.add(layers.LSTM(128))
model.add(layers.Dense(10))
model.summary()
Output:
Model: “sequential™

Layer (type) Output Shape Param #


embedding (Embedding) (None, None, 64) 64000
Istm (LSTM) (None, 128) 98816

dense (Dense) (None, 10) 1290


Total params: 164, 106
Trainable params: 164, 106
Non-trainable params: 0
m Long Short-Term Memory in Machine Learning
LSTM stands for Long Short-Term Memory, which is a type of recurrent neural network (RNN) architecture.
LSTM, short for Long Short-Term Memory, is a popular architecture of recurrent neural networks (RNN) that is
extensively utilized in the field of Deep Leaming. Its proficiency is in capturing extended relationships over time, rendering it
well-suited for problems involving the prediction of sequences.
LSTM, in contrast to conventional neural networks, integrates feedback connections, enabling it to handle complete
sequences of data rather than individual data points. This feature renders it exceptionally efficient at comprehending and
foretelling patterns in sequential data like time series, text, and speech.
LSTM has become a powerful tool in artificial intelligence and deep learning, enabling breakthroughs in various fields
by uncovering valuable insights from sequential data.
Advanced Algorithms in Al and ML 6.7 Deep Leamning for Sequential and Image Data

Long Short-Term Memory (LSTM) Architecture:


In the preceding section, we were introduced to the concept of long short-term memory (LSTM) and its ability to address
the issue of the vanishing gradient problem encountered by recurrent neural networks (RNNs). In this section, we will delve
into the specifics of how LSTM overcomes this difficulty by acquiring knowledge of its design. At a conceptual level, LSTM
operates in a similar manner to an RNN cell. Below is an explanation of the internal mechanisms of the LSTM network. The
LSTM network architecture comprises three components, as depicted in the accompanying image, with each component
serving a distinct purpose.

Forget
r irrelevant
I LSTM Pass updated
information information

Add/update new
information

Fig. 6.10
The Logic Behind LSTM:
The initial phase determines whether the data from the preceding timestamp should be retained or disregarded as
inconsequential. In the subsequent phase, the cell endeavours to acquire novel information from the input it receives. Finally,
in the third segment, the cell transfers the changed information from the present timestamp to the subsequent timestamp. A
single-time step in LSTM refers to one complete cycle of the network.
The three components of an LSTM unit are commonly referred to as gates. They regulate the transmission of information
into and out of the memory cell or LSTM cell. The initial gate is referred to as the Forget gate, the subsequent gate is
recognised as the Input gate, and the final gate is denoted as the Output gate. An LSTM unit, comprising of three gates and a
memory cell or Istm cell, can be conceptualised as a layer of neurons within a conventional feedforward neural network. Each
neuron possesses a hidden layer and a current state.
Input Gate
Forget Gate 5 Output Gate

Forget irrelevant
information LSTM Pass updated
information

Add/update new
information
Flg. 6.11

An LSTM, similar to a basic RNN, possesses a hidden state. In this case, H(t-1) denotes the hidden state from the
previous timestamp, while Ht represents the hidden state at the current timestamp. LSTM possesses a cell state denoted as
C(t-1) and C(t) for the preceding and current timestamps, correspondingly.
In this context, the concealed state is referred to as Short term memory, whereas the cell state is referred to as Long term
memory. Please see the image provided below.
It is interesting to note that the cell state carries the information along with all the timestamps.
Crqy —f —>cC,
Hy ——» — H,
LS™
Flg. 6.12
Bob is a nice person. Dan on the other hand is evil.
Advanced Algorithms in Al and ML 68 Deep Learning for Sequential and Image Data
Exampleof LTSM Working:
Let us use an example to comprehend the functioning of LSTM. Here we have two phrases terminated by a period. The
initial statement asserts that Bob possesses amiable qualities, but the subsequent statement contrasts this by charactes
Dan as malevolent. The first sentence unambiguously refers to Bob, but the introduction of a full stop(.) marks the transition
to discussing Dan.
As we transition from the initial sentence to the subsequent sentence, our network should recognise that we are no longer
discussing Bob. Our current focus is on Dan. In this context, the Forget gate of the network enables it to disregard or
eliminate some information. Let us comprehend the functions performed by these gates in the LSTM architecture.
Forget Gate:
In a cell of the LSTM neural network, the first step is to decide whether we should keep the information from the
previous time step or forget it. Here is the equation for forget gate.
Forget Gate:
* fi=0(xxUr+H xWp)
Let us try to understand the equation, here
* X input to the current timestamp.
e Up weight associated with the input
© H,: The hidden state of the previous timestamp
® W Itis the weight matrix associated with the hidden state
Later, a sigmoid function is applied to it. That will make ft a number between 0 and 1. This ft is later multiplied with the
cell state of the previous timestamp, as shown below.
Cuxf, =0 _..if f,=0 (forget everything)
Caaxfi = Cqy - if f5=1 (forget nothing)
LTSM vs RNN:

Aspect LSTM (Long Short-Term Memory) RNN (Recurrent Neural Network)

Architecture A type of RNN with additional memory cells. A basic type of RNN.
Memory Retention Handles long-term dependencies and prevents | Struggles with long-term dependencies and
vanishing gradient problem. vanishing gradient problem.

Cell Structure Complex cell structure with input, output, and | Simple cell structure with only hidden state.
forget gates.

Handling Sequences | Suitable for processing sequential data. Also designed for sequential data, but
limited memory.

Training Efficiency Slower training process due to increased | Faster training process due to simpler
complexity. architecture.

Performance
on Long | Performs betteron long sequences. Struggles to retain information on long
Sequences sequences.

Usage Best suited for tasks requiring long-term memory, | Appropriate for simple sequential tasks,
such as language translation and sentiment | such as time series forecasting.
analysis.

Vanishing Gradient Addresses the vanishing gradient problem. Prone to the vanishing gradient problem.
Problem
Long-Short Term Memory Networks
® LSTMscan be defined as Recurrent Neural Networks (RNN) that are programmed to learn and adapt for
dependencies for the long term.
® It can memorize and recall past data for a greater period and by default, it is its sole behaviour.
Advanced Algorithms in Al and ML 6.9 Deep Learning for Sequential and Image Data

® LSTMs are designed to retain over time and henceforth they are majorly used in time series predictions because they
can restrain memory or previous inputs.
* This analogy comes from their chaln-like structure consisting of four interacting layers that communicate with each
other differently.
* Besides applications of time series prediction, they can be used to construct speech recognizers, development in
pharmaceuticals, and composition of music 100ps as well.
* LSTM work in a sequence of events. First, they don't tend to remember irrelevant details attained in the previous
state.
® Next, they update certain cell-state values selectively and finally generate certain parts of the cell-state as output.
Below is the diagram of their operation.

Flg. 6.13
Gated Recurrent Unit (GRU)
GRU or Gated recurrent unit is an advancement of the standard RNN i.e recurrent neural network. GRUs are very similar
to Long Short Term Memory (LSTM). Just like LSTM, GRU uses gates to control the flow of information. They are
relatively new as compared to LSTM. This is the reason they offer some improvement over LSTM and have simpler
architecture.

Crs—> G Hy He
LSTM GRU
Hey H,

* %
Fig. 6.14
Another Interesting thing about GRU is that, unlike LSTM, it does not have a separate cell state (Ct). It only has a hidden
state (Ht). Due to the simpler architecture, GRUs are faster to train.
The Architecture of Gated Recurrent Unit:
At each given time t, an input Xt and the hidden state H,; from the previous time t — | are utilised. Subsequently, it
generates a fresh concealed state Ht, which is then transmitted to the subsequent time stamp.
The GRU cell now consists of two gates, whereas the LSTM cell consists of three gates. The initial gate is referred to as
the Reset gate, while the second gate is known as the update gate.

Huy He

Flg. 6.15
Advanced Algorithms in Al and ML 6.10 Deap Leamning for Sequential and Image Data
At each given time t, an input Xt and the hidden state H,, from the previous time t-1 are utilised. Subsequently, it
generates a fresh concealed state Ht, which is then transmitted to the subsequent time stamp.
The GRU cell now consists of two gates, whereas the LSTM cell consists of three gates. The initial gate is referred to as
the Reset gate, while the second gate is known as the update gate.
Reset Gate (Short-term memory):
The Reset Gate is accountable for the network’s transient memory, namely the hidden state (H,). The following is the
mathematical expression for the Reset gate.
n = o(x XU+ Ho
X W)
If you remember from the LSTM gate equation it is very similar to that. The value of r, will range from 0 to | because of
the sigmoid function. Here Ur and Wr are weight matrices for the reset gate.
Update Gate (Long Term Memory):
Similarly, we have an Update gate for long-term memory and the equation of the gate is shown below.
u = o(x,xU,+H_ xW,)
The only difference is of weight metrics i.e. U, and W,
How GRU Works?
Now, let us examine the operation of these gates. The process of finding the Hidden state Ht in a GRU involves a
two-step procedure. To begin, the initial task is to create the candidate concealed state. As demonstrated below.
Candidate Hidden State:
A
H, = tan h(x;x Uy + (r,- Hy) X Wy)

It takes in the input and the hidden state from the previous timestamp t — | which is multiplied by the reset gate output rt.
Later passed this entire information to the tanh function, the resultant value is the candidate’s hidden state.

fi, = nh(xx U, + (- H ) X W,)


The input and the hidden state from the previous timestamp t-1 are received, and then multiplied by the output of the
reset gate, r.. Subsequently, the complete information was inputted into the tanh function, yielding the candidate’s hidden
state.
The important aspect of this equation lies in our utilisation of the reset gate value to regulate the extent of impact exerted
by the preceding hidden state on the candidate state.
If the value of r is 1, it indicates that the complete information from the preceding concealed state H,., is being taken
into account. Similarly, when the value of r, is 0, it indicates that the data from the preceding hidden state is entirely
disregarded.
Hidden State:
After obtaining the candidate state, it is utilised to generate the present hidden state Ht. This is when the Update gate
becomes relevant. The equation is intriguing as it employs a solitary update gate to regulate both the previous historical
information (H,_,) and the incoming fresh information from the candidate state, rather than utilising a separate gate as in
LSTM or GRU.
A
Ho = u-Ho+(1-u)-H
Assuming the value of u, is around 0, the first term in the equation will disappear, indicating that the new hidden state
will have minimal information from the prior hidden state. Conversely, the second portion implies that the current hidden
state will primarily comprise the information from the candidate state.

H = ueHos (-w-f,
Similarly, if the value of u, is on the second term will become entirely 0 and the current hidden state will entirely depend
on the first term i.e. the information from the hidden state at the previous timestamp t — I.
Hence, we can conclude that the value of u, is very critical in this equation, and it can range from O to 1.
Advanced Algorithms in Al and ML 611 Deep Learning for Sequential and Image Data

JEEN INTRODUCTION TO TRANSFORMERS


The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range
dependencies with case. The Transformer is the first transduction model relying entirely on self-attention to compute
representations of its input and output without using sequence aligned RNNs or convolution.

Transformer’s Model Architecture

Output
Probabilities

Attention
Nx

Masked
Multi-Head
Attention

- J
Positional Positional
Encoding -~ b Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

Flg. 6.16: The Transformer — Model Architecture


The graphic above provides an excellent depiction of the construction of a Transformer. Let us initially concentrate
solely on the Encoder and Decoder components.
Direct your attention to the image below. The Encoder block consists of a single layer of Multi-Head Attention, which is
then followed by another layer of Feed Forward Neural Network. In contrast, the decoder is equipped with an additional
Masked Multi-Head Attention mechanism.
‘Advanced Algorithms in Al and ML 6.12 Deep Learning for Sequential and Image Data

Encoder Decoder

Nx

Nx

J
Positional ® Positional
Encoding P & Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)
Fig. 6.17
‘The encoder and decoder blocks consist of many identical encoders and decoders that are stacked vertically. The encoder
stack and the decoder stack possess
an equal number of units.
The quantity of encoder and decoder units is a hyperparameter. The article utilises
a total of 6 encoders and decoders.
(Output)
Please come here

(C_Encoder
3
( Encoder
t
|
Komm bitte her
(input)
Fig. 6.18
Let us see how this set-up of the encoder and the decoder stack works:
® The word embeddings of the input sequence are passed to the first encoder.
® These are then transformed and propagated to the next encoder.
Advanced Algorithms in Al and ML 6.13 Deep Learning for Sequential and Image Data

* The output from the last encoder in the encoder-stack is passed to all the decoders in the decoder-stack as shown in
the figure below:

( Feed Feed Feed N


Forward Forward Forward

Encoder 2
Self-Attention
- J

Feed Feed Feed


Forward Forward Forward Encoder-Decoder Attention

Eneocer { Decoder 1
Self-Attention | Self-Attention
- J
Komm bitte her Please come here

Fig. 6.19
An important thing to note here — in addition to the self-attention and feed-forward layers, the decoders also have one
more layer of Encoder-Decoder Attention layer. This helps the decoder focus on the appropriate parts of the input sequence.
Limitations of the Transformer
Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it comes with its own share
of limitations:
* Attention can only deal with fixed-length text strings. The text has to be split into a certain number of segments or
chunks before being fed into the system as input.
® This chunking of text causes context fragmentation.
For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the
text is split without respecting the sentence or any other semantic boundary.
GPT - GENERATIVE PRE-TRAINED TRANSFORMER
GPT, an acronym for Generative Pre-trained Transformer, is an advanced machine learning model that has garnered
substantial attention and popularity in recent years. Open Al's development of GPT has brought about a significant
transformation in the domain of natural language processing. GPT has found applications in many tasks like text production,
translation, summarization, and even the generation of computer code.
‘The main objective of GPT is to produce text of superior quality that closely resembles human language, utilising the
capabilities of deep learning and pre-training. Through the application of a transformer architecture and rigorous training on a
vast corpus of data, GPT possesses the capacity to comprehend and anticipate words, phrases, and even paragraphs by
leveraging the context and input it receives.
GPT has been praised as a significant advancement in artificial intelligence due to its sophisticated language modelling
abilities. It has been successfully employed in several applications. GPT has demonstrated its versatility and great promise by
assisting in content production, enhancing chatbots, facilitating language understanding, and even supporting creative
writing.
This article aims to provide a comprehensive analysis of GPT, including its definition, functioning, training
methodology, practical uses, and constraints. Upon completion, you will possess a thorough comprehension of this
formidable machine leamning model and its influence across several disciplines.
Advanced Algorithms in Al and ML 6.14 Deap Learning for Sequential and Image Data

What is GPT?
GPT, an acronym for Generative Pre-trained Transformer, is a sophisticated machine leaming model created by OpenAl.
It belongs to the neural network-based models family and is specifically tailored for natural language processing tasks.
GPT is fundamentally a model for generating language. The model is trained on extensive text data to acquire the
statistical patterns and structures of language, allowing it to produce coherent and contextually appropriate text in response to
specific promplts or inputs.
GPT stands out due to its transformer architecture, which is a crucial characteristic. Transformers are a specific kind of
neural network that enables the model to effectively understand and represent distant connections and associations within the
text. GPT excels in comprehending the semantic relationships among words, phrases, and entire paragraphs.
The term "pre-trained” in GPT refers to the first training phase, during which the model is trained on a vast collection of
text material, including books, papers, and webpages. GPT undergoes a pre-training phase where it acquires knowledge of
fundamental language patterns and rules from the incoming data, without focusing on any particular task.
Following the pre-training phase, GPT undergoes a fine-tuning procedure, during which it receives additional training on
specific tasks or datasets to enhance its suitability for that particular jobs. The process of fine-tuning enables GPT to adjust its
knowledge and abilities to meet the exact demands of a particular application or use case.
In summary, GPT is an advanced language generation model that utilises deep leaming, transformers, and prolonged
training on massive datasets, making it a cutting-edge technology. The capacity to comprehend and produce material of
exceptional quality has rendered it an invaluable instrument in diverse fields, encompassing content generation, virtual
assistants, customer care chatbots, and language translation, among other applications.
Working of GPT
GPT, an acronym for Generative Pre-trained Transformer, utilises a distinctive structure and training methodology to
produce coherent and contextually appropriate text. To comprehend the functioning of GPT, one must possess knowledge of
transformers and the underlying concepts of pre-training and fine-tuning.
The fundamental essence of GPT lies in its transformer architecture. Transformers are neural network structures that
demonstrate exceptional proficiency in collecting extensive connections and associations in textual data, rendering them very
ideal for tasks involving language processing. The transformer architecture comprises an encoder and a decoder. Within GPT,
solely the encoder is utilised as it prioritises the production of text in accordance with input prompts.
GPT undergoes two primary stages throughout its training: pre-training and fine-tuning. During the pre-training phase,
GPT is exposed to an extensive corpus of textual material, including books, articles, and webpages. It acquires the ability to
anticipate the subsequent word in a sentence by analysing the preceding words, assimilating the patterns and structures of
language in the process.
The process of unsupervised pre-training allows GPT to acquire a robust comprehension of grammar, syntax, and
semantics. Additionally, it aids the model in acquiring a broad understanding of the world and the context it is exposed to
through a variety of text sources.
Following the pre-training stage, the model proceeds to the fine-tuning phase. During this stage, GPT undergoes training
on certain tasks or datasets, which are meticulously chosen to correspond with the intended application or use case of the
model. Fine-tuning involves refining the pre-trained model by narrowing its attention to the specific target task and making
appropriate adjustments to its parameters.
The fine-tuning phase enables GPT to modify its knowledge and abilities to meet the precise demands of the work at
hand. For instance, if GPT undergoes fine-tuning with a dataset consisting of customer support discussions, it will acquire the
ability to provide responses that are pertinent and beneficial within the context of customer service.
During the process of inference, GPT utilises its pre-existing knowledge and the provided input prompt to generate text
by anticipating the most likely next word or sequence of words. The generated text is shaped by the context and semantics of
the input prompt, leading to coherent and contextually suitable responses.
While GPT demonstrates exceptional proficiency in producing text that resembles human language, it is not without its
constraints. It can sometimes generate inaccurate or illogical results, particularly when the input context is unclear or the
desired output is not well defined.
Overall, the architecture of GPT, along with its pre-training and fine-tuning procedure, allows it to produce text of
exceptional quality that closely resembles content written by humans. The training process enables GPT to comprehend the
complexities of language, enabling it to generate coherent and contextually appropriate responses.
Advanced Algorithms in Al and ML 6.15 Deep Learning for Sequential and Image Data

Training GPT
GPT, also known as Generative Pre-trained Transformer, undergoes two primary stages throughout its training:
pre-training and fine-tuning. These stages are crucial to guarantee that the model can produce coherent and contextually
pertinent material. Now, let’s delve deeper into these training phases.
During the pre-training phase, GPT is exposed to a huge corpus of text data, often consisting of books, papers, and
webpages. The model acquires the ability to anticipate the subsequent word in a phrase by analysing the preceding words, so
efficiently capturing the statistical patterns and structures of language. The unsupervised training procedure facilitates the
development of GPT's comprehension of grammar, syntax, and semantics across different situations.
During the pre-training phase, GPT employs a transformer architecture, which is highly proficient in capturing extensive
connections and associations inside text. Transformers are composed of several layers of self-attention mechanisms, which
enable the model to assess the significance of various words and their surrounding contexts during text generation.
Following the successful completion of pre-training on a wide variety of texts, GPT proceeds to the fine-tuning stage.
During this phase, the model undergoes training on certain tasks or datasets that are meticulously chosen to correspond with
the intended application or use case. Through the process of fine-tuning, GPT is able to modify its pre-existing knowledge to
better suit the precise demands of the given task.
The fine-tuning method entails modifying the parameters of the pre-trained GPT model by utilising a dataset that is
specific to the task at hand. Through the exposure to task-specific data, GPT acquires the ability to produce text that is better
suited and more precise for the intended application.
It is important to mention that GPT might undergo additional refinement on various tasks or datasets. GPT's capacity to
transfer information between different fields renders it a versatile model capable of effectively managing a diverse range of
natural language processing jobs.
The performance of GPT is significantly influenced by the quality and diversity of the training data, both during
pre-training and fine-tuning. By training GPT on extensive and varied datasets, the model gains a comprehensive
understanding of language intricacies and improves its capacity to produce text of superior quality.
Training GPT is a demanding procedure that necessitates substantial computational resources and effort. The model is
commonly trained on dedicated hardware, such as graphics processing units (GPUs) or tensor processing units (TPUs), to
expedite the training process.
In general, the training process of GPT entails initially training the model on a substantial collection of textual data in
order to acquire knowledge about the statistical patterns inherent in language. Subsequently, the model undergoes fine-tuning
on specific tasks or datasets to tailor it for a certain application. The training procedure guarantees that GPT can produce
coherent and contextually appropriate text by utilising the received input.
Applications of GPT
The Generative Pre-trained Transformer (GPT) has been extensively utilised in several sectors. The capacity to produce
coherent and contextually appropriate text renders it a significant asset in various natural language processing jobs. Let us
examine some of the notable uses of GPT:
Content Generatlon: GPT is primarily used for content generation. It has the capability to automatically produce
articles, blogs, product descriptions, and other written content of excellent quality. GPT can aid content creators by offering
suggestions, outlines, and even finishing sentences.
Chatbots and Virtual Assistants: GPT has the ability to improve the conversational skills of chatbots and virtual
assistants. It allows them to produce responses to user inquiries that are more similar to those of a human and more suitable
for the situation, hence enhancing user interaction and involvement.
Translation and Summarization: GPT can be utilized for language translation and text summarizing endeavors.
Through the utilization of extensive multilingual datasets for training, the model can produce precise translations and
succinct summaries with a high level of accuracy.
Question Answering: GPT has been employed in question answering systems, leveraging its knowledge and
comprehension ofa specific subject to deliver comprehensive and informative responses to user inqui
Creative Writing: GPT can serve as a valuable instrument for writers engaging in creative writing and the process of
generating ideas. GPT can develop imaginative concepts, storylines, or even entire short novels or scripts when given a
prompt.
Advanced Algorithms in Al and ML 6.16 Deep Learning for Sequential and Image Data

Code Generation: GPT has been utilised for code generation jobs, wherein it aids in the automated production of
computer code based on specified requirements or descriptions. This is very advantageous in expediting software
development procedures.
These examples illustrate a limited selection of the practical uses of GPT. The adaptability and language generating
capabilities of this technology offer opportunities in several areas like as marketing, customer service, education, and others.
With ongoing advancements, the scope of GPT is anticipated to broaden significantly.
Limitations of GPT
Although GPT, also known as Generative Pre-trained Transformer. is a remarkable model for generating language, it
does possess certain constraints. One must be cognizant of these constraints when employing GPT for diverse applications.
Now, let's examine some of the primary constraints of GPT:
Lack of Common Sense: Deficiency in Rationality: GPT exhibits a deficiency in its ability to reason based on common
sense and possess a comprehensive understanding of the universe. While the system has the ability to generate logical and
contextually appropriate content, there are instances where it may provide inaccurate or non-sensical results. This is
particularly true when the input context is unclear or when the desired output relies on common sense information.
Over-Rellance on Tralning Data: The performance of GPT is significantly influenced by the quality and diversity of
the training data, leading to an over-reliance on it. Should the training data exhibit bias or limitations, GPT has the potential
to produce replies that are biassed or erroneous. It is crucial to guarantee that the training data is varied and inclusive of many
viewpoints and settings.
Vulnerable to Adversarial Attacks: GPT is sensitive to deliberate adversarial assaults, wherein carefully constructed
input prompts can cause the model to provide biassed or malicious outputs. This emphasises the significance of ensuring
strong resilience and protection while implementing GPT in practical scenarios.
GPT faces challenges in comprehending and producing content pertaining to abstract concepts or subjects that
necessitate extensive expertise in the field. Producing precise and logical language in such situations can provide a challenge
for GPT.
Limited Control: GPT functions as an opaque model, making it difficult to exert precise control over the output text.
Although methods such as prompt engineering and conditioning can offer a certain degree of control, achieving accurate
output from GPT still poses a difficulty.
Contextual Dependency: Contextual dependency refers to the extent to which the quality and relevancy of the output
text are dependent on the input context. Slight modifications in the input can result in substantial fluctuations in the produced
output. Achieving uniform and precise text creation in various situations can be a challenging endeavour.
(¥ W IMAGE DATA: (RESNET, VGG) PRE-TRAINED NEURAL NETWORKS,
TRANSFER LEARNING, FINE TUNING
Image Classification Using CNN (Convolutional Neural Networks)
Computer vision is a highly sought-after discipline in the area of data science, and Convolutional Neural Networks
(CNNs) have revolutionized the field and emerged as the cutting-edge technology for computer vision. Out of the many
varieties of neural networks, such as recurrent neural networks (RNN), long short-term memory (LSTM), artificial neural
networks (ANN) etc., convolutional neural networks (CNNs) are often regarded as the most widely used and favored.
Convolutional neural network models are very prevalent in the domain of picture data. They exhibit exceptional performance
on computer vision tasks like as image categorization, object identification, and picture recognition. Consequently, they have
been extensively used in the field of artificial intelligence modeling, particularly for the purpose of developing picture
classifiers. This article aims to acquaint you with the notion of picture classification using Convolutional Neural Networks
(CNN) and demonstrate its functionality on diverse datasets.
A Convolutional Neural Network (CNN) is a kind of artificial neural network that is specifically designed for processing
and analyzing visual data. It is capable of automatically learning and extracting features from images via the use of
convolutional layers, which apply filters to the input data. CNNs have shown to be very effective in tasks such as image
classification, object detection,
A Convolutional Neural Network (CNN) is a very potent neural network that use filters to extract distinctive
characteristics from pictures. Furthermore, it preserves the positional data of each pixels.
Advanced Algorithms in Al and ML 617 Deep Learning for Sequential and Image Data
There are various datasets that you can leverage for applying convolutional neural networks. Here are three popular
datasets:
* MNIST
* CIFAR-10
* ImageNet
We will now see how to classify images using CNN on each of these datasets.
Using CNNs to Classify Hand-written Diglts on MNIST Dataset
0000000000002 000
(VAN 20720000V
24222322222 22222
3333333533333 333
YegrYdayyy sva-4 ¢y
5558555555555+55
bbb blLEbbbodcs bbb
T7979771M 790122777
YE:BESPPEPITYLC S
?999993%9349494499 9
Fig. 6.20
MNIST (Modified National Institute of Standards and Technology) is a well-known dataset used in Computer
Vision that was built by Yann Le Cun et al. It is composed of images that are handwritten digits (0-9), split into a training set
of 50,000 images and a test set of 10,000, where each image is 28 x 28 pixels in width and height.
This dataset is often used for practicing any algorithm made for image classification, as the dataset is fairly easy to
conquer. Hence, I recommend that this should be your first dataset if you are just foraying in the field.
MNIST comes with Keras by default, and you can simply load the train and test files using a few lines of code:
from keras.datasets import mnist

# loading the dataset


(X_train, y_train), (X_test, y_test) = mnist.load_data()
# let's print the shape of the dataset
print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print(“X_test shape”, X_test.shape)
print("y_test shape", y_test.shape)
Here is the shape of X (features) and y (target) for the training and validation data:
X_train shape (60000, 28, 28)
_train shape (60000,)
X_test shape (10000, 28, 28)
y_test shape (10000,)
Steps to Bulld an Image Classification Model using CNN:
Before we train a CNN model, let us build a basic, Fully Connected Neural Network for the dataset. The basic steps to
build an image classification model using a neural network are:
Flatten the input image dimensions to 1D (width pixels x height pixels).
Normalize the image pixel values (divide by 255).
ol ol

One-Hot Encode the categorical column.


Build a model architecture (Sequential) with Dense layers (Fully connected layers).
Train the model and make predictions.
Lo
Advanced Algorithms in Al and ML 6.18 Deep Learning for Sequential and Image Data

Here is how you can build a neural network model for MNIST. I have used relu and softmax as the activation function
and adam optimizer, with accuracy being the evaluation metrics. The code contains all the steps from data loading to
preprocessing to fitting the model. I have commented on the relevant parts of the code for better understanding:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D
from keras.utils import np_utils

# Flattening the images from the 28x28 pixels to 1D 787 pixels


X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype(‘float32')
X_test = X_test.astype('float32')

# normalizing the data to help with the training


X_train /= 255
X_test /= 255

# one-hot encoding using keras' numpy-related utilities


n_classes = 10
print(“Shape before one-hot encoding: “, y_train.shape)
Y_train = np_utils.to_categorical(y_train, n_classes)
Y_test = np_utils.to_categorical(y_test, n_classes)
print(“Shape after one-hot encoding: *, ¥_train.shape)

# building a linear stack of layers with the sequential model


model = Sequential()
# hidden layer
model.add(Dense(100, input_shape=(784,), activation="relu"))
# output layer
model.add(Dense(10, activation="softmax"))

# looking at the model summary


model.summary()
# compiling the sequential model
model.compile(loss="categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
# training the model for 10 epochs
model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_test, Y_test))

After running the above code, you had realized that we are getting a good validation accuracy of around 97% easily.
One major advantage of using ConvNets over NN is that you do not need to flatten the input images to 1D as they are
capable of working with image data in 2D. This helps in retaining the “spatial” properties of images.
Code for the CNN Model:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D, Flatten
from keras.utils import np_utils
Advanced Algorithms in Al and ML 6.19 Deep Learning for Sequential and Image Data

# to calculate accuracy
from sklearn.metrics import accuracy_score

# loading the dataset


(X_train, y_train), (X_test, y_test) = mnist.load_data()

# building the input vector from the 28x28 pixels


X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype(‘float32')

# normalizing the data to help with the training


X_train /= 255
X_test /= 255

# one-hot encoding using keras' numpy-related utilities


n_classes = 10
print(“Shape before one-hot encoding: “, y_train.shape)
Y_train = np_utils.to_categorical(y_train, n_classes)
Y_test = np_utils.to_categorical(y_test, n_classes)
print(“Shape af ter one-hot encoding: “, ¥_train.shape)

# building a linear stack of layers with the sequential model


model = Sequential()
# convolutional layer
model.add(Conv2D(25, kernel_size=(3,3), strides=(1,1), padding="valid', activation="relu’,
input_shape=(28,28,1)))
model.add(MaxPool2D(pool_size=(1,1)))
# flatten output of conv
model.add(Flatten())
# hidden layer
model.add(Dense(100, activation="relu"))
# output layer
model.add(Dense(10, activation="softmax"))

# compiling the sequential model


model.compile(loss="categorical_crossentropy*, metrics=['accuracy'], optimizer="adam')

# training the model for 10 epochs


model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_test, Y_test))

In the above code, | have added the Conv2D layer and max pooling layers, which are essential components of a CNN
model.
Advanced Algorithms in Al and ML 6.20 Deep Learning for Sequential and Image Data

Even though our max validation accuracy by using a simple neural network model was around 97%, the CNN model is
able to get 98%+ with just a single convolution layer!
/1 ] - 95 156us/step - loss: 0,1917 - acc: 0.9442 - val loss: 0.6740 - val acc: 0.9765
Epoch 2/10
[om e ] - 3s 46us/step - loss: 0.8586 - acc: 0.9826 - val_loss: 0.8689 - val_acc: 0.9775
Epoch 3/10
/1 - 35 S0us/step - loss: 0.0352 - acc: 0.9894 - val_loss: 0.0526 - val_acc: 0.9841
Epoch 4/16
60000/60000 |~ - 35 48us/step - loss: 0.0232 - acc: 0.9934 - val_loss: 0.0512 - val_acc: 0.9840
€poch 5/10
68900/68600 - 35 48us/step - loss: 0.0151 - acc: 0.9954 - val_loss: 0.8550 - val_acc: 0.9824
tpoch 6/10
60000/60000 - 3s 48us/step - loss: 0.0099 - acc: 0.9973 - val_loss: 0.8528 - val acc: .9833
Epoch 7710
60000/60000 - 35 S0us/step - loss: 0.0069 - acc: 0.9982 - val_loss: 0.0540 - val_scc: 0.9845
€poch /10
66000/68000 - 35 48us/step - loss: 0.0659 - acc: 0.9982 - val_loss: 0.8610 - val_acc: 0.9828
Epoch 9/10
60000/68600 - 3s 47us/step - loss: 8.0054 - acc: 0.9984 - val loss: ©.6706 - val acc: ©.9804
Epoch 10/10
60000/60000 [ === 1 47us/step - loss: 0.0061 - acc: 0.9980 - val_loss: 0.6774 - val _acc: 0.9824
<keras.cal\backs.History at Ox7eff143a68d0>
Flg. 6.21
CNN algorithm has the best accuracy:
There are many CNN algorithms for many different tasks, such as object detection, object recognition, image
segmentation etc. However, some of the most commonly used CNN architectures that have been proven to have high
accuracy on various computer vision tasks include VGGNet, ResNet (Residual Network), InceptionNet, DenseNet(example
of a deep neural network), and YOLO.
The difference between CNN and other machine learning algorithms:
Convolutional Neural Networks (CNNs) are a type of Deep Leamning algorithm that is primarily used for image
classification and object recognition tasks. Here are some key differences between CNNs and other machine-learning
algorithms:
1. Unlike machine learning algorithms, CNNs can learn relevant features automatically as part of the training process.
2. CNNs have a unique layered architecture consisting of convolutional, pooling, and fully connected layers, which are
designed to automatically learn the features and hierarchies of the input data, while Ohter ML algorithms have
different architecture.
3. CNNs can be computationally expensive due to their large number of parameters and complex architecture. Other
algorithms, such as decision trees and random forests, are typically faster and more computationally efficient.
Feature Maps:
A feature map is a set of filtered and transformed inputs that are learned by ConvNet’s convolutional layer. A feature
map can be thought of as an abstract representation of an input image, where each unit or neuron in the map corresponds to a
specific feature detected in the image, such as an edge, comer, or texture pattern.
Convolutional Neural Networks (CNNs)
* CNN's popularly known as ConvNets majorly consists of several layers and are specifically used for image
processing and detection of objects.
* It was developed in 1998 by Yann LeCun and was first called LeNet. Back then, it was developed to recognize
digits and zip code characters.
* CNNs have wide usage in identifying the image of the satellites, medical image processing, series forecasting, and
anomaly detection.
* CNNs process the data by passing it through multiple layers and extracting features to exhibit convolutional
operations.
* The Convolutional Layer consists of Rectifled Linear Unit (Rel.U) that outlasts to rectify the feature map.
* The Pooling layer is used to rectify these feature maps into the next feed. Pooling is generally a sampling algorithm
that is down-sampled and it reduces the dimensions of the feature map.
* Later, the result generated consists of 2-D arrays consisting of single, long, continuous, and linear vector flattened
in the map.
* The next layer i.e., called Fully Connected Layer which forms the flattened matrix or 2-D array fetched from the
Pooling Layer as input and identifies the image by classifying it.
Advanced Algorithms in Al and ML 6.21 Deep Leamning for Sequential and Image Data

Gonv. Module #1 Conv. Module #2 Classification

] _ output: cat 2(yfn)


> 1

Input convd maxpool convad maxpool Fully Fully


+RelU +RelLU connected connected
Fig. 6.22
Residual Network ResNet
Residual Network (ResNet) is one of the famous deep learning models that was introduced by Shaoging Ren, Kaiming
He, Jian Sun, and Xiangyu Zhang in their paper. The paper was named “Deep Residual Learning for Image Recognition™
in 2015. The ResNet model is one of the popular and most successful deep learning models so far.
® After the first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition.
* Every subsequent winning architecture uses more layers in a deep neural network to reduce the error rate.
® This works for less number of layers, but when we increase the number of layers, there is a common problem in
deep learning associated with that called the Vanishing/Exploding gradient.
* This causes the gradient to become 0 or too large. Thus when we increases number of layers, the training and test
error rate also increases.

20 20

€ £
s£ g
19 21
g
®
£°
=

0l
05 0 T S S S S T
iter. (led) iter. (le4)
Flg. 6.23: Comparison of 20-layer vs 56-layer architecture
* In the above plot, we can observe that a 56-layer CNN gives more error rate on both training and testing dataset than
a 20-layer CNN architecture.
o After analyzing more on error rate the authors were able to reach conclusion that it is caused by vanishing/exploding
gradient.
* ResNet, which was proposed in 2015 by researchers at Microsoft Research introduced a new architecture called
Residual Network.
Resldual Blocks:
The problem of training very deep networks has been relieved with the introduction of these Residual blocks and the
ResNet model is made-up of these blocks.

weight layer

F(x) l relu
weight layer identity

F(x) +x

Flg. 6.24
Advanced Algorithms in Al and ML 6.22 Deep Leamning for Sequential and Image Data

Deep Resldual Learning for Image Recognition:


The problem of training very deep networks has been relieved with the introduction of these Residual blocks and the
ResNet model is made up of these blocks.

In the above figure, the very first thing we can notice is that there is a direct connection that skips some layers of the
model. This connection is called "skip connection” and is the heart of residual blocks. The output is not the same due to this
skip connection. Without the skip connection, input X gets multiplied by the weights of the layer followed by adding a bias
term.

Then comes the activation function, f( ) and we get the output as H(x).
H(x) = f(wx+b) or H(x) = f(x)
Now with the introduction of a new skip connection technique, the output is H(x) is changed to
H(X) = f(x)+x
But the dimension of the input may be varying from that of the output which might happen with a convolutional layer or
pooling layers. Hence, this problem can be handled with these two approaches:
* Zero is padded with the skip connection to increase its dimensions.
* 1 x| convolutional layers are added to the input to match the dimensions. In such a case, the output is:

H(x) = f(x) +wlx


Here an additional parameter w1 is added whereas no additional parameter is added when using the first approach.
These skip connections technique in ResNet solves the problem of vanishing gradient in deep CNNs by allowing
alternate shortcut path for the gradient to flow through. Also, the skip connection helps if any layer hurts the performance of
architecture, then it will be skipped by regularization.
There have been a series of breakthroughs in the field of Deep Learning and Computer Vision. Especially with the
introduction of very deep Convolutional neural networks, these models helped achieve state-of-the-art results on problems
such as image recognition and image classification.

So, over the years, the deep learning architectures became deeper and deeper (adding more layers) to solve more and
more complex tasks which also helped in improving the performance of classification and recognition tasks and also making
them robust.
But when we go on adding more layers to the neural network, it becomes very much difficult to train and the accuracy of
the model starts saturating and then degrades also. Here comes the ResNet to rescue us from that scenario, and helps to
resolve this problem.
Challenges do ResNets tackle:

* Vanishing gradients: ResNets prevent gradients from becoming too small, allowing for better training of deep
neural networks.

« Training difficulty: ResNets make deep neural networks easier to train by introducing skip connections.
* Nolse sensitivity: ResNets are more robust to noise in the data, leading to improved performance.

* Accuracy limitations: ResNets have achieved state-of-the-art accuracy on a variety of tasks, including image
recognition and natural language processing.
Architecture of ResNet:

There is a 34-layer plain network in the architecture that is inspired by VGG-19 in which the shortcut connection or the
skip connections are added. These skip connections or the residual blocks then convert the architecture into the residual
network as shown in the Fig. 6.25 below:
Advanced Algorithms in Al and ML 6.23 Deep Leamning for Sequential and Image Data

VGG-19 34-layer plain 34-layer residual


Image Image Image

Output, [ oy 64 ]
abc
: 224
[comer]
Output poo), 2

7x7 conv, 84,12


pool, 2

3 conv,
3x3 conv,
256 Caewe]
conv,
3x3 conv, 256 CEewe]
[E3conved ]
A |

output
abc:7

353 conv, 512


v
seg pool

0
Flg. 6.25
Advanced Algorithms in Al and ML 624 Deep Learning for Sequential and Image Data
The layers of ResNet:
ResNet blocks are the building blocks of ResNet architectures. Each block contains convolutional layers, batch
normalization, activation functions, and skip connections. Skip connections allow information to flow directly from earlier
layers to later layers, preventing vanishing gradients and improving network performance. The number of residual blocks and
their configurations determine the depth and complexity of the ResNet model.
The main idea of ResNet:
ResNet introduces skip connections to deep neural networks, allowing information to flow directly between layers,
thereby mitigating the vanishing gradient problem and improving training efficiency. This results in enhanced performance
across various machine-learning tasks.
Stages of ResNet:
ResNet architectures typically comprise four or five stages, each containing multiple residual blocks for feature
extraction and refinement. The number of residual blocks per stage increases as the network deepens, enabling the learning of
more intricate feature representations.
VGG Neural Network
VGG- Network is a convolutional neural network model proposed by K. Simonyan and A. Zisserman in the paper “Very
Deep Convolutional Networks for Large-Scale Image Recognition” [1]. This architecture achieved top-5 test accuracy of
92.7% in ImageNet, which has over 14 million images belonging to 1000 classes.
It is one of the famous architectures in the deep learning field. Replacing large kernel-sized filters with 11 and 5 in the
first and second layer respectively showed the improvement over AlexNet architecture, with multiple 3 x 3 kernel-sized
filters one after another. It was trained for weeks and was using NVIDIA Titan Black GPU’s.

- <l
o s o|e
ul ot B& Bl d< =92
Y IR 213 M < THEH <
f d
IHEE
glala 5

i H Gl 5l BH - IHES : IHEHELE]
Flg. 6.26
VGG16 Architecture
The input to the convolution neural network is a fixed-size 224 x 224 RGB image. The only preprocessing it does is
subtracting the mean RGB values, which are computed on the training dataset, from each pixel.
Then the image is running through a stack of convolutional (Conv.) layers, where there are filters with a very small
receptive field that is 3 x 3, which is the smallest size to capture the notion of lefU/right, up/down, and center part.
In one of the configurations, it also utilizes 1 X | convolution filters, which can be observed as a linear transformation of
the input channels followed by non-linearity. The convolutional strides are fixed to | pixel: the spatial padding of
convolutional layer input is such that the spatial resolution is maintained after convolution, that is the padding is | pixel for
3 x 3 Conv. layers.
Then the Spatial pooling is carried out by five max-pooling layers, 16 which follow some of the Conv. layers but not all
the Conv. layers are followed by max-pooling. This Max-pooling is performed over a 2 X 2-pixel window, with stride 2.
204x204x3 224 x 224 % 64

Tx7x5612
5892 | 1x1x40961x1x
1000

() convolution + ReLU
max pooling
fully nected + ReLU
7 softmax

Flg. 6.27
Advanced Algorithms in Al and ML 6.25 Deep Leamning for Sequential and Image Data

The architecture contains a stack of convolutional layers which have a different depth in different architectures which are
followed by three Fully-Connected (FC) layers: the first two FC have 4096 channels each and the third FC performs
1000-way classification and thus contains 1000 channels that is one for each class.
The final layer is the soft-max layer. The configuration of the fully connected layers is similar in all networks.
All of the hidden layers are equipped with rectification (Rel.U) non-linearity. Also, here one of the networks contains
Local Response Normalization (LRN), such normalization does not improve the performance on the trained dataset, but
usage of that leads to increased memory consumption and computation time.
Architecture Summary:
* Input to the model is a fixed size 224 x 224224 x 224 RGB image.
* Pre-processing is subtracting the training set RGB value mean from each pixel.
* Convolutional layers 17
o Stride fixed to | pixel
© paddingis | pixel for 3x33x3
* Spatial pooling layers
This layer does not count to the depth of the network by convention
0

Spatial pooling is done using max-pooling layers


0o

window size is2 X 22x 2


0

Stride fixed to 2
0

Convnets used 5 max-pooling layers


o

* Fully-connected layers:
o 1™:4096 (Rel.U).
o 2":4096 (ReLU).
o 3™ 1000 (Softmax).
Architecture Conflguration:
The below figure contains the Convolution Neural Network configuration of the VGG net with the following layers:
. VGG-11
* VGG-11 (LRN)
* VGG-13
* VGG-16 (Convl)
* VGG-16
* VGG-19
VGG Neural Network Architecture
* The VGG model, or VGG Net, that supports 16 layers is also referred to as VGG 16, which is a convolutional neural
network model proposed by A. Zisserman and K. Simonyan from the University of Oxford.
® These researchers published their model in the research paper titled, “Very Deep Convolutional Networks for Large-
Scale Image Recognition.”
* The VGG16 model achieves almost 92.7% top-5 test accuracy in ImageNet.
* ImageNet is a dataset consisting of more than 14 million images belonging to nearly 1000 classes.
* Moreover, it was one of the most popular models submitted to ILSVRC-2014.
* It replaces the large kemnel-sized filters with several 3x3 kernel-sized filters one after the other, thereby making
significant improvements over AlexNet.
* The VGGI6 model was trained using Nvidia Titan Black GPUs for multiple weeks. As mentioned above, the
VGGNet-16 supports 16 layers and can classify images into 1000 object categories, including keyboard, animals,
pencil, mouse etc.
* Additionally, the model has an image input size of 224-by-224.
Advanced Algorithms in Al and ML 6.26 Deep Leamning for Sequential and Image Data

VGG Architecture VGG Nets are based on the most essential features of convolutional neural networks (CNN). The
following graphic shows the basic concept of how a CNN works:

CNN Convolutional layers Fully-conn


— A— S g la
Class name | prob
Ll kit fox 0.5956

4 red fox
grey fox
0.3576
0.0439
coyote 0.0013
Arctic fox | 0.0003

Flg. 6.29
* The architecture of a Convolutional Neural Network:
* Image data is the input of the CNN; the model output provides prediction categories for input images. The VGG
network is constructed with very small convolutional filters.
* The VGG-16 consists of 13 convolutional layers and three fully connected layers.
Let us take a brief look at the architecture
of VGG:
Input:
* The VGG Net takes in an image input size of 224 x 224.
® For the ImageNet competition, the creators of the model cropped out the center 224 X 224 patch in each image to
keep the input size of the image consistent.
Convolutional Layers:
® VGG’s convolutional layers leverage a minimal receptive field, i.e., 3 X 3, the smallest possible size that still
captures up/down and left/right. Moreover, there are also 1 X | convolution filters acting as a linear transformation
of the input.
® This is followed by a Rel.U unit, which is a huge innovation from AlexNet that reduces training time. Rel.U stands
for rectified linear unit activation function; it is a piecewise linear function that will output the input if positive;
otherwise, the output is zero.
Advanced Algorithms in Al and ML 6.27 Deep Learning for Sequential and Image Data

® The convolution stride is fixed at 1 pixel to keep the spatial resolution preserved after convolution (stride is the
number of pixel shifts over the input matrix).
Hidden Layers:
* All the hidden layers in the VGG network use Rel.U.
® VGG does not usually leverage Local Response Normalization (LRN) as it increases memory consumption and
training time.
* Moreover, it makes no improvements to overall accuracy.
Fully-Connected Layers:
® The VGGNet has three fully connected layers.
* Out of the three layers, the first two have 4096 channels each, and the third has 1000 channels, | for each class.

Flg. 6.30: Fully Connected Layers


Transfer Learning for Deep Learning
Transfer learning is a powerful technique used in Deep Learning. By harnessing the ability to reuse existing models and
their knowledge of new problems, transfer learning has opened doors to training deep neural networks even with limited data.
This breakthrough is especially significant in data science, where practical scenarios often need more labeled data. In this
article, we delve into the depths of transfer leamning, unraveling its concepts and exploring its applications in empowering
data scientists to tackle complex challenges with newfound efficiency and effectiveness.

Transfer Learning:
The reuse of a pre-trained model on a new problem is known as transfer learning in machine learning. A machine uses
the knowledge learned from a prior assignment to increase prediction about a new task in transfer learning. You could, for
example, use the information gained during training to distinguish beverages when training a classifier to predict whether an
image contains cuisine.
The knowledge of an already trained machine learning model is transferred to a different but closely linked problem
throughout transfer learning.
For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the
model’s training knowledge to identify other objects such as sunglasses.
Advanced Algorithms in Al and ML 6.28 Deep Learning for Sequential and Image Data

Probability for cat

Probability for dog

Fig. 6.31
With transfer learning, we basically try to use what we have leamned in one task to better understand the concepts in
another. weights are being automatically being shifted to a network performing “task A” from a network that performed new
“task B.”
Because of the massive amount of CPU power required, transfer learning is typically applied in computer vision and
natural language processing tasks like sentiment analysis.
‘Working of Transfer Learning:
In computer vision, neural networks typically aim to detect edges in the first layer, forms in the middle layer, and task-
specific features in the latter layers.
The early and central layers are employed in transfer learning, and the latter layers are only retrained. It makes use of the
labelled data from the task it was trained on.

Inpul N Output
Pretained E; I 0
Model ARIA w O
Transfer
Learning Je— Common inner—p{
Input
Custom
Model

layers

Fig. 6.32
Let’s return to the example of a model that has been intended to identify a backpack in an image and will now be used to
detect sunglasses. Because the model has trained to recognise objects in the earlier levels, we will simply retrain the
subsequent layers to understand what distinguishes sunglasses from other objects.
* In computer vision, for example, neural networks usually try to detect edges in the earlier layers, shapes in
the middle layer and some task-specific features in the later layers.
* In transfer learning, the early and middle layers are used and we only retrain the latter layers. It helps leverage the
labelled data of the task it was initially trained on.
* Let’s go back to the example of a model trained for recognizing a backpack on an image, which will be used to
identify sunglasses. In the earlier layers, the model has learned to recognize objects, because of that we will only
retrain the latter layers so it will learn what separates sunglasses from other objects.
Advanced Algorithms in Al and ML 6.29 Deep Learning for Sequential and Image Data

* In transfer learning, we try to transfer as much knowledge as possible from the previous task the model was trained
on to the new task at hand.
* This knowledge can be in various forms depending on the problem and the data. For example, it could be how
models are composed, which allows us to identify novel objects more easily.
Uses of Transfer Learning:
® Transfer learning has several benefits, but the main advantages are saving training time, better performance of
neural networks (in most cases), and not needing a lot of data.
® Usually, a lot of data is needed to train a neural network from scratch but access to that data isn't always available —
this is where transfer learning comes in handy.
© With transfer learning a solid machine learning model can be built with comparatively little training data because the
model is already pre-trained.
e This is especially valuable in natural language processing because mostly expert knowledge is required to
create large labelled data sets.
* Additionally, training time is reduced because it can sometimes take days or even weeks to train a deep neural
network from scratch on a complex task.
Need of Transfer Learning?
Transfer learning offers a number of advantages, the most important of which are reduced training time, improved neural
network performance (in most circumstances), and the absence of a large amount of data.
To train a neural model from scratch, a lot of data is typically needed, but access to that data is not always possible — this
is when transfer learning comes in handy.

Old Classifier New Classifier

t 1
CNN Layer CNN Layer

CNNTLayef — CNNTLayer

t
CNN Layer CNNTLayer

Iinul Input

Fig. 6.33
Because the model has already been pre-trained, a good machine learning model can be generated with fairly little
training data using transfer leaming. This is especially useful in natural language processing, where huge labelled datasets
require a lot of expert knowledge. Additionally, training time is decreased because building a deep neural network from the
start of a complex task can take days or even weeks.
Steps to Use Transfer Learning:
‘When we don’t have enough annotated data to train our model with and there is a pre-trained model that has been trained
on similar data and tasks. If you used TensorFlow to train the original model, you might simply restore it and retrain some
layers for your job. Transfer learning, on the other hand, only works if the features learnt in the first task are general, meaning
they can be applied to another activity. Furthermore, the model’s input must be the same size as it was when it was first
trained.
Advanced Algorithms in Al and ML 6.30 Deep Learning for Sequential and Image Data
If you don’t have it, add a step to resize your input to the required size:
1. Training a Model to Reuse It:
* Consider the situation in which you wish to tackle Task A but lack the necessary data to train a deep neural network.
Finding a related task B with a lot of data is one method to get around this.
® Utilize the deep neural network to train on task B and then use the model to solve task A. The problem you are
seeking to solve will decide whether you need to employ the entire model or just a few layers.
® If the input in both jobs is the same, you might reapply the model and make predictions for your new input.
Changing and retraining distinct task-specific layers and the output layer, on the other hand, is an approach to
investigate.
2. Using a Pre Trained Model:
® The second option is to employ a model that has already been trained. There are a number of these models out there,
so do some research beforehand. The number of layers to reuse and retrain is determined by the task.
* Keras consists of nine pre-trained models used in transfer learning, prediction, fine-tuning. These models, as well as
some quick lessons on how to utilise them, may be found here. Many research institutions also make trained models
accessible.
e The most popular application of this form of transfer learning is deep learning.
3. Extraction of Features:
* Another option is to utilise deep learning to identify the optimum representation of your problem, which comprises
identifying the key features. This method is known as representation learning, and it can often produce significantly
better results than hand-designed representations.
e Feature creation in machine learning is mainly done by hand by researchers and domain specialists. Deep learning,
fortunately, can extract features automatically. Of course, this does not diminish the importance of feature
engineering and domain knowledge; you must still choose which features to include in your network.
Features New features
Data points

Original data fl feduced

Flg. 6.34
4. Extraction of Features In Neural Networks:
® Neural networks, on the other hand, have the ability to learn which features are critical and which aren’t. Even for
complicated tasks that would otherwise necessitate a lot of human effort, a representation learning algorithm can
find a decent combination of characteristics in a short amount of time.
® The learned representation can then be applied to a variety of other challenges. Simply utilise the initial layers to
find the appropriate feature representation, but avoid using the network’s output because it is too task-specific.
Instead, send data into your network and output it through one of the intermediate layers.
* The raw data can then be understood as a representation of this layer. This method is commonly used in computer
vision since it can shrink your dataset, reducing computation time and making it more suited for classical
algorithms.
Models That Have Been Pre-Trained
There are a number of popular pre-trained machine learning models available. The Inception-v3 model, which was
developed for the ImageNet “Large Visual Recognition Challenge,” is one of them.” Participants in this challenge had to
categorize pictures into 1,000 subcategories such as “zebra,” “Dalmatian,” and “dishwasher.”
Code Implementation of Transfer Learning with Python:
Importing Libraries
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
Advanced Algorithms in Al and ML 6.31 Deep Learning for Sequential and Image Data

from tensorflow.keras import Model


from tensorflow.keras.layers import Conv2D, Dense, MaxPooling2D, Dropout,
Flatten,Global AveragePooling2D
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import Reducel ROnPlateau
from tensorflow.keras.layers import Input, Lambda, Dense, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.applications.inception_v3 import preprocess_input
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.image import ImageDataGenerator load_img
from tensorflow.keras.models import Sequential
import numpy as np
from glob import glob
Uploading Data via Kaggle API:
from google.colab import files
files.upload()
Saving kaggle.Json to kaggle.Json
Imkdir -p ~/.kaggle
lcp kaggle.json ~/.kaggle/

Ichmod 600 ~/.kaggle/kaggle.json

lkaggle datasets download -d mohamedhanyyy/chest-ctscan-images #downloading data from kaggle APT of


Dataset

from zipfile import ZipFile


file_name = “chest-ctscan-images.zip"

with ZipFile(file_name,'r") as zip:


zip.extractall()
print(‘Done")
Designing Our CNN Model with help of Pre-Tralned Model:
InceptionV3_model = tf.keras.applications.InceptionV3(weights='imagenet', include_top=False,
input_shape=(224, 224, 3))

from tensorflow.keras import Model


from tensorflow.keras.layers import Conv2D, Dense, MaxPooling2D, Dropout,
Flatten,Global AveragePooling2D
from tensorflow.keras.models import Sequential

# The last 15 layers fine tune


for layer in InceptionV3_model.layers[:-15]:
Advanced Algorithms in Al and ML 6.32 Deep Learning for Sequential and Image Data

layer.trainable = False

x = InceptionV3_model.output
x = GlobalAveragePooling2D()(x)
x = Flatten()(x)
x = Dense(units=512, activation="relu')(x)
x = Dropout(0.3)(x)
x = Dense(units=512, activation="relu')(x)
x = Dropout(0.3)(x)
output = Dense(units=4, activation="softmax")(x)
model = Model(InceptionV3_model.input, output)

model.summary()
Image Augmentation (For preventing the issue of Overfitting):
# Use the Image Data Generator to import the images from the dataset
from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255,


shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)

test_datagen = ImageDataGenerator(rescale = 1./255)


#no flip and zoom for test datase

# Make sure you provide the same target size as initialied for the image size
training_set = train_datagen.flow_from_directory('/content/Data/train’,
target_size = (224, 224),
batch_size = 32,
class_mode = 'categorical’)

Training Our Model:


# fit the model
# Run the cell. It will take some time to execute
r = model.fit_generator(
training_set,
validation_data=test_set,
epochs=8,
steps_per_epoch=len(training_set),
validation_steps=len(test_set)
)
# plot the loss
plt.plot(r.history['loss'], label="train loss")
plt.plot(r.history['val_loss'], label="val loss")
Advanced Algorithms in Al and ML 6.33 Deop Learning for Sequential and Image Data

plt.legend()
plt.show()
plt.savefig(‘LossVal_loss")

# plot the accuracy


pit.plot(rhistory[‘accuracy'], label="train acc")
plt.plot(r.history['val_accuracy'], label="val acc')
plt.legend()
plt.show()
plt.savefig(' AccVal_acc')
Making Predictions:
import numpy as np
y_pred = np.argmax(y_pred, axis=1)
y_pred
The above code is being executed, and respective output for classification using Transfer Learning is being shown below
the embedded notebook:

10
===~ Train loss|
— val loss
0.8

06

04

02 e
Advanced Algorithms in Al and ML 634 Deep Learning for Sequential and Image Data

Fine-Tuning Large Language Models


Introduction:
In recent years, the field of natural language processing (NLP) has seen a significant shift due to the emergence of huge
language models. These advanced models have enabled a diverse range of applications, including language translation,
sentiment analysis, and the development of intelligent chatbots.
However, what distinguishes these models is their ability to adapt and specialize for certain tasks and domains, which
has been a common practice. This allows them to reach their full potential and achieve exceptional performance. This
handbook provides a thorough exploration of fine-tuning big language models, including both fundamental and advanced
concepts.
Definition Fine Tuning :
Fine-tuning is a way of applying or utilizing transfer learning. Specifically, fine-tuning is a process that takes a model
that has already been trained for one given task and then tunes or tweaks the model to make it perform a second similar task.
Understanding Pre-Tralned Language Models:
Pre-trained language models are extensive neural networks that have been trained on extensive collections of text data,
often obtained from the internet. The training method entails the prediction of omitted words or tokens inside a certain phrase
or sequence, so endowing the model with a deep comprehension of grammar, context, and semantics. Through the analysis of
billions of phrases, these models can comprehend the complexities of language and accurately record its subtleties.
Some well-known pre-trained language models are BERT (Bidirectional Encoder Representations from Transformers),
GPT-3 (Generative Pre-trained Transformer 3), RoBERTa (A Robustly Optimized BERT Pretraining Approach), and many
more. These models are renowned for their adeptness in executing tasks such as text production, emotion categorization, and
language comprehension with remarkable competence.
GPT-3:
GPT-3 is a generative model. The Pre-trained Transformer 3 is an innovative architecture for language models that has
revolutionized the fields of natural language creation and comprehension. The GPT-3 design is built around the Transformer
concept, which utilizes several factors to achieve outstanding performance.
The Architecture of GPT-3:
Output
Probabilties

Nx

~J
Positional o Positional
Encoding = Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)
Flg. 6.37
Advanced Algorithms in Al and ML 6.35 Deep Learning for Sequential and Image Data

GPT-3 is composed of a series of Transformer encoder layers. Each layer consists of multi-head self-attention
mechanisms and feed-forward neural networks. The feed-forward networks are responsible for processing and transforming
the encoded representations. Meanwhile, the attention mechanism allows the model to identify and understand the
connections and associations between words.
GPT-3's primary breakthrough lies in its colossal scale, enabling it to include an immense volume of linguistic
information due to its staggering 175 billion attributes.
Implementation of Code:
You can use the OpenAl API to interact with the GPT- 3 model of openAl. Here is an example of text generation using
GPT-3.
import openai

# Set up your OpenAT API credentials


openai.api_key = "YOUR_API_KEY'

# Define the prompt for text generation


prompt = “A quick brown fox jumps”

# Make a request to GPT-3 for text generation


response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=100,
temperature=0.6

# Retrieve the generated text from the API response


generated_text = response.choices[0].text

# Print the generated fext


print(generated_text)
Fine-Tuning: Talloring Models to Our Needs:
Reflnement: Customizing Models to Suit Our Requirements
However, it is important to note that while pre-trained language models are very skilled, they do not possess innate
expertise in any particular activity. While they possess a remarkable command of language, they need refinement in areas
such as sentiment analysis, language translation, or providing answers related to certain topics.
Fine-tuning is akin to applying a final adjustment to these adaptable models. Envision having a polymathic acquaintance
who has exceptional proficiency in several domains, but need their mastery of a certain skill for a momentous event. Would
you provide them with specialized training in that particular domain? That is exactly the process we follow when refining
pre-trained language models via fine-tuning.
Advanced Algorithms in Al and ML 6.36 Deep Learning for Sequential and Image Data

Labelled Training Set Labelled Training Set

'
8 i N Pretrained
transformer Ki eep frozen transformer Update
all layers

/ = Update
One or more
fully connected layers
Fig. 6.38
Fine-tuning refers to the process of training a pre-existing model on a smaller dataset that is specific to a particular task.
This dataset contains labelled examples that are relevant to the target task. By exposing the model to these labelled examples,
it can modify its parameters and internal representations to better align with the requirements of the target task.
‘The Need for Fine-Tuning LLMs:
Although pre-trained language models are impressive, they do not possess inherent task-specific capabilities. Fine-tuning
involves modifying these versatile models to achieve higher levels of accuracy and efficiency while performing certain jobs.
When faced with a particular NLP job, such as analyzing the sentiment of customer reviews or answering questions in a
certain field, it is necessary to adjust the pre-trained model to grasp the intricacies of that particular task and domain.
The advantages of fine-tuning are many. Firstly, it utilizes the acquired information from pre-training, resulting in
significant time and computational resource savings that would otherwise be necessary o train a model from the beginning.
Furthermore, fine-tuning enables us to achieve superior performance on certain tasks by aligning the model with the
complexities and subtleties of the specific domain it was fine-tuned for.
Fine-Tuning LLMs Process: A Step-by-step Guide
The fine-tuning approach often entails providing the task-specific dataset to the pre-trained model and modifying its
parameters via backpropagation. The objective is to minimize the loss function, which quantifies the disparity between the
model's predictions and the actual labels in the dataset. The fine-tuning procedure involves updating the model's parameters,
hence enhancing its specialization for the specific job you are targeting.
In this guide, we will explore the steps involved in refining a substantial language model specifically for sentiment
analysis. The Hugging Face Transformers library will be used, offering convenient access to pre-trained models and tools for
fine-tuning.
Step 1: Load the Pre-trained Language Model and Tokenizer
The first step is to load the pre-trained language model and its corresponding tokenizer. For this example, we will use the
“distillery-base-uncased’ model, a lighter version of BERT.
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Load the pre-trained tokenizer


tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Load the pre-trained model for sequence classification


model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased")
Step 2: Prepare the Sentiment Analysls Dataset
We need a labelled dataset with text samples and corresponding sentiments for sentiment analysis. Let us create a small
dataset for illustration purposes:
texts = ['I loved the movie. It was great!”,
“The food was terrible.”,
“The weather is okay."]
sentiments = ["positive”, "negative”, “neutral”]
Advanced Algorithms in Al and ML 637 Deep Learning for Sequential and Image Data
Next, we will use the tokenizer to convert the text samples into token IDs, and attention masks the model requires.
# Tokenize the text samples
encoded_texts = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Extract the input IDs and attention masks


input_ids = encoded_texts['input_ids']
attention_mask = encoded_texts['attention_mask']

# Convert the sentiment labels to numerical form


sentiment_labels = [sentiments.index(sentiment) for sentiment in sentiments]
Step 3: Add a Custom Classification Head
The pre-trained language model itself does not include a classification head. We must add one to the model to perform
sentiment analysis. In this case, we will add a simple linear layer.
import torch.nn as nn

# Add a custom classification head on top of the pre-trained model


num_classes = len(set(sentiment_labels))
classification_head = nn.Linear(model.config.hidden_size, num_classes)

# Replace the pre-trained model's classification head with our custom head
model.classifier = classification_head
Step 4: Fine-Tune the Model
With the custom classification head in place, we can now fine-tune the model on the sentiment analysis dataset. We will
use the AdamW optimizer and CrossEntropyLoss as the loss function.
import torch.optim as optim

# Define the optimizer and loss function


optimizer = optim.AdamW(model.parameters(), Ir=2e-5)
criterion = nn.CrossEntropylLoss()

# Fine-tune the model


num_epochs = 3
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=torch.tensor(sentiment_labels))
loss = outputs.loss
loss.backward()
optimizer.step()
Instruction of Finetuning:
Instruction fine-tuning is a specialized technique to tailor large language models to perform specific tasks based on
explicit instructions. While traditional fine-tuning involves training a model on task-specific data, instruction fine-tuning goes
further by incorporating high-level instructions or demonstrations to guide the model's behaviour.
Advanced Algorithms in Al and ML 6.38 Deep Learning for Sequential and Image Data

Model Task-specific examples Model

Summarize the following text:!


" ERALE et

Flg. 6.39
This approach allows developers to specify desired outputs, encourage certain behaviours, or achieve better control over
the model’s responses. In this comprehensive guide, we will explore the concept of instruction fine-tuning and its
implementation step-by-step.
Instruction Finetuning Process:
What if we could go beyond traditional fine-tuning and provide explicit instructions to guide the model’s behaviour?
Instruction fine-tuning does that, offering a new level of control and precision over model outputs. Here we will explore the
process of instruction fine-tuning large language models for sentiment analysis.
Step 1: Load the Pre-trained Language Model and Tokenizer
To begin, let’s load the pre-trained language model and its tokenizer. We'll use GPT-3, a state-of-the-art language
model, for this example.
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification

# Load the pre-trained tokenizer


tokenizer = 6PT2Tokenizer.from_pretrained('gpt2')

# Load the pre-trained model for sequence classification


model = 6PT2ForSequenceClassification.from_pretrained('gpt2')
Step 2: Prepare the Instruction Data and Sentiment Analysis Dataset
For instruction fine-tuning, we need to augment the sentiment analysis dataset with explicit instructions for the model.
Let us create a small dataset for demonstration:
texts = ["I loved the movie. It was great!”,
“The food was terrible.”,
“The weather is okay."]
sentiments = ["positive”, "negative”, “neutral”]
instructions = ["Analyze the sentiment of the text and identify if it is positive.”,
“Analyze the sentiment of the text and identify if it is negative.”,
“Analyze the sentiment of the text and identify if it is neutral."]
Next, let’s tokenize the texts, sentiments, and instructions using the tokenizer:
# Tokenize the texts, sentiments, and instructions
encoded_texts = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
encoded_instructions = tokenizer(instructions, padding=True, truncation=True, return_tensors='pt")

# Extract input IDs, attention masks, and instruction IDs


input_ids = encoded_texts['input_ids']
attention_mask = encoded_texts['attention_mask']
instruction_ids = encoded_instructions['input_ids']
Advanced Algorithms in Al and ML 6.39 Deep Learning for Sequential and Image Data

Step 3: Customize the Model Architecture with Instructions


To incorporate instructions during fine-tuning, we need to customize the model architecture. We can do this by
concatenating the instruction IDs with the input IDs:
import torch

# Concatenate instruction IDs with input IDs and adjust attention mask
input_ids = torch.cat([instruction_ids, input_ids], dim=1)
attention_mask = torch.cat([torch.ones_like(instruction_ids), attention_mask], dim=1)
Step 4: Fine-Tune the Model with Instructions
With the instructions incorporated, we can now fine-tune the GPT-3 model on the augmented dataset. During
fine-tuning, the instructions will guide the model’s sentiment analysis behaviour.
import torch.optim as optim

# Define the optimizer and loss function


optimizer = optim.AdamW(model.parameters(), Ir=2e-5)
criterion = torch.nn.CrossEntropyLoss()

# Fine-tune the model


num_epochs = 3
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=torch.tensor(sentiments))
loss = outputs.loss
loss.backward()
optimizer.step()
Instruction fine-tuning takes the power of traditional fine-tuning to the next level, allowing us to control the behaviour of
large language models precisely. By providing explicit instructions, we can guide the model’s output and achieve more
accurate and tailored results.
Key Differences Between the Two Approaches
Standard fine-tuning involves training a model on a labelled dataset, honing its abilities to perform specific tasks
effectively. But if we want to provide explicit instructions to guide the model’s behaviour, instruction finetuning comes into
play that offers unparalleled control and adaptability.
Here are the critical differences between instruction finetuning and standard finetuning.
* Data Requirements: Standard fine-tuning relies on a significant amount of labeled data for the specific task,
whereas instruction fine-tuning benefits from the guidance provided by explicit instructions, making it more
adaptable with limited labelled data.
e Control and Precision: Instruction fine-tuning allows developers to specify desired outputs, encourage certain
behaviors, or achieve better control over the model’s responses. Standard fine-tuning may not offer this level of
control.
* Learning from Instructions: Instruction fine-tuning requires an additional step of incorporating instructions into
the model’s architecture, which standard fine-tuning does not.
Introducing Catastrophic Forgetting: A Perilous Challenge
As we sail into the world of fine-tuning, we encounter the perilous challenge of catastrophic forgetting. This
phenomenon occurs when the model’s fine-tuning on a new task erases or ‘forgets” the knowledge gained during pre-training.
The model loses its understanding of the broader language structure as it focuses solely on the new task.
Imagine our language model as a ship’s cargo hold filled with various knowledge containers, each representing different
linguistic nuances. During pre-training, these containers are carefully filled with language understanding. The ship’s crew
rearranges the containers when we approach a new task and begin fine-tuning. They empty some to make space for new
task-specific knowledge. Unfortunately, some original knowledge is lost, leading to catastrophic forgetting.
Advanced Algorithms in Al and ML 6.40 Deep Learning for Sequential and Image Data
Mitigating Catastrophic Forgetting: Safeguarding Knowledge
To navigate the waters of catastrophic forgetting, we need strategies to safeguard the valuable knowledge captured
during pre-training. There are two possible approaches.
Multi-task Finetuning: Progressive Learning
Here we gradually introduce the new task to the model. Initially, the model focuses on pre-training knowledge and
slowly incorporates the new task data, minimizing the risk of catastrophic forgetting.
Multitask instruction fine-tuning embraces a new paradigm by simultancously training language models on multiple
tasks. Instead of fine-tuning the model for one task at a time, we provide explicit instructions for each task, guiding the
model’s behaviour during fine-tuning.
Benefits of Multitask Instruction Fine-Tuning
* Knowledge Transfer: The model gains insights and knowledge from different domains by training on multiple
tasks, enhancing its overall language understanding.
e Shared Representations: Multitask instruction fine-tuning allows the model to share representations across tasks.
This sharing of knowledge improves the model's generalization capabilities.
o Efficlency: Training on multiple tasks concurrently reduces the computational cost and time compared to
fine-tuning cach task individually.
Real-world Applications of Fine-tuning LLMs:
We will look closer at some exciting real-world use cases of fine-tuning large language models, where NLP
advancements are transforming industries and empowering innovative solutions.
* Sentiment Analysis: Fine-tuning language models for sentiment analysis allows businesses to analyze customer
feedback, product reviews, and social media sentiments to understand public perception and make data-driven
decisions.
* Named Entity Recognition (NER): By fine-tuning models for NER, entities like names, dates, and locations can be
automatically extracted from text, enabling applications like information retrieval and document categorization.
* Language Translation: Fine-tuned models can be used for machine translation, breaking language barriers and
enabling seamless communication across different languages.
* Chatbots and Virtual Assistants: By fine-tuning language models, chatbots and virtual assistants can provide more
accurate and contextually relevant responses, enhancing user experiences.
* Medical Text Analysis: Fine-tuned models can aid in analyzing medical documents, electronic health records, and
medical literature, assisting healthcare professionals in diagnosis and research.
* Financial Analysis: Fine-tuning language models can be utilized in financial sentiment analysis, predicting market
trends, and generating financial reports from vast datasets.
* Legal Document Analysis: Fine-tuned models can help in legal document analysis, contract review, and automated
document summarization, saving time and effort for legal professionals.
Transfer Learning:
* In transfer learning, the knowledge of an already trained machine learning model is applied to a different but related
problem.
* For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the
knowledge that the model gained during its training to recognize other objects like sunglasses.
* With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in
another.
* We transfer the weights that a network has learned at “task A” to a new “task B.”
® The general idea is to use the knowledge a model has learned from a task with a lot of available labeled training data
in a new task that doesn't have much data. Instead of starting the learning process from scratch, we start with
patterns learned from solving a related task.
* Transfer leamning is mostly used in computer vision and natural language processing tasks like sentiment analysis
due to the huge amount of computational power required.
* Transfer learning isn’t really a machine learning technique, but can be seen as a “design methodology” within the
field, for example, active learning.
® It is also not an exclusive part or study-area of machine leaming. Nevertheless, it has become quite popular in
combination with neural networks that require huge amounts of data and computational power.
Use Fine-Tuning:
Assuming the original task is similar to the new task, using an artificial neural network that has already been designed
and trained allows us to take advantage of what the model has already learned without having to develop it from scratch.
Advanced Algorithms in Al and ML 6.41 Deep Learning for Sequential and Image Data

When building a model from scratch, we usually must try many approaches through trial-and-error.
For example, we have to choose how many layers we are using, what types of layers we are using, what order to put the
layers in, how many nodes to include in each layer, decide how much regularization to use, what to set our learning rate as,
etc.
* Number of layers.
o Types of layers.
® Order of layers.
* Number of nodes in each layer.
* How much regularization to use.
* Leamning rate.
Building and validating our model can be a huge task in its own right, depending on what data we are training it on.
This is what makes the fine-tuning approach so attractive. If we can find a trained model that already does one task well,
and that task is similar to ours in at least some remote way, then we can take advantage of everything the model has already
learned and apply it to our specific task.
Now, of course, if the two tasks are different, then there will be some information that the model has learned that may
not apply to our new task, or there may be new information that the model needs to learn from the data regarding the new
task that was not learned from the previous task.
For example, a model trained on cars is not going to have ever seen a truck bed, so this feature is something new the
model would have to learn about. However, think about everything our model for recognizing trucks could use from the
model that was originally trained on cars.

Flg. 6.40
This already trained model has learned to understand edges and shapes and textures and more objectively, head lights,
door handles, windshields, tires etc. All of these learned features are definitely things we could benefit from in our new
model for classifying trucks.
(] Freezing Weights
By freezing, we mean that we don't want the weights for these layers to update whenever we train the model on our new
data for our new task. We want to keep all of these weights the same as they were after being trained on the original task. We
only want the weights in our new or modified layers to be updating.

Flg. 6.41
Advanced Algorithms in Al and ML 6.42 Deep Leamning for Sequential and Image Data
After we do this, all that's left is just to train the model on our new data. Again, during this training process, the weights
from all the layers we kept from our original model will stay the same, and only the weights in our new layers will be
updating.

ion:
How does Recurrent Neural Networks work?
What is GPT?
How does GPT work?
Describe RNN - Recurrent Neural Networks (RNNs)
Describe LSTM - Long Short-Term Memory Networks (LSTMs)
Describe GRU - Gated Recurrent Units
‘What is the Transformer model?
What is a Transformer used for?
What is a Transformer in NLP?
. How does a Transformer network work?
. What is GPT?
. What does GPT do?
. How does GPT work?
. What are feature maps?
. Which CNN algorithm has the best accuracy?
. What is the difference between CNN and other machine learning algorithms?
. What is ResNet?
. What challenges do ResNets tackle?
. What are the layers of ResNet?
. What is the main idea of ResNet?
. How many stages are there in ResNet?
. What Is Transfer Learning?
. How Transfer Learning Works?
. Why Should You Use Transfer Learning?
. What is transfer learning in a CNN?
. What is an example of learning transfer?
. What type of leaming is transfer learning?
. What is transfer learning in RL?
. What is Instruction Finetuning?
. Describe Convolutional Neural Networks (CNNs)
. Describe RESIDUAL NETWORKS (ResNet)
. What is VGG?
. What is VGG16?
. What is Transfer Learning?
. How Transfer Learning Works
. Why Use Transfer Learning
. Define Fine Tuning.
. Why Use Fine-Tuning?
. How to Fine-Tune?
. Define Freezing Weights

You might also like