AAM Book
AAM Book
12 Training a Model for Supervised Leaming Features - Understand Your Data Better,
Feature Extraction and Engineering 15
13 Feature Engineering on - Numerical Data and Categorical Data and Text data 16
14 Feature Scaling, Features Selection 18
Practice Questions 1.1
2. Naive Bayes, Decision Tree and Random Forest 21-241
21 Introduction 21
22 Working of Classification 21
23 Naive Bayes Classifier 22
24 Types of Naive Bayes Model 22
25 Naive Bayes 28
251 What is Naive Bayes Classifier Algorithm? 28
252 What is it called Naive Bayes? 29
Learning Objectives...
Select a suitable model for the given data with justification.
&
JEEW INTRODUCTION:
SELECTING A MODEL
‘What Is Model Selection?
“The process of selecting the machine learning model most appropriate for a given Issue Is known as model
selection.”
Model selection is a procedure that may be used to compare models of the same type that have been set-up with various
model hyperparameters
and models of other types.
Why Model Selection?
Model selection is a procedure used by statisticians to examine the relative merits of different predictive methods and
identify which one best fits the observed data. Model evaluation with the data used for training is not accepted in data science
becauseit easily generates overoptimistic
and overfitted models.
‘You may have to check things like :
* Overfitting
and underfitting
* Generalization
error
* Validation for model selection
For certain algorithms, the best way to reveal the problem’s structure to the learning algorithm is through specific data
preparation. The next logical step is to define model selection as the process of choosing amongst model development
workflows.
So, depending
on your use case, you choosean ML model.
n1
Advanced Algorithms in Al and ML 12 Model Selection and Feature Engineering
Model
! v ¥
Probatilistic eneon, Resampling
Fig. 1.1
Resampling Methods:
As the name implies, resampling methods are straightforward methods of rearranging data samples to see how well the
model performs on samples of data it has not been trained. Resampling, in other words, enables us to determine the model's
generalizability.
There are two main types of re-sampling techniques:
Cross-valldation:
It is a resampling procedure to evaluate models by splitting the data. Consider a situation where you have two models
and want to determine which one is the most appropriate for a certain issue. In this case, we can use a cross-validation
process.
So, let’s say you are working on an SVM model and have a dataset that iterates multiple times. We will now divide the
datasets into a few groups. One group out of the five will be used as test data. Machine learning models are evaluated on test
data after being trained on training data.
Let's say you calculated the accuracy of each iteration; the figure below illustrates the iteration and accuracy of that
iteration.
Advanced Algorithms in Al and ML 13 ‘Model Selection and Feature Engineering
Iteration 1 c 88%
Iteration 2 c 83%
Iteration 3 D 86%
Iteration 4 c 82%
Iteration 5 D 84%
Iteration 6 c 85%
Resampling only focuses on model performance, whereas probabllistic modelling concentrates on both model
performance and complexity.
* ICis a statistical metric that yields a score. The model with the lowest score is the most effective.
® Performance is calculated using in-sample data; therefore a test set is unnecessary. Instead, the score is calculated
using all the training data.
® Less complexity entails a straightforward model with fewer parameters that is simple to learn and maintain but
unable to detect fluctuations that affect a model's performance.
There are three statistical methods for calculating the degree of complexity and how well a particular model fits a
dataset:
Akalke Information Criterion (AIC)
AIC is a single numerical score that may be used to distinguish across many models the one that is most likely to be the
best fit for a given dataset. AIC ratings are only helpful when compared to other scores for the same dataset.
Lower AIC ratings are preferable: AIC calculates the model’s accuracy in fitting the training data set and includes a
penalty term for model accuracy.
AIC _= (2K-2log
N (L)
K = The number of distinct variablesor predictors.
L The model's greatest likelihood.
N is the number of data points in the practice set (especially helpful in the case of small datasets).
The drawback of AIC is that it struggles with generalizing models since it favors intricate models that retain more
training data. This implies that all tested models might still have a poor fit.
Minimum Description Length (MDL):
According to the MDL concept, the explanation that allows for the most data compression is the best given a small
collection of observed data. Simply put, it is a technique that forms the comerstone of statistical modelling, pattern
recognition, and machine leamning.
MDL Lo +1.(B)
d = Model, D =The model's predictions
L(h) is the number of bits needed to express the model.
3. Feature Engineering:
(a) Create New Features:
® Derive new features that might capture important patterns or relationships.
* For example, extract date features from a timestamp, create interaction terms, or combine existing features.
(b) Polynomial Features:
* Introduce polynomial features to capture non-linear relationships.
* Forinstance, square or cube certain features to account for quadratic or cubic patterns.
(c) Dimensionality Reduction:
® Use techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the dataset while
retaining essential information.
Data Splitting:
(a) Training and Testing Sets:
o Split your dataset into training and testing sets to evaluate your model's performance on unseen data.
Model Training:
(a) Choose a Model:
® Select a suitable algorithm based on your problem (For example, regression, classification) and data
characteristics.
(b) Train the Model:
® Feed the training data into the chosen model.
o Adjust model parameters using techniques like cross-validation.
Model Evaluation:
(a) Evaluate on Test Set:
® Assess the model’s performance on the testing set to estimate its generalization capability.
(b) Fine-tuning:
* Ifneeded, fine-tune hyperparameters to improve performance.
Iterative Process:
(a) Reflnement:
* Based on model performance, go back to feature engineering or adjust the model architecture.
(b) Cross-Validation:
® Perform cross-validation to ensure robustness of your model.
Deployment:
® Once satisfied with the model, deploy it to make predictions on new, unseen data.
\Jl FEATURE ENGINEERING ON - NUMERICAL DATA AND CATEGORICAL
DATA AND TEXT DATA
What is Feature Engineering ?
‘ "o
[ Extracting [ fl — o
Fig. 1.3
‘Advanced Algorithms in Al and ML 1.7 Model Selection and Feature Engineering
Feature Engineering:
* Feature engineering refers to the process of using domain knowledge to select and transform the most relevant
variables from raw data when creating a predictive model using machine learning or statistical modelling.
® The goal of feature engineering and selection is to improve the performance of machine learning (ML) algorithms.
Data Pre-Processing:
* Data preprocessing is an important step in the data mining process.
® Itrefers to the cleaning, transforming, and integrating of data to make it ready for analysis.
* The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
Feature engineering techniques for numerical data, categorical data, and text data separately:
1. Numerical Data:
(a) Scaling:
* Standardize or normalize numerical features to ensure they are on a similar scale. This is important for
algorithms sensitive to feature scales.
(b) Binning:
* Convert numerical features into categorical features by binning or bucketing. This can help capture non-linear
relationships.
() Polynomial Features:
* Introduce polynomial features to capture non-linear relationships in the data.
(d) Log Transform:
o Apply a log transformation to numerical features to handle skewed distributions.
(e) Interactions:
® Create interaction terms between two or more numerical features.
(O Outlier Handling:
* Identify and handle outliers using techniques such as truncation, transformation, or imputation.
2. Categorical Data:
(a) One-Hot Encoding:
* Convert categorical variables into binary vectors using one-hot encoding.
(b) Label Encoding:
* Transform categorical labels into numerical values if the ordinal relationship is essential.
(c) Target Encoding:
* Encode categorical features based on the mean or median of the target variable for each category.
(d) Frequency Encoding:
* Encode categorical variables based on their frequency in the dataset.
(e) Embeddings:
® Use embeddings for categorical variables, especially useful in deep leaming models.
(f) Dummy Variables:
o Create dummy variables for categorical features with multiple levels.
3. Text Data:
(a) Tokenization:
* Break text into individual words or subwords (tokenization).
(b) TF-IDF (Term Frequency-Inverse Document Frequency):
* Convert text data into numerical vectors using TF-IDF to capture the importance of words in a document.
(c) Word Embeddings:
® Use pre-trained word embeddings like Word2Vec, GloVe, or FastText to represent words in a continuous vector
space.
(d) Bag-of-Words:
® Represent text as a bag of words, counting the frequency of each word.
Advanced Algorithms in Al and ML 18 Model Selection and Feature Engineering
3
g 3
2
s 2
=
0f
-1
-2
-5 0 10 15 2l
Alcohol
Fig. 1.4
The impact of Standardization and Normalisation on the Wine dataset
Methods for Scaling:
Now, since you have an idea of what is feature scaling. Let us explore what methods are available for doing feature
scaling. Of all the methods available, the most common ones are:
Normalization:
Also known as min-max scaling or min-max normalization, it is the simplest method and consists of rescaling the range
of features to scale the range in [0, 1]. The general formula for normalization is given as:
5 X — min (x)
X = ‘max (x) — min (x)
Here, max(x) and min(x) are the maximum and the minimum values of the feature respectively.
‘Advanced Algorithms in Al and ML 19 Model Selection and Feature Engineering
We can also do a normalization over different intervals, for example, choosing to have the variable laying in any [a, b]
interval, a and b being real numbers. To rescale a range between an arbitrary set of values [a, b], the formula becomes:
X — min (x; E)
x'
= a+ ‘max (x) — min (x)
Standardization:
Feature standardization makes the values of cach feature in the data have zero mean and unit variance. The general
method of calculation is to determine the distribution mean and standard deviation for each feature and calculate the new data
point by the following formula:
Here, © is the standard deviation of the feature vector, and % is the average of the feature vector.
Scaling to unit length: The aim of this method is to scale the components of a feature vector such that the complete
vector has length one. This usually means dividing each component by the Euclidean length of the vector:
x
X' = qu Ixllis the Euclidean length of the feature vector.
In addition to the above 3 widely used methods, there are some other methods to scale the features viz. Power
Transformer, Quantile Transformer, Robust Scaler etc. For the scope of this discussion, we are deliberately not diving into
the details of these techniques.
The million-dollar question: Normalization or Standardization
If you have ever built a machine leaming pipeline, you must have always faced this question of whether to Normalize or
to Standardize. While there is no obvious answer to this question, it really depends on the application, there are still a
few generalizations that can be drawn.
Normalization is good to use when the distribution of data does not follow a Gaussian distribution. It can be useful in
algorithms that do not assume any distribution of the data like K-Nearest Neighbors.
In Neural Networks algorithm that require data on a 0-1 scale, normalization is an essential pre-processing step. Another
popular example of data normalization is image processing, where pixel intensities have to be normalized to fit within a
certain range (i.e., 0 to 255 for the RGB color range).
Standardization can be helpful in cases where the data follows a Gaussian distribution. Though this does not have to be
necessarily true. Since standardization does not have a bounding range, so, even if there are outliers in the data, they will not
be affected by standardization.
In clustering analyses, standardization comes in handy to compare similarities between features based on certain distance
measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over
Min-Max scaling since we are interested in the components that maximize the variance.
There are some points which can be considered while deciding whether we need Standardization or Normalization
* Standardization may be used when data represent Gaussian Distribution, while Normalization is great with Non-
Gaussian Distribution.
o Impact of Outliers is very high in Normalization.
To conclude, you can always start by fitting your model to raw, normalized, and standardized data and compare the
performance for the best results.
The link between Data Scaling and Data Leakage
To apply Normalization or Standardization, we can use the prebuilt functions in scikit-learn or can create our own
custom function.
Data leakage mainly occurs when some information from the training data is revealed to the validation data. In order to
prevent the same, the point to pay attention to is to fit the scaler on the train data and then use it to transform the test data.
Define Feature Selection:
Feature Selection is defined as, "It is a process of automatically or manually selecting the subset of most appropriate and
relevant features to be used in model building.”
‘What Is Feature Selection?
Feature is an attribute that has an impact on a problem or is useful for the problem, and choosing the important features
for the model is known as feature selection. Each machine learning process depends on feature engineering, which mainly
contains two processes, which are Feature Selection and Feature Extraction. Although feature selection and extraction
Advanced Algorithms in Al and ML 1.10 Model Selection and Feature Engineering
processes may have the same objective, both are completely different from each other. The main difference between them is
that feature selection is about selecting the subset of the original feature set, whereas feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only relevant data in order to reduce
overfitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually selecting the subset of most
appropriate and relevant features to be used in model building.” Feature selection is performed by either including the
important features or excluding the irrelevant features in the dataset without changing them.
Need for Feature Selection:
Before implementing any technique, it is important to understand, need for the technique and so for the Feature
Selection. As we know, in machine leaming, it is necessary to provide a pre-processed and good input dataset to get better
outcomes. We collect a huge amount of data to train our model and help it to learn better. Generally, the dataset consists of
noisy data, irrelevant data, and some part of useful data. Moreover, the huge amount of data also slows down the training
process of the model, and with noise and irrelevant data, the model may not predict and perform well. So, it is very necessary
to remove such noises and less-important data from the dataset and to do this, and Feature selection techniques are used.
Selecting the best features helps the model to perform well. For example, suppose we want to create a model that
automatically decides which car should be crushed for a spare part, and to do this, we have a dataset. This dataset contains a
Model of the car, Year, Owner's name, Miles. So, in this dataset, the name of the owner does not contribute to the model
performance as it does not decide if the car should be crushed or not, so we can remove this column and select the rest of the
features (column) for the model building.
Below are some benefits of using feature selection in machine learning:
* It helps in avoiding the curse of dimensionality.
o Ithelps in the simplification of the model so that it can be easily interpreted by the researchers.
* It reduces the training time.
* It reduces overfitting hence enhance the generalization.
Feature Selection Techniques:
There are mainly two types of Feature Selection techniques, which are:
* Supervised Feature Selection technique: Supervised Feature selection techniques consider the target variable and
can be used for the labelled dataset.
* Unsupervised Feature Selection technique: Unsupervised Feature selection techniques ignore the target variable
and can be used for the unlabelled dataset.
|
L1 ¥
Supervised Unsupervised
Feature Selection Feature Selection
+ A A *
Filters Embedded Wrappers
Method Method Method
Fig. 1.5
‘Advanced Algorithms in Al and ML 1.1 Model Selection and Feature Engineering
[Practice Questi S|
Learning Objectives...
o] Classify the given data using Bayesian Method with stepwise Justification.
O] Describe working of Decision Tree Algorithm.
m Enlist application of Random Forest Algorithm.
- INTRODUCTION
Suppose you hold the position of a product manager, your objective is to categorize customer feedback into good
and negative groups. Naive
Or as a loan manager, your objective is to determine the creditworthiness of loan applicants, distinguishing between
those who pose a low risk and those who pose a high risk.
Or as a healthcare analyst, your objective is to forecast which people are susceptible to developing diabetes.
All of the cases exhibit a common issue in categorizing reviews, loan applications, and patients.
Naive Bayes is a very efficient and rapid classification technique that is well-suited for processing massive volumes
of data.
The Naive Bayes classifier is effectively used in many applications, including spam filtering, text classification,
sentiment analysis, and recommender systems.
The prediction of unknown class is accomplished by the use of Bayes' theory of probability.
- WORKING OF CLASSIFICATION
When engaging in categorization, the first phase is analysing the issue and identifying probable characteristics and
labels.
Features are particular characteristics or properties that have an impact on the outcomes of the label.
For instance, before distributing a loan, bank management assess the customer's employment, income, age. location,
past loan history, transaction history, and credit score.
21
Advanced Algorithms in Al and ML 22 ‘Supervised Learning:
Naive Bayes, Decision Tree ....
® These attributes are referred to as features that aid the model in categorizing clients.
* The categorization process consists of two distinct phases: a learning phase and an assessment phase.
® During the learning phase, the classifier acquires knowledge by training its model using a specific dataset.
* In the evaluation phase, the classifier assesses its performance. Evaluation of performance is based on many metrics
including accuracy, error, precision, and recall.
Model
Training Development
) Se.
W Performance
Model measures :
Fig. 2.1
JEXN NAIVE BAYES CLASSIFIER
Nalve Bayes Classifler:
® The Naive Bayes Classifier is a machine learning algorithm that is based on Bayes' theorem. It is used for
classification tasks, where it predicts the probability of an input belonging to a certain class based on its features.
* Naive Bayes is a statistical classification method that relies on Bayes' Theorem.
® This technique is considered to be one of the most straightforward methods in supervised learning.
® The Naive Bayes classifier is an algorithm that is known for its speed, accuracy, and reliability.
o Naive Bayes classifiers have excellent accuracy and efficiency when used to extensive datasets.
* Inexperienced or lacking in worldly wisdom.
® The Bayes classifier operates on the assumption that the impact of a specific feature on a class is not influenced by
other characteristics.
* For instance, the desirability of a loan applicant is contingent upon factors such as their income, previous loan and
transaction history, age, and geography.
* Although these traits are interrelated, they are nonetheless regarded as separate entities.
This assumption seems naive since it simplifies calculation.
* This assumption is referred to as class conditional independence.
P(Dlh) P(h)
P(h): The probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h.
P(D): The probability ofthe data (regardless of the hypothesis). This is known as the prior probability.
P(hID): The probability of hypothesis h given the data D. This is known as posterior probability.
P(Dih): The probability of data d given that the hypothesis h was true. This is known as posterior probability.
TYPES OF NAIVE BAYES MODEL
Types of Nalve Bayes Model:
There are five types of NB models under the scikit-learn library:
Gausslan Nalve Bayes: Often known as gaussiannb, is a classification algorithm that operates on the assumption that
the values of the features in the dataset are distributed according to a Gaussian distribution.
Multinomial Nalve Bayes: It is a classification algorithm specifically deslgned for discrete count data. Suppose we
encounter a text categorization issue as an example. In this context, we are ussing Bernoulli trials, which involve
counting the frequency of a word appearing in a document rather than just dctcmumng whether it happens. This may be seen
as the number of times a certain result, denoted as x_i, is observed throughout a series of n trials.
Bernoulll Nalve Bayes: The binomial model is applicable when the feature vectors consist of binary values (i.e., zeros
and ones). An example use case may include text categorization using a 'bag of words' model, where the binary values of
1 and 0 represent whether a word appears or does not appear in a given document, respectively.
Advanced Algorithms in Al and ML 23 Supervised Learning: Naive Bayes, Decision Tree ....
Complement Nalve Bayes: It is a variant of Multinomial NB that use the complement of each class to determine the
model weights. This method is well-suited for unbalanced data sets and often achieves better results than the Multinomial
Naive Bayes (MNB) algorithm in text classification tasks.
Categorical Nalve Bayes: It is a valuable method when the characteristics are distributed in a categorical manner. In
order to use this procedure, it is necessary to convert the category variable into a numerical format by using the ordinal
encoder.
What Is the working mechanism of the Nalve Bayes Classifler?
Let us comprehend the workings of Naive Bayes with an illustrative case. Provide a case study of weather conditions
while participating in sports. You must compute the likelihood of engaging in athletic activities. It is necessary to categorize
whether players will participate or not, depending on the weather conditions.
First Approach (For a single characteristic)
The Naive Bayes classifier computes the probability of an occurrence by the following sequential steps:
Step 1: Compute the previous probability for the supplied class labels.
Step 2: Calculate the probability of each characteristic for each class.
Step 3: Substitute these values into Bayes' Formula and compute the posterior probability.
Step 4: Determine the class with the greater probability, assuming that the input belongs to that class.
For the convenience of the computation of prior and posterior probabilities, you may use two tables: the frequency table
and the likelihood table. Both of these tables will help with the computation of the prior and posterior probability. The
Frequency table records the frequency of labels for all characteristics. There are two tables that represent the probability of
events occurring. Likelihood Table | displays the initial probabilities of labels, whereas Likelihood Table 2 presents the
updated values known as posterior probabilities.
Frequency Table
Whether Play Whether No e
Sunny No Overcast 4
Sunny No Sunny 2 3
P(Overcast) = %= .29
9
P(Yes) = fi=0.64
‘Advanced Algorithms in Al and ML 24 Supervised Learning: Naive Bayes, Decision Tree ....
2. Calculate Posterior Probabilities:
P(Overcast IYes) = %— 0.44
Rainy Cool No o 3
Multiply same class conditional
Overcast Cool Yes probability
Sunny Mild No
Fig. 23
Now suppose you want to calculate the probability of playing when the weather is overcast, and the temperature is mild.
Probability of Playing:
P(Play = Yes | Weather = Overcast, Temp = Mild) = P(Weather = Overcast, Temp=Mild | Play = Yes) P(Play = Yes)
- (23)
Advanced Algorithms in Al and ML 25 Supervised Learning: Naive Bayes, Decision Tree ....
P(Weather = Overcast, Temp = Mild | Play = Yes) = P(Overcast |Yes) P(Mild [Yes) - (24)
1. Calculate Prior Probabi es:
9
P(Yes) = 77=0.64
2. Caleulate Posterior Probabilities:
P(Overcast Yes) = 3 = 0.44 P(Mild IYes) = 3 = 0.44
3. Put Posterior probabilities in equation (2.4)
P(Weather = Overcast, Temp = Mild | Play = Yes) = 0.44 x 0.44 = 0.1936 (Higher)
4. Put Prior and Posterior probabilities in equation (2.1)
P(Play = Yes | Weather = Overcast, Temp = Mild) = 0.1936 x 0.64 = 0.124
Similarly, you can calculate the probability of not playing:
Probability of not playing:
P(Play = No | Weather = Overcast, Temp = Mild) = P(Weather = Overcast, Temp = Mild | Play = No) P(Play = No)
s (25)
P(Weather = Overcast, Temp = Mild | Play = No) = P(Weather = Overcast |Play = No) P(Temp = Mild | Play = No)
.. (2.6)
1. Calculate Prior Probabilities:
5
P(No) = 4= 0.36
X, y = make_classification(
n_features=6,
n_classes=3,
n_samples=800,
n_informative=2,
random_state=1,
n_clusters_per_class=1,
)
Advanced Algorithms in Al and ML 26 ‘Supervised Learning: Naive Bayes, Decision Tree ....
Fig. 2.4
As we can observe, there are three types of target labels, and we will be training a multiclass classification model.
Traln Test Split:
Before we start the training process, we need to split the dataset into training and testing for model evaluation.
from sklearn.model_selection import train_test_split
# Model fraining
model.fit(X_train, y_train)
# Predict Output
predicted = model.predict([X_test[6]])
Model Evaluation:
‘We will not evolve the model on an unseen test dataset. First, we will predict the values for the test dataset and use them
to calculate accuracy and F1 score.
from sklearn.metrics import (
accuracy_score,
confusion_matrix,
ConfusionMatrixDisplay,
f1_score,
)
y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1=f1_score(y_pred, y_test, average="weighted")
print("Accuracy:", accuray)
print("F1 Score:", f1)
Our model has performed fairly well with default hyperparameters.
Accuracy: 0.8484848484848485
F1 Score: 0.8491119695890328
To visualize the Confusion matrix, we will use “confusion_matrix" to calculate the true positives and true negatives and
*ConfusionMatrixDisplay" to display the confusion matrix with the labels.
labels = [0,1,2]
cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot();
True label
Predlu.ld label
Flg. 2.5
Code Available at: https://colab.research. google.com/drive/ | 0yewmAEAkWusiQQcSJ
TdIZ35Gj90q8P | 2usp=sharing
Advantages:
* Itisnot only a simple approach but also a fast and accurate method for prediction.
* Naive Bayes has a very low computation cost.
. It can efficiently work on a large dataset.
. It performs well in case of discrete response variable compared to the continuous variable.
* It can be used with multiple class prediction problems.
Advanced Algorithms in Al and ML 28 Supervised Learning: Naive Bayes, Decision Tree ....
* Italso performs well in the case of text analytics problems.
* When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models
like logistic regression.
Disadvantages:
® The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors
which are entirely independent.
® If there is no training tuple of a particular class, this causes zero posterior probability. In this case, the model is
unable to make predictions. This problem is known as Zero Probability/Frequency Problem.
* Some popular examples of Naive Bayes Algorithm are spam flltration, Sentimental analysls, and classifying
articles.
What is it called Naive Bayes?
The Naive Bayes algorithm is comprised of two words Naive and Bayes, which can be described as:
* Nalve: It is called Naive because it assumes that the occurrence of a certain feature is independent of the occurrence
of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and
sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
What is Bayes’ Theorem?
* Bayes' theorem is also known as Bayes' Rule
or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
* The formula for Bayes' theorem is given as:
PAIB) _= P(BIA)
——p P(A)
Where, P(AIB) Is Posterior probability: Probability of hypothesis A on the observed event B.
P(BIA) Is Likellhood probability: Probability of the evidence given that the probability of a hypothesis is true.
P(A) Is Prior Probabllity: Probability of hypothesis before observing the evidence.
P(B) Is Marglnal Probabllity: Probability of Evidence.
Explain Working of Naive Bayes’ Classifier
‘Working of Naive Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions
and corresponding target variable "Play”. So, using this dataset we
need to decide that whether we should play or not on a particular day according to the weather conditions. So, to solve this
problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Genenate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Example 2.2: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Advanced Algorithms in Al and ML 210 Supervised Learning: Naive Bayes, Decision Tree ...
Frequency table for the Weather Conditions:
‘Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 S _03s
14
Rainy 2 2 414=029
_
Sunny 2 3 2
13=035
Al 4|4—0A29
_ i
14—0.71
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes) = 3/10= 0.3
P(Sunny) = 0.35
P(Yes) = 0.71
So P(Yes|Sunny) = 0.3%0.71/0.35 = 0.60
P(No|Sunny) = P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO) = 2/4 =05
P(No) = 0.29
P(Sunny) = 0.35
50 P(No|Sunny) = 0.5%0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny) > P(No|Sunny)
Hence on a Sunny day, Player can play the game.
‘Write Advantages of Nalve Bayes Classifler:
* Naive Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
* It can be used for Binary as well as Multi-class Classifications.
* It performs well in Multi-class predictions as compared to the other Algorithms.
* Itis the most popular choice for text classification problems.
‘Write Disadvantages of Nalve Bayes Classifler:
* Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between
features.
‘What are the Applications of Naive Bayes Classifler:
* Itisused for Credit Scoring.
* Itisused in medical data classification.
* It can be used in real-time predictions because Naive Bayes Classifier is an eager learner.
® Itisused in Text classification such as Spam flltering and Sentiment analysis.
Advanced Algorithms in Al and ML 2.1 ‘Supervised Learning: Naive Bayes, Decision Tree ...
Fig. 2.6
Write Implementation of the Naive Bayes Algorithm
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the "user_data” dataset, which
we have used in our other classification model. Therefore, we can easily compare the Naive Bayes model with the other
models.
Steps to Implement:
* Data Pre-processing step.
* Fitting Naive Bayes to the Training set.
* Predicting the test result.
® Test accuracy of the result(Creation of Confusion matrix).
* Visualizing the test set result.
1. Data Pre-processing step:
In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is similar as we did
in data-pre-processing. The code for this is given below:
1. Importing the libraries.
2. import numpy as nm.
3. import matplotlib.pyplot as mtp.
Advanced Algorithms in Al and ML 212 Supervised Learing: Naive Bayes, Decision Tree ....
4. import pandas as pd.
5. # Importing the dataset.
6. dataset = pd.read_csv('user_data.csv').
7 . x = dataset.iloc[:, [2, 3]].values
8. .y = dataset.iloc[:, 4].values
9. . # Splitting the dataset into the Training set and Test set
10. from sklearn.model_selection import train_test_split
11. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
12. # Feature Scaling
13. from sklearn.preprocessing import StandardScaler
14. sc = StandardScaler()
15. x_train = sc.fit_transform(x_train)
16. x_test = sc.transform(x_test)
In the above code, we have loaded the dataset
into our program using "dataset = pd.read_csv('user_data.csv').
‘The loaded dataset is divided into training and test set, and then we have scaled the feature variable.
The output for the dataset is given as:
Fig. 27
2. Fitting Nalve Bayes to the Tralning Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is the code for it:
1. # Fitting Ndive Bayes to the Training set
2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)
Advanced Algorithms in Al and ML 243 Supervised Learning: Naive Bayes, Decision Tree ....
In the above code, we have used the GaussianNB classifierto fit it to the training dataset. We can also use other
classifiers as per our requirement.
Output:
Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)
3. Prediction of the test set result:
Now we will predict
the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict
function to make the predictions.
1. #Predicting the Test set results
2. y_pred = classifier.predict(x_test)
Output:
[ e [ e conr
o [ com|
Fig.2.8
The above output shows the result for prediction vectory_pred
and real vector y_test. We can see that some
predications are different from the real values, which are the incorrect predictions.
4. Creating Confusion Matrix:
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is the code for it:
1. # Making the Confusion Matrix
2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)
Output:
B8 cm - NumPy array - o x
0 1
1
o_
IR —
o [oom ]
Fig. 2.9
Advanced Algorithms in Al and ML 214 ‘Supervised Learning: Naive Bayes, Decision Tree ....
As we can see in the above confusion matrix output, there are 7 + 3 = 10 incorrect predictions, and 65 + 25 = 90 correct
predictions.
5. Visualizing the training set result:
Next we will visualize the training set result using Naive Bayes Classifier. Below is the code for it:
# Visualising the Training set results
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0lmax() + 1, step = 0.01),
nm.arange(start = x_set[:, 1}.min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('purple’, ‘green')))
mtp.xlim(XLmin(), X1.max())
mtp.ylim(X2.min(), X2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
¢ = ListedColormap((‘purple’, "green*))(i), label = j)
mtp.title(*Naive Bayes (Training set)’)
mtp.xlabel('Age')
mtp.ylabel(‘Estimated Salary')
mtp.legend()
mtp.show()
Output:
Naive Bayes (Training Set)
Estimated Salary
-2
-2 -1 0 1 2 3
Age
Fig. 2.10
In the above output we can see that the Naive Bayes classifier has segregated the data points with the fine boundary. It is
Gaussian curve as we have used GausslanNB classifier in our code.
6. Visualizing the Test set result:
# Visualising the Test set results
from matplotlib.colors import ListedColormap
x_set,y_set = x_test,y_test
X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0lmax() + 1, step = 0.01),
nm.arange(start = x_set[:, 1}.min() - 1, stop = x_set[:, 1l.max() + 1, step = 0.01))
mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
Advanced Algorithms in Al and ML 215 ‘Supervised Leaming: Naive Bayes, Decision Tree ....
4
s
a
2]
E
b -
-2
0
Age
Fig. 2.11
DECISION TREE
® Decision Tree Diagram,
® Why Used decision tree?
* Working Of Decision tree algorithm,
* Auributes selection Measures (ASM),
* Advantages and Disadvantages of Decision tree,
* Implementation of Decision Tree
Introduction: Decision Tree
As a marketing manager, your objective is to identify a certain group of clients who have the highest probability of
buying your goods. Discovering your target demographic is a strategic approach to economize your marketing expenditure.
To attain a reduced loan default rate, it is necessary for loan managers to identify loan applications that carry a high level of
risk. The act of categorizing clients into groups based on their potential or lack thereof, or distinguishing between safe and
dangerous loan applications, is referred to as a classification challenge.
Classification is a two-step procedure, consisting of a learning phase and a prediction phase. During the learning phase,
the model is constructed using the provided training data. During the prediction stage, the model utilizes the available data to
forecast the expected response. A Decision tree is a straightforward and widely used categorization technique that facilitates
the comprehension and interpretation of data. It may be used for both categorization and estimation tasks.
m The Decision Tree Algorithm
A decision tree is a hierarchical structure resembling a flowchart, in which each internal node represents a specific
characteristic or attribute, each branch represents a decision rule, and each leaf node represents a particular conclusion.
Advanced Algorithms in Al and ML 216 Supervised Learning: Naive Bayes, Decision Tree ....
The highest-level node in a decision tree is referred to as the root node. It acquires the ability to divide data based on the
attribute value. The tree is divided into smaller parts using a recursive method known as recursive partitioning. This
diagrammatic framework facilitates the process of decision-making. It is a kind of visualization that resembles a flowchart
diagram and effectively replicates human-level reasoning. Hence, decision trees possess a high level of comprehensibility and
interpretability.
Root Node
§§§:
-
.Zg
B -
X
ASM such
Using ASM sucn as
i) Breaks
real the Datasot
| information Gain or Gini into smaller subsets
Index or Gain Ratio
L Recursively repeat the J
process for each child
Performance
Evaluation measures|
1. Accuracy
2. Precision
3. Recall
Flg. 2.13
‘Advanced Algorithms in Al and ML 217 ‘Supervised Leaming: Naive Bayes, Decision Tree ...
Info(D) = — X pilog,pi
Ginl Index:
Another decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to create split points.
m
Gini (D) = 1- X P
=1
Splitting Data:
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.
Let's split the dataset by using the function train_test_split(). You need to pass three parameters features: target, and
test_set size.
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training
and 30% test
Bullding Decision Tree Model
Let's create a decision tree model using Scikit-learn.
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True feature_names = feature_cols,class_names=['0",'1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png’)
Image(graph.create_png())
‘Supervised Leaming: Naive Bayes, Decision Tree ...
Fig. 2.14
In the decision tree chart, each internal node has a decision rule that splits the data. Gini, referred to as Gini ratio,
measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes
known as the leaf node.
Here, the resultant tree is unpruned. This unpruned tree is unexplainable and not easy to understand. In the next section,
let's optimize it by pruning.
Optimizing Decislon Tree Performance:
* criterion : optional (default="gini") or Choose attribute selection measure. This parameter allows us to use
the different-different attribute selection measure. Supported criteria are “gini” for the Gini index and “entropy™ for
the information gain.
* splitter : string, optional (default="best") or Split Strategy. This parameter allows us to choose the split
strategy. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
* max_depth : Int or None, optional (default=None) or Maximum Depth of a Tree. The maximum depth of the
tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The
higher value of maximum depth causes overfitting, and a lower value causes underfitting (Source).
In Scikit-learn, optimization of decision tree classifier performed by only pre-pruning. Maximum depth of the tree can be
used as a control variable for pre-pruning. In the following the example, you can plot a decision tree on the same data with
max_depth=3. Other than pre-pruning parameters, You can also try other attribute selection measure such as entropy.
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
‘glucose
< 158.!
‘entropy
= 0.9
samples = 152
value
= [48, 104]|
msls=l
Fig. 2.15
As you can see, this pruned model is less complex, more explainable, and easier to understand than the previous decision
tree model plot.
Decislon Tree Pros:
® Decision trees are easy to interpret and visualize.
® It can easily capture Non-linear patterns.
* It requires fewer data preprocessing from the user, for example, there is no need to normalize columns.
* It can be used for feature engineering such as predicting missing values, suitable for variable selection.
® The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm.
(Source)
Declslon Tree Cons:
® Sensitive to noisy data. It can overfit noisy data.
® The small variation (or variance) in data can result in the different decision tree. This can be reduced by bagging and
boosting algorithms.
Advanced Algorithms in Al and ML 222 ‘Supervised Learning: Naive Bayes, Decision Tree ...
® Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating
the decision tree.
What Is Decislon Tree?
e The Decision Tree serves as a supervised learning technique applicable to both classification and regression
problems, though it is predominantly favoured for addressing classification issues.
® This classifier adopts a tree structure, with internal nodes representing dataset features, branches embodying
decision rules, and each leaf node signifying an outcome.
* Within the Decision Tree, two types of nodes exist:
o Decision Nodes, facilitating decisions with multiple branches, and
o Leaf Nodes, which present the outcomes without additional branches.
® Decisions or tests are executed based on the dataset’s features.
© This method provides a graphical representation offering potential solutions to problems or decisions under specific
conditions.
* The construction of the tree employs the CART algorithm, denoting the Classification and Regression Tree
algorithm.
* The decision tree operates by posing questions and, contingent on the responses (Yes/No), progressively subdivides
the tree into subtrees.
Below diagram explains the general structure of a decision tree:
S S| \
|
1
1 Decision Node
1
1
1
1
Dacision Node
Fig. 2.16
‘Why use Decislon Trees?
Reason for using the Decision tree:
® Decision Trees usually mimic human thinking ability while deciding, so it is easy to understand.
® The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologles:
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets divided
into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.
Advanced Algorithms in Al and ML 223 Supervised Learning: Naive Bayes, Decision Tree ...
Salary is between
$50,000 - $80000
Flg. 2.17
‘What is Attribute Selection Measures (ASM):
Decision tree implementation involves the challenge of selecting the best attributes for both the root node and
sub-nodes.
A technique called Attribute Selection Measure (ASM) is employed to address this issue effectively.
ASM helps in the identification of the most suitable attributes for different nodes within the tree.
Advanced Algorithms in Al and ML 224 Supervised Learning:
Naive Bayes, Decision Tree ....
Gini Index = | - P
Advantages of Decision Tree
© Simple to understand, mirroring human decision-making processes.
e Effective for decision-related problem-solving.
o Facilitates consideration of all possible outcomes for a given problem.
Requires less data cleaning compared to alternative algorithms.
Disadvantages of Decision Tree
* Complex structure with numerous layers.
* Potential overfitting issues, which can be addressed with the Random Forest algorithm.
* Computational complexity may increase with more class labels.
Python Implementation of Decision Tree
Now we will implement the Decision tree using Python. For this, we will use the dataset "user_data.csv,” which we
have used in previous classification models. By using the same dataset, we can compare the Decision tree classifier with
other classification models such as KNN SVM, Logistic Regression, etc.
Steps will also remain the same, which are given below:
* Data Pre-processing step.
« Fitting a Decislon-Tree algorithm to the Training set.
o Predicting the test result.
* Test accuracy of the result(Creation of Confusion matrix).
* Visuallzing the test set result.
Advanced Algorithms in Al and ML ‘Supervised Leaming: Naive Bayes, Decision Tree ....
#importing datasets
data_set= pd.read_csv(‘user_data.csv')
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:
W dua_set
- OstaFrame
Fig. 2.18
Advanced Algorithms in Al and ML 226 Supervised Learning: Naive Bayes, Decision Tree ....
2. Fitting a Decision-Tree Algorithm to the Training Set:
Now we will fit the model to the training set. For this, we will import the DecisionTreeClassifier class from
sklearn.tree library. Below is the code for it:
1. #Fitting Decision Tree classifier to the training set
2. From sklearn.tree import DecisionTreeClassifier
3. classifier= DecisionTreeClassifier(criterion="entropy', random_state=0)
4. classifier.fit(x_train, y_train)
In the above code, we have created a classifier object, in whichwe have passed two main parameters.
« “criterion="entropy': Criterion is used to measure the quality of split, which is calculated by information gain
given by entropy.
* random_state=0": For generating the random states.
Below is the output for this:
Out[8]:
DecisionTreeClassifier(class_weight=None, criterion="entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_spli None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter="best")
3. Predicting the Test Result:
Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred-= classifier.predict(x_test)
Output:
In the below output image, the predicted output and real test output are given. We can clearly see that there are some
values in the prediction vector, which are different from the real vector values. These are prediction errors.
0 - 0 A
Fig. 2.19
Advanced Algorithms in Al and ML 227 Supervised Learning: Naive Bayes, Decision Tree ....
4. Test Accuracy of the Result (Creation of Confusion Matrix):
In the above output, we have seen that there were some incorrect predictions, so if we want to know the number of
correct and incorrect predictions, we need to use the confusion matrix. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
B8 cm - NumPy amay - o x
Savend
o] e ]
Fig. 2.20
In the above output image, we can see the confusion matrix, which has 6 + 3 = 9 Incorrect predictions and 62 + 29 = 91
correct predictions. Therefore, we can say that compared to other classification models, the Decislon Tree classifler
made a good prediction.
5. Visualizing the Tralning Set Result:
Here we will visualize the training set result. To visualize the training set result we will plot a graph for the decision tree
classifier. The classifier will predict yes or No for the users who have either Purchased or Not purchased the SUV car as we
did in Logistic Regression. Below is the code for it:
1. #YVisulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set,y_set=x_train, y_frain
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0J.max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1]min() - 1, stop = x_set[:, 1}max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple’,'green' )))
8. mtp.xlim(xL.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max()) # test accuracy of the result
10. fori, j in enumerate(nm.unique(y_set)): cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap((‘purple’, ‘green*))(i), label = j) #Visualizing the Decision Tree
13. mtp.title('Decision Tree Algorithm (Training set)') plt.figure(figsize=(12, 8))
plot_tree(classifier, feature_names=
['Feature_1', 'Feature_2'], class_names=
['Class_0', 'Class_1'], filled=True)
plt.title("Decision Tree Visualization")
plt.show()
Advanced Algorithms in Al and ML 228 Supervised Learning: Naive Bayes, Decision Tree ....
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
-2 -1 0 1 2 3
Age
Fig. 2.21
The above output is completely different from the rest classification models. It has both vertical and horizontal lines that
are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
-2 -1 0 1 2 3
Age
Fig. 2.22
# Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score,
ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint
# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz
Advanced Algorithms
in Al and ML 231 Supervised Learning: Naive Bayes, Decision Tree ....
Random Forests Workflow:
To fit and train this model, we’ll be following The Machine Learning Workflow infographic; however, as our data is
pretty clean, we won’t be carrying out every step. We will do the following:
Feature engineering
Split the data
Train the model
Hyperparameter tuning
Assess model performance
Preprocessing Data for Random Forests:
Tree-based models are much more robust to outliers than linear models, and they do not need variables to be normalized
to work. As such, we need to do very little preprocessing on our data.
* We will map our ‘default’” column, which containsnoand yes, t0o0’s and I's, respectively. We will
treat unknown values as no for this example.
* We will also map our target, y, to Is and 0s.
bank_data['default'] = bank_data['default' lmap({'no':0,'yes":1,'unknown":0})
bank_data['y'] = bank_data['y'].map({'no":0,'yes":1})
Splitting the Data:
‘When training any supervised leaming model, it is important to split the data into training and test data. The training data
is used to fit the model. The algorithm uses the training data to learn the relationship between the features and the target. The
test data is used to evaluate the performance of the model.
The code below splits the data into separate variables for the features and target, then splits into training and test data.
# Split the data into features (X) and target (y)
X = bank_data.drop('y*, axis=1)
y = bank_data['y']
for i in range(3):
tree = rf.estimators_[i]
dot_data = export_graphviz(tree,
feature_names=X_train.columns,
filled=True,
max_depth=2,
impurity=False,
proportion=True)
graph = graphviz.Source(dot_data)
display(graph)
cons.conf.idx< =-3545
samples = 100.0%
value = [0.885, 0.115]
Try Wu
AN [\ [\ VAR
() () () () () ()
Fig.2.24
A AN [\ VAN
() () () () () ()
Advanced Algorithms in Al and ML ‘Supervised Learning:
Naive Bayes, Decision Tree ....
cons.conf.idx: —35.45
samples = 100.0%
value = [0.886, 0.114]
Yin
()
[\ () ()
[\ et ()
[\ . ()
[\ (-
Flg. 2.26
Each tree image is limited to only showing the first few nodes. These trees can get very large and difficult to visualize.
The colors represent the majority class of each node (box, with red indicating majority 0 (no subscription) and blue indicating
majority 1 (subscription). The colors get darker the closer the node gets to being fully 0 or 1. Each node also contains the
following information:
The variable name and value used for splitting.
The % of total samples in each split.
The % split between classes in each split.
Hyperparameter Tuning:
The code below uses Scikit-Learn’s Randomized Search CV, which will randomly search parameters within a range per
hyperparameter. We define the hyperparameters to use and their ranges in the param_dist dictionary. In our case, we are
using:
n_estimators: the number of decision trees in the forest. Increasing this hyperparameter generally improves the
performance of the model but also increases the computational cost of training and predicting.
max_depth: the maximum depth of each decision tree in the forest. Setting a higher value for max_depth can lead to
overfitting while setting it too low can lead to underfitting.
param_dist = {'n_estimators': randint(50,500),
'max_depth': randint(1,20)}
RandomizedSearchCV will train many models (defined by n_iter_and save each one as variables, the code below creates
a variable for the best model and prints the hyperparameters. In this case, we haven't passed a scoring system to the function,
50 it defaults to accuracy. This function also uses cross validation, which means it splits the data into five equal-sized groups
and uses 4 to train and 1 (o test the result. It will loop through cach group and give an accuracy score, which is averaged to
find the best model.
# Create a variable for the best model
best_rf = rand_search.best_estimator_
ConfusionMatrixDisplay(confusion_matrix=cm).plot();
7000
6000
5000
4000
True label
3000
2000
1000
[
Predicted label
Fig. 2.27
We should also evaluate the best model with accuracy, precision, and recall (note your results may differ due to
randomization)
y_pred = knn.predict(X_test)
print(“Accuracy:", accuracy)
print("Precision:", precision)
print(“Recall:", recall)
‘Advanced Algorithms in Al and ML 235 ‘Supervised Leamning: Naive Bayes, Decision Tree ...
Output:
Accuracy: 0.885
Precision: 0.578
Recall: 0.0873
The below code plots the importance of each feature, using the model’s internal score to find the best way to split the
data within each decision tree.
# Create a series containing feature importances from the model and feature names from the training data
feature_importances = pd.Series(best_rf.feature_importances_. index=X_train.columns).sort_values(ascending=False)
This tells us that the consumer confidence index, at the time of the call, was the biggest predictor in whether the person
subscribed.
0.5
04
03
02
0.1
0.0 T T T T
cons.confidx cons.price.idx age default
Flg. 2.28
Random Forest Algorithm
* "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of that dataset.”
* Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the
majority votes of predictions, and it predicts the final output.
* The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
Advanced Algorithms in Al and ML 2.36 ‘Supervised Learning: Naive Bayes, Decision Tree ...
Instance
Class-A
] ]¥ Class-B
Flg. 2.30
An Overview of Random Forests
There are mainly four sectors where Random Forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advanced Algorithms in Al and ML 237 Supervised Leaming: Naive Bayes, Decision Tree ....
#importing datasets
data_set= pd.read_csv(‘user_data.csv')
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:
Advanced Algorithms in Al and ML 2.38 Supervised Learning: Naive Bayes, Decision Tree ....
classifier.fit(x_train, y_train)
In the above code, the classifier object takes below parameters:
* n_estimators: The required number of trees in the Random Forest. The default value is 10. We can choose
any
number but need to take care of the overfitting issue.
e criterion: It is a function to analyze the accuracy of the split. Here we have taken "entropy” for the information
gain.
Output:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion="entropy",
max_depth=None, max_features="auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
3. Predicting the Test Set Result:
Since our model is fitted to the training set, so now we can predict the test result. For prediction,
we will create a new
prediction
vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Advanced Algorithms in Al and ML 2.39 Supervised Learning: Naive Bayes, Decision Tree ....
Output:
The prediction vector is given as:
R ETe—
- =]
Flg. 2.32
By checking the above prediction
vector and test set real vector, we can determine
the incorrect predictions
done by the
classifier.
4. Creating the Confusion Matrix:
Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
B om - NamPy amay - o x
[T | e | 2 sk ctr
|
Flg.2.33
As we can see in the above matrix, there are 4 + 4 = 8 Incorrect predictions and 64 + 28 = 92 correct predictions.
5. Visualizing the Training Set Result:
Here we will visualize the training set result. To visualize the training set result we will plot a graph for the Random
forest classifier. The classifier will predict yes or No for the users who have either Purchased or Not purchased the SUV car
as we did in Logistic Regression. Below is the code for it:
from matplotlib.colors Import ListedColormap
x_set,y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0l.max() + 1, step = 0.01),
Advanced Algorithms in Al and ML 240 Supervised Learning: Naive Bayes, Decision Tree ....
nm.arange(start = x_set[:, 1lmin() - 1, stop = x_set[:, Il.max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
# Visualizing the Random Forest Decision Boundary (Training Set)
alpha = 0.75, cmap = ListedColormap((*purple’,'green’ ))) def plot_decision_boundary(X_set, y_set, title):
X1, X2 = np.meshgrid(
mtp.xlim(x1.min(), x1.max()) np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01)
mtp.ylim(x2.min(), x2.max()) )
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
for i, j in enumerate(nm.unique(y_set)): X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=plt.cm.Paired)
mtp.scatter(x_set[y_set == j, 0], x_set[y_set = plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
¢ = ListedColormap((‘purple’, ‘green*))(i), label = j) for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j][:, 0], X_set[y_set == j][:, 1],
mtp.title('Random Forest Algorithm (Training set)') c=('red', 'blue')[i], label=j)
plt.title(title)
mtp.xlabel('Age') plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
mtp.ylabel(*Estimated Salary") plt.legend()
plt.show()
mtp.legend() # Visualizing Training Set Results
mtp.show() plot_decision_boundary(x_train, y_train, "Random Forest (Training Set)")
Age
Fig. 2.34
The above image is the visualization result for the Random Forest classifier working with the training set result. It is very
much similar to the Decision tree classifier. Each data point corresponds to each user of the user_data, and the purple and
green regions are the prediction regions. The purple region is classified for the users who did not purchase the SUV car, and
the green region is for the users who purchased the SUV.
So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or NO for the Purchased variable.
The classifier took the majority of the predictions and provided the result.
6. Visualizing the Test Set Result:
Now we will visualize the test set result. Below is the code for it:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set,y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0]max() + 1, step =0.01),
nm.arange(start = x_set[:, 11min() - 1, stop = x_set[:, 1lmax() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((*purple’,'green’ )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
Advanced Algorithms in Al and ML 241 Supervised Learning: Naive Bayes, Decision Tree ...
Learning Objectives...
® Describe Support Vector Machines.
® Enlist advantages and disadvantages of KNN algorithm
* * Margin
Y-Axis
*
b
*
A
A Al A
A A A Class B
Support Vectors
X-Axis
Fig. 3.1
Support Vectors
* Support vectors refer to the specific data points that are located closest to the hyperplane.
* These points will provide a more precise definition of the dividing line by computing the margins.
® These considerations are particularly relevant to the building of the classifier.
Hyperplane:
A hyperplane is a plane that is used to make decisions by separating a group of objects with various class memberships.
Margin:
* A margin refers to the space between two adjacent lines on the nearest class points.
® The calculation involves determining the perpendicular distance between the line and either the support vectors or
the nearest points.
* A wider gap between the classes indicates a favorable margin, whereas a narrower margin is deemed unfavorable.
‘What Is the mechanism behind Support Vector Machines (SVM)?
The primary goal is to categorize the provided dataset in the most optimal manner.
The gap between the closest spots is referred to as the margin.
The goal is to choose a hyperplane that has the largest feasible distance between the support vectors in the provided
dataset.
‘The SVM algorithm seeks to identify the hyperplane with the largest margin using the following steps:
® Create hyperplanes that effectively separate the classes. The left-hand side figure displays three hyperplanes: one in
black, one in blue, and one in orange. The blue and orange colors exhibit a significant classification error, but the
black color accurately separates the two groups.
* Choose the optimal hyperplane that achieves the highest level of separation from the closest data points, as seen in
the picture on the right-hand side.
Class A + Class A .
. * * * largin
é *
* *% *
A A Class B SE
A A A
ANA A
Support Vectors
hois pox-AXIS >
Fig.32
Advanced Algorithms in Al and ML 33 ‘Supervised Leamning: Support Vector Machines, ...
$> A
Ajaa
Chssa* A
£> Class A
* AChassB *x *
AkA *
xx% A
O *k Kok
A
k iassB
AdsaA AAp,AAA,
> >
X-Axis X-Axis
Fig. 3.3
Support Vector Machine Kernels
The SVM method is practically implemented by using a kernel. A kernel maps an input data space to the desired format.
The Support Vector Machine (SVM) algorithm employs a method known as the kernel trick. In this context, the kernel
function maps a lower-dimensional input space to a higher-dimensional space.
Put simply, it transforms an issue that cannot be separated into separate problems by introducing more dimensions.
It is particularly beneficial in situations when there is a need to separate non-linear data.
The kemnel technique enhances the accuracy of the classifier.
* Linear Kernel A linear kernel can be used as normal dot product any two given observations. The product between
two vectors is the sum of the multiplication of each pairof input values.
K(x, xi) = sum(x * xi)
s Polynomial Kemel A polynomial kemel is a more generalized form of the linear kernel. The polynomial kernel can
distinguish between curved or nonlinear input space.
K(x,xi) = 1+ sum(x * xi)"d
Where, d is the degree of the polynomial. d = 1 is similar to the linear transformation. The degree needs to be manually
specified in the learning algorithm.
Radlal Basis Function Kernel: The Radial basis function kernel is a popular kernel function commonly used in support
vector machine classification. RBF can map an input space in infinite dimensional space.
K(x.xi) = exp(-gamma * sum((x - xi"2))
Here gamma is a parameter, which ranges from 0 to 1. A higher value of gamma will perfectly fit the training dataset,
which causes over-fitting. Gamma=0.1 is a good default value. The value of gamma needs to be manually specified in the
learning algorithm.
Classifier Building in Scikit-Learn
So far, you have acquired knowledge on the theoretical underpinnings of Support Vector Machines (SVM). Now you
will be taught about how to build it in Python using scikit-learn.
Within the framework of the building model, the cancer dataset may be used, which is a well recognized multiclass
classification challenge. This information is derived from a digitized picture of a fine needle aspirate (FNA) of a breast mass.
They provide a description of the attributes of the cell nuclei seen in the photograph.
The dataset consists of 30 features, including mean radius, mean texture, mean perimeter, mean area, mean smoothness,
mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture
error, perimeter error, area error, smoothness error, compactness error, Concavity error, concave points error, symmetry error,
Advanced Algorithms in Al and ML 34 Supervised Learning: Support Vector Machines, ....
fractal dimension error, worst radius, worst texture, worst perimeter, worst area, worst smoothness, worst compactness, worst
concavity, worst concave points, worst symmetry, and worst fractal dimension. Additionally, there is a target variable
indicating the type of cancer.
The dataset consists of two categories of cancer: malignant (having the potential to cause damage) and benign (not
having the potential to cause harm). At this location, you have the ability to construct a model that can accurately categorize
the specific form of cancer. The dataset may be accessed using the scikit-learn library or downloaded from the UCI Machine
Learning Library.
Loading Data:
Let's first load the required dataset you will use.
#Import scikit-learn dataset library
from sklearn import datasets
#Load dataset
cancer = datasets.load_breast_cancer()
Exploring Data:
After you have loaded the dataset, you might want to know a little bit more about it. You can check feature and target
names.
(569, 30)
Let's check top 5 records of the feature set.
# print the cancer data features (top 5 records)
print(cancer.data[0:5])
Splitting Data:
Dividing the dataset into a training set and a test set is a prudent method for comprehending model performance.
Partition the dataset by use the train_test_split() method. Please ensure that you include three parameters: features, goal,
and test_set size. In addition, you may use the random_state parameterto randomly choose records.
# Import train_test_split function
from sklearn.model_selection import train_test_split
Generating Model:
Let's build support vector machine model. First, import the SVM module and create support vector classifier object by
passing argument kemnel as the linear kernel in SVC() function.
Then, it your model on train set using fit() and perform prediction on the test set using predict().
#Import svm model
from sklearn import svm
Precision: 0.9811320754716981
Recall: 0.9629629629629629
Well, you got a precision of 98% and recall of 96%, which are considered as very good values.
Tuning Hyperparameters:
Kernel:
® The kemnel's primary role is to convert the input data of a particular dataset into the necessary format.
® There are several categories of functions, including linear, polynomial, and radial basis function (RBF).
* Polynomial and Radial Basis Function (RBF) are effective for modelling non-linear hyperplanes.
* The polynomial and RBF kemels calculate the boundary line in a higher-dimensional space.
* For some applications, it is recommended to use a more intricate kernel in order to effectively distinguish between
classes that exhibit curvature or non-linearity.
* This conversion has the potential to result in classifiers that are more precise.
Regularization:
o Regularization refers to a technique used in machine leaming and statistics to prevent overfitting by adding a penalty
term to the loss function.
® The regularization parameter in Scikit-learn's Python library is denoted by the C parameter and is used to control the
degree of regularization.
® The penalty parameter, denoted as C, reflects the misclassification or error term.
® The misclassification or error term informs the SVM optimization algorithm about the acceptable level of
inaccuracy.
® Here is a method to manipulate the balance between the decision boundary and the misclassification term.
© A lower value of C results in a hyperplane with a smaller margin, whereas a higher value of C leads to a hyperplane
with a bigger margin.
Gamma:
A smaller gamma number will result in a less precise fit to the training dataset, whereas a larger gamma value will lead
to an exact match to the training dataset, resulting in overfitting.
Put simply, a low gamma value takes into account just the neighboring points when determining the separation line,
while a high gamma value takes into account all the data points in this computation.
Benefits:
® SVM classifiers provide superior accuracy and have speedier prediction capabilities in comparison to the Naive
Bayes method.
« Additionally, they use less memory since they employ a subset of training points during the decision phase.
* Support Vector Machines (SVM) perform optimally when there is a distinct distinction between data points and
when working with spaces that have a large number of dimensions.
Drawbacks:
* Support Vector Machines (SVM) is not well-suited for big datasets because to its lengthy training period, which is
also longer than that of Naive Bayes.
© It exhibits suboptimal performance when dealing with overlapping classes and is also highly dependent on the
choice of kernel.
SUPPORT VECTOR MACHINES (SVM)?
Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification and regression tasks.
SVM works by finding the optimal hyperplane that best separates the data into different classes. The basic idea is to identify
the hyperplane that maximizes the margin between classes.
Advanced Algorithms in Al and ML 38 Supervised Learning: Support Vector Machines, ....
Maximum
Margin Positive
Maximum
Margin
Hyperplane
Negative Hyperplanc
Flg. 3.4
Example of SVM:
* [llustration using the scenario from the KNN classifier example.
* Imagine encountering a peculiar creature with features of both cats and dogs.
* Goal: Develop a model using SVM to accurately determine whether it is a cat or dog.
* Training the model involves using a dataset with numerous cat and dog images to leam their distinct features.
® Testing the model with the strange creature, SVM creates a decision boundary between cat and dog data.
* SVM identifies extreme cases (support vectors) representing the distinctive features of cats and dogs.
* Based on support vectors, the model classifies the creature as a cat.
* Refer to the provided diagram for visualization.
New Data
|
)4 ;8
B Model __
Training Prediction
e
Past Labelled
Data
Flg.3.5
ation, text categorization etc.
SVM algorithm can be used for Face detection, Image classific
JEEN TYPES
OF SVM
1. Linear SVM:
® Used for lincarly separable data.
© Assumes that the data can be separated by a straight line.
2. Non-linear SVM:
o Used when the data is not linearly separable.
« Employs kemel functions to map the input data into a higher-dimensional space where a hyperplane can be used to
separate the classes.
Advanced Algorithms in Al and ML 39 Supervised Learning: Support Vector Machines, ....
* Like C-SVC but uses a parameter to control the number of support vectors.
* °
*
* [+
* (<]
(]
° (<]
(<}
>
X
Flg. 3.6
So as it is 2-d space so by just using a straight line, we can casily separate these two classes. But there can be multiple
lines that can separate these classes. Consider the below image:
Y4
*
’
/
/ /
/
* / °
o) e
o / °
7 o [}
I/ .
’
K o
>
X
Fig. 3.7
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as
a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Advanced Algorithms in Al and ML 310 Supervised Learning: Support Vector Machines, ...
Fig. 3.8
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a
single straight line. Consider the below image:
Iy
A A
v
X
Fig. 3.9
So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x
and y, so for non-lincar data, we will add a third dimension z. It can be calculated as:
z = xX+y
By adding the third dimension, the sample space will become as below image:
A A
A A
z 4A A
A A
X
Fig. 3.10
Advanced Algorithms in Al and ML 31 Supervised Learning: Support Vector Machines, ...
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
A B A A A
A A
z
A A
A
° Best Hyperplane
- >
0
X
Fig. 3.11
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space withz =1,
then it will become as:
4
A A
A A
<]
Y - .J-' a
A (5] Best Hyperplane
A A
A A
Fig. 3.12
Hence we get a circumference of radius | in case of non-linear data.
Dataset for Implementation: Save below data as 'user_data.csv’ in MS-Excel
User 1D Gender
15624510
15810944 Ma) £ 20000 0
15668575 Female 2 43000 0
15603246 Female z 57000 [
15804002 Male 1 76000 [
15728773 Male 7 58000 [
15598044 Female 2 84000 °
15694829 Female n 150000 1
15600575 Male > 33000 [
15727311 Female 35 65000 °
15570769 Female % 80000 [
15606274 Female % 52000 [
15746139 Male 2 86000 °
15704987 Male 2 18000 3
15628972 Male 18 82000 [
15697686 Male » 80000 °
15733883 Male ] 25000 1
15617482 Male a5 26000 1
15704583 Male % 28000 1
15621083 Female a 29000 1
15649487 Male a 22000 1
15736760 Female a7 49000 1
Fig. 3.13
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv(‘user_data.csv')
°
1
2
3
s
6
7
.
.
10
1
12
13
14
Fig. 3.14
Advanced Algorithms in Al and ML 343 Supervised Learning: Support Vector Machines, ....
The scaled output for the test set will be:
Sovemitese [om ]
Fig. 3.15
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will import SVC class
from Sklearn.svm library. Below is the code for it:
1. from sklearn.svm import SVC # “Support vector classifier"
2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)
In the above code, we have used kernel="linear",
as here we are creating
SVM for linearly separable data. However, we
can change it for non-linear data. And then we fitted the classifier to the training dataset(x_train, y_train).
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape="ovr', degree=3, gamma="auto_deprecated',
kernel="linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, t0l=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor), gamma, and kernel.
* Predicting the test set result: Now, we will predict the output for test set. For this, we will create a new vector
y_pred. Below is the code for it
1. #Predicting the test set result
2. y_pred-= classifier.predict(x_test)
After getting the y_pred vector, we can compare
the result of y_pred and y_test to check the difference between the
actual value and predicted
value.
Advanced Algorithms in Al and ML 3.14 Supervised Learning: Support Vector Machines, ....
Output: Below is the output for the prediction of the test set:
[E8 pred
- NumPy srray - o x
e o
Flg. 3.16
* Creating the confuslon matrix: Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create the confusion matrix, we need to import
the confuslon_matrix function of the sklearn library. After importing the function, we will call it using a new variable cm.
The function takes two parameters, mainly y_true(the actual values) and y_pred (the targeted value return by the classifier).
Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:
B8 cm - NumPy amay - o Xx
0 1
= L] 24
se
Fig. 3.17
As we can see in the above output image, there are 66 + 24 = 90 correct predictions and 8 + 2 = 10 correct predictions.
Therefore
we can say that our SVM model improved
as compared to the Logistic regression model.
Advanced Algorithms in Al and ML 3.15 Supervised Learning: Support Vector Machines, ....
* Visualizing the training set result: Now we will visualize the training set result, below is the code for it:
from matplotlib.colors import ListedColormap
x_set,y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0]lmax() + 1, step =0.01),
nm.arange(start = x_set[:, 1l.min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(xl, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red’, 'green')))
mtp.xlim(xL.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == |, 0], x_set[y_set == j, 1],
¢ = ListedColormap(('red’, "green))(i), label = j)
mtp.title('SVM classifier (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary")
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
SVM classifier
Estimated Salary
Age
Fig. 3.18
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got the straight
line as hyperplane because we have used a linear kernel in the classifler. And we have also discussed above that for the 2d
space, the hyperplane in SVM is a straight line.
* Visualizing the test set result:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set,y_set = x_test,y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0Jmax() + 1, step =0.01),
nm.arange(start = x_set[:, 1}.min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((‘red','green’ )))
Advanced Algorithms in Al and ML 3.16 Supervised Learning: Support Vector Machines, ....
mtp.xlim(xL.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap((‘red', 'green"))(i), label = j)
mtp.title('SVM classifier (Test set)')
mtp.xlabel(' Age')
mtp.ylabel(*Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
SVM classifier (Test set)
2
k-
3
gm
£
2
W
-2
-2 -1 0 1 2 3
Age
Fig. 3.19
JEXAl ADVANTAGES AND DISADVANTAGES OF SVM
Advantages of SVM:
1. SVM classifiers offer good accuracy and perform faster predictions compared to the Naive Bayes algorithm.
2. They also use less memory because they use a subset of training points in the decision phase.
3. SVM works well with a clear margin of separation and with high-dimensional space.
Disadvantages of SVM:
I. SVM is not suitable for large datasets because of its high training time, and it also takes more time in training
compared to Naive Bayes.
2. It works poorly with overlapping classes and is also sensitive to the type of kernel used.
JEXH K NEAREST NEIGHBOURS
k-nearest neighbors (K-NN) algorithm. It is a popular supervised model used for both classification and regression and is
a useful way to understand distance functions, voting systems, and hyperparameter optimization.
While K-NN can be used for classification and regression, here we will focus on building a classification model.
Classification in machine learning is a supervised learning task that involves predicting a categorical label for a given input
data point. The algorithm is trained on a labelled dataset and uses the input features to learn the mapping between the inputs
and the corresponding class labels. We can use the trained model to predict new, unseen data.
A Brief Introduction to K-Nearest Nelghbours:
The K-NN algorithm functions as a voting system, where the class label of a new data point is determined by the
‘majority class label among its k closest neighbors in the feature space. Envision a little hamlet including a limited populace
Advanced Algorithms in Al and ML 347 Supervised Learning: Support Vector Machines, ....
of a few hundred inhabitants, wherein you are faced with the imperative task of determining the political party that merits
your vote. In order to do this, you may approach your closest neighbors and inquire about their political party affiliation. If
the majority of your k closest neighbors endorse party A, it is quite probable that you would likewise cast your vote in favor
of party A. The process is analogous to the functioning of the K-NN algorithm, in which the class label of a new data point is
determined based on the majority class label among its k closest neighbors.
Now, let’s examine another case in further detail. Consider a dataset containing information on two types of fruit: grapes
and pears. You possess a numerical value representing the fruit's roundness and its diameter. You choose to graphically
represent these data points. If presented with an unfamiliar fruit, one might also include it in the graph and then determine its
identity by measuring the distance to the k (a numerical value) closest spots. In the above example, by selecting three
locations for measurement, we may confidently conclude that the three closest points correspond to pears. Hence, | am quite
certain that this object is a pear. By considering the four closest locations, it is seen that three of them correspond to pears,
while one corresponds to a grape. Consequently, we may confidently state that there is a 75% probability that the object in
question is a pear. In this post, we will discuss the methods for determining the optimal value of k and several techniques for
measuring distance.
1
09 o0
08 . °
0.7 O o °
Roundness S core
S e Il3
]
® Grape
o
°
© Pear
o
© New Fruit
e
w o o
N
0 1 2 3 4 5 6 7 8
Diameter (cm)
Fig. 3.20
The Dataset
To further illustrate the K-NN algorithm, let's work on a case study you may find while working as a data scientist. Let's
assume you are a data scientist at an online retailer, and you have been tasked with detecting fraudulent transactions. The
only features you have at this stage are:
e dist_from_home: The distance between the user’s home location and where the transaction was made.
* purchase_price_ratio: The ratio between the price of the item purchased in this transaction to the median purchase
price of that user.
The data has 39 observations which are individual transactions.
dist_from_home purchase_price_ratio fraud
0 21 6.4 1
1 38 22 1
2 157 44 1
3 26.7 4.6 1
4 10.7 49 1
K-Nearest Neighbors Workflow
To fit and train this model, we’ll be following The Machine Learning Workflow infographic.
‘Advanced Algorithms in Al and ML 3.18 Supervised Learning: Support Vector Machines, ....
3 4
Modelling Deployment
3. Make predictions j
Make predictions on
the testing set
Fig. 321
However, as our data is pretty clean, we won’t carry out every step. We will do the following:
* Feature engineering.
Spliting the data.
Train the model.
Hyperparameter tuning.
Assess model performance.
Advanced Algorithms in Al and ML 3.19 Supervised Learning: Support Vector Machines, ....
8 . Fraud
o 0
o 1
.
6 o o
£
g . .
£8 oo °
2@ | 4] @ e. .
©
g .
2 o .
24 Ny °
o ° o
%5 00 0° e
) ° )
o %
0 20 40 60 80 100 120 140
dist_from_home
Fig.3.22
Normalizing and Splitting the Data
When building a machine leaming model, it is crucial to partition the data into separate sets for training and testing
purposes. The training data is used to optimize the model. The algorithm utilizes the training data to acquire knowledge about
the correlation between the characteristics and the goal. Its objective is to identify a consistent structure within the training
data, which can then be used to create accurate forecasts on unfamiliar data. The test data is used to assess the efficacy of the
model. The model undergoes testing on the test data by using it to generate predictions and then comparing these predictions
to the actual target values.
Normalizing the features is crucial for training a K-NN classifier. The reason for this is because K-NN calculates the
distance between data points. The default method is to use the Euclidean Distance, which is calculated as the square root of
the total of the squared differences between two places. In our scenario, the purchase_price_ratio falls within the range of
0 to 8, however the dist_from_home is much more. Without normalization, our estimate would be significantly influenced
by the variable "dist_from_home" due to its larger values.
It is advisable to standardize the data after it has been divided into training and test sets. To avoid 'data leaking’, it is
necessary to normalize the data separately for the test set. This prevents the model from gaining extra knowledge about the
test set during normalization if all the data is normalized together.
The provided code segment divides the data into separate train and test sets, and then applies normalization using scikit-
learn's standard scaler. Initially, we use the .fit_transform() method on the training data, which adjusts our scaler to the
average and standard deviation of the training data. Subsequently, we may use this approach on the test data by invoking the
.transform() function, which employs the previously acquired values.
# Split the data into features (X) and target (y)
X = df.drop(‘fraud’, axis=1)
y = df['fraud']
Advanced Algorithms in Al and ML 320 Supervised Learning: Support Vector Machines, ....
# Split the data into training and test sets
X_train, X_test, y_tfrain, y_test = train_test_split(X, y, test_size=0.2)
Accuracy: 0.875
This is a pretty good score! However, we may be able to do better by optimizing our value of k.
Using Cross Validation to Get the Best Value of k:
Regrettably, there is no miraculous method to determine the optimal value for k. We need to iterate over many distinct
values and thereafter use our most astute discernment.
In the code shown below, we provide a range of values for the variable k and initialize an empty list to record the
outcomes. Cross-validation is used to determine the accuracy scores, obviating the necessity for creating separate training and
test sets. However, data scaling is still required. Subsequently, we iterate over the data and append the scores to our list.
In order to carry out cross-validation, we use the cross_val_score function provided by scikit-learn. We provide the
K-NN model instance, our data, and the desired number of splits as input. In the code below, we use five splits, indicating
that the model will divide the data into five groups of equal size. Four
of these groups will be used for training, while one will
be used for testing the results. The program will go over each group and calculate an accuracy score for each. These scores
will be averaged to determine the optimal model.
k_values = [i for i in range (1,31)]
scores =[]
scaler = StandardScaler()
X = scaler.fit_transform(X)
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
score = cross_val_score(knn, X, y, cv=b)
scores.append(np.mean(score))
‘We can plot the results with the following code
sns.lineplot(x = k_values, y = scores, marker = '0')
plt.xlabel("K Values")
plt.ylabel(“Accuracy Score")
Advanced Algorithms in Al and ML 3.21 ‘Supervised Learning: Support Vector Machines, ....
‘We can see from our chart that K=9, 10, 11, 12, and 13 all have an accuracy score of just under 95%. As these are tied
for the best score, it is advisable to use a smaller value for K. This is because when using higher values of K, the model will
use more data points that are further away from the original. Another option would be to explore other evaluation metrics.
0.95
0.90
Accuracy score
°
5&
0.70
0.65
0.60
0.55
[ 5 10 15 20 25 30
K Values
Fig.3.23
More Evaluation Metrics:
We can now train our model using the best k value using the code below.
best_index = np.argmax(scores)
best_k = k_values[best_index]
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)
then evaluate with accuracy, precision, and recall (note your results may differ due to randomization)
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy)
print(“Precision:", precision)
print(“Recall:", recall)
Accuracy: 0.875
Precision: 0.75
Recall: 1.0
* K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
* K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the
category that is most similar to the available categories.
o K-NNis a non-parametric algorithm, which means it does not make any assumption on underlying data.
Advanced Algorithms in Al and ML 322 Supervised Learning: Support Vector Machines, ....
* Itisalso called a lazy learner algorithm because it does not learn from the training set immediately instead it stores
the dataset and at the time of classification, it performs an action on the dataset.
* K-NNalgorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into
a category that is much like the new data.
Example: Suppose we have an image of a creature that looks like cat and dog, but we want to know either it is a cat or
dog. So, for this identification, we can use the K-NN algorithm, as it works on a similarity measure. Our K-NN model will
find the similar features of the new data set to the cats and dogs’ images and based on the most similar features it will put it in
either cat or dog category.
K-NN Classifier
Fig. 3.24
NEED OF K-NN ALGORITHM
‘Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point
will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we
can easily identify the category or class of a particular dataset. Consider the below diagram:
%
P 0% °
0% o o ° %0
) o ° ° b
'\ Category B Category B
° ° ° \
° ° °
g New data point b4 New data point
°° ° ° °: © assigned to
° Category 1
Category A
X, Xy
Flg. 3.25
JEEM WORKING OF K-NN ALGORITHM
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
* Step-1: Select the number
K of the neighbors
* Step-2: Calculate the Euclidean distance of K number of neighbors
® Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
* Step-4: Among these k neighbors, count the number of the data points in each category.
® Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
* Step-6: Our model is ready.
Advanced Algorithms in Al and ML 323 Supervised Learning: Support Vector Machines, ....
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
Xop
AR
0%
oo\
* oC ategory
New Data
’ oB
o0 M <O point
<
Category A
>
Xy
Flg. 3.26
* Firstly, we will choose the number of neighbors, so we will choose the k=5.
* Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance
between two points, which we have already studied in geometry. It can be calculated as:
Y 4
1] et
L e L et B (X,.Y)
Xy
k2
Flg. 3.27
Euclidean Distance Between A, and B, =[(X; - X,)" + (Y- Y,)".
* By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
AR3
®°,°
+ o °
\ Category B
¢ New Data
o 9 ¢ point
<
Category A
>
Xy
Fig. 3.28
® As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
Advanced Algorithms in Al and ML 3.24 Supervised Leaming: Support Vector Machines, ...
How to select the value of K In the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
* There is no way to determine the best value for "K", so we need to try some values to find the best out of them. The
most preferred value for K is 5.
* A very low value for K such as K = 1 or K = 2, can be noisy and lead to the effects of outliers in the model.
* Large values for K are good, but it may find some difficulties.
U] ADVANTAGES AND DISADVANTAGES OF K-NN ALGORITHM
[EXTEY Advantages of K-NN Algorithm
* Itissimple to implement.
® Itis robust to the noisy training data.
It can be more effective if the training data is large.
Disadvantages of K-NN Algorithm
* Always needs to determine the value of K which may be complex some time.
* The computation cost is high because of calculating the distance between the data points for all the training samples.
AV IMPLEMENTATION OF THE K-NN ALGORITHM
Implementation of the KNN Algorithm:
To do the Python implementation of the K-NN algorithm, we will use the same problem and dataset which we have used
in Logistic Regression. But here we will improve the performance of the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV car. The
company wants to give the ads to the users who are interested in buying that SUV. So for this problem, we have a dataset that
contains multiple user’s information through the social network. The dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent variable and the Purchased variable is for the dependent variable. Below
is the dataset:
User 1D Gender Age EstimatedSalary Purchased
15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 o
15804002 Male 19 76000 0
15728773 Male 27 58000 0
15598044 Female 27 84000 0
15694829 Female 7] 150000 1
15600575 Male b0 33000 0
15727311 Female 35 65000 0
15570769 Female 26 80000 o
15606274 Female 26 52000 o
15746139 Male 20 86000 0
15704987 Male 32 18000 0
15628972 Male 18 82000 0
15697686 Male 29 80000 0
15733883 Male 47 25000 1
15617482 Male as 26000 1
15704583 Male 46 28000 1
15621083 Female a8 29000 1
15649487 Male as 22000 1
15736760 Female 47 49000 1
Fig. 3.29
Steps to implement the K-NN algorithm:
* Data Pre-processing step.
* Fitting the K-NN algorithm to the Training set.
Advanced Algorithms in Al and ML 3.25 ‘Supervised Leamning: Support Vector Machines, ....
#importing datasets
data_set= pd.read_csv(‘user_data.csv')
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed. After feature scaling our
test dataset will look like:
e ] B 1. test - NumPy smay - o0 x
Fig. 3.30
From the above output image, we can see that our data is successfully scaled.
‘Advanced Algorithms in Al and ML 3.26 ‘Supervised Learning: Support Vector Machines, ....
Fig. 3.31
Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the classifier. Below is the code for
it:
#Creating the Confusion matrix
Advanced Algorithms in Al and ML 3.27 Supervised Learning: Support Vector Machines, ....
@ cm - NumPy armay - o x
0 1
e
1 3 29
Fig. 3.32
In the above image, we can see there are 64 + 29 = 93 correct predictions and 3 + 4 = 7 incorrect predictions, whereas, in
Logistic Regression, there were 11 incorrect predictions. So we can say that the performance of the model is improved by
using the K-NN algorithm.
Visualizing the Tralning Set Result:
Now, we will visualize the training set result for K-NN model. The code will remain same as we did in Logistic
Regression, except the name of the graph. Below is the code for it:
#Visulaizing the trianing set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0l.max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1lmax() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((‘red','green’ )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap((‘red', 'green*))(i), label = j)
mtp.title("K-NN Algorithm (Training set)')
mtp.xlabel('Age')
mtp.ylabel(*Estimated Salary')
mtp.legend()
mtp.show()
Advanced Algorithms in Al and ML 3.28 Supervised Learning: Support Vector Machines, ....
Output:
By exccuting the above code, we will get the below graph:
K-NN Algorithm (Training Set)
Estimated Salary
[}
~1
) -1 0 1 2 3
Age
Fig. 3.33
The output graph is different from the graph which we have occurred in Logistic Regression. It can be understood in the
below points:
® As we can see the graph is showing the red point and green points. The green points are for Purchased(1) and Red
Points for not Purchased(0) variable.
© The graph is showing an iregular boundary instead of showing any straight line or any curve because it is a K-NN
algorithm, i.c., finding the nearest neighbor.
* The graph has classified users in the correct categories as most of the users who didn't buy the SUV are in the red
region and users who bought the SUV are in the green region.
* The graph is showing good result but still, there are some green points in the red region and red points in the green
region. But this is no big issue as by doing this model is prevented from overfitting issues.
* Hence our model is well trained.
Visualizing the Test Set Result:
After the training of the model, we will now test the result by putting a new dataset, i.c., Test dataset. Code remains the
same except some minor changes: such as X_train and y_train will be replaced by x_test and y_test.
Below is the code for it:
#Visualizing the test set result
from matplotlib.colors import ListedColormap
x_set,y_set = x_test,y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0l.min() - 1, stop = x_set[:, 0lmax() + 1, step =0.01),
nm.arange(start = x_set[:, 1}min() - 1, stop = x_set[:, 1]max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap((‘red’,'green’ )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
¢ = ListedColormap(('red', 'green'))(i), label = j)
mtp.title("K-NN algorithm(Test set)’)
mtp.xlabel('Age')
mtp.ylabel(‘Estimated Salary")
Advanced Algorithms in Al and ML 3.29 Supervised Learing: Support Vector Machines, ....
mtp.legend()
mtp.show()
Output:
K-NN Algorithm (Test set)
Estimated Salary
o
n|
Age
Fig. 3.34
The above graph is showing the output for the test data set. As we can see in the graph, the predicted output is well good
as most of the red points are in the red region and most of the green points are in the green region.
However, there are few green points in the red region and a few red points in the green region. So these are the incorrect
observations that we have observed in the confusion matrix (7 Incorrect output).
Y ¥
Unsupervised Learning:
Clustering Algorithm )
Chapter Outcomes...
After reading this chapter, students will be able to understand :
[ The concept of K-means clustering.
[ The working of K-means algorithm.
(8] The failure of K-means.
Learning Objectives...
® Describe the performance analysis of clustering for the given situation.
® Describe Dimensionality Reduction.
K-MEANS CLUSTERING
A popular clustering approach for effectively dividing spherical data into separate categories is k-means clustering. This
is particularly useful as a feature-engineering step to improve supervised learning models, as well as an analytical tool when
the groupings of data rows are not evident.
For this lesson, we assume that you have a basic understanding of Python and the ability to work with pandas
Dataframes.
Clustering models attempt to categorise data into discrete "clusters” or groups. This might serve as a compelling
perspective in an analysis or as a characteristic in a supervised learning system.
Imagine a social environment where individuals are engaged in conversations inside separate clusters around a room.
Upon initial observation of the room, one's gaze falls upon a gathering of individuals. One approach would be to conceptually
assign a distinct identification to each group of people by thinking placing points at the centre of each group. Subsequently,
you would have the capacity to designate a distinct appellation to each group in order to characterise them. K-means
clustering does data segmentation in a fundamental manner.
811
Advanced Algorithms in Al and ML 42 Unsupervised
Leaming: Clustering Algorithm
Fig. 4.1
On the left side of the picture, there are two separate sets of dots that are not labelled and are coloured to indicate
similarity. Applying a k-means algorithm to this dataset (on the right-hand side) can uncover two unique clusters (shown by
separate circles and colours).
When dealing with two dimensions, humans can easily divide these clusters. However, when working with higher
dimensions, a model is required to accomplish the same task.
The Dataset:
We are going to wuse the California housing data obtained via Kaggle (available here
hitps://www kaggle.com/datasets/camnugent/california-housing-prices?resource=download). Our analysis will incorporate
geographic data, specifically latitude and longitude, together with the median house value. Our objective is to group the
houses based on their geographical proximity and analyse the variations in property values throughout California. The dataset
is stored in a CSV file named 'housing.csv' in our current working directory and is accessed using the pandas library.
import pandas as pd
The data include 3 variables that we have selected using the usecols parameter:
* longltude: A value representing how far west a house is. Higher values represent houses that are further West.
e latitude: A value representing how far north a house is. Higher values represent houses that are further north.
* median_house_value: The median house price within a block measured in USD.
K-Means Clustering Workflow:
Similar to other Machine Learning algorithms, K-Means Clustering follows a specific process.
Advanced Algorithms in Al and ML 43 Unsupervised
Learning: Clustering Algorithm
3 4
Modelling Deployment
techniques to improve
model performance. '\ 2. Monitor model performance
Regularly test the performance of your model
4. Assess models as your data changes to avoid model drift.
2. Train your performance
models For each model,
B a calculate performance]| 3. Improve your model
iy t?" il m‘m the testing Continuously literate and improve your model
te such as accuracy post-de| nt. Replace your model
with an
hon mn recall and precision e varson o e coforrce |
3. Make predictions j
Make predictions on
the testing set
42 Median_house_value
“ 100000
® 200000
4 * 300000
e 400000
38 ® 500000
238
2
3 36
-
MY
34
0
0 1 2
Longitude
Flg. 4.5
We clearly see that the Northern and Southern clusters have similar distributions of median house values (clusters 0 and
2) that are higher than the prices in the central cluster (cluster 1).
We can evaluate performance of the clustering algorithm using a Silhouette score which is a part
of sklearn.metrics where a lower score represents a better fit.
from sklearn.metrics import silhouette_score
for kinK:
# train the model for current value of k on training data
model = KMeans(n_clusters = k, random_state = 0, n_init="auto").fit(X_train_norm)
42F * 0
o1
a0k
g 38
3 a6 .
34 o
K
b
-124 -122 -120 -118 -116 114
Longitude
Fig. 4.6
The model does an ok job of splitting the state into two halves, but probably does not capture enough nuance in the
California housing market.
Next, we look atk =4.
sns.scatterplot(data = X_train, x = 'longitude’, y = 'latitude’, hue = fits[2].labels_)
42
40
o 38
-]
£
Sas
34
This figure categorises California into more coherent clusters based on the geographical location of residences,
specifically their proximity to the northern or southern regions of the state. This model is highly likely to capture a greater
level of subtlety in the housing market as we traverse the state.
Lastly, we examine the case where k is equal to 7.
sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude’, hue = fits[2].labels_)
oabwN=0O
" W,
" W en
-124 -122 -120 -18 ~116 —114
Longitude
Fig. 4.8
The graph depicted above exhibits an excessive number of clusters. We have prioritised sacrificing the simplicity of
interpreting the clusters in order to achieve a geo-clustering result that is deemed "more accurate”.
Generally, as the value of K increases, we observe enhancements in the clusters and their respective representations until
reaching a specific threshold. Subsequently, we observe a decline in productivity or, in some cases, even poorer performance.
To aid in determining the value of k, we can employ a visual representation known as an elbow plot. In this plot, the y-axis
represents the measure of goodness of fit, while the x-axis corresponds to the value of k.
sns.lineplot(x = K, y = score)
0.78
0.76
0.74
0.72
0.70
0.68
0.66
0.64
IS
o
©
[
Fig. 4.9
Typically, we select the point at which the performance improvements level off or deteriorate. It appears that k = 5 is the
optimal choice without risking overfitting. Furthermore, the clusters effectively divide California into distinct groups, which
correspond reasonably well to different price ranges, as demonstrated below.
sns.scatterplot(data = X_train, x = 'longitude’, y = 'latitude’, hue = fits[3]labels_)
‘Advanced Algorithms in Al and ML 48 Unsupervised
Learning: Clustering Algorithm
© 400000
[
®
§ 300000
2 4
3£ 200000
= 100000 l::l
o = - L
0 1 2 3 4
Fig. 4.11
Under what circumstances does k-means cluster analysis fall?
K-means clustering is most effective when applied to data that exhibit a spherical shape. Spherical data refers to data
points that cluster closely together in space. It is easier to visualise this in 2 or 3 dimensional space. Data that deviate from a
spherical shape or are not ideally spherical are not suitable for effective use in k-means clustering. For instance, the k-means
clustering algorithm would not perform effectively on the given data because it would fail to identify separate centroids to
cluster the two circles or arcs differently, even if they are visually distinct and should be labelled accordingly.
SN .00s .00s
Fig. 4.12
Is It advisable to partition your data Into separate training and testing sets?
The choice to partition your data is contingent upon the objectives you have set for the clustering process. If the
objective is to group your data at the conclusion of your research, then it is not obligatory. If you intend to utilize the clusters
as a feature in a supervised learning model or for prediction, as demonstrated in the Scikit-Learn Tutorial: Baseball Analytics
Pt | tutorial, it is necessary to partition your data prior to clustering in order to adhere to the recommended procedures for the
supervised learning workflow.
JEEH WHAT IS K-MEANS CLUSTERING?
® K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters.
* Here K definesthe number of pre-defined clusters that need to be created in the process, as if K = 2, there will be
two clusters, and for K = 3, there will be three clusters, and so on.
Advanced Algorithms in Al and ML 49 Unsupervised
Learning: Clustering Algorithm
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the
unlabelled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each clusteris associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs
two tasks
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a
cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Before K-Means After K-Means
Y Y
Flg. 4.13
XN WORKING OF K-MEANS ALGORITHM
How Does the k-means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let us understand the above steps by considering the visual plots:
Suppose we have two variables M, and M,. The x-y axis scatter plot of these two variables is given below:
Y
Advanced Algorithms in Al and ML 410 Unsupervised
Learning: Clustering Algorithm
® Let us take number K of clusters, i.e., K = 2, to identify the dataset and to put them into different clusters. It means
here we will try to group these datasets into two different clusters.
* We need to choose some random k points or centroid to form the cluster. These points can be cither the points from
the dataset or any other point. So, here we are selecting the below two points as K points, which are not the part of
our dataset. Consider the below image:
Y
Fig. 4.15
e Now we will assign cach data point of the scatter plot to its closest K-point or centroid. We will compute it by
applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:
Flg. 4.16
* From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the
right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
Fig. 4.17
* As we need to find the closest cluster, so we will repeat the process by choosing a new centrold. To choose the new
centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
Advanced Algorithms in Al and ML 411 Unsupervised Learning: Clustering Algorithm
X
Flg. 4.18
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a
median line. The median will be like below image:
Y.
o
o o -
o, *>~
T, .
_-o”7 ®
o . L
o o o
X
Flg. 4.19
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to
the line. So, these three points will be assigned to new centroids.
Y
X
Flg. 4.20
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
‘We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the
below image:
Flg. 4.21
Advanced Algorithms in Al and ML 412 Unsupervised Learning: Clustering Algorithm
© As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:
Y4
. o o
\\ °
N \ oDOdo
\\?
] . ©
. N
s 5
LR
. \N
>
X
Fig. 4.22
* We can see in the above image; there are no dissimilar data points on either side of the line, which means our model
is formed. Consider the below image:
Y4
Fig.4.23
As our model is ready. so we can now remove the assumed centroids, and the two final clusters will be as shown in the
below image:
Flg. 4.24
LYW FAILURE OF K-MEANS
Fallures or challenges assoclated with K-Means:
1. Sensitive to Initial Centrold Positlons: K-Means is sensitive to the initial placement of centroids. Different
initializations can lead to different final cluster assignments, and the algorithm may converge to a local minimum
rather than the global minimum.
2. Assumes Spherical Clusters: K-Means assumes that clusters are spherical and equally sized. In situations where
clusters have different shapes, densities, or sizes, K-Means may fail to accurately capture the underlying structure of
the data.
Advanced Algorithms in Al and ML 413 Unsupervised
Learning: Clustering Algorithm
3. Sensitive to Outllers: Outliers can significantly impact the performance of K-Means. Since the algorithm relies on
the mean (centroid) of the data points in each cluster, outliers can disproportionately influence the centroid, leading
to suboptimal cluster assignments.
4. Requires Pre-specification of the Number of Clusters (K): One of the major limitations of K-Means is that it
requires the user to specify the number of clusters (K) in advance. Choosing an inappropriate value for K can result
in poor clustering results.
5. Limited to Euclidean Distance: K-Means uses Euclidean distance to measure the dissimilarity between data points
and centroids. This can be a limitation when dealing with data that does not adhere to Euclidean geometry or when
the features have different scales.
6. May Produce Unbalanced Clusters: K-Means can produce clusters of significantly different sizes. In cases where
the data naturally forms clusters of unequal sizes, K-Means may not be the most suitable algorithm.
7. Not Robust to Non-Convex Shapes: K-Means assumes that clusters are convex, which means it struggles with
non-convex shapes. If the true clusters have complex, non-convex boundaries, K-Means may fail to accurately
represent them.
8. Does Not Handle Categorical Data Well: K-Means is designed for numerical data, and it may not perform well
with categorical or binary features. Preprocessing techniques, such as one-hot encoding, are often required.
9. Nolsy Data Impact: Noise in the data can lead to incorrect cluster assignments. K-Means is not robust to noisy data,
and outliers or irrelevant features can affect the clustering results.
10. Convergence to Local Optima: K-Means uses an iterative optimization process, and it may converge to a local
minimum rather than the global minimum. Multiple runs with different initializations are often performed to
mitigate this issue.
IMPLEMENTATION OF K-MEANS ALGORITHM
Before implementation, let's understand what type of problem we will solve here. So, we have a dataset of
Mall_Customers, which is the data of customers who visit the mall and spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score (which is the
calculated value of how much a customer has spent in the mall, the more the value, the more he has spent). From this dataset,
we need to calculate some patterns, as it is an unsupervised method, so we don't know what to calculate exactly.
The steps to be followed for the implementation are given below:
* Data Pre-processing.
« Finding the optimal number of clusters using the elbow method.
« Training the K-means algorithm on the training dataset.
* Visualizing the clusters.
Step-1: Data Pre-processing Step
The first step will be the data pre-processing, as we did in our earlier topics of Regression and Classification. But for the
clustering problem, it will be different from other models. Let’s discuss it:
« Importing Librarles: As we did in previous topics, firstly, we will import the libraries for our model, which is part
of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the DUMPY we have imported for the performing mathematics calculation, matplotlib is for plotting
the graph, and pandas are for managing the dataset.
* Importing the Dataset: Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:
1. # Importing the dataset
2. dataset = pd.read_csv('Mall_Customers_data.csv')
Advanced Algorithms in Al and ML 414 Unsupervised
Learning: Clustering Algorithm
By exccuting the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks like the below image:
W dataset Oatatrame - B x
inser CustomentD. Gewe Age Aveusl ncome (t5) Spending
Scove (1-100) ~
>
Fermat = L o e (o ]
Fig. 4.25
From the above dataset, we need to find some patterns in it.
* Extracting Independent Variables: Here we don't need any dependent variable for data pre-processing step as it is
a clustering problem, and we have no idea about what to determine. So we will just add a line of code for the matrix
of features.
1. x = dataset.iloc[:, [3, 4]].values
As we can see, we are extracting only 3™ and 4 feature. It is because we need a 2d plot to visualize the model, and some
features are not required, such as customer_id.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering problem. So, as discussed above,
here we are going to use the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis and the
number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values ranging from | to
10. Below is the code for it:
1. #finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
AON
. mtp.xlabel(‘Number of clusters(k)')
=
Advanced Algorithms in Al and ML 4.15 Unsupervised Learning: Clustering Algorithm
12. mtp.ylabel('wcss_list")
13. mtp.show()
As we can see in the above code, we have used the KMeans class of sklearn. cluster library to form the clusters.
Next, we have created the wess_list variable to initialize an empty list, which is used to contain the value of wess
computed for different values of k ranging from | to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to 10; since for loop
in Python, exclude the outbound limit, so it is taken as 11 to include 10 value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix of features and
then plotted the graph between the numberof clusters and WCSS.
Output: After executing the above code, we will get the below output:
The Elbow Method Graph
250000
Z 500000
g= 150000
100000
50000
2 4 6 8 10
Number of clusters (k)
Fig. 4.26
From the above plot, we can see the elbow point is at 5. So, the number of clusters here will be 5.
@ s st - st (10 ebemments: - o x
® flostés 1 20850.36526258563)
. flowttd 1 19672.07204901432
e |
Flig. 4.27
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number
of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the above section, but here instead of
using i, we will use 5, as we know there are 5 clusters that need to be formed. The code is given below:
1. #training the K-means model on a dataset
2. kmeans = KMeans(n_clusters=5, init="k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)
‘Advanced Algorithms in Al and ML 4.16 Unsupervised
Learning: Clustering Algorithm
The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train the model.
By executing the above lines of code, we will get the y predict variable. We can check it under the variable
explorer option in the Spyder IDE. We can now compare the values of y_predict with our original dataset. Consider the
below image:
r LT
u CusomeD = A —
IA e -
® B i —cus
L tomerld]
T T o9 R:3 ™ . 3 belongs to
y Female = * 2 3 cluster
b A [ » u ’
a . n b4 . 2
s . B ” [
s ’ -~ » " 2
. o - » " 3
. . - » 2
0 » Pute - .
- ——
- - A
=
Fig. 4.28
From the above image, we can now relate that the CustomerlD 1 belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.
Step-4: Visualizing the Clusters
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize each cluster
one-by-one.
To visualize the clusters will use scatter plot using mip.scatter( ) function of matplotlib.
1. #visulaizing the clusters
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue’, label = 'Cluster 1') #for
first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green’, label = 'Cluster 2') #for
second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, ¢ = 'red’, label = *Cluster 3') #for
third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan’, label = ‘Cluster 4') #for
fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, ¢ = 'magenta’, label = 'Cluster 5') #
for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, ¢ = 'yellow',
label = 'Centroid")
8. mtp.title(‘Clusters of customers')
9. mtp.xlabel(' Annual Income (k$)")
10. mtp.ylabel(* Spending Score (1-100)")
11. mtp.legend()
12. mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first co-ordinate of the
mip.scatter, i.c., x[y_predict == 0, 0] containing the x value for the showing the matrix of features values, and the y_predict is
ranging from 0 to 1.
Advanced Algorithms in Al and ML 417 Unsupervised
Leaming: Clustering Algorithm
Output:
Clusters of customers
100
= 80
g
w Cluster 1
if; 60 Cluster 2
2 ® Ciuster3
$3, ] @ Cluster 4
=] ) @ Cluster5
s ‘Q Centroids|
&E 20
o o8 g °
o] Weee
20 40 60 80 100 120 140
Annual Income (k$)
Fig. 4.29
The output image is clearly showing the five different clusters with different colors. The clusters are formed between two
parameters of the dataset; Annual income of customer and Spending. We can change the colors and labels as per the
requirement or choice. We can also observe some points from the above patterns, which are given below:
o Cluster 1 shows the customers with average salary and average spending so we can categorize these customers as
* Cluster 2 shows the customer has a high income but low spending, so we can categorize them as careful.
© Cluster 3 shows the low income and also low spending so they can be categorized as sensible.
* Cluster 4 shows the customers with low income with very high spending so they can be categorized as careless.
* Cluster 5 shows the customers with high income and high spending so they can be categorized as target, and these
customers can be the most profitable customers for the mall owner.
LXJ DIMENSIONALITY REDUCTION
What Is the significance of Dimension Reduction in machine learning and predictive modelling?
The issue of undesired increase in dimension is intricately linked to the fixation of measuring or recording data at a far
more detailed level than in previous times. This is by no means implying that this is a recent issue. Recently, it has become
increasingly important due to a significant increase of data.
Recently, there has been a significant surge in the utilisation of sensors in the business. The sensors consistently capture
and store data for subsequent analysis. There might be a significant amount of redundancy in the process of capturing data.
Consider, for instance, the scenario of a motorbike racer participating in racing contests. Currently, his location and motion
are determined by the utilisation of a GPS sensor on his bicycle, gyro metres, various video feeds, and his smart watch. Due
to individual discrepancies in the recording process, the data will not be identical. However, the inclusion of these other
sources provides minimal extra information regarding position gained. Suppose an analyst has access to all this data and is
tasked with analysing the racing strategy of the biker. They will encounter numerous factors or dimensions that are
comparable and provide little to no additional information. This issue pertains to the presence of excessive dimensions that
are not desired and requires a method for reducing the number of dimensions.
Now, let’s examine further instances of innovative methods for gathering data:
* Casinos are collecting data by means of surveillance cameras and monitoring the activities of their consumers.
® Political parties are collecting data by broadening their presence in the field.
* Smartphone applications gather extensive personal information about users.
® The set top box gathers data regarding programme preferences and viewing schedules.
© Organisations are assessing the worth of their brand by analysing social media interactions such as comments, likes,
number of followers, and the overall emotion expressed, both good and negative.
® Increasing the number of variables leads to an increase in difficulties. To mitigate this issue, dimension reduction
approaches offer a solution.
Advanced Algorithms in Al and ML 418 Unsupervised
Learning: Clustering Algorithm
Flg. 4.30
We can reduce the number of dimensions in a data collection from n to k, where k is less than n, using comparable
methods. The k dimensions can be identified directly or can be a combination of dimensions (weighted averages of
dimensions) or new dimension(s) that effectively represent several existing dimensions.
Image processing is a widely used application of this method. You may have encountered the Facebook application titled
"Which Celebrity Do You Resemble?”. However, have you ever contemplated the underlying algorithm employed in this?
Here is the solution: In order to determine the corresponding celebrity image, we employ pixel data, with each pixel
representing a single dimension. Each image contains a large number of pixels, which corresponds to a high number of
dimensions. Each dimension holds significance in this context. Arbitrarily excluding dimensions is not permissible in order to
enhance the comprehensibility of your entire dataset. Dimension reduction approaches are employed in such scenarios to
identify the important dimension(s) through the use of various methods. We will address these strategies briefly.
‘What benefits does Dimension Reduction offer?
Now, let's examine the advantages of implementing the Dimension Reduction process:
It facilitates data compression and decreases the necessary storage space.
It reduces the amount of time needed to conduct identical calculations. Reducing the number of dimensions results in
decreased computational requirements. Additionally, a lower number of dimensions enables the use of algorithms that are not
suitable for high-dimensional data.
It addresses the issue of multicollinearity, which enhances the performance of the model. It eliminates superfluous
characteristics. For instance, it is unnecessary to store a value in two distinct units, such as metres and inches.
By reducing the dimensions of data to either two or three, we can accurately plot and visualise it. Subsequently, one can
discern patterns with more clarity. Below, you can observe the process of converting 3D data into 2D. Initially, it has
established the 2D plane and subsequently depicted the points on these two newly defined axes, z, and z,.
® The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the
process to reduce these features is called dimensionality reduction.
® A dataset contains a huge number of input features in various cases, which makes the predictive modelling task
more complicated. Because it is very difficult to visualize or make predictions for the training dataset with a high
number of features, for such cases, dimensionality reduction techniques are required to use.
Advanced Algorithms in Al and ML 4.19 Unsupervised Learning: Clustering Algorithm
* Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions dataset into
lesser dimensions dataset ensuring that it provides similar information.” These techniques are widely used
in machine learning for obtaining a better fit predictive model while solving the classification and regression
problems.
e It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal
processing, bioinformatics etc. It can also be used for data visualization, noise reduction, cluster analysis etc.
Dimensionality reduction
Techniques
Dimensionality
e Sl Reduction
Fig. 4.31
Advantages of Applying Dimensionality Reduction
* By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
* Reduced dimensions of features of the dataset help in visualizing the data quickly.
* It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of Dimensionality Reduction
There are also some disadvantages of applying the dimensionality reduction, which are given below:
* Some data may be lost due to dimensionality reduction.
® In the PCA dimensionality reduction technique, sometimes the principal components required to consider are
unknown.
‘What are the common methods to perform Dimension Reduction?
There are many methods to perform Dimension reduction. I have listed the most common methods below:
1. Missing Values: When we come across missing values when analysing data, how should we proceed? To begin, we
should first determine the cause and then address missing data or eliminate variables using suitable approaches. However,
what if we encounter an excessive number of missing values? Should we replace missing values with imputed values or
remove the variables entirely?
1 would choose the later option as it contains fewer specifics about the data set. Furthermore, it would not contribute to
enhancing the efficacy of the model. Next, is there a specific threshold for the number of missing values that would warrant
deleting a variable? The outcome differs depending on the specific circumstances. If the variable has a relatively little amount
of information, it can be discarded if it has more than around 40-50% missing values.
2. Low Varlance: Consider a situation where all observations in our dataset have the same value, 5, indicating a
constant variable. Do you believe it has the potential to enhance the efficacy of the model? No, it does not have any variance.
Advanced Algorithms in Al and ML 420 Unsupervised
Learning: Clustering Algorithm
If there are a large number of dimensions, it is advisable to exclude variables with low variance in comparison to others, as
these variables will not effectively account for the variation in the target variables.
3. Decislon Trees: It is a strategy that I particularly favour. It serves as a comprehensive approach to address various
issues like as handling missing results, outliers, and finding relevant variables. It performed effectively during our Data
Hackathon as well. Multiple data scientists employed decision tree algorithms and achieved successful outcomes.
4. Random Forest: Random Forest is a method that is similar to a decision tree. | would suggest utilising the inherent
feature importance offered by random forests to choose a reduced set of input characteristics. It is important to note that
random forests tend to show a bias towards variables with a higher number of different values, meaning they favour numeric
variables over binary or category values.
5. Strong Correlation: Dimensions that have a strong correlation can negatively impact the model's performance.
Furthermore, it is undesirable to have several variables that contain comparable information or exhibit variance, a
phenomenon commonly referred to as "multicollinearity”. To locate variables with high correlation, you can utilise either the
Pearson correlation matrix for continuous data or the Polychoric correlation matrix for discrete variables. Once identified,
you can then select one of these highly correlated variables using the Variance Inflation Factor (VIF). Variables with a VIF
(Variance Inflation Factor) greater than 5 can be eliminated.
6. Backward Feature Elimination: This method begins with all n dimensions. Calculate the sum of squared errors
(SSR) by removing each variable individually, repeating this process n times. Next, we find the variables that, when
removed, result in the smallest increase in the sum of squared residuals (SSR). Finally, we remove these variables, resulting
in a dataset with n-1 input features.
Continue this procedure until there are no remaining variables that can be eliminated. In the recent Online Hackathon
conducted by Analytics Vidhya on 11-12 Jun'l5, the data scientist who secured the second rank utilised Backward Feature
Elimination in linear regression to train their model.
Conversely, we can employ the "Forward Feature Selection” technique. This strategy involves selecting a single variable
and evaluating the model’s performance by introducing an additional variable. Variable selection is determined by the extent
to which it improves model performance.
7. Factor Analysls: Suppose there exists a strong correlation among certain variables. These variables can be
categorised based on their correlations, meaning that all variables within a specific group may exhibit strong connections
with each other, but weak correlations with variables in other group(s). Each group signifies an individual underlying
component or factor. These parameters are relatively few in comparison to the vast number of dimensions. Nevertheless,
these elements are challenging to perceive. There are essentially two approaches to conducting factor analysis:
* Exploratory Factor Analysls (EFA)
« Confirmatory Factor Analysls (CFA)
8. Principal Component Analysis (PCA) is a method that involves transforming variables into a new set of variables
that are linear combinations of the original variables. The new set of variables is referred to as principal components. The
components are obtained by ensuring that the first principal component captures the most of the potential variation in the
original data, followed by each subsequent component having the largest variance possible.
The second principle component must be perpendicular to the first principal component. Put simply, it strives to capture
the remaining variability in the data that is not accounted for by the initial principal component. A two-dimensional dataset
can have a maximum of two main components. Displayed below is a summary of the data along with its primary and
secondary major components. It is evident that the second principal component is perpendicular to the first principal
component.
X2 * principal component
© 5™ principal component
Ll
Fig. 4.32
Advanced Algorithms in Al and ML 421 Unsupervised
Learning: Clustering Algorithm
The main components are influenced by the scale of measurement. To address this problem, it is necessary to standardise
variables prior to performing PCA. The use of Principal Component Analysis (PCA) to your data collection becomes
meaningless. If the importance of result interpretability is a priority for your analysis, Principal Component Analysis (PCA)
is not the appropriate technique for your project.
PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while
retaining as much information as possible. It is commonly used in data analysis and machine learning to identify the most
important features or variables that contribute the most to the overall variance in the data.
Principal Component Analysis (PCA) is a widely used unsupervised learning method that is employed to decrease the
dimensionality of extensive datasets. It enhances interpretability while simultancously reducing information loss. It facilitates
the identification of the most prominent characteristics in a dataset and simplifies the process of visualising the data in both
two and three dimensions. Principal Component Analysis (PCA) facilitates the identification of a series of linear
combinations of variables.
Fig. 4.33
In the above figure, we have several points plotted on a 2-D plane. There are two principal components. PC1 is the
primary principal component that explains the maximum variance in the data. PC2 is another principal component that is
orthogonal to PC1.
A principal component is a linear combination of the original variables in a dataset that captures the maximum amount of
variance in the data.
The Principal Components refer to a linear representation that effectively represents the majority of the variability
included in the data. They possess both a specific orientation and a measurable size. Principal components refer to the
orthogonal projections, or perpendicular projections, of data onto a lower-dimensional space.
Having grasped the fundamental concepts of PCA, we will now go into the subsequent aspect of PCA in the field of
Machine Learning.
Dimensionality here:
Dimensionality refers to the number of dimensions or variables in a given dataset or problem.
Dimensionality refers to the number of aspects or variables utilised in the investigation. Visualising and interpreting the
relationships between variables can provide challenges when working with high-dimensional data, such as datasets
containing a large number of variables. Dimensionality reduction approaches such as PCA are employed to retain the most
essential data while decreasing the number of variables in the dataset. PCA converts the original variables into a new set of
variables called principal components. These principal components are linear combinations of the original variables. The
dimensionality of the dataset is determined by the number of main components utilised in the investigation. The goal of PCA
is to identify a reduced set of main components that capture the most significant variation in the data. Principal Component
processing (PCA) can optimise data processing, improve visualisation, and facilitate the identification of patterns and
correlations among variables by lowering the dataset’s dimensionality.
The mathematical formulation of dimensionality reduction in the context of Principal Component Analysis (PCA) can be
expressed as:
The objective of Principal Component Analysis (PCA) is to convert the initial variables in a dataset, represented by the
n X p data matrix X, into a new collection of k variables known as principal components. These components aim to capture
the most substantial variation contained in the data. The primary components are determined by calculating linear
combinations of the original variables according to the following formula:
Advanced Algorithms in Al and ML 422 Unsupervised
Learning: Clustering Algorithm
The value of PC_1 is determined by the sum of the products of a_I 1, x_I,a_12,x 2, .., a_lp, and x_p.
The value of PC_2 is determined by the sum of the products of a_21 multiplied by x_I, a_22 multiplied by x_2, and so
on, up to a_2p multiplied by x_p.
The user’s text is enclosed in tags.
The equation PC_k is defined as the sum of the products of the coefficients a_k1, a_k2, ..., a_kp and the variables x_I,
x_2, .., %X_p.
The term "a_ij" represents the loading or weight of variable "x_j" on principal component "PC_i". Here, "x_j" refers to
the j* variable in the data matrix "X". The principle components are arranged in a specific order, with PC_| capturing the
highest amount of variation in the data, PC_2 capturing the second highest amount of variation, and so forth. The value of k,
which represents the number of principal components utilised in the analysis, directly influences the decreased
dimensionality of the dataset.
Correlation:
Correlation refers to the statistical relationship between two or more variables.
Correlation is a statistical term that quantifies the direction and magnitude of the linear relationship between two
variables. In the context of Principal Component Analysis (PCA), the covariance matrix is computed to represent the
pairwise correlations between all variables in the dataset. This matrix is a square matrix. The diagonal members of the
covariance matrix represent the variance of each variable, whereas the off-diagonal elements reflect the covariances between
distinct pairs of variables. The correlation coefficient, which ranges from —1 to 1, is a standardised statistic used to determine
the degree and direction of the linear relationship between two variables.
A correlation value of 0 indicates the absence of a linear relationship between the two variables, whereas correlation
coefficients of | and —1 indicate perfect positive and negative correlations, respectively. The principal components in PCA
are linear combinations of the original variables that optimise the amount of variation accounted for by the data. The
calculation of principal components involves the utilisation of the correlation matrix.
Within the context of Principal Component Analysis (PCA), correlation is mathematically expressed in the following
manner:
The correlation matrix C is a symmetric matrix of size n X n, where n represents the number of variables (x, Xa...., X,) in
the dataset. It contains the correlation coefficients between these variables.
The formula for calculating the correlation coefficient between two variables x; and x; is given by C; =
(sd(x)) * sd(x;)
cov(x;, x) °
The standard deviation of variable x; is denoted as sd(x;), while the standard deviation of variable x; is denoted as sd(x;).
The correlation between variables x; and x; is represented as cov(x;, x;).
The correlation matrix C can be expressed in matrix notation as:
The equation can be expressed as C = X T X /(n - 1) (n — 1), where C is a matrix, X is the transpose of another matrix,
and n is a constant.
Orthogonality
The term "orthogonality” refers to the property of the principle components in the PCA technique, where they are
constructed to be perpendicular to one other. This suggests that there is no superfluous information among the primary
components and that they are not interrelated.
The concept of orthogonality in Principal Component Analysis (PCA) can be mathematically defined as follows: each
principal component is constructed to maximise the amount of variance it explains, while also satisfying the condition that it
is perpendicular to all other principal components. The primary components are calculated by combining the original
variables in a linear manner. Therefore, every primary component is ensured to reflect a distinct and non-duplicative portion
of the variability in the data.
The orthogonality constraint is defined as:
The sum of the products of a_il *a_jl,a_i2*a_j2, .., a_ip * a_jp is equal to zero.
For all values of i andj, where i is not equal to j. Consequently, the dot product of loading vectors for distinct principal
components is zero, signifying their orthogonality.
Advanced Algorithms in Al and ML 423 Unsupervised Learning: Clustering Algorithm
PCA Implementation
Example: Let's take a look at how PCA can be implemented in Scikit-Learn. We will be using the Mushroom
classification dataset for this.
Download Datasets: https://www.kaggle.com/datasets/uciml/mushroom-classification
First, we need to import all the modules we need, which includes PCA, train_test_split, and labeling and scaling tools:
import pandas as pd
import matplotlib.pyplot as pit
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(“ignore")
After we load in the data, we will check for any null values. We will also encode the data with the LabelEncoder. The
class feature is the first column in the dataset, so we split-up the features and labels accordingly:
m_data = pd.read_csv('mushrooms.csv')
encoder = LabelEncoder()
X_features = m_data.iloc[:,1:23]
y_label = m_data.iloc[:, 0]
We will now scale the features with the standard scaler. This is optional as we aren't actually running the classifier, but it
may impact how our data is analyzed by PCA:
# Scale the features
scaler = StandardScaler()
X_features = scaler.fit_transform(X_features)
We will now use PCA to get the list of features and plot which features have the most explanatory power, or have the
most variance. These are the principle components. It looks like around 17 or 18 of the features explain the majority, almost
95% of our data:
# Visualize
pea=PCA()
pea.fit_transform(X_features)
pca_variance = pca.explained_variance_
plt.figure(figsize=(8, 6))
plt.bar(range(22), pca_variance, alpha=0.5, align="center’, label="individual variance")
plt.legend()
plt.ylabel(*Variance ratio')
plt.xlabel('Principal components")
plt.show()
‘Advanced Algorithms in Al and ML 424 Unsupervised Learning: Clustering Algorithm
[ Individual variance
[nd
Variance Ratio
o
N
° Ed
0.5
0.0 HUHHHHDDHD:;
Principal Components
Flg. 4.34
Let's convert the features into the 17 top features. We will then plot a scatter plot of the data point classification based on
these 17 features:
pca2 = PCA(n_components=17)
pea2.fit(X_features)
x_3d = pca2.transform(X_features)
plt-figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:5], c=m_data['class'])
plt.show()
@
Flg. 4.35
Let's also do this for the top 2 features and see how the classification changes:
pca3 = PCA(n_components=2)
pea3.fit(X_features)
x_3d = pca3.transform(X_features)
plt-figure(figsize=(8,6))
plt.scatter(x_3d[:,0], x_3d[:,1], c=m_data['class'])
plt.show()
Advanced Algorithms in Al and ML 425 Unsupervised
Learning: Clustering Algorithm
-4 -3 2 -1 0 1 2 3 4
Fig. 4.36
SINGULAR VALUE DECOMPOSITION (SVD)
The primary objective of Singular Value Decomposition (SVD) is to streamline a matrix and facilitate computations
involving the matrix. The matrix is decomposed into its individual components, akin to the objective of Principal Component
Analysis (PCA). While it is not essential to comprehend all the intricacies of Singular Value Decomposition (SVD) in order
to apply it in your machine learning models, possessing a basic understanding of its functioning will enhance your ability to
determine its appropriate usage.
Singular Value Decomposition (SVD) can be performed on matrices that are either complex or real-valued. However, for
the purpose of clarity, we will focus on explaining the process of decomposing a real-valued matrix.
During Singular Value Decomposition (SVD), we are presented with a matrix containing data and our objective is to
decrease the number of columns in the matrix. This process decreases the number of dimensions in the matrix while retaining
the maximum amount of variability in the data.
Matrix A is equivalent to the transpose of matrix V.
The equation A = U * D * Vt represents a mathematical relationship between the variables A, U, D, and Vt.
Given a matrix A, it is possible to express this matrix as three separate matrices denoted as U, V, and D. Matrix A
contains x*y elements, Matrix U is an orthogonal matrix with x*x elements, and Matrix V is a separate orthogonal matrix
with y+y components. Ultimately, D is a matrix with diagonal elements that are the product of x and y.
Decomposition of a matrix entails transforming the singular values of the original matrix into the diagonal values of the
resulting matrix. Orthogonal matrices retain their properties even when multiplied by other integers, allowing us to exploit
this characteristic to obtain an approximation of matrix A. By multiplying the orthogonal matrix with the transpose of matrix
V, we obtain a matrix that is equal to the original matrix A.
By decomposing matrix A into U, D, and V, we have three distinct matrices that encapsulate the information of Matrix
Al
Interestingly, the primary data of our matrices is concentrated in the left-most columns. By selecting only these specific
columns, we may obtain a reliable approximation of Matrix A. The new matrix is more streamlined and user-friendly, as it
possesses significantly less dimensions.
Example of Singular Value Decomposition (SVD) Implementation
Singular Value Decomposition (SVD) is frequently employed for picture compression. Ultimately, by diminishing the
pixel values comprising the red, green, and blue channels within the image, a less intricate image can be obtained while
retaining the same visual information. Let us attempt to utilise Singular Value Decomposition (SVD) to compress an image
and subsequently display it.
We will utilise multiple functions to manage the compression of the image. To execute this task, we simply require the
Numpy library and the Image function from the PIL library. Numpy provides a mechanism to perform the SVD computation.
import numpy
from PIL import Image
Advanced Algorithms in Al and ML 4.26 Unsupervised
Learning: Clustering Algorithm
First, we will just write a function to load in the image and tun it into a Numpy array. We then want to select the red,
green, and blue color channels from the image:
def load_image(image):
image = Image.open(image)
im_array = numpy.array(image)
red = im_array[:, :, 0]
green = im_array[:, :, 1]
blue = im_array[:, :, 2]
im_red = Image.fromarray(compressed_red)
im_blue = Image.fromarray(compressed_blue)
im_green = Image.fromarray(compressed_green)
L f‘ g ot
Fig. 4.37
‘We also need to set the singular value limit we'll use, let's start with 600 for now:
red, green, blue = load_image("dog.jpg")
singular_val_lim = 350
Ultimately, we can obtain the condensed values for the three colour channels and convert them from Numpy arrays into
image components using PIL. Next, we simply need to combine the three channels and display the image. The desired image
should possess reduced dimensions and exhibit a more basic and uncomplicated design compared to the original image.
Fig. 4.38
Upon examining the image sizes, it becomes evident that the compressed version is smaller, but with some lossy
compression applied. There is also some visual distortion present in the photograph.
You have the option to manipulate and modify the solitary value limit. As the selected limit decreases, the compression
will increase. However, there is a threshold where image artifacting becomes visible and the image quality deteriorates.
def compress_image(red, green, blue, singular_val_lim):
compressed_red = channel_compress(red, singular_val_lim)
compressed_green = channel_compress(green, singular_val_lim)
compressed_blue = channel_compress(blue, singular_val_lim)
im_red = Image.fromarray(compressed_red)
im_blue = Image.fromarray(compressed_blue)
im_green = Image.fromarray(compressed_green)
X SUBSET SELECTION
Approaches of Dimension Reduction:
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection:
Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant features
present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal features from the
input dataset.
Three methods are used for the feature selection:
1. Fliters Methods:
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common
techniques of filters method are:
* Correlation
* Chi-Square Test
e ANOVA
* Information Gain etc.
2. Wrappers Methods:
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its evaluation. In
this method, some features are fed to the ML model, and evaluate the performance. The performance decides whether to add
those features or remove to increase the accuracy of the model. This method is more accurate than the filtering method but
complex to work. Some common techniques of wrapper methods are:
* Forward Selection.
* Backward Selection.
* Bi-directional Elimination.
3. Embedded Methods:
Embedded methods check the different training iterations of the machine learning model and evaluate the importance of
each feature. Some common techniques of Embedded methods are:
* LASSO
* Elastic Net
* Ridge Regression efc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole information but use fewer resources while processing
the information.
Some common feature extraction techniques are:
(a) Principal Component Analysis
(b) Linear Discriminant Analysis
(c) Kemel PCA
(d) Quadratic Discriminant Analysis
Common Techniques of Dimensionality Reduction:
(a) Principal Component Analysis
(b) Backward Elimination
(¢) Forward Selection
(d) Score comparison
(e) Missing Value Ratlo
() Low Varlance Fllter
(g) High Correlation Filter
Advanced Algorithms in Al and ML 4.29 Unsupervised
Learning: Clustering Algorithm
Y7
6...
Deep Learning for Sequential
and Image Data )
Chapter Outcomes...
After reading this chapter, students will be able to understand:
The concept of sequential data.
EENEE
Learning Objectives...
® Implement Deep Learning for sequential data.
® Implement Deep Learning for Image data.
Fig. 6.1
In normal neural networks, the inputs and outputs are considered to be independent. However, in certain situations, such
as predicting the next word in a phrase, the preceding words become crucial and need to be remembered. Consequently, the
Recurrent Neural Network (RNN) was developed, employing a Hidden Layer to address the issue. The fundamental element
of RNN is the Hidden state, which retains precise details regarding a sequence.
6.1]
Advanced Algorithms in Al and ML 62 Deep Leamning for Sequential and Image Data
Recurrent Neural Networks (RNNs) possess a Memory component that retains comprehensive information pertaining to
the computations. It utilises identical configurations for every input, as it generates the same result by executing the same
operation on all inputs or hidden layers.
The Architecture of a Traditlonal RNN:
RNNs are a type of neural network that has hidden states and allows past outputs to be used as inputs. They usually go
like this:
- <tot
X
Flg. 6.2
For each time step t, the activation a® and the output y* are expressed as follows:
[aT = Waa™
" + W x® +b) | and [yF = g(Wyaa®
+b,) ]
where, Wy, W,,, Way, b,, b, are coefficients that are shared temporarily and g,, g, activation functions.
<t-1>
Fig. 6.3
RNN architecture can vary depending on the problem you are trying to solve. From those with a single input and output
to those with many (with variations between).
Below are some examples of RNN architectures that can help you better understand this.
* One To One: There is only one pair here. A one-to-one architecture is used in traditional neural networks.
* One To Many: A single input in a one-to-many network might result in numerous outputs. One too many networks
are used in the production of music.
For example,
* Many To One: In this scenario, a single output is produced by combining many inputs from distinct time steps.
Sentiment analysis and emotion identification use such networks, in which the class label is determined by a
sequence of words.
e Many To Many: For many to many, there are numerous options. Two inputs yield three outputs. Machine
translation systems, such as English to French or vice versa translation systems, use many to many networks.
* Recurrent Neural Networks or RNNs consist of some directed connections that form a cycle that allow the input
provided from the .STMs to be used as input in the current phase of RNNs.
* These inputs are deeply embedded as inputs and enforce the memorization ability of LSTMs lets these inputs get
absorbed fora period in the internal memory.
Advanced Algorithms in Al and ML 6.3 Deep Leamning for Sequential and Image Data
* RNNs are therefore dependent on the inputs that are preserved by LSTMs and work under the synchronization
phenomenon of LSTMs.
© RNNs are mostly used in captioning images, time series analysis, recognizing handwritten data, and translating data
to machines.
* RNNs follow the work approach by putting output feeds (t — 1) time if the time is defined as t. Next, the output
determined by t is feed at input time t + 1.
* Similarly, these processes are repeated for all the input consisting of any length.
® There's also a fact about RNNs is that they store historical information and there's no increase in the input size even
if the model size is increased.
* RNNs look something like this when unfolded.
Hidden
bl state Output at timet|
Input at timet
Flg. 6.4
‘Workingof Recurrent Neural Networks work:
The information in recurrent neural networks cycles through a loop to the middle hidden layer.
Flg. 6.5
The input layer, denoted as x, receives and processes the input data of the neural network before transmitting it to the
middle layer.
The middle layer h has multiple hidden layers, each having its own activation functions, weights, and biases. If there is
no memory in the neural network and the parameters of distinct hidden layers are not affected by the preceding layer, you can
employ a recurrent neural network.
The Recurrent Neural Network will standardise the activation functions, weights, and biases, so ensuring that each
hidden layer possesses identical features. Instead of generating many hidden layers, it will produce a single layer and iterate
overit as many times as needed.
Common Activation Functions:
A neuron’s activation function dictates whether it should be turned on or off. Nonlinear functions usually transform a
neuron’s output to a number between O and | or— I and 1.
Deep Learning for Sequential and Image Data
Rule
Flg. 6.6
The following are some of the most commouly utilized functions:
* Sigmold: The formula g(z) —%flis used to express this.
-
« Tanh: The formula g(z) = {%} is used to express this.
* Relu: The formula g(z) = max(0, z) is used to express this.
Pros and cons of Recurrent Neural Networks (RNN)
Beneflts of Recurrent Neural Networks (RNNs):
o Efficiently process sequential data, such as text, voice, and time series.
* Unlike feedforward neural networks, this model is capable of processing inputs of any length.
* By distributing weights over multiple time steps, the efficiency of training is improved.
Drawbacks of Recurrent Neural Networks (RNNs):
* Susceptible to the issues of vanishing and exploding gradients, which impede the learning process.
* Training can be arduous, particularly for lengthy stretches.
o Characterised by a slower computational speed compared to alternative neural network topologies.
Recurrent Neural Network Vs Feedforward Neural Network:
A feed-forward neural network is characterised by a unidirectional flow of information, specifically from the input layer
to the output layer, while traversing the hidden layers. The data traverses the network on a direct path, without passing
through any node more than once.
The transfer of information between a recurrent neural network (RNN) and a feed-forward neural network is illustrated
in the two figures provided.
The information is processed in a recurrent neural network (RNN) by a looping mechanism. Prior to forming a
conclusion, it assesses the present input and incorporates knowledge gained from previous inputs. In contrast, a recurrent
neural network has the ability to remember information through its internal memory. The system generates output, duplicates
it, and subsequently transmits it back to the network.
Backpropagation Through Time (BPTT)
Backpropagation via time refers to the application of the Backpropagation algorithm to a Recurrent Neural Network that
takes time series data as input.
In a typical Recurrent Neural Network (RNN), just one input is processed at a time, resulting in a single output.
Contrarily, backpropagation utilises both the present and previous inputs as input. A timestep is the term used to describe the
occurrence when numerous time series data points are simultaneously fed into the RNN.
Yo Y1 N Yn
e!
v v
w W
c—
u l Gey u
%
X0 X % X
Flg. 6.8
The output of the neural network is used to calculate and collect the errors once it has trained on a time set and given you
an output. The network is then rolled back-up, and weights are recalculated and adjusted to account for the faults.
Two Issues of Standard RNNs:
There are two key challenges that RNNs have had to overcome, but in order to comprehend them, one must first grasp
what a gradient is.
Exploding Gradient Vanishing Gradlent
|Aw] 00 |aw|
Derivative
Derivative
size
size
-t
Fig. 6.9
A gradient is a partial derivative with relation to its inputs. If you are uncertain about the implications, take into account
the following: A gradient measures the extent to which the output of a function changes when the inputs are altered slightly.
The slope of a function is synonymous with its gradient. A model’s learning speed increases as the slope becomes
steeper, resulting in a higher gradient. Conversely, the model will cease learning if the slope is zero. A gradient is employed
to quantify the variation in all weights with respect to the variation in error.
Exploding Gradlents: Exploding gradients refer to a situation in which the algorithm assigns excessively high
importance to the weights without any clear justification. Fortunately, the problem can be easily resolved by truncating or
squashing the gradients.
Vanishing Gradlents: Vanishing gradients refer to the situation where the values of the gradients become extremely
small, resulting in the model either ceasing to learn or taking an excessively long time to leam. This problem was of
significant concern during the 1990’s and posed a greater challenge to resolve compared to the issue of exploding gradients.
Fortunately, the difficulty was resolved by Sepp Hochreiter and Juergen Schmidhuber's LSTM idea.
Advanced Algorithms in Al and ML 66 Deep Learning for Sequential and Image Data
RNN Applications:
Applications of Recurrent Neural Networks (RNN).
Recurrent Neural Networks are employed to address a range of issues related to sequential data. Several forms of
sequence data exist, with the following being the most prevalent: Audio, text, video, and biological sequences.
By utilising recurrent neural network (RNN) models and sequence datasets, you can address a wide range of issues, such
as:
* Automatic Speech Recognition.
* Music Generation.
* Machine Translations.
* Video Action Analysis.
® Genomic and DNA Sequencing Analysis.
Baslc Python Implementation (RNN with Keras)
Import the required librarles
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
Here is a simple Sequential model that processes integer sequences, embeds each integer into a 64-dimensional vector,
and then uses an LSTM layer to handle the sequence of vectors.
model = keras.Sequential()
model.add(layers.Embedding(input_dim=1000, output_dim=64))
model.add(layers.LSTM(128))
model.add(layers.Dense(10))
model.summary()
Output:
Model: “sequential™
Forget
r irrelevant
I LSTM Pass updated
information information
Add/update new
information
Fig. 6.10
The Logic Behind LSTM:
The initial phase determines whether the data from the preceding timestamp should be retained or disregarded as
inconsequential. In the subsequent phase, the cell endeavours to acquire novel information from the input it receives. Finally,
in the third segment, the cell transfers the changed information from the present timestamp to the subsequent timestamp. A
single-time step in LSTM refers to one complete cycle of the network.
The three components of an LSTM unit are commonly referred to as gates. They regulate the transmission of information
into and out of the memory cell or LSTM cell. The initial gate is referred to as the Forget gate, the subsequent gate is
recognised as the Input gate, and the final gate is denoted as the Output gate. An LSTM unit, comprising of three gates and a
memory cell or Istm cell, can be conceptualised as a layer of neurons within a conventional feedforward neural network. Each
neuron possesses a hidden layer and a current state.
Input Gate
Forget Gate 5 Output Gate
Forget irrelevant
information LSTM Pass updated
information
Add/update new
information
Flg. 6.11
An LSTM, similar to a basic RNN, possesses a hidden state. In this case, H(t-1) denotes the hidden state from the
previous timestamp, while Ht represents the hidden state at the current timestamp. LSTM possesses a cell state denoted as
C(t-1) and C(t) for the preceding and current timestamps, correspondingly.
In this context, the concealed state is referred to as Short term memory, whereas the cell state is referred to as Long term
memory. Please see the image provided below.
It is interesting to note that the cell state carries the information along with all the timestamps.
Crqy —f —>cC,
Hy ——» — H,
LS™
Flg. 6.12
Bob is a nice person. Dan on the other hand is evil.
Advanced Algorithms in Al and ML 68 Deep Learning for Sequential and Image Data
Exampleof LTSM Working:
Let us use an example to comprehend the functioning of LSTM. Here we have two phrases terminated by a period. The
initial statement asserts that Bob possesses amiable qualities, but the subsequent statement contrasts this by charactes
Dan as malevolent. The first sentence unambiguously refers to Bob, but the introduction of a full stop(.) marks the transition
to discussing Dan.
As we transition from the initial sentence to the subsequent sentence, our network should recognise that we are no longer
discussing Bob. Our current focus is on Dan. In this context, the Forget gate of the network enables it to disregard or
eliminate some information. Let us comprehend the functions performed by these gates in the LSTM architecture.
Forget Gate:
In a cell of the LSTM neural network, the first step is to decide whether we should keep the information from the
previous time step or forget it. Here is the equation for forget gate.
Forget Gate:
* fi=0(xxUr+H xWp)
Let us try to understand the equation, here
* X input to the current timestamp.
e Up weight associated with the input
© H,: The hidden state of the previous timestamp
® W Itis the weight matrix associated with the hidden state
Later, a sigmoid function is applied to it. That will make ft a number between 0 and 1. This ft is later multiplied with the
cell state of the previous timestamp, as shown below.
Cuxf, =0 _..if f,=0 (forget everything)
Caaxfi = Cqy - if f5=1 (forget nothing)
LTSM vs RNN:
Architecture A type of RNN with additional memory cells. A basic type of RNN.
Memory Retention Handles long-term dependencies and prevents | Struggles with long-term dependencies and
vanishing gradient problem. vanishing gradient problem.
Cell Structure Complex cell structure with input, output, and | Simple cell structure with only hidden state.
forget gates.
Handling Sequences | Suitable for processing sequential data. Also designed for sequential data, but
limited memory.
Training Efficiency Slower training process due to increased | Faster training process due to simpler
complexity. architecture.
Performance
on Long | Performs betteron long sequences. Struggles to retain information on long
Sequences sequences.
Usage Best suited for tasks requiring long-term memory, | Appropriate for simple sequential tasks,
such as language translation and sentiment | such as time series forecasting.
analysis.
Vanishing Gradient Addresses the vanishing gradient problem. Prone to the vanishing gradient problem.
Problem
Long-Short Term Memory Networks
® LSTMscan be defined as Recurrent Neural Networks (RNN) that are programmed to learn and adapt for
dependencies for the long term.
® It can memorize and recall past data for a greater period and by default, it is its sole behaviour.
Advanced Algorithms in Al and ML 6.9 Deep Learning for Sequential and Image Data
® LSTMs are designed to retain over time and henceforth they are majorly used in time series predictions because they
can restrain memory or previous inputs.
* This analogy comes from their chaln-like structure consisting of four interacting layers that communicate with each
other differently.
* Besides applications of time series prediction, they can be used to construct speech recognizers, development in
pharmaceuticals, and composition of music 100ps as well.
* LSTM work in a sequence of events. First, they don't tend to remember irrelevant details attained in the previous
state.
® Next, they update certain cell-state values selectively and finally generate certain parts of the cell-state as output.
Below is the diagram of their operation.
Flg. 6.13
Gated Recurrent Unit (GRU)
GRU or Gated recurrent unit is an advancement of the standard RNN i.e recurrent neural network. GRUs are very similar
to Long Short Term Memory (LSTM). Just like LSTM, GRU uses gates to control the flow of information. They are
relatively new as compared to LSTM. This is the reason they offer some improvement over LSTM and have simpler
architecture.
Crs—> G Hy He
LSTM GRU
Hey H,
* %
Fig. 6.14
Another Interesting thing about GRU is that, unlike LSTM, it does not have a separate cell state (Ct). It only has a hidden
state (Ht). Due to the simpler architecture, GRUs are faster to train.
The Architecture of Gated Recurrent Unit:
At each given time t, an input Xt and the hidden state H,; from the previous time t — | are utilised. Subsequently, it
generates a fresh concealed state Ht, which is then transmitted to the subsequent time stamp.
The GRU cell now consists of two gates, whereas the LSTM cell consists of three gates. The initial gate is referred to as
the Reset gate, while the second gate is known as the update gate.
Huy He
Flg. 6.15
Advanced Algorithms in Al and ML 6.10 Deap Leamning for Sequential and Image Data
At each given time t, an input Xt and the hidden state H,, from the previous time t-1 are utilised. Subsequently, it
generates a fresh concealed state Ht, which is then transmitted to the subsequent time stamp.
The GRU cell now consists of two gates, whereas the LSTM cell consists of three gates. The initial gate is referred to as
the Reset gate, while the second gate is known as the update gate.
Reset Gate (Short-term memory):
The Reset Gate is accountable for the network’s transient memory, namely the hidden state (H,). The following is the
mathematical expression for the Reset gate.
n = o(x XU+ Ho
X W)
If you remember from the LSTM gate equation it is very similar to that. The value of r, will range from 0 to | because of
the sigmoid function. Here Ur and Wr are weight matrices for the reset gate.
Update Gate (Long Term Memory):
Similarly, we have an Update gate for long-term memory and the equation of the gate is shown below.
u = o(x,xU,+H_ xW,)
The only difference is of weight metrics i.e. U, and W,
How GRU Works?
Now, let us examine the operation of these gates. The process of finding the Hidden state Ht in a GRU involves a
two-step procedure. To begin, the initial task is to create the candidate concealed state. As demonstrated below.
Candidate Hidden State:
A
H, = tan h(x;x Uy + (r,- Hy) X Wy)
It takes in the input and the hidden state from the previous timestamp t — | which is multiplied by the reset gate output rt.
Later passed this entire information to the tanh function, the resultant value is the candidate’s hidden state.
H = ueHos (-w-f,
Similarly, if the value of u, is on the second term will become entirely 0 and the current hidden state will entirely depend
on the first term i.e. the information from the hidden state at the previous timestamp t — I.
Hence, we can conclude that the value of u, is very critical in this equation, and it can range from O to 1.
Advanced Algorithms in Al and ML 611 Deep Learning for Sequential and Image Data
Output
Probabilities
Attention
Nx
Masked
Multi-Head
Attention
- J
Positional Positional
Encoding -~ b Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
Encoder Decoder
Nx
Nx
J
Positional ® Positional
Encoding P & Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
Fig. 6.17
‘The encoder and decoder blocks consist of many identical encoders and decoders that are stacked vertically. The encoder
stack and the decoder stack possess
an equal number of units.
The quantity of encoder and decoder units is a hyperparameter. The article utilises
a total of 6 encoders and decoders.
(Output)
Please come here
(C_Encoder
3
( Encoder
t
|
Komm bitte her
(input)
Fig. 6.18
Let us see how this set-up of the encoder and the decoder stack works:
® The word embeddings of the input sequence are passed to the first encoder.
® These are then transformed and propagated to the next encoder.
Advanced Algorithms in Al and ML 6.13 Deep Learning for Sequential and Image Data
* The output from the last encoder in the encoder-stack is passed to all the decoders in the decoder-stack as shown in
the figure below:
Encoder 2
Self-Attention
- J
Eneocer { Decoder 1
Self-Attention | Self-Attention
- J
Komm bitte her Please come here
Fig. 6.19
An important thing to note here — in addition to the self-attention and feed-forward layers, the decoders also have one
more layer of Encoder-Decoder Attention layer. This helps the decoder focus on the appropriate parts of the input sequence.
Limitations of the Transformer
Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it comes with its own share
of limitations:
* Attention can only deal with fixed-length text strings. The text has to be split into a certain number of segments or
chunks before being fed into the system as input.
® This chunking of text causes context fragmentation.
For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the
text is split without respecting the sentence or any other semantic boundary.
GPT - GENERATIVE PRE-TRAINED TRANSFORMER
GPT, an acronym for Generative Pre-trained Transformer, is an advanced machine learning model that has garnered
substantial attention and popularity in recent years. Open Al's development of GPT has brought about a significant
transformation in the domain of natural language processing. GPT has found applications in many tasks like text production,
translation, summarization, and even the generation of computer code.
‘The main objective of GPT is to produce text of superior quality that closely resembles human language, utilising the
capabilities of deep learning and pre-training. Through the application of a transformer architecture and rigorous training on a
vast corpus of data, GPT possesses the capacity to comprehend and anticipate words, phrases, and even paragraphs by
leveraging the context and input it receives.
GPT has been praised as a significant advancement in artificial intelligence due to its sophisticated language modelling
abilities. It has been successfully employed in several applications. GPT has demonstrated its versatility and great promise by
assisting in content production, enhancing chatbots, facilitating language understanding, and even supporting creative
writing.
This article aims to provide a comprehensive analysis of GPT, including its definition, functioning, training
methodology, practical uses, and constraints. Upon completion, you will possess a thorough comprehension of this
formidable machine leamning model and its influence across several disciplines.
Advanced Algorithms in Al and ML 6.14 Deap Learning for Sequential and Image Data
What is GPT?
GPT, an acronym for Generative Pre-trained Transformer, is a sophisticated machine leaming model created by OpenAl.
It belongs to the neural network-based models family and is specifically tailored for natural language processing tasks.
GPT is fundamentally a model for generating language. The model is trained on extensive text data to acquire the
statistical patterns and structures of language, allowing it to produce coherent and contextually appropriate text in response to
specific promplts or inputs.
GPT stands out due to its transformer architecture, which is a crucial characteristic. Transformers are a specific kind of
neural network that enables the model to effectively understand and represent distant connections and associations within the
text. GPT excels in comprehending the semantic relationships among words, phrases, and entire paragraphs.
The term "pre-trained” in GPT refers to the first training phase, during which the model is trained on a vast collection of
text material, including books, papers, and webpages. GPT undergoes a pre-training phase where it acquires knowledge of
fundamental language patterns and rules from the incoming data, without focusing on any particular task.
Following the pre-training phase, GPT undergoes a fine-tuning procedure, during which it receives additional training on
specific tasks or datasets to enhance its suitability for that particular jobs. The process of fine-tuning enables GPT to adjust its
knowledge and abilities to meet the exact demands of a particular application or use case.
In summary, GPT is an advanced language generation model that utilises deep leaming, transformers, and prolonged
training on massive datasets, making it a cutting-edge technology. The capacity to comprehend and produce material of
exceptional quality has rendered it an invaluable instrument in diverse fields, encompassing content generation, virtual
assistants, customer care chatbots, and language translation, among other applications.
Working of GPT
GPT, an acronym for Generative Pre-trained Transformer, utilises a distinctive structure and training methodology to
produce coherent and contextually appropriate text. To comprehend the functioning of GPT, one must possess knowledge of
transformers and the underlying concepts of pre-training and fine-tuning.
The fundamental essence of GPT lies in its transformer architecture. Transformers are neural network structures that
demonstrate exceptional proficiency in collecting extensive connections and associations in textual data, rendering them very
ideal for tasks involving language processing. The transformer architecture comprises an encoder and a decoder. Within GPT,
solely the encoder is utilised as it prioritises the production of text in accordance with input prompts.
GPT undergoes two primary stages throughout its training: pre-training and fine-tuning. During the pre-training phase,
GPT is exposed to an extensive corpus of textual material, including books, articles, and webpages. It acquires the ability to
anticipate the subsequent word in a sentence by analysing the preceding words, assimilating the patterns and structures of
language in the process.
The process of unsupervised pre-training allows GPT to acquire a robust comprehension of grammar, syntax, and
semantics. Additionally, it aids the model in acquiring a broad understanding of the world and the context it is exposed to
through a variety of text sources.
Following the pre-training stage, the model proceeds to the fine-tuning phase. During this stage, GPT undergoes training
on certain tasks or datasets, which are meticulously chosen to correspond with the intended application or use case of the
model. Fine-tuning involves refining the pre-trained model by narrowing its attention to the specific target task and making
appropriate adjustments to its parameters.
The fine-tuning phase enables GPT to modify its knowledge and abilities to meet the precise demands of the work at
hand. For instance, if GPT undergoes fine-tuning with a dataset consisting of customer support discussions, it will acquire the
ability to provide responses that are pertinent and beneficial within the context of customer service.
During the process of inference, GPT utilises its pre-existing knowledge and the provided input prompt to generate text
by anticipating the most likely next word or sequence of words. The generated text is shaped by the context and semantics of
the input prompt, leading to coherent and contextually suitable responses.
While GPT demonstrates exceptional proficiency in producing text that resembles human language, it is not without its
constraints. It can sometimes generate inaccurate or illogical results, particularly when the input context is unclear or the
desired output is not well defined.
Overall, the architecture of GPT, along with its pre-training and fine-tuning procedure, allows it to produce text of
exceptional quality that closely resembles content written by humans. The training process enables GPT to comprehend the
complexities of language, enabling it to generate coherent and contextually appropriate responses.
Advanced Algorithms in Al and ML 6.15 Deep Learning for Sequential and Image Data
Training GPT
GPT, also known as Generative Pre-trained Transformer, undergoes two primary stages throughout its training:
pre-training and fine-tuning. These stages are crucial to guarantee that the model can produce coherent and contextually
pertinent material. Now, let’s delve deeper into these training phases.
During the pre-training phase, GPT is exposed to a huge corpus of text data, often consisting of books, papers, and
webpages. The model acquires the ability to anticipate the subsequent word in a phrase by analysing the preceding words, so
efficiently capturing the statistical patterns and structures of language. The unsupervised training procedure facilitates the
development of GPT's comprehension of grammar, syntax, and semantics across different situations.
During the pre-training phase, GPT employs a transformer architecture, which is highly proficient in capturing extensive
connections and associations inside text. Transformers are composed of several layers of self-attention mechanisms, which
enable the model to assess the significance of various words and their surrounding contexts during text generation.
Following the successful completion of pre-training on a wide variety of texts, GPT proceeds to the fine-tuning stage.
During this phase, the model undergoes training on certain tasks or datasets that are meticulously chosen to correspond with
the intended application or use case. Through the process of fine-tuning, GPT is able to modify its pre-existing knowledge to
better suit the precise demands of the given task.
The fine-tuning method entails modifying the parameters of the pre-trained GPT model by utilising a dataset that is
specific to the task at hand. Through the exposure to task-specific data, GPT acquires the ability to produce text that is better
suited and more precise for the intended application.
It is important to mention that GPT might undergo additional refinement on various tasks or datasets. GPT's capacity to
transfer information between different fields renders it a versatile model capable of effectively managing a diverse range of
natural language processing jobs.
The performance of GPT is significantly influenced by the quality and diversity of the training data, both during
pre-training and fine-tuning. By training GPT on extensive and varied datasets, the model gains a comprehensive
understanding of language intricacies and improves its capacity to produce text of superior quality.
Training GPT is a demanding procedure that necessitates substantial computational resources and effort. The model is
commonly trained on dedicated hardware, such as graphics processing units (GPUs) or tensor processing units (TPUs), to
expedite the training process.
In general, the training process of GPT entails initially training the model on a substantial collection of textual data in
order to acquire knowledge about the statistical patterns inherent in language. Subsequently, the model undergoes fine-tuning
on specific tasks or datasets to tailor it for a certain application. The training procedure guarantees that GPT can produce
coherent and contextually appropriate text by utilising the received input.
Applications of GPT
The Generative Pre-trained Transformer (GPT) has been extensively utilised in several sectors. The capacity to produce
coherent and contextually appropriate text renders it a significant asset in various natural language processing jobs. Let us
examine some of the notable uses of GPT:
Content Generatlon: GPT is primarily used for content generation. It has the capability to automatically produce
articles, blogs, product descriptions, and other written content of excellent quality. GPT can aid content creators by offering
suggestions, outlines, and even finishing sentences.
Chatbots and Virtual Assistants: GPT has the ability to improve the conversational skills of chatbots and virtual
assistants. It allows them to produce responses to user inquiries that are more similar to those of a human and more suitable
for the situation, hence enhancing user interaction and involvement.
Translation and Summarization: GPT can be utilized for language translation and text summarizing endeavors.
Through the utilization of extensive multilingual datasets for training, the model can produce precise translations and
succinct summaries with a high level of accuracy.
Question Answering: GPT has been employed in question answering systems, leveraging its knowledge and
comprehension ofa specific subject to deliver comprehensive and informative responses to user inqui
Creative Writing: GPT can serve as a valuable instrument for writers engaging in creative writing and the process of
generating ideas. GPT can develop imaginative concepts, storylines, or even entire short novels or scripts when given a
prompt.
Advanced Algorithms in Al and ML 6.16 Deep Learning for Sequential and Image Data
Code Generation: GPT has been utilised for code generation jobs, wherein it aids in the automated production of
computer code based on specified requirements or descriptions. This is very advantageous in expediting software
development procedures.
These examples illustrate a limited selection of the practical uses of GPT. The adaptability and language generating
capabilities of this technology offer opportunities in several areas like as marketing, customer service, education, and others.
With ongoing advancements, the scope of GPT is anticipated to broaden significantly.
Limitations of GPT
Although GPT, also known as Generative Pre-trained Transformer. is a remarkable model for generating language, it
does possess certain constraints. One must be cognizant of these constraints when employing GPT for diverse applications.
Now, let's examine some of the primary constraints of GPT:
Lack of Common Sense: Deficiency in Rationality: GPT exhibits a deficiency in its ability to reason based on common
sense and possess a comprehensive understanding of the universe. While the system has the ability to generate logical and
contextually appropriate content, there are instances where it may provide inaccurate or non-sensical results. This is
particularly true when the input context is unclear or when the desired output relies on common sense information.
Over-Rellance on Tralning Data: The performance of GPT is significantly influenced by the quality and diversity of
the training data, leading to an over-reliance on it. Should the training data exhibit bias or limitations, GPT has the potential
to produce replies that are biassed or erroneous. It is crucial to guarantee that the training data is varied and inclusive of many
viewpoints and settings.
Vulnerable to Adversarial Attacks: GPT is sensitive to deliberate adversarial assaults, wherein carefully constructed
input prompts can cause the model to provide biassed or malicious outputs. This emphasises the significance of ensuring
strong resilience and protection while implementing GPT in practical scenarios.
GPT faces challenges in comprehending and producing content pertaining to abstract concepts or subjects that
necessitate extensive expertise in the field. Producing precise and logical language in such situations can provide a challenge
for GPT.
Limited Control: GPT functions as an opaque model, making it difficult to exert precise control over the output text.
Although methods such as prompt engineering and conditioning can offer a certain degree of control, achieving accurate
output from GPT still poses a difficulty.
Contextual Dependency: Contextual dependency refers to the extent to which the quality and relevancy of the output
text are dependent on the input context. Slight modifications in the input can result in substantial fluctuations in the produced
output. Achieving uniform and precise text creation in various situations can be a challenging endeavour.
(¥ W IMAGE DATA: (RESNET, VGG) PRE-TRAINED NEURAL NETWORKS,
TRANSFER LEARNING, FINE TUNING
Image Classification Using CNN (Convolutional Neural Networks)
Computer vision is a highly sought-after discipline in the area of data science, and Convolutional Neural Networks
(CNNs) have revolutionized the field and emerged as the cutting-edge technology for computer vision. Out of the many
varieties of neural networks, such as recurrent neural networks (RNN), long short-term memory (LSTM), artificial neural
networks (ANN) etc., convolutional neural networks (CNNs) are often regarded as the most widely used and favored.
Convolutional neural network models are very prevalent in the domain of picture data. They exhibit exceptional performance
on computer vision tasks like as image categorization, object identification, and picture recognition. Consequently, they have
been extensively used in the field of artificial intelligence modeling, particularly for the purpose of developing picture
classifiers. This article aims to acquaint you with the notion of picture classification using Convolutional Neural Networks
(CNN) and demonstrate its functionality on diverse datasets.
A Convolutional Neural Network (CNN) is a kind of artificial neural network that is specifically designed for processing
and analyzing visual data. It is capable of automatically learning and extracting features from images via the use of
convolutional layers, which apply filters to the input data. CNNs have shown to be very effective in tasks such as image
classification, object detection,
A Convolutional Neural Network (CNN) is a very potent neural network that use filters to extract distinctive
characteristics from pictures. Furthermore, it preserves the positional data of each pixels.
Advanced Algorithms in Al and ML 617 Deep Learning for Sequential and Image Data
There are various datasets that you can leverage for applying convolutional neural networks. Here are three popular
datasets:
* MNIST
* CIFAR-10
* ImageNet
We will now see how to classify images using CNN on each of these datasets.
Using CNNs to Classify Hand-written Diglts on MNIST Dataset
0000000000002 000
(VAN 20720000V
24222322222 22222
3333333533333 333
YegrYdayyy sva-4 ¢y
5558555555555+55
bbb blLEbbbodcs bbb
T7979771M 790122777
YE:BESPPEPITYLC S
?999993%9349494499 9
Fig. 6.20
MNIST (Modified National Institute of Standards and Technology) is a well-known dataset used in Computer
Vision that was built by Yann Le Cun et al. It is composed of images that are handwritten digits (0-9), split into a training set
of 50,000 images and a test set of 10,000, where each image is 28 x 28 pixels in width and height.
This dataset is often used for practicing any algorithm made for image classification, as the dataset is fairly easy to
conquer. Hence, I recommend that this should be your first dataset if you are just foraying in the field.
MNIST comes with Keras by default, and you can simply load the train and test files using a few lines of code:
from keras.datasets import mnist
Here is how you can build a neural network model for MNIST. I have used relu and softmax as the activation function
and adam optimizer, with accuracy being the evaluation metrics. The code contains all the steps from data loading to
preprocessing to fitting the model. I have commented on the relevant parts of the code for better understanding:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D
from keras.utils import np_utils
After running the above code, you had realized that we are getting a good validation accuracy of around 97% easily.
One major advantage of using ConvNets over NN is that you do not need to flatten the input images to 1D as they are
capable of working with image data in 2D. This helps in retaining the “spatial” properties of images.
Code for the CNN Model:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D, Flatten
from keras.utils import np_utils
Advanced Algorithms in Al and ML 6.19 Deep Learning for Sequential and Image Data
# to calculate accuracy
from sklearn.metrics import accuracy_score
In the above code, | have added the Conv2D layer and max pooling layers, which are essential components of a CNN
model.
Advanced Algorithms in Al and ML 6.20 Deep Learning for Sequential and Image Data
Even though our max validation accuracy by using a simple neural network model was around 97%, the CNN model is
able to get 98%+ with just a single convolution layer!
/1 ] - 95 156us/step - loss: 0,1917 - acc: 0.9442 - val loss: 0.6740 - val acc: 0.9765
Epoch 2/10
[om e ] - 3s 46us/step - loss: 0.8586 - acc: 0.9826 - val_loss: 0.8689 - val_acc: 0.9775
Epoch 3/10
/1 - 35 S0us/step - loss: 0.0352 - acc: 0.9894 - val_loss: 0.0526 - val_acc: 0.9841
Epoch 4/16
60000/60000 |~ - 35 48us/step - loss: 0.0232 - acc: 0.9934 - val_loss: 0.0512 - val_acc: 0.9840
€poch 5/10
68900/68600 - 35 48us/step - loss: 0.0151 - acc: 0.9954 - val_loss: 0.8550 - val_acc: 0.9824
tpoch 6/10
60000/60000 - 3s 48us/step - loss: 0.0099 - acc: 0.9973 - val_loss: 0.8528 - val acc: .9833
Epoch 7710
60000/60000 - 35 S0us/step - loss: 0.0069 - acc: 0.9982 - val_loss: 0.0540 - val_scc: 0.9845
€poch /10
66000/68000 - 35 48us/step - loss: 0.0659 - acc: 0.9982 - val_loss: 0.8610 - val_acc: 0.9828
Epoch 9/10
60000/68600 - 3s 47us/step - loss: 8.0054 - acc: 0.9984 - val loss: ©.6706 - val acc: ©.9804
Epoch 10/10
60000/60000 [ === 1 47us/step - loss: 0.0061 - acc: 0.9980 - val_loss: 0.6774 - val _acc: 0.9824
<keras.cal\backs.History at Ox7eff143a68d0>
Flg. 6.21
CNN algorithm has the best accuracy:
There are many CNN algorithms for many different tasks, such as object detection, object recognition, image
segmentation etc. However, some of the most commonly used CNN architectures that have been proven to have high
accuracy on various computer vision tasks include VGGNet, ResNet (Residual Network), InceptionNet, DenseNet(example
of a deep neural network), and YOLO.
The difference between CNN and other machine learning algorithms:
Convolutional Neural Networks (CNNs) are a type of Deep Leamning algorithm that is primarily used for image
classification and object recognition tasks. Here are some key differences between CNNs and other machine-learning
algorithms:
1. Unlike machine learning algorithms, CNNs can learn relevant features automatically as part of the training process.
2. CNNs have a unique layered architecture consisting of convolutional, pooling, and fully connected layers, which are
designed to automatically learn the features and hierarchies of the input data, while Ohter ML algorithms have
different architecture.
3. CNNs can be computationally expensive due to their large number of parameters and complex architecture. Other
algorithms, such as decision trees and random forests, are typically faster and more computationally efficient.
Feature Maps:
A feature map is a set of filtered and transformed inputs that are learned by ConvNet’s convolutional layer. A feature
map can be thought of as an abstract representation of an input image, where each unit or neuron in the map corresponds to a
specific feature detected in the image, such as an edge, comer, or texture pattern.
Convolutional Neural Networks (CNNs)
* CNN's popularly known as ConvNets majorly consists of several layers and are specifically used for image
processing and detection of objects.
* It was developed in 1998 by Yann LeCun and was first called LeNet. Back then, it was developed to recognize
digits and zip code characters.
* CNNs have wide usage in identifying the image of the satellites, medical image processing, series forecasting, and
anomaly detection.
* CNNs process the data by passing it through multiple layers and extracting features to exhibit convolutional
operations.
* The Convolutional Layer consists of Rectifled Linear Unit (Rel.U) that outlasts to rectify the feature map.
* The Pooling layer is used to rectify these feature maps into the next feed. Pooling is generally a sampling algorithm
that is down-sampled and it reduces the dimensions of the feature map.
* Later, the result generated consists of 2-D arrays consisting of single, long, continuous, and linear vector flattened
in the map.
* The next layer i.e., called Fully Connected Layer which forms the flattened matrix or 2-D array fetched from the
Pooling Layer as input and identifies the image by classifying it.
Advanced Algorithms in Al and ML 6.21 Deep Leamning for Sequential and Image Data
20 20
€ £
s£ g
19 21
g
®
£°
=
0l
05 0 T S S S S T
iter. (led) iter. (le4)
Flg. 6.23: Comparison of 20-layer vs 56-layer architecture
* In the above plot, we can observe that a 56-layer CNN gives more error rate on both training and testing dataset than
a 20-layer CNN architecture.
o After analyzing more on error rate the authors were able to reach conclusion that it is caused by vanishing/exploding
gradient.
* ResNet, which was proposed in 2015 by researchers at Microsoft Research introduced a new architecture called
Residual Network.
Resldual Blocks:
The problem of training very deep networks has been relieved with the introduction of these Residual blocks and the
ResNet model is made-up of these blocks.
weight layer
F(x) l relu
weight layer identity
F(x) +x
Flg. 6.24
Advanced Algorithms in Al and ML 6.22 Deep Leamning for Sequential and Image Data
In the above figure, the very first thing we can notice is that there is a direct connection that skips some layers of the
model. This connection is called "skip connection” and is the heart of residual blocks. The output is not the same due to this
skip connection. Without the skip connection, input X gets multiplied by the weights of the layer followed by adding a bias
term.
Then comes the activation function, f( ) and we get the output as H(x).
H(x) = f(wx+b) or H(x) = f(x)
Now with the introduction of a new skip connection technique, the output is H(x) is changed to
H(X) = f(x)+x
But the dimension of the input may be varying from that of the output which might happen with a convolutional layer or
pooling layers. Hence, this problem can be handled with these two approaches:
* Zero is padded with the skip connection to increase its dimensions.
* 1 x| convolutional layers are added to the input to match the dimensions. In such a case, the output is:
So, over the years, the deep learning architectures became deeper and deeper (adding more layers) to solve more and
more complex tasks which also helped in improving the performance of classification and recognition tasks and also making
them robust.
But when we go on adding more layers to the neural network, it becomes very much difficult to train and the accuracy of
the model starts saturating and then degrades also. Here comes the ResNet to rescue us from that scenario, and helps to
resolve this problem.
Challenges do ResNets tackle:
* Vanishing gradients: ResNets prevent gradients from becoming too small, allowing for better training of deep
neural networks.
« Training difficulty: ResNets make deep neural networks easier to train by introducing skip connections.
* Nolse sensitivity: ResNets are more robust to noise in the data, leading to improved performance.
* Accuracy limitations: ResNets have achieved state-of-the-art accuracy on a variety of tasks, including image
recognition and natural language processing.
Architecture of ResNet:
There is a 34-layer plain network in the architecture that is inspired by VGG-19 in which the shortcut connection or the
skip connections are added. These skip connections or the residual blocks then convert the architecture into the residual
network as shown in the Fig. 6.25 below:
Advanced Algorithms in Al and ML 6.23 Deep Leamning for Sequential and Image Data
Output, [ oy 64 ]
abc
: 224
[comer]
Output poo), 2
3 conv,
3x3 conv,
256 Caewe]
conv,
3x3 conv, 256 CEewe]
[E3conved ]
A |
output
abc:7
0
Flg. 6.25
Advanced Algorithms in Al and ML 624 Deep Learning for Sequential and Image Data
The layers of ResNet:
ResNet blocks are the building blocks of ResNet architectures. Each block contains convolutional layers, batch
normalization, activation functions, and skip connections. Skip connections allow information to flow directly from earlier
layers to later layers, preventing vanishing gradients and improving network performance. The number of residual blocks and
their configurations determine the depth and complexity of the ResNet model.
The main idea of ResNet:
ResNet introduces skip connections to deep neural networks, allowing information to flow directly between layers,
thereby mitigating the vanishing gradient problem and improving training efficiency. This results in enhanced performance
across various machine-learning tasks.
Stages of ResNet:
ResNet architectures typically comprise four or five stages, each containing multiple residual blocks for feature
extraction and refinement. The number of residual blocks per stage increases as the network deepens, enabling the learning of
more intricate feature representations.
VGG Neural Network
VGG- Network is a convolutional neural network model proposed by K. Simonyan and A. Zisserman in the paper “Very
Deep Convolutional Networks for Large-Scale Image Recognition” [1]. This architecture achieved top-5 test accuracy of
92.7% in ImageNet, which has over 14 million images belonging to 1000 classes.
It is one of the famous architectures in the deep learning field. Replacing large kernel-sized filters with 11 and 5 in the
first and second layer respectively showed the improvement over AlexNet architecture, with multiple 3 x 3 kernel-sized
filters one after another. It was trained for weeks and was using NVIDIA Titan Black GPU’s.
- <l
o s o|e
ul ot B& Bl d< =92
Y IR 213 M < THEH <
f d
IHEE
glala 5
i H Gl 5l BH - IHES : IHEHELE]
Flg. 6.26
VGG16 Architecture
The input to the convolution neural network is a fixed-size 224 x 224 RGB image. The only preprocessing it does is
subtracting the mean RGB values, which are computed on the training dataset, from each pixel.
Then the image is running through a stack of convolutional (Conv.) layers, where there are filters with a very small
receptive field that is 3 x 3, which is the smallest size to capture the notion of lefU/right, up/down, and center part.
In one of the configurations, it also utilizes 1 X | convolution filters, which can be observed as a linear transformation of
the input channels followed by non-linearity. The convolutional strides are fixed to | pixel: the spatial padding of
convolutional layer input is such that the spatial resolution is maintained after convolution, that is the padding is | pixel for
3 x 3 Conv. layers.
Then the Spatial pooling is carried out by five max-pooling layers, 16 which follow some of the Conv. layers but not all
the Conv. layers are followed by max-pooling. This Max-pooling is performed over a 2 X 2-pixel window, with stride 2.
204x204x3 224 x 224 % 64
Tx7x5612
5892 | 1x1x40961x1x
1000
() convolution + ReLU
max pooling
fully nected + ReLU
7 softmax
Flg. 6.27
Advanced Algorithms in Al and ML 6.25 Deep Leamning for Sequential and Image Data
The architecture contains a stack of convolutional layers which have a different depth in different architectures which are
followed by three Fully-Connected (FC) layers: the first two FC have 4096 channels each and the third FC performs
1000-way classification and thus contains 1000 channels that is one for each class.
The final layer is the soft-max layer. The configuration of the fully connected layers is similar in all networks.
All of the hidden layers are equipped with rectification (Rel.U) non-linearity. Also, here one of the networks contains
Local Response Normalization (LRN), such normalization does not improve the performance on the trained dataset, but
usage of that leads to increased memory consumption and computation time.
Architecture Summary:
* Input to the model is a fixed size 224 x 224224 x 224 RGB image.
* Pre-processing is subtracting the training set RGB value mean from each pixel.
* Convolutional layers 17
o Stride fixed to | pixel
© paddingis | pixel for 3x33x3
* Spatial pooling layers
This layer does not count to the depth of the network by convention
0
Stride fixed to 2
0
* Fully-connected layers:
o 1™:4096 (Rel.U).
o 2":4096 (ReLU).
o 3™ 1000 (Softmax).
Architecture Conflguration:
The below figure contains the Convolution Neural Network configuration of the VGG net with the following layers:
. VGG-11
* VGG-11 (LRN)
* VGG-13
* VGG-16 (Convl)
* VGG-16
* VGG-19
VGG Neural Network Architecture
* The VGG model, or VGG Net, that supports 16 layers is also referred to as VGG 16, which is a convolutional neural
network model proposed by A. Zisserman and K. Simonyan from the University of Oxford.
® These researchers published their model in the research paper titled, “Very Deep Convolutional Networks for Large-
Scale Image Recognition.”
* The VGG16 model achieves almost 92.7% top-5 test accuracy in ImageNet.
* ImageNet is a dataset consisting of more than 14 million images belonging to nearly 1000 classes.
* Moreover, it was one of the most popular models submitted to ILSVRC-2014.
* It replaces the large kemnel-sized filters with several 3x3 kernel-sized filters one after the other, thereby making
significant improvements over AlexNet.
* The VGGI6 model was trained using Nvidia Titan Black GPUs for multiple weeks. As mentioned above, the
VGGNet-16 supports 16 layers and can classify images into 1000 object categories, including keyboard, animals,
pencil, mouse etc.
* Additionally, the model has an image input size of 224-by-224.
Advanced Algorithms in Al and ML 6.26 Deep Leamning for Sequential and Image Data
VGG Architecture VGG Nets are based on the most essential features of convolutional neural networks (CNN). The
following graphic shows the basic concept of how a CNN works:
4 red fox
grey fox
0.3576
0.0439
coyote 0.0013
Arctic fox | 0.0003
Flg. 6.29
* The architecture of a Convolutional Neural Network:
* Image data is the input of the CNN; the model output provides prediction categories for input images. The VGG
network is constructed with very small convolutional filters.
* The VGG-16 consists of 13 convolutional layers and three fully connected layers.
Let us take a brief look at the architecture
of VGG:
Input:
* The VGG Net takes in an image input size of 224 x 224.
® For the ImageNet competition, the creators of the model cropped out the center 224 X 224 patch in each image to
keep the input size of the image consistent.
Convolutional Layers:
® VGG’s convolutional layers leverage a minimal receptive field, i.e., 3 X 3, the smallest possible size that still
captures up/down and left/right. Moreover, there are also 1 X | convolution filters acting as a linear transformation
of the input.
® This is followed by a Rel.U unit, which is a huge innovation from AlexNet that reduces training time. Rel.U stands
for rectified linear unit activation function; it is a piecewise linear function that will output the input if positive;
otherwise, the output is zero.
Advanced Algorithms in Al and ML 6.27 Deep Learning for Sequential and Image Data
® The convolution stride is fixed at 1 pixel to keep the spatial resolution preserved after convolution (stride is the
number of pixel shifts over the input matrix).
Hidden Layers:
* All the hidden layers in the VGG network use Rel.U.
® VGG does not usually leverage Local Response Normalization (LRN) as it increases memory consumption and
training time.
* Moreover, it makes no improvements to overall accuracy.
Fully-Connected Layers:
® The VGGNet has three fully connected layers.
* Out of the three layers, the first two have 4096 channels each, and the third has 1000 channels, | for each class.
Transfer Learning:
The reuse of a pre-trained model on a new problem is known as transfer learning in machine learning. A machine uses
the knowledge learned from a prior assignment to increase prediction about a new task in transfer learning. You could, for
example, use the information gained during training to distinguish beverages when training a classifier to predict whether an
image contains cuisine.
The knowledge of an already trained machine learning model is transferred to a different but closely linked problem
throughout transfer learning.
For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the
model’s training knowledge to identify other objects such as sunglasses.
Advanced Algorithms in Al and ML 6.28 Deep Learning for Sequential and Image Data
Fig. 6.31
With transfer learning, we basically try to use what we have leamned in one task to better understand the concepts in
another. weights are being automatically being shifted to a network performing “task A” from a network that performed new
“task B.”
Because of the massive amount of CPU power required, transfer learning is typically applied in computer vision and
natural language processing tasks like sentiment analysis.
‘Working of Transfer Learning:
In computer vision, neural networks typically aim to detect edges in the first layer, forms in the middle layer, and task-
specific features in the latter layers.
The early and central layers are employed in transfer learning, and the latter layers are only retrained. It makes use of the
labelled data from the task it was trained on.
Inpul N Output
Pretained E; I 0
Model ARIA w O
Transfer
Learning Je— Common inner—p{
Input
Custom
Model
layers
Fig. 6.32
Let’s return to the example of a model that has been intended to identify a backpack in an image and will now be used to
detect sunglasses. Because the model has trained to recognise objects in the earlier levels, we will simply retrain the
subsequent layers to understand what distinguishes sunglasses from other objects.
* In computer vision, for example, neural networks usually try to detect edges in the earlier layers, shapes in
the middle layer and some task-specific features in the later layers.
* In transfer learning, the early and middle layers are used and we only retrain the latter layers. It helps leverage the
labelled data of the task it was initially trained on.
* Let’s go back to the example of a model trained for recognizing a backpack on an image, which will be used to
identify sunglasses. In the earlier layers, the model has learned to recognize objects, because of that we will only
retrain the latter layers so it will learn what separates sunglasses from other objects.
Advanced Algorithms in Al and ML 6.29 Deep Learning for Sequential and Image Data
* In transfer learning, we try to transfer as much knowledge as possible from the previous task the model was trained
on to the new task at hand.
* This knowledge can be in various forms depending on the problem and the data. For example, it could be how
models are composed, which allows us to identify novel objects more easily.
Uses of Transfer Learning:
® Transfer learning has several benefits, but the main advantages are saving training time, better performance of
neural networks (in most cases), and not needing a lot of data.
® Usually, a lot of data is needed to train a neural network from scratch but access to that data isn't always available —
this is where transfer learning comes in handy.
© With transfer learning a solid machine learning model can be built with comparatively little training data because the
model is already pre-trained.
e This is especially valuable in natural language processing because mostly expert knowledge is required to
create large labelled data sets.
* Additionally, training time is reduced because it can sometimes take days or even weeks to train a deep neural
network from scratch on a complex task.
Need of Transfer Learning?
Transfer learning offers a number of advantages, the most important of which are reduced training time, improved neural
network performance (in most circumstances), and the absence of a large amount of data.
To train a neural model from scratch, a lot of data is typically needed, but access to that data is not always possible — this
is when transfer learning comes in handy.
t 1
CNN Layer CNN Layer
CNNTLayef — CNNTLayer
t
CNN Layer CNNTLayer
Iinul Input
Fig. 6.33
Because the model has already been pre-trained, a good machine learning model can be generated with fairly little
training data using transfer leaming. This is especially useful in natural language processing, where huge labelled datasets
require a lot of expert knowledge. Additionally, training time is decreased because building a deep neural network from the
start of a complex task can take days or even weeks.
Steps to Use Transfer Learning:
‘When we don’t have enough annotated data to train our model with and there is a pre-trained model that has been trained
on similar data and tasks. If you used TensorFlow to train the original model, you might simply restore it and retrain some
layers for your job. Transfer learning, on the other hand, only works if the features learnt in the first task are general, meaning
they can be applied to another activity. Furthermore, the model’s input must be the same size as it was when it was first
trained.
Advanced Algorithms in Al and ML 6.30 Deep Learning for Sequential and Image Data
If you don’t have it, add a step to resize your input to the required size:
1. Training a Model to Reuse It:
* Consider the situation in which you wish to tackle Task A but lack the necessary data to train a deep neural network.
Finding a related task B with a lot of data is one method to get around this.
® Utilize the deep neural network to train on task B and then use the model to solve task A. The problem you are
seeking to solve will decide whether you need to employ the entire model or just a few layers.
® If the input in both jobs is the same, you might reapply the model and make predictions for your new input.
Changing and retraining distinct task-specific layers and the output layer, on the other hand, is an approach to
investigate.
2. Using a Pre Trained Model:
® The second option is to employ a model that has already been trained. There are a number of these models out there,
so do some research beforehand. The number of layers to reuse and retrain is determined by the task.
* Keras consists of nine pre-trained models used in transfer learning, prediction, fine-tuning. These models, as well as
some quick lessons on how to utilise them, may be found here. Many research institutions also make trained models
accessible.
e The most popular application of this form of transfer learning is deep learning.
3. Extraction of Features:
* Another option is to utilise deep learning to identify the optimum representation of your problem, which comprises
identifying the key features. This method is known as representation learning, and it can often produce significantly
better results than hand-designed representations.
e Feature creation in machine learning is mainly done by hand by researchers and domain specialists. Deep learning,
fortunately, can extract features automatically. Of course, this does not diminish the importance of feature
engineering and domain knowledge; you must still choose which features to include in your network.
Features New features
Data points
Flg. 6.34
4. Extraction of Features In Neural Networks:
® Neural networks, on the other hand, have the ability to learn which features are critical and which aren’t. Even for
complicated tasks that would otherwise necessitate a lot of human effort, a representation learning algorithm can
find a decent combination of characteristics in a short amount of time.
® The learned representation can then be applied to a variety of other challenges. Simply utilise the initial layers to
find the appropriate feature representation, but avoid using the network’s output because it is too task-specific.
Instead, send data into your network and output it through one of the intermediate layers.
* The raw data can then be understood as a representation of this layer. This method is commonly used in computer
vision since it can shrink your dataset, reducing computation time and making it more suited for classical
algorithms.
Models That Have Been Pre-Trained
There are a number of popular pre-trained machine learning models available. The Inception-v3 model, which was
developed for the ImageNet “Large Visual Recognition Challenge,” is one of them.” Participants in this challenge had to
categorize pictures into 1,000 subcategories such as “zebra,” “Dalmatian,” and “dishwasher.”
Code Implementation of Transfer Learning with Python:
Importing Libraries
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
Advanced Algorithms in Al and ML 6.31 Deep Learning for Sequential and Image Data
layer.trainable = False
x = InceptionV3_model.output
x = GlobalAveragePooling2D()(x)
x = Flatten()(x)
x = Dense(units=512, activation="relu')(x)
x = Dropout(0.3)(x)
x = Dense(units=512, activation="relu')(x)
x = Dropout(0.3)(x)
output = Dense(units=4, activation="softmax")(x)
model = Model(InceptionV3_model.input, output)
model.summary()
Image Augmentation (For preventing the issue of Overfitting):
# Use the Image Data Generator to import the images from the dataset
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Make sure you provide the same target size as initialied for the image size
training_set = train_datagen.flow_from_directory('/content/Data/train’,
target_size = (224, 224),
batch_size = 32,
class_mode = 'categorical’)
plt.legend()
plt.show()
plt.savefig(‘LossVal_loss")
10
===~ Train loss|
— val loss
0.8
06
04
02 e
Advanced Algorithms in Al and ML 634 Deep Learning for Sequential and Image Data
Nx
~J
Positional o Positional
Encoding = Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
Flg. 6.37
Advanced Algorithms in Al and ML 6.35 Deep Learning for Sequential and Image Data
GPT-3 is composed of a series of Transformer encoder layers. Each layer consists of multi-head self-attention
mechanisms and feed-forward neural networks. The feed-forward networks are responsible for processing and transforming
the encoded representations. Meanwhile, the attention mechanism allows the model to identify and understand the
connections and associations between words.
GPT-3's primary breakthrough lies in its colossal scale, enabling it to include an immense volume of linguistic
information due to its staggering 175 billion attributes.
Implementation of Code:
You can use the OpenAl API to interact with the GPT- 3 model of openAl. Here is an example of text generation using
GPT-3.
import openai
'
8 i N Pretrained
transformer Ki eep frozen transformer Update
all layers
/ = Update
One or more
fully connected layers
Fig. 6.38
Fine-tuning refers to the process of training a pre-existing model on a smaller dataset that is specific to a particular task.
This dataset contains labelled examples that are relevant to the target task. By exposing the model to these labelled examples,
it can modify its parameters and internal representations to better align with the requirements of the target task.
‘The Need for Fine-Tuning LLMs:
Although pre-trained language models are impressive, they do not possess inherent task-specific capabilities. Fine-tuning
involves modifying these versatile models to achieve higher levels of accuracy and efficiency while performing certain jobs.
When faced with a particular NLP job, such as analyzing the sentiment of customer reviews or answering questions in a
certain field, it is necessary to adjust the pre-trained model to grasp the intricacies of that particular task and domain.
The advantages of fine-tuning are many. Firstly, it utilizes the acquired information from pre-training, resulting in
significant time and computational resource savings that would otherwise be necessary o train a model from the beginning.
Furthermore, fine-tuning enables us to achieve superior performance on certain tasks by aligning the model with the
complexities and subtleties of the specific domain it was fine-tuned for.
Fine-Tuning LLMs Process: A Step-by-step Guide
The fine-tuning approach often entails providing the task-specific dataset to the pre-trained model and modifying its
parameters via backpropagation. The objective is to minimize the loss function, which quantifies the disparity between the
model's predictions and the actual labels in the dataset. The fine-tuning procedure involves updating the model's parameters,
hence enhancing its specialization for the specific job you are targeting.
In this guide, we will explore the steps involved in refining a substantial language model specifically for sentiment
analysis. The Hugging Face Transformers library will be used, offering convenient access to pre-trained models and tools for
fine-tuning.
Step 1: Load the Pre-trained Language Model and Tokenizer
The first step is to load the pre-trained language model and its corresponding tokenizer. For this example, we will use the
“distillery-base-uncased’ model, a lighter version of BERT.
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
# Replace the pre-trained model's classification head with our custom head
model.classifier = classification_head
Step 4: Fine-Tune the Model
With the custom classification head in place, we can now fine-tune the model on the sentiment analysis dataset. We will
use the AdamW optimizer and CrossEntropyLoss as the loss function.
import torch.optim as optim
Flg. 6.39
This approach allows developers to specify desired outputs, encourage certain behaviours, or achieve better control over
the model’s responses. In this comprehensive guide, we will explore the concept of instruction fine-tuning and its
implementation step-by-step.
Instruction Finetuning Process:
What if we could go beyond traditional fine-tuning and provide explicit instructions to guide the model’s behaviour?
Instruction fine-tuning does that, offering a new level of control and precision over model outputs. Here we will explore the
process of instruction fine-tuning large language models for sentiment analysis.
Step 1: Load the Pre-trained Language Model and Tokenizer
To begin, let’s load the pre-trained language model and its tokenizer. We'll use GPT-3, a state-of-the-art language
model, for this example.
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
# Concatenate instruction IDs with input IDs and adjust attention mask
input_ids = torch.cat([instruction_ids, input_ids], dim=1)
attention_mask = torch.cat([torch.ones_like(instruction_ids), attention_mask], dim=1)
Step 4: Fine-Tune the Model with Instructions
With the instructions incorporated, we can now fine-tune the GPT-3 model on the augmented dataset. During
fine-tuning, the instructions will guide the model’s sentiment analysis behaviour.
import torch.optim as optim
When building a model from scratch, we usually must try many approaches through trial-and-error.
For example, we have to choose how many layers we are using, what types of layers we are using, what order to put the
layers in, how many nodes to include in each layer, decide how much regularization to use, what to set our learning rate as,
etc.
* Number of layers.
o Types of layers.
® Order of layers.
* Number of nodes in each layer.
* How much regularization to use.
* Leamning rate.
Building and validating our model can be a huge task in its own right, depending on what data we are training it on.
This is what makes the fine-tuning approach so attractive. If we can find a trained model that already does one task well,
and that task is similar to ours in at least some remote way, then we can take advantage of everything the model has already
learned and apply it to our specific task.
Now, of course, if the two tasks are different, then there will be some information that the model has learned that may
not apply to our new task, or there may be new information that the model needs to learn from the data regarding the new
task that was not learned from the previous task.
For example, a model trained on cars is not going to have ever seen a truck bed, so this feature is something new the
model would have to learn about. However, think about everything our model for recognizing trucks could use from the
model that was originally trained on cars.
Flg. 6.40
This already trained model has learned to understand edges and shapes and textures and more objectively, head lights,
door handles, windshields, tires etc. All of these learned features are definitely things we could benefit from in our new
model for classifying trucks.
(] Freezing Weights
By freezing, we mean that we don't want the weights for these layers to update whenever we train the model on our new
data for our new task. We want to keep all of these weights the same as they were after being trained on the original task. We
only want the weights in our new or modified layers to be updating.
Flg. 6.41
Advanced Algorithms in Al and ML 6.42 Deep Leamning for Sequential and Image Data
After we do this, all that's left is just to train the model on our new data. Again, during this training process, the weights
from all the layers we kept from our original model will stay the same, and only the weights in our new layers will be
updating.
ion:
How does Recurrent Neural Networks work?
What is GPT?
How does GPT work?
Describe RNN - Recurrent Neural Networks (RNNs)
Describe LSTM - Long Short-Term Memory Networks (LSTMs)
Describe GRU - Gated Recurrent Units
‘What is the Transformer model?
What is a Transformer used for?
What is a Transformer in NLP?
. How does a Transformer network work?
. What is GPT?
. What does GPT do?
. How does GPT work?
. What are feature maps?
. Which CNN algorithm has the best accuracy?
. What is the difference between CNN and other machine learning algorithms?
. What is ResNet?
. What challenges do ResNets tackle?
. What are the layers of ResNet?
. What is the main idea of ResNet?
. How many stages are there in ResNet?
. What Is Transfer Learning?
. How Transfer Learning Works?
. Why Should You Use Transfer Learning?
. What is transfer learning in a CNN?
. What is an example of learning transfer?
. What type of leaming is transfer learning?
. What is transfer learning in RL?
. What is Instruction Finetuning?
. Describe Convolutional Neural Networks (CNNs)
. Describe RESIDUAL NETWORKS (ResNet)
. What is VGG?
. What is VGG16?
. What is Transfer Learning?
. How Transfer Learning Works
. Why Use Transfer Learning
. Define Fine Tuning.
. Why Use Fine-Tuning?
. How to Fine-Tune?
. Define Freezing Weights