0% found this document useful (0 votes)
22 views88 pages

Chapter 2 Machine Learning Draft-85-172

First of all I really thankful to my Lovely Professional University because of them 1 could achieve the target. I express my sincere thanks to my project guide Mrs. Deepika Dhall who had guide to me throughout my project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views88 pages

Chapter 2 Machine Learning Draft-85-172

First of all I really thankful to my Lovely Professional University because of them 1 could achieve the target. I express my sincere thanks to my project guide Mrs. Deepika Dhall who had guide to me throughout my project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Chapter

2
Supervised Learning
***

"The best way to get a good prediction is


to get a lot of data."

– Peter Norvig

2.1 Introduction to Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 86


2.2 Introduction to Datasets in Machine Learning . . . . . . . . . . . . . . . . . . . . . 87
2.2.1 Datasets Available in sklearn.datasets . . . . . . . . . . . . . . . . . . . 89
2.2.2 Datasets Available on Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.2.3 Datasets Available in mglearn.datasets . . . . . . . . . . . . . . . . . . . 92
2.2.4 Supervised Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . 93
2.3 K-Nearest Neighbor (K-NN) Algorithm for Machine Learning . . . . . . . . . . . . 94
2.4 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.4.1 Bias-Variance Tradeoff and Regularization . . . . . . . . . . . . . . . . . . . 103
2.4.2 Linear Models for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.4.3 Regularization: Preventing Overfitting . . . . . . . . . . . . . . . . . . . . . 110
2.4.4 Linear Models for Classification . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.4.5 Multiclass Classification with Linear Models . . . . . . . . . . . . . . . . . . 124
2.4.6 Model Complexity and Regularization . . . . . . . . . . . . . . . . . . . . . 125
2.4.7 Strengths and Weaknesses of Linear Models . . . . . . . . . . . . . . . . . . 128

85
2.4.8 Evaluation Metrics for Regression and Classification . . . . . . . . . . . . . 128
2.5 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.5.1 Types of Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 132
2.5.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.6 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.6.1 Introduction to Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.6.2 Decision Criteria: Gini Index and Information Gain . . . . . . . . . . . . . 142
2.6.3 Example of a Decision Tree Algorithm with Calculations . . . . . . . . . . . 143
2.6.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.6.5 Step-by-Step Algorithm for Building a Decision Tree . . . . . . . . . . . . . 145
2.6.6 Decision Tree Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
2.6.7 Advantages and Disadvantages of Pruning . . . . . . . . . . . . . . . . . . . 150
2.7 Ensembles of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.7.1 Bagging (Bootstrap Aggregating) . . . . . . . . . . . . . . . . . . . . . . . . 151
2.7.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.7.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.7.4 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
2.7.5 Comparison of Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . 157
2.8 Kernelized Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.8.1 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.8.2 Common Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.8.3 Choosing the Right Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.8.4 Regularization and Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 161
2.8.5 Advantages of Kernelized SVMs . . . . . . . . . . . . . . . . . . . . . . . . 161
2.8.6 Disadvantages of Kernelized SVMs . . . . . . . . . . . . . . . . . . . . . . . 161
2.8.7 Example of Kernelized SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 162
2.9 Uncertainty Estimates from Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 168
2.9.1 Why Do We Need Uncertainty Estimates? . . . . . . . . . . . . . . . . . . . 168
2.9.2 Types of Classifiers and Uncertainty Estimates . . . . . . . . . . . . . . . . 169
2.9.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Rama Bhadra Rao Maddu 86 ANL Kumar


upervised learning is one of the most widely used and fundamental types of machine learn-
S ing. It involves training a model on a labeled dataset, where the input-output pairs are
known. The goal is to learn a mapping from inputs to outputs that can be generalized to new,
unseen data. In a supervised learning scenario, the model makes predictions or classifications
based on the training data, and the learning process is supervised by comparing the predictions
against the actual labels. This type of learning is powerful and has numerous applications across
various domains, such as finance, healthcare, marketing, and more.

In supervised learning, there are two main tasks: classification and regression. Classification
involves predicting a discrete label or category. For example, it may be used to determine whether
an email is spam or not, or whether a patient has a particular disease. Regression, on the other
hand, is about predicting a continuous value, such as forecasting stock prices or estimating the
price of a house based on its characteristics. These tasks rely on labeled datasets, where the
correct output is known and used to train the model.

The success of supervised learning models depends heavily on the quality and quantity of the
labeled data available for training. A well-labeled and diverse dataset allows the model to learn
the underlying patterns and relationships within the data, which can then be applied to new cases.
However, collecting such datasets can be a challenge and often requires significant effort in data
collection and labeling. Additionally, the choice of algorithm plays a critical role in the model’s
performance. Simple models like linear regression are easy to interpret, while more complex
models, such as neural networks, can handle intricate patterns but are harder to interpret.

One of the key advantages of supervised learning is its ability to produce interpretable results,
particularly in models like linear regression or decision trees. These models allow us to understand
the relationships between the input features and the output, which is valuable for fields such as
healthcare or finance, where transparency in decision-making is crucial. However, supervised
learning also presents challenges, including the risk of overfitting, where the model performs
well on the training data but poorly on unseen data. To address this, techniques such as cross-
validation, regularization, and pruning are often employed.

In this chapter, we will explore the concepts, algorithms, and techniques that form the foundation
of supervised learning. We will dive into various models, ranging from simple linear classifiers
to more sophisticated methods like support vector machines and neural networks. Through real-
world examples and case studies, we will demonstrate how these models are applied in practice
and discuss the key considerations involved in building and evaluating supervised learning models.
By the end of this chapter, you will have a comprehensive understanding of supervised learning
and be prepared to apply these techniques in your own projects.

Rama Bhadra Rao Maddu 87 ANL Kumar


2.1 Introduction to Supervised Learning
Supervised learning is a type of machine learning where a model is trained on labeled data, mean-
ing that each training example is associated with a specific output. The purpose of supervised
learning is for the model to learn a relationship or mapping between the input data (features) and
the correct output (labels). This can be thought of as teaching the machine to make decisions
based on examples, much like how a teacher might supervise a student by showing them the
correct answers during a lesson. Once trained, the model can generalize what it has learned to
make predictions or classifications on new, unseen data.
Real-World Example: Spam Email Detection One classic example of supervised learning is spam
email detection. In this case, the input features could be the content of the email, the sender’s
address, or even specific words or phrases used in the email. The output would be whether the
email is labeled as "spam" or "not spam." The machine learning algorithm is trained using a
dataset of emails that have already been labeled by humans as either spam or not spam. By
learning from this dataset, the model can recognize patterns such as frequent use of certain
keywords or suspicious email addresses that are more likely to be associated with spam. Once
trained, the model can then predict whether a new incoming email is spam, helping email services
like Gmail or Outlook automatically filter spam messages into a separate folder.
Real-World Example: Predicting House Prices Another common example of supervised learning
is predicting house prices based on various features. In this case, the input features could include
the size of the house, the number of bedrooms, the location, and other relevant details. The
output is the price of the house, which is a continuous value. A real estate company could train
a model using historical data about houses that have been sold in the past, including all the
features of the house and the final selling price. By learning from this labeled dataset, the model
can identify patterns, such as houses in certain neighborhoods being more expensive, or larger
houses commanding higher prices. After training, the model can make predictions about the
selling price of new houses based on their features.
Real-World Example: Medical Diagnosis Supervised learning is also used in healthcare for medical
diagnosis. For example, consider a system designed to detect whether a patient has a certain
disease, such as diabetes, based on various medical tests (input features like blood sugar levels,
age, weight, and other health indicators). The output is a binary label: whether the patient has
the disease or not. The model is trained using a dataset of patients whose medical test results
and diagnoses are already known. By learning from this data, the algorithm can then predict
whether a new patient, based on their test results, is likely to have the disease. This can assist
doctors in making quicker, data-driven diagnoses.
In all these examples, the model’s job is to find patterns in the data and make decisions or
predictions based on those patterns. The key advantage of supervised learning is that, given
enough labeled data, it can generalize well to new data, meaning it can make accurate predictions
even on examples it hasn’t seen before.
There are two main types of supervised learning:
• Classification: Where the output is a categorical label. The task is to assign an input to

Rama Bhadra Rao Maddu 88 ANL Kumar


one of several predefined classes.
• Regression: Where the output is a continuous value. The task is to predict a numerical
value based on the input.

2.2 Introduction to Datasets in Machine Learning


Datasets are the backbone of any machine learning model. They provide the raw material from
which models learn patterns, make predictions, and derive insights. In essence, a dataset is a
collection of data that is typically organized in a structured format, such as a table or matrix,
where each row corresponds to an instance (or sample), and each column corresponds to a feature
(or attribute) of the data. The last column in supervised learning datasets often contains labels
or target values that the model is supposed to predict.

Importance of Datasets
In machine learning, the quality, quantity, and relevance of data are often more critical than
the complexity of the algorithms themselves. A well-curated dataset allows a model to learn
effectively and generalize well to new, unseen data. Conversely, poor or insufficient data can
lead to inaccurate predictions, overfitting, or underfitting, regardless of the sophistication of the
model.
Datasets serve several key purposes in machine learning:
• Training: The model learns from the training data by identifying patterns and relation-
ships between the features and the target variable.
• Validation: During the model development phase, validation datasets are used to tune
hyperparameters and assess the model’s performance to avoid overfitting.
• Testing: After the model is trained, a separate testing dataset is used to evaluate how well
the model performs on unseen data, providing an unbiased estimate of its accuracy.

Types of Datasets
Datasets in machine learning can be broadly categorized into several types, depending on their
purpose and structure:
• Supervised Datasets: These datasets include input features and corresponding labels or
target values. Examples include the Iris dataset for classification and the Boston Housing
dataset for regression.
• Unsupervised Datasets: These datasets contain input features without associated labels.
They are used in tasks such as clustering, where the goal is to find patterns or groupings
within the data.
• Semi-supervised Datasets: These datasets contain a mixture of labeled and unlabeled
data, and they are particularly useful when labeling data is expensive or time-consuming.

Rama Bhadra Rao Maddu 89 ANL Kumar


• Reinforcement Learning Datasets: In reinforcement learning, datasets are often gen-
erated dynamically through interactions with an environment, where the agent learns from
the consequences of its actions.

Structure of Datasets
Typically, a dataset is structured into:
• Features (Attributes): These are the independent variables or inputs that the model
uses to make predictions. In a dataset, features are represented as columns.
• Labels (Targets): In supervised learning, labels are the dependent variables or outputs
that the model is trained to predict. They are also represented as columns, usually the last
column in a dataset.
• Instances (Samples): Each row in a dataset corresponds to an instance, which is a single
data point consisting of various feature values and, in the case of supervised learning, a
corresponding label.

Real-world vs. Synthetic Datasets


• Real-world Datasets: These are derived from real-world sources such as sensors, surveys,
databases, or logs. They often contain noise, missing values, and irregularities, which
makes them more challenging but also more valuable for developing robust machine learning
models.
• Synthetic Datasets: These are artificially generated datasets that are often used for test-
ing and educational purposes. They are useful for illustrating specific aspects of algorithms
without the complexity of real-world data.

Sources of Datasets
Several platforms and libraries provide ready-to-use datasets for machine learning, including:
• scikit-learn (sklearn.datasets): Offers a variety of standard datasets for both classifi-
cation and regression tasks.
• UCI Machine Learning Repository: A popular resource for finding real-world datasets
across various domains.
• Kaggle: A platform for data science competitions that provides access to a wide range of
datasets along with challenges to solve using machine learning.
• mglearn.datasets: Part of the mglearn package, it offers synthetic datasets specifically
designed to illustrate machine learning concepts.
Datasets are fundamental to the process of building, training, and evaluating machine learning
models. The selection of the right dataset, along with proper preprocessing and understanding
of its structure, is crucial for developing models that perform well and generalize effectively. As

Rama Bhadra Rao Maddu 90 ANL Kumar


machine learning continues to advance, the importance of high-quality datasets will only grow,
making them one of the most critical resources in the field.

2.2.1 Datasets Available in sklearn.datasets


The sklearn.datasets module in scikit-learn provides a variety of datasets that are commonly
used for testing and benchmarking machine learning algorithms. These datasets are either toy
datasets that are small enough to be loaded into memory, or real-world datasets that are slightly
larger and are often used in research. Below is a list of the available datasets, categorized into
toy datasets and real-world datasets:

Toy Datasets
Toy datasets are small datasets that are easy to manipulate and can be quickly used to test
algorithms. They include:
• Iris Dataset (load_iris()): A dataset for classification that contains measurements of
iris flowers in three species.
• Digits Dataset (load_digits()): A dataset for image classification with 8x8 pixel images
of handwritten digits.
• Wine Dataset (load_wine()): A dataset for classification with chemical analysis of wines
grown in the same region.
• Breast Cancer Dataset (load_breast_cancer()): A dataset for binary classification
containing features of breast cancer tumors.
• Boston Housing Dataset (load_boston()): A regression dataset containing housing
values in suburbs of Boston (deprecated in scikit-learn 1.0).

Real-World Datasets
These datasets are slightly larger and are often used in research:
• California Housing Dataset (fetch_california_housing()): A regression dataset con-
taining house prices and features from California districts.
• 20 Newsgroups Dataset (fetch_20newsgroups()): A dataset for text classification with
newsgroup posts from 20 different categories.
• LFW People Dataset (fetch_lfw_people()): A dataset for face recognition with labeled
faces in the wild.
• Olive Oil Dataset (fetch_openml(data_id=171)): A regression dataset about the com-
position of olive oils.
• COIL20 Dataset (fetch_openml(data_id=40922)): A dataset for image classification
containing images of 20 different objects.

Rama Bhadra Rao Maddu 91 ANL Kumar


How to Use sklearn.datasets
To load and use these datasets, you can simply call the corresponding functions. Below is a
sample code showing how to load and use the Iris dataset:

[ ]: from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Labels

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

# Initialize the k-NN classifier


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the accuracy


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

2.2.2 Datasets Available on Kaggle


Kaggle is a popular platform for data science and machine learning competitions. It also serves
as a rich repository of datasets across a wide variety of domains, including healthcare, finance,
social media, and more. Many datasets on Kaggle are real-world datasets, often large, complex,
and noisy, which provide a great opportunity for hands-on learning and experimentation.
Some important datasets available on Kaggle include:
• Titanic Dataset: This is a classification dataset where the goal is to predict whether a
passenger survived the Titanic disaster based on features such as age, ticket class, and sex.
It is commonly used for beginner tutorials in machine learning.
• House Prices - Advanced Regression Techniques: This dataset is used for regression
tasks, where the goal is to predict the price of a house based on various features like the
number of rooms, the year built, and location.

Rama Bhadra Rao Maddu 92 ANL Kumar


• MNIST Handwritten Digits: This dataset consists of 28x28 pixel images of handwritten
digits (0-9) and is widely used for image classification tasks.
• COVID-19 Open Research Dataset (CORD-19): A rich dataset of scholarly articles
related to COVID-19, used for natural language processing tasks such as text classification
and information retrieval.
• Credit Card Fraud Detection: This dataset contains transactions made by credit cards
and is used to predict whether a transaction is fraudulent or not, making it useful for
anomaly detection tasks.

Example - Using Kaggle’s Titanic Dataset


Below is an example of how to use the Titanic dataset from Kaggle to build a simple machine
learning model to predict survival:

[ ]: import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Titanic dataset from Kaggle (assuming it has been downloaded␣
,→locally)

data = pd.read_csv('titanic.csv')

# Display the first few rows of the dataset


print(data.head())

# Feature selection and preprocessing


data.dropna() # Removing rows with missing data for simplicity
X = data[['Pclass', 'Age', 'Fare']] # Features: Passenger class, Age, and␣
,→Fare

y = data['Survived'] # Target: Survival (1 for survived, 0 for did not␣


,→survive)

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

# Initialize and train the logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

Rama Bhadra Rao Maddu 93 ANL Kumar


# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

In this example, the Titanic dataset is loaded, preprocessed, and used to train a
LogisticRegression model to predict whether a passenger survived the Titanic disaster. The
model is then evaluated using the accuracy score.
Kaggle provides a wide range of datasets that cater to both beginners and advanced practitioners.
These datasets allow users to practice and experiment with different algorithms and techniques
in machine learning. Moreover, Kaggle competitions offer a unique opportunity to work on real-
world problems while competing with data scientists worldwide.

2.2.3 Datasets Available in mglearn.datasets


The mglearn.datasets module, part of the mglearn package, provides a few synthetic datasets
that are particularly useful for understanding the behavior of machine learning algorithms. These
datasets are designed to be simple yet illustrative, making them excellent for educational purposes.

Available Datasets in mglearn.datasets


• Forge Dataset (make_forge()): A simple 2D classification dataset with two features. It
is often used to illustrate classification algorithms.
• Wave Dataset (make_wave()): A regression dataset with a single feature, used to demon-
strate regression algorithms.
• Blobs Dataset (make_blobs()): A dataset for clustering, containing multiple Gaussian
blobs.
• Moons Dataset (make_moons()): A dataset for binary classification with two interleaving
half circles, commonly used to illustrate non-linear classification.
• Circles Dataset (make_circles()): A dataset for binary classification with concentric
circles, used to illustrate non-linear classification.

How to Use mglearn.datasets


To use the datasets from mglearn.datasets, you can follow the steps below. Here is an example
with the Forge dataset:

[ ]: import mglearn
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Generate the Forge dataset


X, y = mglearn.datasets.make_forge()

Rama Bhadra Rao Maddu 94 ANL Kumar


# Visualize the dataset
plt.figure(figsize=(8, 6))
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Forge Dataset")
plt.legend(["Class 0", "Class 1"], loc="best")
plt.show()

This code snippet shows how to generate the Forge dataset using make_forge() from
mglearn.datasets and visualize it using a scatter plot.

Visualizing the Data


The ‘mglearn‘ library is particularly powerful because it provides visualization functions that help
you understand how algorithms work on different datasets. For example, the ‘discrete_scatter‘
function used in the code above creates a scatter plot where the points are color-coded based on
their class, making it easy to visualize classification problems.

Conclusion
The datasets provided by sklearn.datasets, mglearn.datasets, and Kaggle serve as invaluable
resources for learning, experimenting, and advancing in the field of machine learning. While
sklearn.datasets offers a wide variety of real-world and toy datasets commonly used in both
research and practical applications, mglearn.datasets focuses on synthetic datasets that are
ideal for educational purposes. Kaggle, on the other hand, provides access to a vast repository of
real-world datasets across diverse domains, along with competitions that challenge users to solve
real-world problems using machine learning techniques. Together, these resources enable rapid
prototyping, visualization, and deeper insights into the behavior of machine learning models,
catering to both beginners and experts. Whether you’re just starting out or conducting advanced
research, these datasets provide the foundation for hands-on learning and help accelerate progress
in machine learning.

2.2.4 Supervised Machine Learning Algorithms


A wide variety of algorithms are used in supervised learning, each suited for different types of
problems and datasets. Some of the most commonly used supervised learning algorithms include:
• k-Nearest Neighbors (k-NN): A simple algorithm that classifies data points based on
the labels of the nearest neighbors in the feature space.
• Linear Models: These models, such as linear regression and logistic regression, assume a
linear relationship between the input features and the output.
• Naive Bayes Classifiers: Probabilistic models that apply Bayes’ theorem with strong
independence assumptions between features.

Rama Bhadra Rao Maddu 95 ANL Kumar


• Decision Trees: A tree-like model of decisions where each internal node represents a test
on a feature, and each leaf node represents a class label or continuous value.
• Ensembles of Decision Trees: Methods such as Random Forests and Gradient Boosting
combine multiple decision trees to improve prediction accuracy.
• Kernelized Support Vector Machines (SVMs): These algorithms use kernel func-
tions to transform data into higher dimensions, making it easier to find a hyperplane that
separates different classes.
• Uncertainty Estimates from Classifiers: Techniques like probabilistic classifiers and
Bayesian methods that provide estimates of uncertainty in their predictions, making them
valuable in situations where confidence in predictions is important.
Supervised learning forms the backbone of many real-world applications, including image recog-
nition, speech processing, and medical diagnosis. Understanding the various algorithms and their
strengths and limitations is essential to applying supervised learning effectively.

2.3 K-Nearest Neighbor (K-NN) Algorithm for Machine


Learning
K-Nearest Neighbor (K-NN) is a simple, yet powerful machine learning algorithm that falls under
the category of supervised learning. It is mainly used for classification tasks, although it can also
be applied to regression problems. The algorithm assumes that similar data points are likely to
belong to the same class and classifies new data points based on the similarity between them and
the available labeled data.
The K-NN algorithm does not involve any training phase in the traditional sense, which is why
it is called a lazy learner. Instead, it stores the dataset and performs classification or regression
only when a new data point needs to be predicted.

Key Steps in K-NN Algorithm


The working of the K-NN algorithm can be broken down into the following steps:
1. Choose the value of K: The number of nearest neighbors (K) is chosen. Commonly used
values for K range between 1 and 10.
2. Calculate the distance: For the given test data point, calculate the distance between
this point and all the points in the training dataset. The most common distance metric
used is the Euclidean distance, which is given by:
p
d = (X2 − X1 )2 + (Y2 − Y1 )2

where (X1 , Y1 ) and (X2 , Y2 ) are the coordinates of two data points.
3. Find the K nearest neighbors: Based on the calculated distances, find the K nearest
neighbors of the test data point.

Rama Bhadra Rao Maddu 96 ANL Kumar


4. Assign the category: Count how many of the nearest neighbors belong to each category
and assign the category that is most frequent among the K neighbors to the new data point.

Example of K-NN for Classification


Consider two categories of data points, Category A and Category B. Given a new data point,
we want to determine which category this point belongs to. As shown in Figure 2.1, the dataset
consists of two groups: Category A and Category B. The K-NN algorithm works by identifying
the nearest data points to a new point and assigning a category based on the majority vote among
its nearest neighbors.

Next, we calculate the Euclidean distances from the new data point to all the points in the
dataset, as shown in Figure 2.2. Euclidean distance is a measure of the straight-line distance

Rama Bhadra Rao Maddu 97 ANL Kumar


between two points and can be calculated using the formula:
p
d = (X2 − X1 )2 + (Y2 − Y1 )2

where (X1 , Y1 ) and (X2 , Y2 ) represent the coordinates of two points.

In this case, we choose K = 3. As shown in Figure 2.3, the 3 nearest neighbors are identified.
Among these neighbors, 1 belong to Category A and 2 belong to Category B. The new data point
is then classified into Category B because the majority of its neighbors belong to this category.

Finally, the classification result is shown in Figure 2.4, where the new data point is assigned to
Category B based on the majority vote.

Rama Bhadra Rao Maddu 98 ANL Kumar


Choosing the Value of K
The value of K plays a crucial role in the performance of the K-NN algorithm. A small value of
K, such as 1 or 2, may lead to overfitting and make the model sensitive to noise. Conversely, a
large value of K can smoothen the decision boundaries but may lead to underfitting. A typical
value used for K is 5, but this can vary depending on the dataset.

Advantages and Disadvantages of K-NN


Advantages:

• Simple to implement and understand.

• Does not require any training phase, making it computationally inexpensive in that aspect.

• Works well with small datasets and cases where data is not linearly separable.

Disadvantages:

• Computationally expensive during classification, especially for large datasets, as it calculates


the distance between the test point and all training points.

• Sensitive to the scale of the data, meaning that features with larger scales can dominate
the distance calculation.

The K-Nearest Neighbor algorithm is a fundamental method in machine learning and has broad
applications in various domains, from image recognition to recommendation systems. Its sim-
plicity and effectiveness make it a popular choice, especially for small datasets and classification
problems. However, care must be taken when selecting the value of K and handling large datasets
to ensure optimal performance.

Rama Bhadra Rao Maddu 99 ANL Kumar


Example with Python Code
Here is an example of implementing the k-NN algorithm using the popular Iris dataset in Python:

[Ex]: # Import necessary libraries


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target
iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df['species'] = pd.Categorical.from_codes(y, iris.target_names)

# Data visualization: Pairplot of the Iris dataset


sns.pairplot(iris_df, hue='species', markers=["o", "s", "D"])
plt.suptitle('Pairplot of the Iris Dataset', y=1.02)
plt.show()

# Data visualization: Distribution of each feature


plt.figure(figsize=(12, 8))
for i, column in enumerate(iris_df.columns[:-1]):
plt.subplot(2, 2, i + 1)
sns.histplot(iris_df[column], kde=True)
plt.title(f'Distribution of {column}')
plt.tight_layout()
plt.show()

# Updated: Correlation matrix heatmap with numeric_only parameter


plt.figure(figsize=(8, 6))
corr_matrix = iris_df.corr(numeric_only=True)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Iris Features')
plt.show()

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

Rama Bhadra Rao Maddu 100 ANL Kumar


# Initialize the k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model


knn.fit(X_train, y_train)

# Make predictions on the test set


y_pred = knn.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Rama Bhadra Rao Maddu 101 ANL Kumar


Accuracy: 100.00%

Rama Bhadra Rao Maddu 102 ANL Kumar


2.4 Linear Models
Linear models are fundamental tools in machine learning due to their simplicity and interpretabil-
ity. They are widely used in predictive modeling, particularly for tasks where the relationship
between input features and the target variable is assumed to be linear. In these models, the
output is predicted by combining input features in a linear manner, making them computation-
ally efficient and easy to understand. This makes them ideal for large datasets, especially when
working in high-dimensional spaces.
In machine learning, linear models can be applied to both regression and classification prob-
lems. For regression, the goal is to predict a continuous output, such as estimating house prices
based on features like house size or the number of rooms. For classification, the task is to assign
discrete class labels, such as determining if an email is spam or not. Despite their simplicity,
linear models can be effective, especially when the relationship between input features and the
output is approximately linear.
Linear regression is used for predicting a continuous outcome by modeling the relationship be-
tween the input features and the target variable as a linear function. The general form of a linear
regression model is:

ŷ = w[0] · x[0] + w[1] · x[1] + · · · + w[p] · x[p] + b

where:

Rama Bhadra Rao Maddu 103 ANL Kumar


• ŷ represents the predicted value,
• w are the learned weights,
• x are the input features, and
• b is the bias term.
One of the most common methods for linear regression is Ordinary Least Squares (OLS),
which minimizes the mean squared error between the predicted and actual values. However, linear
models can sometimes overfit the training data, especially when dealing with many features. To
address this, regularization techniques like Ridge Regression and Lasso Regression are used.
Ridge regression (L2 regularization) penalizes large weights to improve generalization, while Lasso
regression (L1 regularization) can shrink some coefficients to zero, effectively selecting important
features.
For classification tasks, linear models can also be applied. Logistic Regression is a widely used
linear model for binary classification tasks. Although it is called regression, it solves classification
problems by predicting the probability of a class using the sigmoid function, which converts
the linear combination of input features into a probability. This makes it suitable for tasks like
spam detection or binary decision-making problems. Another linear model for classification is the
Support Vector Machine (SVM), which finds the hyperplane that best separates different
classes in the feature space, maximizing the margin between them. SVMs are effective for both
binary and multiclass classification problems and are especially robust when dealing with noisy
data.
In multiclass classification problems, linear models are often extended using techniques like the
one-vs-rest approach. This method trains a separate binary classifier for each class, where one
class is treated as positive, and the rest are treated as negative. The class with the highest
confidence in its prediction is selected, enabling the application of linear models to problems
involving more than two classes.
Regularization plays a critical role in controlling model complexity. It helps avoid overfitting
by penalizing large coefficients, thus encouraging simpler models that generalize better to unseen
data. In linear regression, this is typically done using α in Ridge Regression or the C parameter in
Logistic Regression and SVMs. Larger regularization values lead to smaller coefficients, reducing
overfitting, while smaller values can result in underfitting, where the model fails to capture the
underlying pattern in the data.
Despite their simplicity, linear models have several strengths, including ease of interpretation, fast
computation, and suitability for high-dimensional data. They perform well when the relationship
between input features and output is approximately linear. However, they do have limitations.
Linear models assume that the relationship between the features and the target is linear, which
may not always be the case in real-world data. When the relationship is more complex or
nonlinear, other models such as decision trees, neural networks, or kernel methods might be more
appropriate.
In terms of evaluation metrics, linear models are commonly evaluated using Mean Squared
Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE)

Rama Bhadra Rao Maddu 104 ANL Kumar


for regression tasks. These metrics provide insight into the model’s prediction accuracy by quan-
tifying the difference between predicted and actual values. For classification tasks, metrics such
as accuracy, precision, recall, F1 score, and ROC-AUC are used to assess how well the
model is distinguishing between different classes.
In summary, linear models are powerful and efficient for solving both regression and classification
problems, especially when the relationship between inputs and outputs is close to linear. With
the right application of regularization, these models can prevent overfitting and perform well on
various types of data. Understanding linear models lays a strong foundation for exploring more
complex machine learning algorithms.

2.4.1 Bias-Variance Tradeoff and Regularization


Machine learning models aim to find patterns in the data that generalize well to unseen examples.
A key challenge in achieving this is managing the tradeoff between bias and variance. This section
explains how these concepts interact and how regularization techniques help address them.

Bias-Variance Tradeoff
The bias-variance tradeoff refers to the balance between the errors introduced by bias and vari-
ance.
• Bias: Bias is the error introduced by approximating a real-world problem (which may be
complex) by a simplified model. High bias models (e.g., linear models) are too simple and
fail to capture the underlying data patterns, leading to underfitting.
• Variance: Variance is the model’s sensitivity to changes in the training data. High variance
models (e.g., highly complex models) may capture noise in the training data, leading to
overfitting.
• Tradeoff : As model complexity increases, bias decreases, but variance increases. The goal
is to find an optimal balance between bias and variance to minimize the total error.

Overfitting and Underfitting


• Overfitting: This occurs when the model captures noise in the training data, leading to
poor generalization on unseen data. Overfitting models have high variance and low bias.
In the diagram below, the model fits every point in the training set, including noise.
• Underfitting: Underfitting occurs when the model is too simple to capture the underlying
patterns in the data. Underfitted models have high bias and low variance. The model fails
to fit the data well, as shown below.

Example: Predicting House Prices


Let’s consider an example of predicting house prices using features like house size, number of
bedrooms, and location. Suppose the training dataset has the following information:

Rama Bhadra Rao Maddu 105 ANL Kumar


Error
Variance

Total error
Optimal Model Complexity

Bias2

Model Complexity

Price
Underfitting

Overfitting

Size

Size (sq ft) Bedrooms Location Price (in $1000)


1200 3 Suburb 300
1500 4 City Center 450
900 2 Suburb 200
1700 4 Suburb 480
1300 3 City Center 350

Rama Bhadra Rao Maddu 106 ANL Kumar


A simple linear model (high bias) might only consider size to predict prices, ignoring the other
features. This could lead to underfitting:

P rice = β0 + β1 · Size

On the other hand, a complex model (high variance) might use high-degree polynomials to fit
every training point exactly, leading to overfitting. To mitigate this, regularization techniques like
Ridge or Lasso regression can be applied, which add penalty terms to control model complexity.

Regularization
Regularization techniques help reduce overfitting by penalizing large coefficients in the model.
This encourages simpler models that generalize better to unseen data.

Lasso (L1) Regularization


Lasso regression adds the absolute value of the coefficients to the loss function:

n m
1X X
Cost = (yi − yˆi )2 + λ |wi |
n i=1 i=1

Lasso helps reduce the impact of less important features by driving some coefficients to zero, thus
performing feature selection.
Ridge (L2) Regularization
Ridge regression adds the squared magnitude of the coefficients as a penalty:

n m
1X X
Cost = (yi − yˆi )2 + λ wi2
n i=1 i=1

Ridge regression prevents overfitting by shrinking the coefficients, but unlike Lasso, it does not
set coefficients exactly to zero.
Elastic Net Regularization
Elastic Net is a combination of L1 and L2 regularization. It adds both the absolute and squared
penalties:

n m m
!
1X X X
Cost = (yi − yˆi )2 + λ α |wi | + (1 − α) wi2
n i=1 i=1 i=1

Elastic Net combines the benefits of both Lasso and Ridge regularization.

Rama Bhadra Rao Maddu 107 ANL Kumar


w2

L2 Regularization

L1 Regularization

w1

2.4.2 Linear Models for Regression


What is Regression?
Regression is a type of supervised machine learning used for predicting continuous values. In
regression tasks, the goal is to predict a numerical output based on input features. For example,
you might use regression to predict a person’s height based on their age, or to estimate the price
of a house based on its size and location. Unlike classification, where the output is a discrete label
(e.g., "spam" or "not spam"), regression deals with continuous values that can take any number
within a range.

Formula for Linear Regression


In linear regression, we predict the output value y by combining the input features x0 , x1 , . . . , xp
linearly. The equation for a linear regression model looks like this:

y = w0 · x0 + w1 · x1 + · · · + wp · xp + b

Where:
• x0 , x1 , . . . , xp are the input features (for example, the size of the house, the number of
rooms, or the location).
• w0 , w1 , . . . , wp are the weights assigned to each of these features. These weights determine
how much influence each feature has on the output value.

Rama Bhadra Rao Maddu 108 ANL Kumar


• b is the bias term, which shifts the predicted value.
The weights w0 , w1 , . . . , wp are learned from the training data, and the model uses these weights
to make predictions on new data.

Example: Predicting House Prices


To make this clearer, let’s consider an example. Suppose we want to predict the price of a house
based on two features: the size of the house (in square feet) and the number of rooms.
The linear regression model might look like this:

Price = (100 · size) + (50 · rooms) + 10

This equation tells us that:


• For every additional square foot, the price of the house increases by $100.
• For every additional room, the price increases by $50.
• The bias b = 10 represents a base price for the house, accounting for other factors not
included in the model.

Graphical Representation
A graphical representation of this concept can be shown by plotting the house price (the target
variable) against one of the features, such as the size of the house. The line represents the model’s
predictions, and the data points represent actual house prices for different sizes.

Price ($)

Size (sq ft)

In this graph:

Rama Bhadra Rao Maddu 109 ANL Kumar


• The blue points represent actual house prices for different house sizes.
• The red line is the regression line. It represents the predictions of the linear regression
model.
• As the size of the house increases, the price also increases in a linear fashion, as shown by
the straight line.

Interpretation of Weights and Bias


The weights in the linear regression model determine how much each feature contributes to the
prediction. In our house price example:
• The weight w0 = 100 tells us that for every additional square foot, the price increases by
$100.
• The weight w1 = 50 tells us that for every additional room, the price increases by $50.
• The bias b = 10 adds a base amount to the price to account for factors outside the features
considered.

How the Model Learns


The goal of training a linear regression model is to find the optimal values for the weights
w0 , w1 , . . . , wp and bias b such that the predictions made by the model are as close as possi-
ble to the actual values in the training data. This is done by minimizing a loss function, typically
the mean squared error, which measures the average squared difference between the predicted
values and the actual values.

n
1X
MSE = (ŷi − yi )2
n i=1

Where:
• ŷi is the predicted value for the i-th sample,
• yi is the actual value for the i-th sample,
• n is the number of samples in the dataset.
The lower the mean squared error, the better the model fits the data.

Example: Linear Regression: Actual vs Predicted Prices (California Housing)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

Rama Bhadra Rao Maddu 110 ANL Kumar


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California housing dataset


california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=42)

# Initialize and train the model


lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predictions
y_pred = lin_reg.predict(X_test)

# Calculate and print the Mean Squared Error (MSE)


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Visualize actual vs predicted values


plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red',␣
,→linestyle='--', linewidth=2)

plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Linear Regression: Actual vs Predicted Prices (California␣
,→Housing)")

plt.show()

Mean Squared Error: 0.555891598695244

Rama Bhadra Rao Maddu 111 ANL Kumar


Linear regression is a simple yet powerful technique for predicting continuous values. It is par-
ticularly effective when there is a linear relationship between the input features and the target
variable. Understanding linear regression and its application is fundamental to grasping more
advanced machine learning models and techniques.

2.4.3 Regularization: Preventing Overfitting


What is Overfitting?

In machine learning, overfitting occurs when a model becomes too complex and learns not only the
underlying patterns in the training data but also the noise and random fluctuations. This means
the model performs very well on the training data but struggles to generalize to new, unseen data,
leading to poor performance. Overfitting often happens when the model’s complexity increases,
which can be caused by using too many features or training the model for too long.

Regularization is a technique that helps prevent overfitting by introducing a penalty term in the
model’s cost function. This penalty discourages the model from assigning too much importance
(or weight) to any one feature, thus keeping the model simpler and more generalizable.

Rama Bhadra Rao Maddu 112 ANL Kumar


Ridge Regression (L2 Regularization)

Ridge regression, also known as L2 regularization, adds a penalty term to the loss function
proportional to the square of the weights. The regularized loss function for ridge regression is
given by:

n p
1X X
Loss = (ŷi − yi )2 + λ wj2
n i=1 j=1

Where:

• ŷi is the predicted value for the i-th data point,

• yi is the actual value for the i-th data point,

• wj are the weights associated with the features,

• λ is the regularization parameter (sometimes called α), controlling the strength of the
penalty.
Pp
The additional term λ j=1 wj2 penalizes large weights, encouraging the model to keep the weights
smaller. This makes the model simpler and less prone to overfitting.

Error on Training and Test Data

Test Error

Optimal Complexity

Training Error

Model Complexity

Example: Suppose we are predicting house prices based on various features such as size, number
of rooms,etc. Ridge regression prevents any one feature (like size) from having an overly large
weight, ensuring that all features contribute more evenly to the prediction.

Rama Bhadra Rao Maddu 113 ANL Kumar


Input Table

House Size (sq. ft.) Number of Rooms Price (in $1000s)


1 2000 3 300
2 1800 3 280
3 2200 4 350
4 1700 2 220

Ridge Regression Equation


The Ridge regression model is given by:

w = (XT X + λI)−1 XT y

Where:
• X is the matrix of input features (in this case, Size and Rooms).
• y is the vector of target values (in this case, Price).
• λ is the regularization parameter (we are using λ = 1).
• I is the identity matrix.

Step-by-Step Calculation
1. Feature Matrix (X) and Target Vector (y)
From the given table, the input features and target values are:

   
2000 3 300
1800 3 280
X=
2200
 y= 
4 350
1700 2 220

2. Calculate XT X
 
15700000 22100
XT X =
22100 38

3. Calculate XT y
 
1874000
X y=T
4230

Rama Bhadra Rao Maddu 114 ANL Kumar


4. Regularization Term: λI
 
1 0
λI =
0 1

5. Compute (XT X + λI)


 
15700001 22100
X X + λI =
T
22100 39

6. Compute the Inverse (XT X + λI)−1


The inverse of the matrix is:

3.13 × 10−7 −1.77 × 10−4


   
1 39 −22100
=
124490039 −22100 15700001 −1.77 × 10−4 1.26 × 10−1

7. Compute w = (XT X + λI)−1 XT y


3.13 × 10−7 −1.77 × 10−4
  
1874000
w=
−1.77 × 10−4 1.26 × 10−1 4230

Multiplying this out:

w0 ≈ 0.1961, w1 ≈ 10.6499

So the weights are:

w0 = 0.1961 (for Size), w1 = 10.6499 (for Rooms)

8. Intercept (Bias)
The intercept b is computed during the fitting process:

b = −121.9437

Final Results
The final values for the Ridge regression model are:

Feature Value
Size (w0) 0.1961
Rooms (w1) 10.6499
Intercept (b) −121.9437

Rama Bhadra Rao Maddu 115 ANL Kumar


Example:Ridge Regression (L2 Regularization) on California Housing Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California housing dataset


california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=42)

# Initialize and train the Ridge model with alpha = 0.1 (regularization␣
,→strength)

ridge_reg = Ridge(alpha=0.1)
ridge_reg.fit(X_train, y_train)

# Predictions
y_pred = ridge_reg.predict(X_test)

# Calculate and print the Mean Squared Error (MSE)


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (Ridge): {mse}")

# Visualize actual vs predicted values


plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red',␣
,→linestyle='--', linewidth=2)

plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Ridge Regression (L2): Actual vs Predicted Prices (California␣
,→Housing)")

plt.show()

Mean Squared Error (Ridge): 0.5558827543113799

Rama Bhadra Rao Maddu 116 ANL Kumar


Lasso Regression (L1 Regularization)
Lasso regression, also known as L1 regularization, takes the idea of regularization further by
not only penalizing large weights but also shrinking some weights to exactly zero. The regularized
loss function for lasso regression is:

n p
1X 2
X
Loss = (ŷi − yi ) + λ |wj |
n i=1 j=1

Where:
Pp
• λ j=1 |wj | is the L1 regularization term, which penalizes the absolute values of the weights.
Lasso regression is especially useful when we have a large number of features, many of which
might be irrelevant. The L1 regularization can shrink the coefficients of less important features
to zero, effectively performing feature selection.
Example: In a scenario where we are using many features to predict house prices, lasso regression
can automatically select only the most important features, such as house size and location, and
discard less relevant features like the color of the house.

Rama Bhadra Rao Maddu 117 ANL Kumar


w2

L2 regularization

L1 regularization

w1

Input Table

House Size (sq. ft.) Number of Rooms Price (in $1000s)


1 2000 3 300
2 1800 3 280
3 2200 4 350
4 1700 2 220

Lasso Regression Equation


The Lasso regression model is given by:

n p
X 2 X
Minimize yi − (w xi + b)
T
+λ |wj |
i=1 j=1

Where:
• X is the matrix of input features (in this case, Size and Rooms).
• y is the vector of target values (in this case, Price).
• λ is the regularization parameter (we are using λ = 1).

Rama Bhadra Rao Maddu 118 ANL Kumar


Final Results
After applying Lasso regression with λ = 1, we get the following values:

Feature Value
Size (w0) 0.19
Rooms (w1) 0
Intercept (b) −81.88

In this case, Lasso regression set the coefficient w1 (for Rooms) to 0, effectively eliminating that
feature from the model.

Lasso Regression (L1 Regularization) on California Housing Dataset

[2]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California housing dataset


california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=42)

# Initialize and train the Lasso model with alpha = 0.1 (regularization␣
,→strength)

lasso_reg = Lasso(alpha=0.1, max_iter=10000)


lasso_reg.fit(X_train, y_train)

# Predictions
y_pred = lasso_reg.predict(X_test)

# Calculate and print the Mean Squared Error (MSE)


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (Lasso): {mse}")

# Visualize actual vs predicted values

Rama Bhadra Rao Maddu 119 ANL Kumar


plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red',␣
,→linestyle='--', linewidth=2)

plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Lasso Regression (L1): Actual vs Predicted Prices (California␣
,→Housing)")

plt.show()

Mean Squared Error (Lasso): 0.6135115198058131

Comparing Lasso and Ridge: Lasso tends to zero out less important features, effectively per-
forming feature selection. Ridge shrinks the weights of all features but does not eliminate any,
so it’s better suited when you believe all features contribute to the prediction. In both cases,
alpha is the regularization strength, and increasing it will apply stronger regularization, leading
to smaller weights (and possibly reduced overfitting).

Differences Between L1 and L2 Regularization

Rama Bhadra Rao Maddu 120 ANL Kumar


Feature L1 Regularization L2 Regularization
(Lasso)
Pn (Ridge)
Pn
Penalty λ i=1 |wi | λ i=1 wi2
Term
Effect on Shrinks some coefficients Shrinks coefficients but
Coeffi- to zero, performing fea- never to zero
cients ture selection
Solution Diamond-shaped con- Circular (ellipsoid) con-
Space straint region straint region
Shape
Sparsity Produces sparse mod- Produces non-sparse mod-
els (selects important els (keeps all features)
features)
Use Cases Useful when feature selec- Useful when multi-
tion is needed or there are collinearity exists or when
many irrelevant features all features are important
Optimization Leads to sparse solutions Leads to smoother, non-
due to sharp corners in the sparse solutions
constraint region
ComputationalComputationally more Computationally simpler
Complex- intensive due to non- due to smooth constraint
ity differentiable points

Why Use Regularization?

Without regularization, models tend to fit the training data too well, capturing noise and random
fluctuations in the data. This leads to high training accuracy but poor performance on unseen
data. Regularization helps by limiting the complexity of the model and ensuring that it generalizes
better to new data. By penalizing large weights, ridge regression keeps all the features in the model
but controls their influence. Lasso regression, on the other hand, not only controls complexity
but also performs feature selection by shrinking some weights to zero.

The key difference between L1 and L2 regularization is that L1 (lasso) tends to produce sparse
models (with fewer features), while L2 (ridge) keeps all the features but reduces their influence.

Regularization is a powerful tool for preventing overfitting in machine learning models. By adding
a penalty to the model’s loss function, we can control the size of the weights and keep the model
simpler, leading to better generalization on unseen data. Ridge regression (L2 regularization)
penalizes large weights, while lasso regression (L1 regularization) can shrink some weights to
zero, effectively performing feature selection. These techniques are essential when working with
high-dimensional datasets or when you want to balance complexity and performance.

Rama Bhadra Rao Maddu 121 ANL Kumar


2.4.4 Linear Models for Classification
What is Classification?
Classification is a type of supervised learning used when the target variable is categorical, meaning
we aim to predict discrete categories or classes. For example, classification can be used to predict
whether an email is spam or not (binary classification) or to determine if a patient has a particular
disease based on their medical records. The goal of a classification algorithm is to find a decision
boundary that separates the data points into different classes.

Logistic Regression
Logistic regression is a widely used linear model for binary classification tasks, where there are two
possible outcomes (e.g., yes/no, true/false). While linear regression models predict continuous
values, logistic regression transforms the output into a probability using a special function called
the sigmoid function or logistic function.
The sigmoid function is defined as:

1
σ(z) =
1 + e−z
Where:
• z = w0 · x0 + w1 · x1 + · · · + wp · xp + b is the linear combination of input features.
• w0 , w1 , . . . , wp are the weights learned by the model.
• b is the bias term.
The sigmoid function squashes the output of the linear equation into a range between 0 and
1, which can be interpreted as the probability of the input belonging to a certain class. If the
predicted probability is greater than 0.5, the model assigns the input to the positive class (e.g.,
spam), and if it’s less than 0.5, the model assigns it to the negative class (e.g., not spam).

Probability

P = 0.5

Example: Spam Detection


Let’s walk through an example of how logistic regression is used for binary classification, such as
detecting whether an email is spam or not.

Rama Bhadra Rao Maddu 122 ANL Kumar


Suppose we have features like the number of suspicious words in the email and the sender’s
reputation score. Using logistic regression, the model would learn the relationship between these
features and whether an email is spam. After training, the model might compute a probability
that an email is spam based on these features.
For example, the model might compute a 0.8 (or 80%) probability that a particular email is spam.
Since 0.8 is greater than 0.5, the model classifies the email as spam.

Spam (80%)

Model: Logistic Regression

Not Spam (20%)

Email

In this case, the model predicts that there is an 80% chance that the email is spam, so it classifies
it as spam.

Decision Boundary in Logistic Regression


In logistic regression, the model learns a decision boundary that separates the data into two
classes. The decision boundary is the point where the model predicts a probability of 0.5. Any
data points on one side of the boundary will be classified into one class, and those on the other
side will be classified into the opposite class.

Feature 2
Decision Boundary

Feature 1

In this diagram, the dashed line represents the decision boundary learned by logistic regression.

Rama Bhadra Rao Maddu 123 ANL Kumar


All the blue points on one side of the boundary are classified as belonging to one class, while the
red points on the other side are classified as belonging to the other class.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,␣
,→classification_report

# Load the breast cancer dataset


cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target)

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=42)

# Initialize and train the Logistic Regression model


log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

# Predictions on the test set


y_pred = log_reg.predict(X_test)

# Calculate and print the accuracy


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Generate the classification report


print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate the confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Visualize the confusion matrix


plt.matshow(conf_matrix, cmap='Blues', alpha=0.7)
for i in range(conf_matrix.shape[0]):

Rama Bhadra Rao Maddu 124 ANL Kumar


for j in range(conf_matrix.shape[1]):
plt.text(x=j, y=i, s=conf_matrix[i, j], va='center', ha='center')
plt.title("Confusion Matrix for Breast Cancer Classification")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

Accuracy: 95.61%
Classification Report:
precision recall f1-score support

0 0.97 0.91 0.94 43


1 0.95 0.99 0.97 71

accuracy 0.96 114


macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114

Confusion Matrix:
[[39 4]
[ 1 70]]

Rama Bhadra Rao Maddu 125 ANL Kumar


Logistic regression is a simple yet powerful algorithm for binary classification tasks. By trans-
forming the output of a linear equation into a probability using the sigmoid function, it allows us
to make decisions based on whether the probability is above or below a certain threshold (usually
0.5). Logistic regression is widely used in applications such as spam detection, medical diagnoses,
and many other areas where decisions need to be made between two possible outcomes.

2.4.5 Multiclass Classification with Linear Models


What if there are more than two categories?
In many real-world problems, classification tasks involve more than two categories. For instance,
classifying images of animals may require identifying whether the animal is a cat, dog, or bird.
Linear models can handle multiclass classification by using a technique called one-vs-rest (OvR)
or one-vs-all (OvA).
In this approach, the model trains a separate classifier for each class, where one class is considered
"positive" and all other classes are "negative." When predicting a new instance, each classifier
outputs a confidence score, and the class with the highest score is selected as the final prediction.

FeatureClass
2 C

Class B

Feature 1

Class A

In this diagram, the dashed lines represent decision boundaries for three classes: Class A, Class
B, and Class C. When a new data point is evaluated, the classifiers each calculate a score, and
the class with the highest confidence is selected.

Example: Classifying Types of Animals


For example, if we want to classify images of animals as either a cat, dog, or bird, a separate
logistic regression classifier is trained for each category:

Rama Bhadra Rao Maddu 126 ANL Kumar


• One classifier identifies if the image is a cat or not.
• Another classifier determines if the image is a dog or not.
• The third classifier checks if the image is a bird or not.
Each classifier outputs a confidence score, and the class with the highest confidence is chosen.
For example, if the dog classifier has a confidence score of 0.75, while the cat and bird classifiers
have scores of 0.6 and 0.4 respectively, the image will be classified as a dog.

2.4.6 Model Complexity and Regularization


What is Model Complexity?
A model’s complexity refers to its ability to capture detailed patterns in the data. Complex
models can learn intricate relationships, but they are also prone to overfitting, where the model
memorizes the training data rather than learning the general patterns.

Regularization: Balancing Simplicity and Complexity


Regularization helps balance between a model that is too simple (which underfits the data) and
a model that is too complex (which overfits the data). Regularization techniques, such as Lasso
(L1 regularization) and Ridge (L2 regularization), add penalties to the model’s loss function to
discourage overly large weights. This keeps the model simpler and helps it generalize better to
unseen data.

Error on Training and Test Data

Test Error

Optimal Complexity

Training Error

Model Complexity

In the diagram, as the model complexity increases, the training error decreases, but the test error
follows a U-shaped curve. Regularization helps by pushing the model towards optimal complexity,
where the test error is minimized.

Rama Bhadra Rao Maddu 127 ANL Kumar


Example: Multiclass Classification Using Logistic Regression

Let’s use the well-known Iris dataset as an example. This dataset has three classes: Setosa,
Versicolor, and Virginica. We’ll use logistic regression as the linear model for this multiclass
classification task

# Importing necessary libraries


from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the Iris dataset


iris = datasets.load_iris()
X = iris.data # Features
y = iris.target # Target classes

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

# Initialize and train a Logistic Regression model


logistic_reg = LogisticRegression(multi_class='ovr', solver='lbfgs',␣
,→max_iter=200)

logistic_reg.fit(X_train, y_train)

# Predict on the test set


y_pred = logistic_reg.predict(X_test)

# Print classification report


print("Classification Report:\n", classification_report(y_test, y_pred,␣
,→target_names=iris.target_names))

# Print confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Plotting the confusion matrix for better visualization


df_cm = pd.DataFrame(conf_matrix, index=iris.target_names, columns=iris.
,→target_names)

plt.figure(figsize=(8, 6))

Rama Bhadra Rao Maddu 128 ANL Kumar


sns.heatmap(df_cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Logistic Regression (Multiclass)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Classification Report:
precision recall f1-score support

setosa 1.00 1.00 1.00 19


versicolor 1.00 0.85 0.92 13
virginica 0.87 1.00 0.93 13

accuracy 0.96 45
macro avg 0.96 0.95 0.95 45
weighted avg 0.96 0.96 0.96 45

Confusion Matrix:
[[19 0 0]
[ 0 11 2]
[ 0 0 13]]

Rama Bhadra Rao Maddu 129 ANL Kumar


2.4.7 Strengths and Weaknesses of Linear Models
Strengths of Linear Models
Linear models have several key advantages that make them highly useful in many machine learning
tasks:

• Easy to interpret: Linear models provide clear insights into how the input features affect
the output. For example, we can easily interpret the coefficients in a linear regression model
as the amount of change in the target variable per unit change in the input feature.

• Fast to train and use: Linear models are computationally efficient and can be trained
quickly, even on large datasets.

• Works well with high-dimensional data: Linear models perform well when there are
many features (high-dimensional data), even if some features are irrelevant.

Weaknesses of Linear Models


Despite their strengths, linear models also have limitations:

• Struggles with nonlinear relationships: Linear models assume a straight-line relation-


ship between features and the target. They may perform poorly if the true relationship is
nonlinear or involves interactions between features.

• May underperform on complex datasets: Without regularization or feature transfor-


mations, linear models can underfit, especially when dealing with complex or non-linearly
separable data.

In summary, linear models are simple, interpretable, and fast, but they may struggle with complex
relationships in the data. Regularization helps address this by controlling model complexity and
preventing overfitting, making linear models more effective on a wider range of tasks.

2.4.8 Evaluation Metrics for Regression and Classification


In machine learning, evaluation metrics are essential for understanding how well a model is per-
forming. The choice of evaluation metric depends on the type of problem being solved—whether
it’s a regression task, where the goal is to predict continuous values, or a classification task, where
the objective is to assign labels to discrete classes. These metrics help quantify the accuracy and
effectiveness of the model, guiding improvements and fine-tuning of the algorithm.

For regression tasks, common evaluation metrics include Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), and Mean Absolute Error (MAE). These metrics measure
how close the predicted values are to the actual values. On the other hand, for classification
tasks, metrics like accuracy, precision, recall, F1 score, and ROC-AUC are used to assess
how well the model distinguishes between different classes. Understanding these metrics allows
practitioners to make informed decisions about the performance and reliability of their models.

Rama Bhadra Rao Maddu 130 ANL Kumar


1. Linear Regression Metrics
Let the actual values ytrue and predicted values ypred be:

ytrue = [3.0, 2.5, 4.0, 5.5, 6.0]

ypred = [2.8, 2.6, 3.9, 5.4, 5.8]


1. Mean Squared Error (MSE): Measures the average of the squares of the errors, i.e., the
average squared difference between the actual and predicted values.
n
1X
MSE = (yi − ŷi )2
n i=1

• Sample Calculation:

1
(3.0 − 2.8)2 + (2.5 − 2.6)2 + (4.0 − 3.9)2 + (5.5 − 5.4)2 + (6.0 − 5.8)2 = 0.016

MSE =
5

• Interpretation: A lower MSE indicates that the model’s predictions are close to the
actual values.
2. Root Mean Squared Error (RMSE): The square root of MSE, which brings the error
metric back to the same units as the target variable.

RMSE = MSE

• Sample Calculation:

RMSE = 0.016 = 0.126

• Interpretation: The RMSE tells us that the predictions are off by about 0.126 units
on average.
3. Mean Absolute Error (MAE): Measures the average of the absolute errors, i.e., the
absolute difference between actual and predicted values.
n
1X
MAE = |yi − ŷi |
n i=1

• Sample Calculation:

1
MAE = [|3.0 − 2.8| + |2.5 − 2.6| + |4.0 − 3.9| + |5.5 − 5.4| + |6.0 − 5.8|] = 0.12
5

• Interpretation: A lower MAE indicates better prediction performance.

Rama Bhadra Rao Maddu 131 ANL Kumar


4. R-squared (R2): Represents the proportion of the variance in the dependent variable that
is predictable from the independent variables.

(yi − ŷi )2
P
2
R =1− P
(yi − ȳ)2

Where ȳ is the mean of the actual values.


• Sample Calculation:
0.08
R2 = 1 − = 0.978
3.6

• Interpretation: An R2 value closer to 1 means that the model explains a high pro-
portion of variance.

2. Logistic Regression Metrics


Let the actual labels ytrue and predicted labels ypred be:

ytrue = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
ypred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
1. Confusion Matrix: A table used to describe the performance of a classification model,
showing the correct and incorrect predictions.

Predicted 0 Predicted 1
Actual 0 TN = 4 FP = 1
Actual 1 FN = 1 TP = 4

2. Accuracy: The proportion of correct predictions (both true positives and true negatives)
among the total predictions.

TP + TN
Accuracy =
TP + TN + FP + FN
• Sample Calculation:
4+4
Accuracy = = 0.8
4+4+1+1

• Interpretation: The model is 80% accurate, meaning it correctly predicted 80% of


the outcomes.
3. Precision: The proportion of true positives among the predicted positives.

TP
Precision =
TP + FP

Rama Bhadra Rao Maddu 132 ANL Kumar


• Sample Calculation:
4
Precision = = 0.8
4+1

• Interpretation: The model’s precision is 80%, meaning that when the model predicts
a positive class, it is correct 80% of the time.
4. Recall (Sensitivity or True Positive Rate): The proportion of true positives among
the actual positives.
TP
Recall =
TP + FN
• Sample Calculation:
4
Recall = = 0.8
4+1

• Interpretation: The recall is 80%, meaning the model correctly identifies 80% of the
actual positive cases.
5. F1 Score: The harmonic mean of precision and recall. It is useful when the class distri-
bution is imbalanced.
Precision × Recall
F1 = 2 ×
Precision + Recall
• Sample Calculation:
0.8 × 0.8
F1 = 2 × = 0.8
0.8 + 0.8

• Interpretation: The F1 score is 0.8, indicating a good balance between precision and
recall.
6. ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Measures
the ability of the model to distinguish between classes. A value closer to 1 is better.
• Sample Value:
ROC-AUC = 0.85

• Interpretation: A ROC-AUC score of 0.85 indicates the model is good at distin-


guishing between the positive and negative classes.

2.5 Naive Bayes Classifiers


Naive Bayes classifiers are a family of probabilistic classifiers based on Bayes’ Theorem. These
classifiers assume that all features are independent of each other given the class label, which is
referred to as the naive assumption. Despite this simplifying assumption, Naive Bayes classifiers
perform remarkably well for many real-world tasks such as text classification, spam detection,
and sentiment analysis, due to their ability to handle high-dimensional data efficiently.

Rama Bhadra Rao Maddu 133 ANL Kumar


Bayes’ Theorem describes the relationship between the posterior probability (the probability of a
class given certain features) and the likelihood (the probability of features given a class). It can
be written as:

P (X|y) · P (y)
P (y|X) =
P (X)

Where:
• P (y|X) is the posterior probability of class y given the feature vector X.
• P (X|y) is the likelihood of observing the feature vector X given class y.
• P (y) is the prior probability of class y, which represents how common the class is in the
dataset.
• P (X) is the marginal probability of the feature vector X, which serves as a normalization
constant.
Naive Bayes classifiers leverage this theorem by estimating the posterior probability for each class
and then selecting the class with the highest posterior probability.

2.5.1 Types of Naive Bayes Classifiers


There are three main types of Naive Bayes classifiers depending on the nature of the input
features: Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes.
Each of these classifiers uses a different probability distribution to model the likelihood of the
features.

1. Gaussian Naive Bayes (For Continuous Data)


Gaussian Naive Bayes is used when the features are continuous and are assumed to follow a
normal (Gaussian) distribution. The likelihood of a feature xi given a class y is modeled using
the Gaussian distribution:

(xi −µy )2
1 − 2
2σy
P (xi |y) = q ·e
2πσy2

Where:
• µy is the mean of feature xi for class y.
• σy2 is the variance of feature xi for class y.
• xi is the value of the feature.

Rama Bhadra Rao Maddu 134 ANL Kumar


Consider a dataset with study hours and their outcomes (Pass/Fail):

Study Hours Outcome (Pass/Fail)


2 Fail
4 Fail
6 Pass
8 Pass
10 Pass

We want to predict whether a student who studied for 7 hours will pass or fail.
Step 1: Calculate the Mean (µ) and Variance (σ 2 ) for Each Class
For class "Pass":
6 + 8 + 10 24
µPass = = =8
3 3
2 (6 − 8)2 + (8 − 8)2 + (10 − 8)2 4+0+4
σPass = = = 2.67
3 3

For class "Fail":


2+4 6
µFail = = =3
2 2
2 (2 − 3)2 + (4 − 3)2 1+1
σFail = = =1
2 2

Step 2: Apply the Gaussian Naive Bayes Formula


The likelihood for each class is computed using the Gaussian distribution formula:

(xi − µy )2
 
1
P (xi |y) = q · exp −
2πσy2 2σy2

Where:
• xi is the feature value (study hours in this case, x = 7),
• µy is the mean for the class y,
• σy2 is the variance for the class y.
For class "Pass":
(7 − 8)2
 
1
P (x = 7|Pass) = √ · exp −
2π · 2.67 2 · 2.67
• First, calculate the denominator:

2π · 2.67 ≈ 4.08

Rama Bhadra Rao Maddu 135 ANL Kumar


• Then, calculate the exponent:

(7 − 8)2 1
= ≈ 0.187
2 · 2.67 5.34

• Now, calculate the exponential term:

exp(−0.187) ≈ 0.829

• Finally, calculate the likelihood:


1
P (x = 7|Pass) ≈ · 0.829 ≈ 0.203
4.08

For class "Fail":


(7 − 3)2
 
1
P (x = 7|Fail) = √ · exp −
2π · 1 2·1
• First, calculate the denominator: √
2π · 1 ≈ 2.506

• Then, calculate the exponent:


(7 − 3)2 16
= =8
2·1 2
• Now, calculate the exponential term:

exp(−8) ≈ 0.000335

• Finally, calculate the likelihood:


1
P (x = 7|Fail) ≈ · 0.000335 ≈ 0.000133
2.506

Step 3: Compare the Likelihoods and Make a Prediction


The posterior probabilities are proportional to the likelihoods calculated for each class. Since:

P (x = 7|Pass) ≈ 0.203, P (x = 7|Fail) ≈ 0.000133

We predict that the student will pass, as P (x = 7|Pass) is much greater than P (x = 7|Fail).

Example Code:

# Import necessary libraries


from sklearn.datasets import load_iris

Rama Bhadra Rao Maddu 136 ANL Kumar


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Iris dataset


data = load_iris()
X = data.data # Features (Continuous values)
y = data.target # Labels

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

# Initialize and train the Gaussian Naive Bayes model


gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on the test set


y_pred = gnb.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Gaussian Naive Bayes Accuracy: {accuracy:.2f}")

Gaussian Naive Bayes Accuracy: 0.98

2. Multinomial Naive Bayes (For Discrete Data)


Multinomial Naive Bayes is suitable for discrete data, particularly for text classification tasks,
where features represent the frequency or count of events (e.g., word occurrences in documents).
The likelihood of a feature xi given a class y is proportional to its frequency in the dataset:

Count of xi in class y
P (xi |y) =
Total count of features in class y

Example: Spam Detection


Consider a dataset with word counts for the words "Buy" and "Free" in emails:

Email Buy (Count) Free (Count) Class (Spam/Not Spam)


1 3 2 Spam
2 0 1 Not Spam
3 2 3 Spam
4 1 0 Not Spam

Rama Bhadra Rao Maddu 137 ANL Kumar


We want to classify a new email with word counts: "Buy" = 2 and "Free" = 1.
Step 1: Calculate Prior Probabilities
The prior probabilities are based on how many emails are classified as spam or not spam:
• Spam: There are 2 spam emails out of 4:
2
P (Spam) = = 0.5
4

• Not Spam: There are 2 not spam emails out of 4:


2
P (Not Spam) = = 0.5
4

Step 2: Calculate the Likelihood for Each Class


Next, we calculate the likelihood of observing the words "Buy" and "Free" in both spam and not
spam emails.
Total Word Count for Each Class:
• For Spam:
Total Word Count (Spam) = (3 + 2) + (2 + 3) = 10

• For Not Spam:

Total Word Count (Not Spam) = (0 + 1) + (1 + 0) = 2

Likelihood Calculation for Each Class:


• For Spam:
2 1
P (Buy = 2|Spam) = = 0.2, P (Free = 1|Spam) = = 0.1
10 10

• For Not Spam:


1 1
P (Buy = 2|Not Spam) = = 0.5, P (Free = 1|Not Spam) = = 0.5
2 2

Step 3: Apply Bayes’ Theorem


Now, using Bayes’ Theorem, we calculate the posterior probability for each class:

P (Spam|Buy = 2, Free = 1) ∝ P (Spam) · P (Buy = 2|Spam) · P (Free = 1|Spam)

Substitute the values:

Rama Bhadra Rao Maddu 138 ANL Kumar


P (Spam|Buy = 2, Free = 1) ∝ 0.5 · 0.2 · 0.1 = 0.01

Similarly, for not spam:

P (Not Spam|Buy = 2, Free = 1) ∝ P (Not Spam)·P (Buy = 2|Not Spam)·P (Free = 1|Not Spam)

P (Not Spam|Buy = 2, Free = 1) ∝ 0.5 · 0.5 · 0.5 = 0.125

Step 4: Compare Posterior Probabilities


The posterior probabilities are proportional to the likelihoods we just computed:
• For Spam:
P (Spam|Buy = 2, Free = 1) ∝ 0.01

• For Not Spam:


P (Not Spam|Buy = 2, Free = 1) ∝ 0.125

Since P (Not Spam) > P (Spam), we classify the email as Not Spam.

Example Code:

# Import necessary libraries


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Create a simple toy dataset for spam detection


emails = [
'Free entry in 2 a weekly competition to win tickets', # Spam
'Upto 20% discount on selected items for a limited time', # Spam
'Congratulations! You have won a prize, claim now', # Spam
'Hello, are you available for a meeting tomorrow?', # Not Spam
'Reminder: Your appointment is scheduled for Monday', # Not Spam
'Don\'t forget to submit your project by end of this week', # Not␣
,→Spam

'Claim your free ticket to the concert this weekend', # Spam


'Meeting reminder: Agenda and notes attached' # Not Spam
]

# Labels for each email (Spam = 1, Not Spam = 0)

Rama Bhadra Rao Maddu 139 ANL Kumar


labels = [1, 1, 1, 0, 0, 0, 1, 0]

# Convert text data into a bag-of-words representation (word frequencies)


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, labels,␣
,→test_size=0.3, random_state=42)

# Initialize and train the Multinomial Naive Bayes model


mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Predict on the test set


y_pred = mnb.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Multinomial Naive Bayes Accuracy: {accuracy:.2f}")

# Display results
print("\nPredicted labels for test data:")
for i, email in enumerate(X_test.toarray()):
print(f"Email: '{emails[i]}', Predicted Label: {y_pred[i]}")

Multinomial Naive Bayes Accuracy: 0.33

Predicted labels for test data:


Email: 'Free entry in 2 a weekly competition to win tickets', Predicted␣
,→Label: 0

Email: 'Upto 20% discount on selected items for a limited time', Predicted
Label: 1
Email: 'Congratulations! You have won a prize, claim now', Predicted Label:␣
,→1

3. Bernoulli Naive Bayes (For Binary Data)


Bernoulli Naive Bayes is used when features are binary, meaning they can either be present
(1) or absent (0).

The likelihood is calculated based on the presence or absence of the feature:

Count of feature xi being present in class y


P (xi = 1|y) =
Total number of samples in class y

Rama Bhadra Rao Maddu 140 ANL Kumar


Example: Email Classification with Binary Features (Bernoulli Naive Bayes)
Consider the following dataset where features "Buy" and "Free" are either present (1) or absent
(0):

Email Buy (0/1) Free (0/1) Class (Spam/Not Spam)


1 1 1 Spam
2 0 1 Not Spam
3 1 1 Spam
4 1 0 Not Spam

We need to classify a new email with features: Buy = 1 and Free = 0.


Step 1: Calculate the Prior Probabilities
The prior probabilities are calculated based on how many emails are classified as spam or not
spam:
• P(Spam): There are 2 spam emails out of 4, so:

2
P (Spam) = = 0.5
4

• P(Not Spam): There are 2 not spam emails out of 4, so:

2
P (Not Spam) = = 0.5
4

Step 2: Calculate the Likelihood for Each Feature (Buy, Free)


Next, we calculate the likelihood of observing the features "Buy" = 1 and "Free" = 0 for each
class (Spam and Not Spam).
For Spam:
There are 2 emails classified as Spam. Now, we compute the likelihood for "Buy" and "Free" in
the spam class:

2 0
P (Buy = 1|Spam) = = 1, P (Free = 0|Spam) = =0
2 2

For Not Spam:


There are 2 emails classified as Not Spam. Now, we compute the likelihood for "Buy" and "Free"
in the not spam class:

1 1
P (Buy = 1|Not Spam) = = 0.5, P (Free = 0|Not Spam) = = 0.5
2 2

Rama Bhadra Rao Maddu 141 ANL Kumar


Step 3: Apply Bayes’ Theorem
We now calculate the posterior probability for each class (Spam and Not Spam) using Bayes’
theorem. We ignore P (X) since it’s the same for both classes, so we only need to compare the
numerators:
Posterior for Spam:

P (Spam|Buy = 1, Free = 0) ∝ P (Spam) · P (Buy = 1|Spam) · P (Free = 0|Spam)


Substitute the values:
P (Spam|Buy = 1, Free = 0) ∝ 0.5 · 1 · 0 = 0

Posterior for Not Spam:

P (Not Spam|Buy = 1, Free = 0) ∝ P (Not Spam)·P (Buy = 1|Not Spam)·P (Free = 0|Not Spam)

Substitute the values:

P (Not Spam|Buy = 1, Free = 0) ∝ 0.5 · 0.5 · 0.5 = 0.125

Step 4: Compare the Posterior Probabilities


The final step is to compare the posterior probabilities:
• P (Spam|Buy = 1, Free = 0) = 0
• P (Not Spam|Buy = 1, Free = 0) = 0.125
Since P (Not Spam) > P (Spam), we classify the email as Not Spam.

Example Code

# Import necessary libraries


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

# Example binary dataset (presence/absence of specific words)


data = [
'Buy the new car now',
'Free offer for limited time',
'Don\'t miss out on this amazing deal',
'Get your new car insurance',
'No purchase required, free sample'

Rama Bhadra Rao Maddu 142 ANL Kumar


]
labels = [1, 1, 1, 0, 0] # Spam (1) or Not Spam (0)

# Convert text data into binary features (word presence/absence)


vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(data)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, labels,␣
,→test_size=0.3, random_state=42)

# Initialize and train the Bernoulli Naive Bayes model


bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# Predict on the test set


y_pred = bnb.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bernoulli Naive Bayes Accuracy: {accuracy:.2f}")

Bernoulli Naive Bayes Accuracy: 0.50

2.5.2 Advantages and Disadvantages


Advantages

• Simple and computationally efficient.

• Performs well with small datasets and high-dimensional data.

• Handles multi-class classification.

Disadvantages

• Assumes feature independence, which may not hold in all cases.

• Sensitive to zero-frequency problems (handled using Laplace smoothing).

• May be outperformed by more complex models in certain applications.

Rama Bhadra Rao Maddu 143 ANL Kumar


2.6 Decision Tree Algorithm
2.6.1 Introduction to Decision Trees
A Decision Tree is a supervised learning algorithm that can be used for both classification and
regression tasks. It works by splitting the dataset into smaller subsets based on the most signifi-
cant feature that provides the best separation. The result is a tree-like model of decisions. Each
internal node represents a "test" or "decision" on a feature, each branch represents the outcome
of that decision, and each leaf node represents the class label or predicted value.
Key Concepts of Decision Trees:
• Root Node: Represents the entire dataset and the first feature to split on.
• Decision Nodes: Nodes where the dataset is further split.
• Leaf Nodes: Nodes where the final classification or regression outcome is present.
• Branches: The splitting decision that leads from one node to another.

2.6.2 Decision Criteria: Gini Index and Information Gain


To decide the best feature to split the dataset, Decision Trees use metrics like the Gini Index or
Information Gain. These criteria evaluate how "pure" or "impure" the data is after the split.

Gini Index
The Gini Index is used to measure the impurity of a dataset. Lower Gini index values indicate
better splits. It works by evaluating how often a randomly chosen element from the dataset would
be incorrectly classified.
Formula for Gini Index:
n
X
Gini(D) = 1 − p2i
i=1

Where:
• pi is the proportion of class i in the dataset.

Information Gain (Entropy)


The Information Gain measures how much "information" is gained by splitting the data based
on a particular feature. It is calculated as the difference between the entropy before the split and
the entropy after the split.
Formula for Entropy:
n
X
Entropy(D) = − pi log2 (pi )
i=1

Where:

Rama Bhadra Rao Maddu 144 ANL Kumar


• pi is the proportion of class i in the dataset.

2.6.3 Example of a Decision Tree Algorithm with Calculations


We will use a simple example of customer data to build a decision tree. Our goal is to predict
whether a customer will buy a product based on two features: Age and Income.

2.6.4 Dataset
Customer Age Income Outcome (Buy/Not Buy)
1 Young Low Not Buy
2 Young High Buy
3 Middle Low Buy
4 Old Low Not Buy
5 Old High Buy
6 Middle High Buy
7 Young Low Not Buy
8 Old Low Buy
9 Young High Buy

Step-by-Step Calculation
Step 1: Gini Index Calculation for the Whole Dataset
First, we calculate the Gini index for the target variable (Outcome). We have two outcomes:
"Buy" and "Not Buy". In the dataset:
• 5 customers decided to Buy
• 4 customers decided Not Buy
The Gini index for the whole dataset is:
 2  2 !
5 4
Gini(D) = 1 − +
9 9
 
25 16 41
Gini(D) = 1 − + =1− = 0.49
81 81 81

Step 2: Gini Index for "Age"


Now, we calculate the Gini index for each category of Age: Young, Middle, and Old.
For Age = Young:
 2  2 !
2 2
Gini(Y oung) = 1 − + = 1 − 0.5 = 0.5
4 4

Rama Bhadra Rao Maddu 145 ANL Kumar


For Age = Middle:
 2
2
Gini(M iddle) = 1 − =0
2

For Age = Old:


 2  2 !
2 1 5
Gini(Old) = 1 − + = 1 − = 0.44
3 3 9

The weighted Gini index for Age is:


4 2 3
Gini(Age) = × 0.5 + × 0 + × 0.44 = 0.222 + 0 + 0.147 = 0.369
9 9 9

Step 3: Gini Index for "Income"


Now, we calculate the Gini index for Income.
For Income = Low:  2  2 !
3 2
Gini(Low) = 1 − + = 0.48
5 5

For Income = High:


 2  2 !
3 1
Gini(High) = 1 − + = 0.375
4 4

The weighted Gini index for Income is:


5 4
Gini(Income) = × 0.48 + × 0.375 = 0.267 + 0.167 = 0.434
9 9

Step 4: Information Gain for "Age"


Now, let’s calculate the entropy for Age.
For Age = Young:
 
2 2 2 2
Entropy(Y oung) = − log2 + log2 =1
4 4 4 4

For Age = Middle:


Entropy(M iddle) = 0

For Age = Old:  


2 2 1 1
Entropy(Old) = − log2 + log2 = 0.918
3 3 3 3

Rama Bhadra Rao Maddu 146 ANL Kumar


The weighted entropy for Age is:

4 2 3
Entropy(Age) = × 1 + × 0 + × 0.918 = 0.444 + 0 + 0.306 = 0.75
9 9 9

The information gain for Age is:

Inf ormation Gain(Age) = Entropy(D) − Entropy(Age) = 0.940 − 0.75 = 0.19

Start

Age < 30?

Yes No

Income Age < 50?

Yes No Yes No

Buy Not Buy Buy Not Buy

2.6.5 Step-by-Step Algorithm for Building a Decision Tree


The Decision Tree Algorithm consists of creating Root Nodes, Decision Nodes, Leaf Nodes,
and Branches by recursively selecting the best feature based on the Gini Index or Information
Gain. The following steps outline how to build a decision tree:
1. Start with the Entire Dataset: The first step in building the decision tree is to consider
the entire dataset as the root of the tree. Each feature will be evaluated to determine which
provides the best split.

Rama Bhadra Rao Maddu 147 ANL Kumar


2. Calculate Gini Index/Information Gain: For each feature, calculate the Gini Index
or Information Gain to assess the quality of the split.
• Gini Index: Measures the impurity of a dataset. The formula is:
X
Gini = 1 − (pi )2
where pi is the probability of each class in the dataset.
• Information Gain: Measures the reduction in entropy after a dataset is split on a
feature. The formula is:
X n 
IG = Entropy(parent) − Entropy(child)
N
where n is the size of the child node, and N is the size of the parent node.
3. Select the Best Feature: Based on the calculated Gini Index or Information Gain, select
the feature that provides the best split (i.e., the feature with the lowest Gini Index or the
highest Information Gain). This feature becomes the Root Node of the tree.
4. Create Decision Nodes: For each possible value of the selected feature, create Decision
Nodes. These nodes represent the decisions made by splitting the dataset based on the
feature values.
5. Split the Dataset: Split the dataset into subsets according to the feature values chosen
at the root. The records in each subset share the same value for the root feature.
6. Repeat for Each Subset: For each subset created from the split, repeat the process of
calculating the Gini Index or Information Gain for each feature, and select the best feature
for further splitting. The chosen features become new Decision Nodes.
7. Create Leaf Nodes: When no further splits are possible (i.e., all records in a subset
belong to the same class), create a Leaf Node for that subset. This leaf node represents
the final prediction or class label.
8. Continue Until All Data is Classified: Continue recursively splitting the dataset and
creating decision nodes and leaf nodes until all records are classified or some stopping
criterion (e.g., maximum depth or minimum node size) is met.
9. Stop and Form the Tree: Once the process is complete, the decision tree will consist of
a Root Node, several Decision Nodes, Leaf Nodes, and Branches connecting them.
The tree can now be used to classify new data points by following the path from the root
to the leaf nodes.

2.6.6 Decision Tree Pruning


Pruning is a technique used in decision trees to reduce their complexity and increase their gener-
alization ability. Decision trees are prone to overfitting, especially when they grow deep and start
capturing noise and outliers in the training data. Pruning helps address this issue by trimming
unnecessary branches and nodes, resulting in a simpler tree that performs better on new, unseen
data.

Rama Bhadra Rao Maddu 148 ANL Kumar


Why Pruning is Important

• Overfitting Prevention: Unpruned decision trees can overfit the training data, meaning
they learn patterns that are specific to the training data and do not generalize well to new
data.

• Improved Interpretability: Pruned trees are smaller and easier to interpret compared
to large, complex trees.

• Reduced Complexity: Pruned trees have fewer nodes and branches, making them less
complex and computationally efficient.

• Improved Generalization: Pruned trees are better at generalizing to unseen data and
thus perform better on test datasets.

Types of Pruning

There are two main types of pruning techniques:

Pre-pruning (Early Stopping)

In pre-pruning, the growth of the decision tree is halted early based on certain conditions. Com-
mon pre-pruning criteria include limiting the maximum depth of the tree, requiring a minimum
number of samples in a node before splitting, or setting a threshold on the minimum impurity
decrease. Pre-pruning helps avoid building an excessively large tree in the first place, but it might
also prevent capturing useful patterns.

Post-pruning (Reduced Error Pruning)

In post-pruning, a fully grown decision tree is created first, and then branches that contribute
less to the overall prediction accuracy are removed. The decision to prune a node is based on
metrics like the reduction in accuracy or a pruning criterion such as the complexity parameter.
Post-pruning generally results in better generalization as it uses the entire tree to evaluate which
branches to prune.

Pruning Process with an Example

Suppose we have the following decision tree built using customer data to predict whether a
customer will buy a product:

Rama Bhadra Rao Maddu 149 ANL Kumar


Age < 30?

Yes No

Income > 50K? Age < 50?

Yes No Yes No

Buy Not Buy Buy Not Buy

1. Pre-pruning: During the tree-building process, we stop growing the tree if a node has
fewer than 5 samples or if the depth of the tree reaches 3.
2. Post-pruning: After constructing the tree, we examine each node and remove the branches
that contribute the least to prediction accuracy. For example, suppose the node "Income >
50K?" does not significantly improve the classification accuracy. We can prune this branch
and replace it with the majority class ("Buy").

Pruning Metrics
• Gini Index: Measures the impurity of a node. Lower Gini index values indicate purer
nodes.
• Entropy: Another measure of impurity used in information gain. Lower entropy values
indicate purer nodes.
• Cost Complexity Pruning (CCP): A method in which nodes are removed based on a
cost-complexity measure that considers both the size of the tree and the error rate.

Implementing Pruning in Scikit-learn


Scikit-learn provides parameters to control tree growth and pruning:
• Pre-pruning Parameters:
– max_depth: Limits the maximum depth of the tree.

Rama Bhadra Rao Maddu 150 ANL Kumar


– min_samples_split: The minimum number of samples required to split a node.

– min_samples_leaf: The minimum number of samples required to be at a leaf node.

– min_impurity_decrease: A node will be split if this split induces a decrease in im-


purity greater than or equal to this value.

• Post-pruning Parameter:

– ccp_alpha: Complexity parameter used for Minimal Cost-Complexity Pruning.

Example Code

from sklearn.datasets import load_iris


from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# Load dataset
X, y = load_iris(return_X_y=True)

# Train a decision tree with pre-pruning using max_depth


clf_prepruned = DecisionTreeClassifier(max_depth=3, random_state=0)
clf_prepruned.fit(X, y)

# Train a fully grown tree for post-pruning


clf_full = DecisionTreeClassifier(random_state=0)
clf_full.fit(X, y)

# Apply post-pruning using ccp_alpha


path = clf_full.cost_complexity_pruning_path(X, y)
ccp_alphas = path.ccp_alphas # Get effective alphas
clf_postpruned = DecisionTreeClassifier(random_state=0,␣
,→ccp_alpha=ccp_alphas[-2]) # Prune using the second-to-last alpha value
clf_postpruned.fit(X, y)

# Plot the original and pruned trees


fig, ax = plt.subplots(1, 2, figsize=(12, 5))
plot_tree(clf_full, filled=True, ax=ax[0], fontsize=8)
ax[0].set_title("Full Tree")

plot_tree(clf_postpruned, filled=True, ax=ax[1], fontsize=8)


ax[1].set_title("Post-Pruned Tree")

plt.show()

Rama Bhadra Rao Maddu 151 ANL Kumar


2.6.7 Advantages and Disadvantages of Pruning

Aspect Advantages Disadvantages


Pre-Pruning Reduces computational cost May lead to underfitting if
by limiting tree growth. stopped too early.
Avoids overfitting during tree Difficult to choose the right
construction. stopping criteria.
Post-Pruning Results in simpler, more in- Requires constructing a fully
terpretable trees. grown tree, which can be
Generally leads to better per- computationally expensive.
formance on unseen data. Additional complexity in
choosing pruning criteria.

Pruning is a crucial step in building decision trees that are not only effective but also efficient. By
understanding the different pruning techniques and their impact, one can create decision trees that
generalize well and avoid overfitting, making them suitable for various real-world applications.

Based on the Gini Index or Information Gain, the decision tree algorithm selects the feature that
best splits the data at each step, creating Root Nodes, Decision Nodes, and Leaf Nodes.
The tree grows recursively until all records are classified or a stopping criterion is met. This
step-by-step approach allows decision trees to effectively classify data into distinct categories or
predict continuous outcomes.

Rama Bhadra Rao Maddu 152 ANL Kumar


2.7 Ensembles of Decision Trees
Ensemble methods in machine learning use multiple models to improve the performance and
accuracy of predictions. These techniques combine weak learners, such as decision trees, to create
a strong learner. The four key ensemble methods are Bagging (Bootstrap Aggregating),
Random Forests, Boosting, and Gradient Boosting.
• Bagging: Creates multiple models on different bootstrap samples of the data and combines
their predictions. Reduces variance and overfitting.
• Random Forests: A type of bagging where each tree is also trained on a random subset
of features. Further reduces correlation between trees and improves generalization.
• Boosting: Sequentially trains models, focusing on correcting the errors of previous ones.
Improves accuracy but can be sensitive to noise.
• Gradient Boosting: A specific type of boosting that uses gradients to minimize the loss
function. Often achieves high accuracy but can be computationally expensive.

2.7.1 Bagging (Bootstrap Aggregating)


Bagging, short for Bootstrap Aggregating, is an ensemble technique that aims to reduce
the variance of machine learning models. The main idea is to train multiple models on different
random subsets of the data (created through bootstrapping) and then aggregate their predictions.
The final prediction is the majority vote (for classification) or the average (for regression).

Algorithm for Bagging


1. Randomly select subsets of the training data with replacement (bootstrapping).
2. Train a decision tree on each subset.
3. Aggregate the predictions from all the trees (majority voting for classification, averaging
for regression).

Example: Predicting House Prices


Consider a dataset with features such as the number of rooms, square footage, and house price:

Rooms Square Footage Price (in $)


3 1500 300, 000
2 1000 200, 000
4 1800 400, 000
5 2200 500, 000

We create multiple subsets of the data by sampling with replacement and train a decision tree
on each subset. The final predicted price for a new house is the average of the predictions from
all the trees.

Rama Bhadra Rao Maddu 153 ANL Kumar


2.7.2 Random Forests
Random Forest is a specific type of bagging technique that introduces additional randomness
by selecting random subsets of features at each node of a decision tree. This makes the trees less
correlated and improves the model’s generalization.

Algorithm for Random Forests


1. Randomly sample the training data with replacement to create multiple bootstrap samples.
2. For each tree, select a random subset of features at each node to split on.
3. Build the decision tree using the selected features and data.
4. Aggregate the predictions from all trees (majority voting for classification, averaging for
regression).

Example: Classifying Emails as Spam or Not Spam


Consider a dataset with features such as the number of "Buy" and "Free" words in emails:

Buy (Count) Free (Count) Class (Spam/Not Spam)


3 2 Spam
0 1 Not Spam
2 3 Spam
1 0 Not Spam

For each tree in the random forest, a random subset of features is selected to split on. The final
classification is based on the majority vote from all the trees.

2.7.3 Boosting
Boosting is an ensemble technique where models are trained sequentially, and each model tries
to correct the errors made by the previous model. Unlike bagging, where trees are built indepen-
dently, boosting builds trees in a sequence, and each tree focuses on the mistakes of the previous
one.

Algorithm for Boosting


1. Train an initial model on the dataset.
2. Calculate the error of the model.
3. Focus on the misclassified or poorly predicted instances by adjusting their weights.
4. Train a new model that focuses on the errors of the previous model.
5. Repeat the process for a set number of models.

Rama Bhadra Rao Maddu 154 ANL Kumar


Example: Predicting Credit Card Fraud
Consider a dataset where the goal is to classify whether a transaction is fraudulent or not:

Transaction Amount Country Merchant Type Fraud (0/1)


500 U SA Retail 0
3000 UK Online 1
100 U SA Retail 0
2500 UK Online 1

The first model may classify most transactions correctly, but some may be misclassified. The
second model will focus on the misclassified transactions by adjusting their weights, and this
process continues iteratively.

2.7.4 Gradient Boosting


Gradient Boosting is a type of boosting where the models are trained sequentially, and each
new model tries to reduce the residual errors of the previous model. It minimizes the loss function
(e.g., mean squared error for regression) using gradient descent.

Algorithm for Gradient Boosting


1. Initialize the model with a simple prediction (e.g., the mean value for regression).
2. Calculate the residuals (the difference between the actual and predicted values).
3. Train a new decision tree to predict the residuals.
4. Update the model by adding the predictions of the new tree to the previous predictions.
5. Repeat the process for a set number of trees.

Example: Predicting House Prices Using Gradient Boosting


Consider the same house price dataset:

Rooms Square Footage Price (in $)


3 1500 300, 000
2 1000 200, 000
4 1800 400, 000
5 2200 500, 000

1. The first model predicts the average house price (e.g., 350,000). 2. The residuals are calculated:
Residual = Actual Price - Predicted Price. 3. A new decision tree is trained to predict the
residuals, and the model is updated with the predictions from this tree. 4. The process continues
iteratively, improving the model’s predictions with each new tree.

Rama Bhadra Rao Maddu 155 ANL Kumar


Comprehensive Example: Bagging, Random Forest, Boosting, and Gradient Boost-
ing in Python

Let’s consider a comprehensive example using the sklearn library to demonstrate the differences
between Bagging, Random Forest, Boosting, and Gradient Boosting for a classification problem.
We will use the Breast Cancer dataset from sklearn.datasets to classify whether a tumor is benign
or malignant. Each method will be trained on the same dataset and evaluated for performance.

# Import necessary libraries


from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,␣
,→AdaBoostClassifier, GradientBoostingClassifier

from sklearn.model_selection import train_test_split


from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

# Load the Breast Cancer dataset


data = load_breast_cancer()
X = data.data # Features
y = data.target # Target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

# 1. Bagging Classifier (Base: Decision Tree) - Use 'estimator' instead of␣


,→'base_estimator'

bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),␣
,→n_estimators=50, random_state=42)

bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)

# 2. Random Forest Classifier


random_forest = RandomForestClassifier(n_estimators=50, random_state=42)
random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)

# 3. AdaBoost Classifier (Boosting) - Specify the SAMME algorithm


adaboost = AdaBoostClassifier(estimator=DecisionTreeClassifier(),␣
,→n_estimators=50, random_state=42, algorithm='SAMME')

adaboost.fit(X_train, y_train)
y_pred_adaboost = adaboost.predict(X_test)

Rama Bhadra Rao Maddu 156 ANL Kumar


# 4. Gradient Boosting Classifier
gradient_boosting = GradientBoostingClassifier(n_estimators=50,␣
,→random_state=42)

gradient_boosting.fit(X_train, y_train)
y_pred_gb = gradient_boosting.predict(X_test)

# Evaluate each model's performance


print("Bagging Classifier Accuracy: {:.2f}%".format(accuracy_score(y_test,␣
,→y_pred_bagging) * 100))

print("Bagging Classifier Report:\n", classification_report(y_test,␣


,→y_pred_bagging, target_names=data.target_names))

print("Random Forest Classifier Accuracy: {:.2f}%".


,→format(accuracy_score(y_test, y_pred_rf) * 100))

print("Random Forest Classifier Report:\n", classification_report(y_test,␣


,→y_pred_rf, target_names=data.target_names))

print("AdaBoost Classifier Accuracy: {:.2f}%".format(accuracy_score(y_test,␣


,→y_pred_adaboost) * 100))

print("AdaBoost Classifier Report:\n", classification_report(y_test,␣


,→y_pred_adaboost, target_names=data.target_names))

print("Gradient Boosting Classifier Accuracy: {:.2f}%".


,→format(accuracy_score(y_test, y_pred_gb) * 100))

print("Gradient Boosting Classifier Report:\n", classification_report(y_test,␣


,→y_pred_gb, target_names=data.target_names))

# Plotting Feature Importance for Random Forest and Gradient Boosting


fig, axes = plt.subplots(1, 2, figsize=(16, 5))
fig.suptitle('Feature Importances')

# Random Forest Feature Importance


axes[0].barh(data.feature_names, random_forest.feature_importances_)
axes[0].set_title('Random Forest Feature Importance')
axes[0].set_xlabel('Importance')

# Gradient Boosting Feature Importance


axes[1].barh(data.feature_names, gradient_boosting.feature_importances_)
axes[1].set_title('Gradient Boosting Feature Importance')
axes[1].set_xlabel('Importance')

plt.tight_layout()
plt.show()

Rama Bhadra Rao Maddu 157 ANL Kumar


Bagging Classifier Accuracy: 95.91%
Bagging Classifier Report:
precision recall f1-score support

malignant 0.95 0.94 0.94 63


benign 0.96 0.97 0.97 108

accuracy 0.96 171


macro avg 0.96 0.95 0.96 171
weighted avg 0.96 0.96 0.96 171

Random Forest Classifier Accuracy: 97.08%


Random Forest Classifier Report:
precision recall f1-score support

malignant 0.98 0.94 0.96 63


benign 0.96 0.99 0.98 108

accuracy 0.97 171


macro avg 0.97 0.96 0.97 171
weighted avg 0.97 0.97 0.97 171

AdaBoost Classifier Accuracy: 92.98%


AdaBoost Classifier Report:
precision recall f1-score support

malignant 0.88 0.94 0.91 63


benign 0.96 0.93 0.94 108

accuracy 0.93 171


macro avg 0.92 0.93 0.93 171
weighted avg 0.93 0.93 0.93 171

Gradient Boosting Classifier Accuracy: 95.91%


Gradient Boosting Classifier Report:
precision recall f1-score support

malignant 0.95 0.94 0.94 63


benign 0.96 0.97 0.97 108

accuracy 0.96 171


macro avg 0.96 0.95 0.96 171
weighted avg 0.96 0.96 0.96 171

Rama Bhadra Rao Maddu 158 ANL Kumar


2.7.5 Comparison of Ensemble Methods

Ensemble Method Advantages Disadvantages


Bagging
• Reduces variance • Does not reduce bias
by averaging predic- significantly.
tions from multiple • Computationally
models. expensive due to
• Handles high vari- multiple models.
ance datasets well.
• Can improve the
stability and ac-
curacy of machine
learning algorithms.

Random Forests
• Reduces both vari- • Less interpretable
ance and overfitting than a single deci-
by using random sion tree.
subsets of features. • Can still suffer from
• Works well with bias if the base
high-dimensional model is too simple.
data.
• Less sensitive to
noisy data.

Continued on next page

Rama Bhadra Rao Maddu 159 ANL Kumar


Table 2.3 – continued from previous page
Ensemble Method Advantages Disadvantages
Boosting
• Reduces both bias • Prone to overfitting
and variance. if not properly regu-
• Focuses on correct- larized.
ing the mistakes of • Slower to train due
the previous models, to sequential model
improving accuracy. building.
• Suitable for imbal-
anced datasets.

Gradient Boosting
• Produces highly ac- • Computationally
curate models. expensive and
• Works well with slower to train.
complex datasets • Requires careful
and captures intri- tuning of hyperpa-
cate relationships. rameters.
• Can handle both • More prone to over-
classification and fitting compared to
regression tasks Random Forests.
effectively.

Feature Bagging Random Boosting Gradient


Forest Boosting
Model Type Can be applied Specific to Can be applied Typically
to any base decision trees to any weak applied to
model for both but works for learner, used for decision trees,
regression and both regression both regression used for both
classification. and and regression and
classification. classification. classification.
Feature All features are Random subset All features are Each tree
Selection considered for of features is considered at corrects errors
splits. used at each each step. of previous
split. models by
fitting to
residuals.

Rama Bhadra Rao Maddu 160 ANL Kumar


Feature Bagging Random Boosting Gradient
Forest Boosting
Variance Reduces Further reduces Reduces Primarily
Reduction variance by variance by variance and reduces bias by
averaging decorrelating bias by focusing minimizing
predictions of trees using on correcting residuals in
multiple random feature mistakes made subsequent
models. selection. by previous models.
models.
Interpretability Base models Less Less Least
retain their interpretable interpretable interpretable
interpretability. due to random due to due to iterative
feature sequential corrections of
selection. learning. residuals.
Performance Works well with Often performs Generally Highly accurate
high-variance better than performs better but can overfit
models. bagging due to than bagging without proper
additional and random tuning.
randomness and forest on
decorrelation of complex
trees. datasets.
Sequential No, models are No, trees are Yes, models are Yes, models are
Training trained trained trained trained
independently. independently. sequentially, sequentially,
focusing on focusing on
correcting reducing
previous errors. residual errors.
Parallelization Easy to Easy to Harder to Harder to
parallelize, as parallelize, as parallelize due parallelize due
models are trees are to sequential to sequential
independent. independent. training. training.
Bias vs. Reduces Reduces Reduces both Primarily
Variance variance but variance but bias and reduces bias by
Tradeoff may still have can have bias if variance over focusing on
bias. features are not time by residuals.
sufficiently focusing on
decorrelated. misclassified or
mispredicted
examples.

Ensemble methods such as Bagging, Random Forests, Boosting, and Gradient Boosting are pow-
erful tools for improving the accuracy and robustness of decision trees. By combining multiple

Rama Bhadra Rao Maddu 161 ANL Kumar


trees, these methods reduce overfitting, enhance model stability, and deliver better performance
compared to individual trees.

2.8 Kernelized Support Vector Machines


Support Vector Machines (SVMs) are a powerful set of supervised learning methods used for
classification and regression tasks. The core idea of SVM is to find a hyperplane that best
separates data points of different classes. However, in many real-world problems, the data is
not linearly separable in its original feature space. To address this, kernelized SVMs are used,
which apply the "kernel trick" to implicitly transform the input features into a higher-dimensional
space where a linear separator can be found.

2.8.1 The Kernel Trick


The kernel trick allows SVM to fit the maximum-margin hyperplane in a transformed feature space
without explicitly computing the transformation. Instead, it uses a kernel function to compute the
dot product of the transformed feature vectors directly. This makes it computationally efficient,
even for high-dimensional transformations.
Mathematically, instead of applying the transformation ϕ(x) and computing ⟨ϕ(x), ϕ(x′ )⟩, the
kernel function K(x, x′ ) computes the inner product in the transformed space:

K(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩

2.8.2 Common Kernel Functions


Several kernel functions are commonly used in SVMs, each allowing for different types of decision
boundaries:
• Linear Kernel: This kernel is used when the data is linearly separable. It computes the
standard dot product between feature vectors:

K(x, x′ ) = x · x′

• Polynomial Kernel: This kernel allows for curved decision boundaries. It is useful for
capturing interactions between features.

K(x, x′ ) = (γx · x′ + r)d

where d is the degree of the polynomial, γ is a scaling factor, and r is a constant.


• Radial Basis Function (RBF) Kernel: The RBF kernel is the most widely used kernel.
It can create complex, non-linear decision boundaries.

K(x, x′ ) = exp −γ∥x − x′ ∥2




where γ is a parameter that controls the influence of each training example.

Rama Bhadra Rao Maddu 162 ANL Kumar


• Sigmoid Kernel: This kernel is related to neural networks and is often used as an ap-
proximation for them.
K(x, x′ ) = tanh(γx · x′ + r)

2.8.3 Choosing the Right Kernel


Selecting the right kernel depends on the nature of the data and the problem. Linear kernels are
used for linearly separable data, while non-linear kernels such as the RBF and polynomial kernels
are used for more complex datasets.
Example: If you are working on a dataset where the decision boundary is clearly a straight line,
the linear kernel would suffice. However, if the decision boundary is more complex, the RBF
kernel would be more appropriate, as it can map the data to a higher-dimensional space and
create a non-linear decision boundary.

2.8.4 Regularization and Hyperparameters


Kernelized SVMs have several hyperparameters that control the model’s complexity and perfor-
mance:
• C: The regularization parameter that controls the trade-off between maximizing the margin
and minimizing classification error. A lower value of C creates a wider margin but allows
for some misclassifications, while a higher value penalizes misclassification more heavily,
potentially leading to overfitting.
• Gamma: In RBF and polynomial kernels, γ controls the influence of each individual
training example. A small γ means a large influence, while a large γ limits the influence to
only nearby examples.

2.8.5 Advantages of Kernelized SVMs


• Flexibility: Kernelized SVMs are highly flexible and can model both linear and non-linear
relationships by choosing the appropriate kernel.
• Robustness to High-Dimensional Data: SVMs perform well in high-dimensional spaces
and are less prone to overfitting compared to other classifiers.
• Support for Non-linear Boundaries: By using non-linear kernels, SVMs can separate
data that is not linearly separable in its original feature space.

2.8.6 Disadvantages of Kernelized SVMs


• Computationally Intensive: The kernel trick requires calculating the dot product of
every pair of data points, which can become computationally expensive as the dataset
grows.
• Requires Careful Tuning: The performance of an SVM depends heavily on selecting the
right kernel and tuning the hyperparameters such as C and γ, which can be time-consuming.

Rama Bhadra Rao Maddu 163 ANL Kumar


2.8.7 Example of Kernelized SVM
Multiclass Classification with SVM Linear Models: Wine Dataset Example
In this example, we’ll use the Wine dataset from the scikit-learn library. The Wine dataset
consists of 178 samples of wine, with 13 features and 3 classes. We’ll build a multiclass classifier
using Logistic Regression and Support Vector Classifier (SVC) to classify the wines into their
respective categories.

Dataset Description Features: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phe-
nols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315
of diluted wines, and Proline. Target classes: 3 types of wine labeled as 0, 1, and 2

# Import necessary libraries


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix,␣
,→accuracy_score

from sklearn.preprocessing import StandardScaler # Import scaler


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the Wine dataset


wine = load_wine()
X = wine.data # Features
y = wine.target # Target classes

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.
,→3, random_state=42)

# Initialize and train a Logistic Regression model with increased max_iter


logistic_reg = LogisticRegression(multi_class='ovr', solver='lbfgs',␣
,→max_iter=500) # Increased max_iter
logistic_reg.fit(X_train, y_train)

# Predict on the test set using Logistic Regression


y_pred_logistic = logistic_reg.predict(X_test)

Rama Bhadra Rao Maddu 164 ANL Kumar


# Initialize and train a Support Vector Classifier (SVC) model with a linear␣
,→kernel

svm_classifier = SVC(kernel='linear', decision_function_shape='ovr')


svm_classifier.fit(X_train, y_train)

# Predict on the test set using SVC


y_pred_svm = svm_classifier.predict(X_test)

# Evaluate the Logistic Regression model


print("Logistic Regression Classification Report:\n",␣
,→classification_report(y_test, y_pred_logistic, target_names=wine.

,→target_names))

print("Logistic Regression Accuracy: {:.2f}%".format(accuracy_score(y_test,␣


,→y_pred_logistic) * 100))

# Evaluate the SVM model


print("\nSupport Vector Classifier (SVC) Classification Report:\n",␣
,→classification_report(y_test, y_pred_svm, target_names=wine.target_names))

print("SVM Classifier Accuracy: {:.2f}%".format(accuracy_score(y_test,␣


,→y_pred_svm) * 100))

# Plot Confusion Matrices


fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Confusion Matrix for Logistic Regression


conf_matrix_logistic = confusion_matrix(y_test, y_pred_logistic)
df_cm_logistic = pd.DataFrame(conf_matrix_logistic, index=wine.target_names,␣
,→columns=wine.target_names)

sns.heatmap(df_cm_logistic, annot=True, fmt='d', cmap='Blues', ax=axes[0])


axes[0].set_title('Logistic Regression Confusion Matrix')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

# Confusion Matrix for SVM


conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
df_cm_svm = pd.DataFrame(conf_matrix_svm, index=wine.target_names,␣
,→columns=wine.target_names)

sns.heatmap(df_cm_svm, annot=True, fmt='d', cmap='Blues', ax=axes[1])


axes[1].set_title('SVM Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

Rama Bhadra Rao Maddu 165 ANL Kumar


plt.tight_layout()
plt.show()

Logistic Regression Classification Report:


precision recall f1-score support

class_0 1.00 1.00 1.00 19


class_1 1.00 0.95 0.98 21
class_2 0.93 1.00 0.97 14

accuracy 0.98 54
macro avg 0.98 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54

Logistic Regression Accuracy: 98.15%

Support Vector Classifier (SVC) Classification Report:


precision recall f1-score support

class_0 1.00 1.00 1.00 19


class_1 1.00 0.95 0.98 21
class_2 0.93 1.00 0.97 14

accuracy 0.98 54
macro avg 0.98 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54

SVM Classifier Accuracy: 98.15%

Kernelized SVMs are a powerful and flexible tool for classification and regression tasks, espe-
cially when the data is not linearly separable. By choosing the appropriate kernel function and

Rama Bhadra Rao Maddu 166 ANL Kumar


tuning hyperparameters, SVMs can model complex relationships and perform well even in high-
dimensional datasets.

Here’s a detailed example of how to use Support Vector Machines (SVM) with different kernel
functions (Polynomial, RBF, and Sigmoid) using the scikit-learn library in Pytho

# Import necessary libraries


from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import numpy as np

# Load the Iris dataset


iris = datasets.load_iris()
X = iris.data[:, :2] # Use only the first two features for visualization
y = iris.target # Target classes

# Standardize the features


scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
,→random_state=42)

# Define and train SVM models with different kernels


svm_poly = SVC(kernel='poly', degree=3, C=1, gamma='auto')
svm_poly.fit(X_train, y_train)

svm_rbf = SVC(kernel='rbf', C=1, gamma=0.5)


svm_rbf.fit(X_train, y_train)

svm_sigmoid = SVC(kernel='sigmoid', C=1, gamma='auto')


svm_sigmoid.fit(X_train, y_train)

# Make predictions
y_pred_poly = svm_poly.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)
y_pred_sigmoid = svm_sigmoid.predict(X_test)

# Evaluate each model

Rama Bhadra Rao Maddu 167 ANL Kumar


print("Polynomial Kernel SVM Classification Report:\n",␣
,→classification_report(y_test, y_pred_poly, target_names=iris.target_names))
print("Polynomial Kernel SVM Accuracy: {:.2f}%".
,→format(accuracy_score(y_test, y_pred_poly) * 100))

print("\nRBF Kernel SVM Classification Report:\n",␣


,→classification_report(y_test, y_pred_rbf, target_names=iris.target_names))
print("RBF Kernel SVM Accuracy: {:.2f}%".format(accuracy_score(y_test,␣
,→y_pred_rbf) * 100))

print("\nSigmoid Kernel SVM Classification Report:\n",␣


,→classification_report(y_test, y_pred_sigmoid, target_names=iris.
,→target_names))

print("Sigmoid Kernel SVM Accuracy: {:.2f}%".


,→format(accuracy_score(y_test, y_pred_sigmoid) * 100))

# Plot the decision boundaries for each kernel


def plot_svm_decision_boundary(model, ax, title):
# Create mesh grid
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min,␣
,→y_max, h))

# Plot decision boundary


Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

# Plot original data points


scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm,␣
,→s=20, edgecolors='k')

ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
ax.add_artist(legend1)

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

Rama Bhadra Rao Maddu 168 ANL Kumar


# Plot for Polynomial Kernel
plot_svm_decision_boundary(svm_poly, axes[0], 'Polynomial Kernel␣
,→(Degree=3)')

# Plot for RBF Kernel


plot_svm_decision_boundary(svm_rbf, axes[1], 'RBF Kernel (Gamma=0.5)')

# Plot for Sigmoid Kernel


plot_svm_decision_boundary(svm_sigmoid, axes[2], 'Sigmoid Kernel')

plt.show()

Polynomial Kernel SVM Classification Report:


precision recall f1-score support

setosa 1.00 0.89 0.94 19


versicolor 0.54 1.00 0.70 13
virginica 1.00 0.31 0.47 13

accuracy 0.76 45
macro avg 0.85 0.73 0.71 45
weighted avg 0.87 0.76 0.74 45

Polynomial Kernel SVM Accuracy: 75.56%

RBF Kernel SVM Classification Report:


precision recall f1-score support

setosa 1.00 1.00 1.00 19


versicolor 0.54 0.54 0.54 13
virginica 0.54 0.54 0.54 13

accuracy 0.73 45
macro avg 0.69 0.69 0.69 45
weighted avg 0.73 0.73 0.73 45

RBF Kernel SVM Accuracy: 73.33%

Sigmoid Kernel SVM Classification Report:


precision recall f1-score support

setosa 1.00 1.00 1.00 19


versicolor 0.71 0.38 0.50 13

Rama Bhadra Rao Maddu 169 ANL Kumar


virginica 0.58 0.85 0.69 13

accuracy 0.78 45
macro avg 0.76 0.74 0.73 45
weighted avg 0.80 0.78 0.77 45

Sigmoid Kernel SVM Accuracy: 77.78%

2.9 Uncertainty Estimates from Classifiers


Uncertainty estimates from classifiers are essential in machine learning, as they provide not only
predictions but also a sense of how confident the model is about these predictions. This is
particularly important in tasks where incorrect predictions can have significant consequences,
such as in medical diagnosis or autonomous driving.

2.9.1 Why Do We Need Uncertainty Estimates?


In many real-world applications, it’s not enough for a classifier to output a single prediction. We
need to know how confident the model is in its predictions to:

• Make better decisions: Knowing the uncertainty helps in making more informed de-
cisions. For example, in medical diagnosis, a model that is uncertain about a patient’s
condition might suggest further tests before making a final diagnosis.

• Improve model performance: By identifying areas where a model is uncertain, more


data can be collected or more sophisticated models can be used to reduce uncertainty.

• Build trust in model predictions: In high-stakes scenarios like finance or healthcare,


having uncertainty estimates helps build trust and validate the model’s predictions.

Rama Bhadra Rao Maddu 170 ANL Kumar


2.9.2 Types of Classifiers and Uncertainty Estimates
1. Probabilistic Classifiers
Probabilistic classifiers output probabilities that serve as uncertainty estimates. For example,
logistic regression outputs the probability that a sample belongs to a particular class.
Example: In binary classification, if a logistic regression model predicts a probability of 0.8 for
class A, the model is 80% confident that the sample belongs to class A. However, if the predicted
probability is 0.55, the model is less certain.

2. Non-Probabilistic Classifiers with Calibrated Probabilities


Some classifiers, such as Support Vector Machines (SVMs), do not inherently provide probability
estimates. However, techniques like Platt Scaling and Isotonic Regression can be used to
map their scores to probabilities.
• Platt Scaling: A method that fits a logistic regression model to the output of SVM scores.
• Isotonic Regression: A non-parametric calibration method that can improve probability
estimates.

3. Bayesian Neural Networks


Bayesian neural networks provide a distribution over the model’s weights, allowing uncertainty to
propagate to the predictions. This provides a distribution over the predicted class probabilities,
rather than a single point estimate.
Example: A Bayesian neural network predicting an image as 60% dog and 40% cat can also
provide an uncertainty estimate on these predictions.

4. Ensemble Methods
Ensemble methods, such as Random Forests and Gradient Boosting, average predictions from
multiple models. The variation in the predictions across different models can be interpreted as
an uncertainty estimate.
Example: In a Random Forest, if the trees consistently predict the same class, the model is
confident. If the trees are divided, the model is uncertain.

5. Monte Carlo Dropout in Neural Networks


Monte Carlo dropout is a technique where dropout is applied at inference time, allowing the
model to make multiple predictions and use the variance in those predictions as an uncertainty
estimate.
Example: For house price prediction, a model might apply dropout at inference to produce a
range of predicted prices, and the variance in the predictions can be interpreted as the uncertainty.

Rama Bhadra Rao Maddu 171 ANL Kumar


2.9.3 Practical Considerations
• Thresholding: You may only want to make predictions when the model’s uncertainty is
low. For example, only predicting if the probability exceeds 0.8.
• Calibration: Many classifiers are not inherently well-calibrated. Calibrating them using
techniques like Platt Scaling ensures the predicted probabilities correspond well to true
uncertainties.
Uncertainty estimates from classifiers are crucial for making well-informed decisions in machine
learning. They allow models to quantify their confidence in predictions, which is essential in
applications such as healthcare, finance, and autonomous systems, where incorrect predictions
can have costly consequences. Probabilistic classifiers, Bayesian methods, and ensemble models all
offer ways to measure uncertainty, helping improve the reliability and trustworthiness of machine
learning systems.

Rama Bhadra Rao Maddu 172 ANL Kumar

You might also like