Chapter 2 Machine Learning Draft-85-172
Chapter 2 Machine Learning Draft-85-172
2
Supervised Learning
***
– Peter Norvig
85
2.4.8 Evaluation Metrics for Regression and Classification . . . . . . . . . . . . . 128
2.5 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.5.1 Types of Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 132
2.5.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.6 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.6.1 Introduction to Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.6.2 Decision Criteria: Gini Index and Information Gain . . . . . . . . . . . . . 142
2.6.3 Example of a Decision Tree Algorithm with Calculations . . . . . . . . . . . 143
2.6.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.6.5 Step-by-Step Algorithm for Building a Decision Tree . . . . . . . . . . . . . 145
2.6.6 Decision Tree Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
2.6.7 Advantages and Disadvantages of Pruning . . . . . . . . . . . . . . . . . . . 150
2.7 Ensembles of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2.7.1 Bagging (Bootstrap Aggregating) . . . . . . . . . . . . . . . . . . . . . . . . 151
2.7.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.7.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.7.4 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
2.7.5 Comparison of Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . 157
2.8 Kernelized Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.8.1 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.8.2 Common Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.8.3 Choosing the Right Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.8.4 Regularization and Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 161
2.8.5 Advantages of Kernelized SVMs . . . . . . . . . . . . . . . . . . . . . . . . 161
2.8.6 Disadvantages of Kernelized SVMs . . . . . . . . . . . . . . . . . . . . . . . 161
2.8.7 Example of Kernelized SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 162
2.9 Uncertainty Estimates from Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 168
2.9.1 Why Do We Need Uncertainty Estimates? . . . . . . . . . . . . . . . . . . . 168
2.9.2 Types of Classifiers and Uncertainty Estimates . . . . . . . . . . . . . . . . 169
2.9.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
In supervised learning, there are two main tasks: classification and regression. Classification
involves predicting a discrete label or category. For example, it may be used to determine whether
an email is spam or not, or whether a patient has a particular disease. Regression, on the other
hand, is about predicting a continuous value, such as forecasting stock prices or estimating the
price of a house based on its characteristics. These tasks rely on labeled datasets, where the
correct output is known and used to train the model.
The success of supervised learning models depends heavily on the quality and quantity of the
labeled data available for training. A well-labeled and diverse dataset allows the model to learn
the underlying patterns and relationships within the data, which can then be applied to new cases.
However, collecting such datasets can be a challenge and often requires significant effort in data
collection and labeling. Additionally, the choice of algorithm plays a critical role in the model’s
performance. Simple models like linear regression are easy to interpret, while more complex
models, such as neural networks, can handle intricate patterns but are harder to interpret.
One of the key advantages of supervised learning is its ability to produce interpretable results,
particularly in models like linear regression or decision trees. These models allow us to understand
the relationships between the input features and the output, which is valuable for fields such as
healthcare or finance, where transparency in decision-making is crucial. However, supervised
learning also presents challenges, including the risk of overfitting, where the model performs
well on the training data but poorly on unseen data. To address this, techniques such as cross-
validation, regularization, and pruning are often employed.
In this chapter, we will explore the concepts, algorithms, and techniques that form the foundation
of supervised learning. We will dive into various models, ranging from simple linear classifiers
to more sophisticated methods like support vector machines and neural networks. Through real-
world examples and case studies, we will demonstrate how these models are applied in practice
and discuss the key considerations involved in building and evaluating supervised learning models.
By the end of this chapter, you will have a comprehensive understanding of supervised learning
and be prepared to apply these techniques in your own projects.
Importance of Datasets
In machine learning, the quality, quantity, and relevance of data are often more critical than
the complexity of the algorithms themselves. A well-curated dataset allows a model to learn
effectively and generalize well to new, unseen data. Conversely, poor or insufficient data can
lead to inaccurate predictions, overfitting, or underfitting, regardless of the sophistication of the
model.
Datasets serve several key purposes in machine learning:
• Training: The model learns from the training data by identifying patterns and relation-
ships between the features and the target variable.
• Validation: During the model development phase, validation datasets are used to tune
hyperparameters and assess the model’s performance to avoid overfitting.
• Testing: After the model is trained, a separate testing dataset is used to evaluate how well
the model performs on unseen data, providing an unbiased estimate of its accuracy.
Types of Datasets
Datasets in machine learning can be broadly categorized into several types, depending on their
purpose and structure:
• Supervised Datasets: These datasets include input features and corresponding labels or
target values. Examples include the Iris dataset for classification and the Boston Housing
dataset for regression.
• Unsupervised Datasets: These datasets contain input features without associated labels.
They are used in tasks such as clustering, where the goal is to find patterns or groupings
within the data.
• Semi-supervised Datasets: These datasets contain a mixture of labeled and unlabeled
data, and they are particularly useful when labeling data is expensive or time-consuming.
Structure of Datasets
Typically, a dataset is structured into:
• Features (Attributes): These are the independent variables or inputs that the model
uses to make predictions. In a dataset, features are represented as columns.
• Labels (Targets): In supervised learning, labels are the dependent variables or outputs
that the model is trained to predict. They are also represented as columns, usually the last
column in a dataset.
• Instances (Samples): Each row in a dataset corresponds to an instance, which is a single
data point consisting of various feature values and, in the case of supervised learning, a
corresponding label.
Sources of Datasets
Several platforms and libraries provide ready-to-use datasets for machine learning, including:
• scikit-learn (sklearn.datasets): Offers a variety of standard datasets for both classifi-
cation and regression tasks.
• UCI Machine Learning Repository: A popular resource for finding real-world datasets
across various domains.
• Kaggle: A platform for data science competitions that provides access to a wide range of
datasets along with challenges to solve using machine learning.
• mglearn.datasets: Part of the mglearn package, it offers synthetic datasets specifically
designed to illustrate machine learning concepts.
Datasets are fundamental to the process of building, training, and evaluating machine learning
models. The selection of the right dataset, along with proper preprocessing and understanding
of its structure, is crucial for developing models that perform well and generalize effectively. As
Toy Datasets
Toy datasets are small datasets that are easy to manipulate and can be quickly used to test
algorithms. They include:
• Iris Dataset (load_iris()): A dataset for classification that contains measurements of
iris flowers in three species.
• Digits Dataset (load_digits()): A dataset for image classification with 8x8 pixel images
of handwritten digits.
• Wine Dataset (load_wine()): A dataset for classification with chemical analysis of wines
grown in the same region.
• Breast Cancer Dataset (load_breast_cancer()): A dataset for binary classification
containing features of breast cancer tumors.
• Boston Housing Dataset (load_boston()): A regression dataset containing housing
values in suburbs of Boston (deprecated in scikit-learn 1.0).
Real-World Datasets
These datasets are slightly larger and are often used in research:
• California Housing Dataset (fetch_california_housing()): A regression dataset con-
taining house prices and features from California districts.
• 20 Newsgroups Dataset (fetch_20newsgroups()): A dataset for text classification with
newsgroup posts from 20 different categories.
• LFW People Dataset (fetch_lfw_people()): A dataset for face recognition with labeled
faces in the wild.
• Olive Oil Dataset (fetch_openml(data_id=171)): A regression dataset about the com-
position of olive oils.
• COIL20 Dataset (fetch_openml(data_id=40922)): A dataset for image classification
containing images of 20 different objects.
# Make predictions
y_pred = knn.predict(X_test)
[ ]: import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the Titanic dataset from Kaggle (assuming it has been downloaded␣
,→locally)
data = pd.read_csv('titanic.csv')
# Make predictions
y_pred = model.predict(X_test)
In this example, the Titanic dataset is loaded, preprocessed, and used to train a
LogisticRegression model to predict whether a passenger survived the Titanic disaster. The
model is then evaluated using the accuracy score.
Kaggle provides a wide range of datasets that cater to both beginners and advanced practitioners.
These datasets allow users to practice and experiment with different algorithms and techniques
in machine learning. Moreover, Kaggle competitions offer a unique opportunity to work on real-
world problems while competing with data scientists worldwide.
[ ]: import mglearn
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
This code snippet shows how to generate the Forge dataset using make_forge() from
mglearn.datasets and visualize it using a scatter plot.
Conclusion
The datasets provided by sklearn.datasets, mglearn.datasets, and Kaggle serve as invaluable
resources for learning, experimenting, and advancing in the field of machine learning. While
sklearn.datasets offers a wide variety of real-world and toy datasets commonly used in both
research and practical applications, mglearn.datasets focuses on synthetic datasets that are
ideal for educational purposes. Kaggle, on the other hand, provides access to a vast repository of
real-world datasets across diverse domains, along with competitions that challenge users to solve
real-world problems using machine learning techniques. Together, these resources enable rapid
prototyping, visualization, and deeper insights into the behavior of machine learning models,
catering to both beginners and experts. Whether you’re just starting out or conducting advanced
research, these datasets provide the foundation for hands-on learning and help accelerate progress
in machine learning.
where (X1 , Y1 ) and (X2 , Y2 ) are the coordinates of two data points.
3. Find the K nearest neighbors: Based on the calculated distances, find the K nearest
neighbors of the test data point.
Next, we calculate the Euclidean distances from the new data point to all the points in the
dataset, as shown in Figure 2.2. Euclidean distance is a measure of the straight-line distance
In this case, we choose K = 3. As shown in Figure 2.3, the 3 nearest neighbors are identified.
Among these neighbors, 1 belong to Category A and 2 belong to Category B. The new data point
is then classified into Category B because the majority of its neighbors belong to this category.
Finally, the classification result is shown in Figure 2.4, where the new data point is assigned to
Category B based on the majority vote.
• Does not require any training phase, making it computationally inexpensive in that aspect.
• Works well with small datasets and cases where data is not linearly separable.
Disadvantages:
• Sensitive to the scale of the data, meaning that features with larger scales can dominate
the distance calculation.
The K-Nearest Neighbor algorithm is a fundamental method in machine learning and has broad
applications in various domains, from image recognition to recommendation systems. Its sim-
plicity and effectiveness make it a popular choice, especially for small datasets and classification
problems. However, care must be taken when selecting the value of K and handling large datasets
to ensure optimal performance.
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
where:
Bias-Variance Tradeoff
The bias-variance tradeoff refers to the balance between the errors introduced by bias and vari-
ance.
• Bias: Bias is the error introduced by approximating a real-world problem (which may be
complex) by a simplified model. High bias models (e.g., linear models) are too simple and
fail to capture the underlying data patterns, leading to underfitting.
• Variance: Variance is the model’s sensitivity to changes in the training data. High variance
models (e.g., highly complex models) may capture noise in the training data, leading to
overfitting.
• Tradeoff : As model complexity increases, bias decreases, but variance increases. The goal
is to find an optimal balance between bias and variance to minimize the total error.
Total error
Optimal Model Complexity
Bias2
Model Complexity
Price
Underfitting
Overfitting
Size
P rice = β0 + β1 · Size
On the other hand, a complex model (high variance) might use high-degree polynomials to fit
every training point exactly, leading to overfitting. To mitigate this, regularization techniques like
Ridge or Lasso regression can be applied, which add penalty terms to control model complexity.
Regularization
Regularization techniques help reduce overfitting by penalizing large coefficients in the model.
This encourages simpler models that generalize better to unseen data.
n m
1X X
Cost = (yi − yˆi )2 + λ |wi |
n i=1 i=1
Lasso helps reduce the impact of less important features by driving some coefficients to zero, thus
performing feature selection.
Ridge (L2) Regularization
Ridge regression adds the squared magnitude of the coefficients as a penalty:
n m
1X X
Cost = (yi − yˆi )2 + λ wi2
n i=1 i=1
Ridge regression prevents overfitting by shrinking the coefficients, but unlike Lasso, it does not
set coefficients exactly to zero.
Elastic Net Regularization
Elastic Net is a combination of L1 and L2 regularization. It adds both the absolute and squared
penalties:
n m m
!
1X X X
Cost = (yi − yˆi )2 + λ α |wi | + (1 − α) wi2
n i=1 i=1 i=1
Elastic Net combines the benefits of both Lasso and Ridge regularization.
L2 Regularization
L1 Regularization
w1
y = w0 · x0 + w1 · x1 + · · · + wp · xp + b
Where:
• x0 , x1 , . . . , xp are the input features (for example, the size of the house, the number of
rooms, or the location).
• w0 , w1 , . . . , wp are the weights assigned to each of these features. These weights determine
how much influence each feature has on the output value.
Graphical Representation
A graphical representation of this concept can be shown by plotting the house price (the target
variable) against one of the features, such as the size of the house. The line represents the model’s
predictions, and the data points represent actual house prices for different sizes.
Price ($)
In this graph:
n
1X
MSE = (ŷi − yi )2
n i=1
Where:
• ŷi is the predicted value for the i-th sample,
• yi is the actual value for the i-th sample,
• n is the number of samples in the dataset.
The lower the mean squared error, the better the model fits the data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
# Predictions
y_pred = lin_reg.predict(X_test)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Linear Regression: Actual vs Predicted Prices (California␣
,→Housing)")
plt.show()
In machine learning, overfitting occurs when a model becomes too complex and learns not only the
underlying patterns in the training data but also the noise and random fluctuations. This means
the model performs very well on the training data but struggles to generalize to new, unseen data,
leading to poor performance. Overfitting often happens when the model’s complexity increases,
which can be caused by using too many features or training the model for too long.
Regularization is a technique that helps prevent overfitting by introducing a penalty term in the
model’s cost function. This penalty discourages the model from assigning too much importance
(or weight) to any one feature, thus keeping the model simpler and more generalizable.
Ridge regression, also known as L2 regularization, adds a penalty term to the loss function
proportional to the square of the weights. The regularized loss function for ridge regression is
given by:
n p
1X X
Loss = (ŷi − yi )2 + λ wj2
n i=1 j=1
Where:
• λ is the regularization parameter (sometimes called α), controlling the strength of the
penalty.
Pp
The additional term λ j=1 wj2 penalizes large weights, encouraging the model to keep the weights
smaller. This makes the model simpler and less prone to overfitting.
Test Error
Optimal Complexity
Training Error
Model Complexity
Example: Suppose we are predicting house prices based on various features such as size, number
of rooms,etc. Ridge regression prevents any one feature (like size) from having an overly large
weight, ensuring that all features contribute more evenly to the prediction.
w = (XT X + λI)−1 XT y
Where:
• X is the matrix of input features (in this case, Size and Rooms).
• y is the vector of target values (in this case, Price).
• λ is the regularization parameter (we are using λ = 1).
• I is the identity matrix.
Step-by-Step Calculation
1. Feature Matrix (X) and Target Vector (y)
From the given table, the input features and target values are:
2000 3 300
1800 3 280
X=
2200
y=
4 350
1700 2 220
2. Calculate XT X
15700000 22100
XT X =
22100 38
3. Calculate XT y
1874000
X y=T
4230
w0 ≈ 0.1961, w1 ≈ 10.6499
8. Intercept (Bias)
The intercept b is computed during the fitting process:
b = −121.9437
Final Results
The final values for the Ridge regression model are:
Feature Value
Size (w0) 0.1961
Rooms (w1) 10.6499
Intercept (b) −121.9437
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Initialize and train the Ridge model with alpha = 0.1 (regularization␣
,→strength)
ridge_reg = Ridge(alpha=0.1)
ridge_reg.fit(X_train, y_train)
# Predictions
y_pred = ridge_reg.predict(X_test)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Ridge Regression (L2): Actual vs Predicted Prices (California␣
,→Housing)")
plt.show()
n p
1X 2
X
Loss = (ŷi − yi ) + λ |wj |
n i=1 j=1
Where:
Pp
• λ j=1 |wj | is the L1 regularization term, which penalizes the absolute values of the weights.
Lasso regression is especially useful when we have a large number of features, many of which
might be irrelevant. The L1 regularization can shrink the coefficients of less important features
to zero, effectively performing feature selection.
Example: In a scenario where we are using many features to predict house prices, lasso regression
can automatically select only the most important features, such as house size and location, and
discard less relevant features like the color of the house.
L2 regularization
L1 regularization
w1
Input Table
n p
X 2 X
Minimize yi − (w xi + b)
T
+λ |wj |
i=1 j=1
Where:
• X is the matrix of input features (in this case, Size and Rooms).
• y is the vector of target values (in this case, Price).
• λ is the regularization parameter (we are using λ = 1).
Feature Value
Size (w0) 0.19
Rooms (w1) 0
Intercept (b) −81.88
In this case, Lasso regression set the coefficient w1 (for Rooms) to 0, effectively eliminating that
feature from the model.
# Initialize and train the Lasso model with alpha = 0.1 (regularization␣
,→strength)
# Predictions
y_pred = lasso_reg.predict(X_test)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Lasso Regression (L1): Actual vs Predicted Prices (California␣
,→Housing)")
plt.show()
Comparing Lasso and Ridge: Lasso tends to zero out less important features, effectively per-
forming feature selection. Ridge shrinks the weights of all features but does not eliminate any,
so it’s better suited when you believe all features contribute to the prediction. In both cases,
alpha is the regularization strength, and increasing it will apply stronger regularization, leading
to smaller weights (and possibly reduced overfitting).
Without regularization, models tend to fit the training data too well, capturing noise and random
fluctuations in the data. This leads to high training accuracy but poor performance on unseen
data. Regularization helps by limiting the complexity of the model and ensuring that it generalizes
better to new data. By penalizing large weights, ridge regression keeps all the features in the model
but controls their influence. Lasso regression, on the other hand, not only controls complexity
but also performs feature selection by shrinking some weights to zero.
The key difference between L1 and L2 regularization is that L1 (lasso) tends to produce sparse
models (with fewer features), while L2 (ridge) keeps all the features but reduces their influence.
Regularization is a powerful tool for preventing overfitting in machine learning models. By adding
a penalty to the model’s loss function, we can control the size of the weights and keep the model
simpler, leading to better generalization on unseen data. Ridge regression (L2 regularization)
penalizes large weights, while lasso regression (L1 regularization) can shrink some weights to
zero, effectively performing feature selection. These techniques are essential when working with
high-dimensional datasets or when you want to balance complexity and performance.
Logistic Regression
Logistic regression is a widely used linear model for binary classification tasks, where there are two
possible outcomes (e.g., yes/no, true/false). While linear regression models predict continuous
values, logistic regression transforms the output into a probability using a special function called
the sigmoid function or logistic function.
The sigmoid function is defined as:
1
σ(z) =
1 + e−z
Where:
• z = w0 · x0 + w1 · x1 + · · · + wp · xp + b is the linear combination of input features.
• w0 , w1 , . . . , wp are the weights learned by the model.
• b is the bias term.
The sigmoid function squashes the output of the linear equation into a range between 0 and
1, which can be interpreted as the probability of the input belonging to a certain class. If the
predicted probability is greater than 0.5, the model assigns the input to the positive class (e.g.,
spam), and if it’s less than 0.5, the model assigns it to the negative class (e.g., not spam).
Probability
P = 0.5
Spam (80%)
In this case, the model predicts that there is an 80% chance that the email is spam, so it classifies
it as spam.
Feature 2
Decision Boundary
Feature 1
In this diagram, the dashed line represents the decision boundary learned by logistic regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,␣
,→classification_report
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=42)
Accuracy: 95.61%
Classification Report:
precision recall f1-score support
Confusion Matrix:
[[39 4]
[ 1 70]]
FeatureClass
2 C
Class B
Feature 1
Class A
In this diagram, the dashed lines represent decision boundaries for three classes: Class A, Class
B, and Class C. When a new data point is evaluated, the classifiers each calculate a score, and
the class with the highest confidence is selected.
Test Error
Optimal Complexity
Training Error
Model Complexity
In the diagram, as the model complexity increases, the training error decreases, but the test error
follows a U-shaped curve. Regularization helps by pushing the model towards optimal complexity,
where the test error is minimized.
Let’s use the well-known Iris dataset as an example. This dataset has three classes: Setosa,
Versicolor, and Virginica. We’ll use logistic regression as the linear model for this multiclass
classification task
logistic_reg.fit(X_train, y_train)
plt.figure(figsize=(8, 6))
Classification Report:
precision recall f1-score support
accuracy 0.96 45
macro avg 0.96 0.95 0.95 45
weighted avg 0.96 0.96 0.96 45
Confusion Matrix:
[[19 0 0]
[ 0 11 2]
[ 0 0 13]]
• Easy to interpret: Linear models provide clear insights into how the input features affect
the output. For example, we can easily interpret the coefficients in a linear regression model
as the amount of change in the target variable per unit change in the input feature.
• Fast to train and use: Linear models are computationally efficient and can be trained
quickly, even on large datasets.
• Works well with high-dimensional data: Linear models perform well when there are
many features (high-dimensional data), even if some features are irrelevant.
In summary, linear models are simple, interpretable, and fast, but they may struggle with complex
relationships in the data. Regularization helps address this by controlling model complexity and
preventing overfitting, making linear models more effective on a wider range of tasks.
For regression tasks, common evaluation metrics include Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), and Mean Absolute Error (MAE). These metrics measure
how close the predicted values are to the actual values. On the other hand, for classification
tasks, metrics like accuracy, precision, recall, F1 score, and ROC-AUC are used to assess
how well the model distinguishes between different classes. Understanding these metrics allows
practitioners to make informed decisions about the performance and reliability of their models.
• Sample Calculation:
1
(3.0 − 2.8)2 + (2.5 − 2.6)2 + (4.0 − 3.9)2 + (5.5 − 5.4)2 + (6.0 − 5.8)2 = 0.016
MSE =
5
• Interpretation: A lower MSE indicates that the model’s predictions are close to the
actual values.
2. Root Mean Squared Error (RMSE): The square root of MSE, which brings the error
metric back to the same units as the target variable.
√
RMSE = MSE
• Sample Calculation:
√
RMSE = 0.016 = 0.126
• Interpretation: The RMSE tells us that the predictions are off by about 0.126 units
on average.
3. Mean Absolute Error (MAE): Measures the average of the absolute errors, i.e., the
absolute difference between actual and predicted values.
n
1X
MAE = |yi − ŷi |
n i=1
• Sample Calculation:
1
MAE = [|3.0 − 2.8| + |2.5 − 2.6| + |4.0 − 3.9| + |5.5 − 5.4| + |6.0 − 5.8|] = 0.12
5
(yi − ŷi )2
P
2
R =1− P
(yi − ȳ)2
• Interpretation: An R2 value closer to 1 means that the model explains a high pro-
portion of variance.
ytrue = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
ypred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
1. Confusion Matrix: A table used to describe the performance of a classification model,
showing the correct and incorrect predictions.
Predicted 0 Predicted 1
Actual 0 TN = 4 FP = 1
Actual 1 FN = 1 TP = 4
2. Accuracy: The proportion of correct predictions (both true positives and true negatives)
among the total predictions.
TP + TN
Accuracy =
TP + TN + FP + FN
• Sample Calculation:
4+4
Accuracy = = 0.8
4+4+1+1
TP
Precision =
TP + FP
• Interpretation: The model’s precision is 80%, meaning that when the model predicts
a positive class, it is correct 80% of the time.
4. Recall (Sensitivity or True Positive Rate): The proportion of true positives among
the actual positives.
TP
Recall =
TP + FN
• Sample Calculation:
4
Recall = = 0.8
4+1
• Interpretation: The recall is 80%, meaning the model correctly identifies 80% of the
actual positive cases.
5. F1 Score: The harmonic mean of precision and recall. It is useful when the class distri-
bution is imbalanced.
Precision × Recall
F1 = 2 ×
Precision + Recall
• Sample Calculation:
0.8 × 0.8
F1 = 2 × = 0.8
0.8 + 0.8
• Interpretation: The F1 score is 0.8, indicating a good balance between precision and
recall.
6. ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Measures
the ability of the model to distinguish between classes. A value closer to 1 is better.
• Sample Value:
ROC-AUC = 0.85
P (X|y) · P (y)
P (y|X) =
P (X)
Where:
• P (y|X) is the posterior probability of class y given the feature vector X.
• P (X|y) is the likelihood of observing the feature vector X given class y.
• P (y) is the prior probability of class y, which represents how common the class is in the
dataset.
• P (X) is the marginal probability of the feature vector X, which serves as a normalization
constant.
Naive Bayes classifiers leverage this theorem by estimating the posterior probability for each class
and then selecting the class with the highest posterior probability.
(xi −µy )2
1 − 2
2σy
P (xi |y) = q ·e
2πσy2
Where:
• µy is the mean of feature xi for class y.
• σy2 is the variance of feature xi for class y.
• xi is the value of the feature.
We want to predict whether a student who studied for 7 hours will pass or fail.
Step 1: Calculate the Mean (µ) and Variance (σ 2 ) for Each Class
For class "Pass":
6 + 8 + 10 24
µPass = = =8
3 3
2 (6 − 8)2 + (8 − 8)2 + (10 − 8)2 4+0+4
σPass = = = 2.67
3 3
(xi − µy )2
1
P (xi |y) = q · exp −
2πσy2 2σy2
Where:
• xi is the feature value (study hours in this case, x = 7),
• µy is the mean for the class y,
• σy2 is the variance for the class y.
For class "Pass":
(7 − 8)2
1
P (x = 7|Pass) = √ · exp −
2π · 2.67 2 · 2.67
• First, calculate the denominator:
√
2π · 2.67 ≈ 4.08
(7 − 8)2 1
= ≈ 0.187
2 · 2.67 5.34
exp(−0.187) ≈ 0.829
exp(−8) ≈ 0.000335
We predict that the student will pass, as P (x = 7|Pass) is much greater than P (x = 7|Fail).
Example Code:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Gaussian Naive Bayes Accuracy: {accuracy:.2f}")
Count of xi in class y
P (xi |y) =
Total count of features in class y
P (Not Spam|Buy = 2, Free = 1) ∝ P (Not Spam)·P (Buy = 2|Not Spam)·P (Free = 1|Not Spam)
Since P (Not Spam) > P (Spam), we classify the email as Not Spam.
Example Code:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Multinomial Naive Bayes Accuracy: {accuracy:.2f}")
# Display results
print("\nPredicted labels for test data:")
for i, email in enumerate(X_test.toarray()):
print(f"Email: '{emails[i]}', Predicted Label: {y_pred[i]}")
Email: 'Upto 20% discount on selected items for a limited time', Predicted
Label: 1
Email: 'Congratulations! You have won a prize, claim now', Predicted Label:␣
,→1
2
P (Spam) = = 0.5
4
2
P (Not Spam) = = 0.5
4
2 0
P (Buy = 1|Spam) = = 1, P (Free = 0|Spam) = =0
2 2
1 1
P (Buy = 1|Not Spam) = = 0.5, P (Free = 0|Not Spam) = = 0.5
2 2
P (Not Spam|Buy = 1, Free = 0) ∝ P (Not Spam)·P (Buy = 1|Not Spam)·P (Free = 0|Not Spam)
Example Code
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bernoulli Naive Bayes Accuracy: {accuracy:.2f}")
Disadvantages
Gini Index
The Gini Index is used to measure the impurity of a dataset. Lower Gini index values indicate
better splits. It works by evaluating how often a randomly chosen element from the dataset would
be incorrectly classified.
Formula for Gini Index:
n
X
Gini(D) = 1 − p2i
i=1
Where:
• pi is the proportion of class i in the dataset.
Where:
2.6.4 Dataset
Customer Age Income Outcome (Buy/Not Buy)
1 Young Low Not Buy
2 Young High Buy
3 Middle Low Buy
4 Old Low Not Buy
5 Old High Buy
6 Middle High Buy
7 Young Low Not Buy
8 Old Low Buy
9 Young High Buy
Step-by-Step Calculation
Step 1: Gini Index Calculation for the Whole Dataset
First, we calculate the Gini index for the target variable (Outcome). We have two outcomes:
"Buy" and "Not Buy". In the dataset:
• 5 customers decided to Buy
• 4 customers decided Not Buy
The Gini index for the whole dataset is:
2 2 !
5 4
Gini(D) = 1 − +
9 9
25 16 41
Gini(D) = 1 − + =1− = 0.49
81 81 81
4 2 3
Entropy(Age) = × 1 + × 0 + × 0.918 = 0.444 + 0 + 0.306 = 0.75
9 9 9
Start
Yes No
Yes No Yes No
• Overfitting Prevention: Unpruned decision trees can overfit the training data, meaning
they learn patterns that are specific to the training data and do not generalize well to new
data.
• Improved Interpretability: Pruned trees are smaller and easier to interpret compared
to large, complex trees.
• Reduced Complexity: Pruned trees have fewer nodes and branches, making them less
complex and computationally efficient.
• Improved Generalization: Pruned trees are better at generalizing to unseen data and
thus perform better on test datasets.
Types of Pruning
In pre-pruning, the growth of the decision tree is halted early based on certain conditions. Com-
mon pre-pruning criteria include limiting the maximum depth of the tree, requiring a minimum
number of samples in a node before splitting, or setting a threshold on the minimum impurity
decrease. Pre-pruning helps avoid building an excessively large tree in the first place, but it might
also prevent capturing useful patterns.
In post-pruning, a fully grown decision tree is created first, and then branches that contribute
less to the overall prediction accuracy are removed. The decision to prune a node is based on
metrics like the reduction in accuracy or a pruning criterion such as the complexity parameter.
Post-pruning generally results in better generalization as it uses the entire tree to evaluate which
branches to prune.
Suppose we have the following decision tree built using customer data to predict whether a
customer will buy a product:
Yes No
Yes No Yes No
1. Pre-pruning: During the tree-building process, we stop growing the tree if a node has
fewer than 5 samples or if the depth of the tree reaches 3.
2. Post-pruning: After constructing the tree, we examine each node and remove the branches
that contribute the least to prediction accuracy. For example, suppose the node "Income >
50K?" does not significantly improve the classification accuracy. We can prune this branch
and replace it with the majority class ("Buy").
Pruning Metrics
• Gini Index: Measures the impurity of a node. Lower Gini index values indicate purer
nodes.
• Entropy: Another measure of impurity used in information gain. Lower entropy values
indicate purer nodes.
• Cost Complexity Pruning (CCP): A method in which nodes are removed based on a
cost-complexity measure that considers both the size of the tree and the error rate.
• Post-pruning Parameter:
Example Code
# Load dataset
X, y = load_iris(return_X_y=True)
plt.show()
Pruning is a crucial step in building decision trees that are not only effective but also efficient. By
understanding the different pruning techniques and their impact, one can create decision trees that
generalize well and avoid overfitting, making them suitable for various real-world applications.
Based on the Gini Index or Information Gain, the decision tree algorithm selects the feature that
best splits the data at each step, creating Root Nodes, Decision Nodes, and Leaf Nodes.
The tree grows recursively until all records are classified or a stopping criterion is met. This
step-by-step approach allows decision trees to effectively classify data into distinct categories or
predict continuous outcomes.
We create multiple subsets of the data by sampling with replacement and train a decision tree
on each subset. The final predicted price for a new house is the average of the predictions from
all the trees.
For each tree in the random forest, a random subset of features is selected to split on. The final
classification is based on the majority vote from all the trees.
2.7.3 Boosting
Boosting is an ensemble technique where models are trained sequentially, and each model tries
to correct the errors made by the previous model. Unlike bagging, where trees are built indepen-
dently, boosting builds trees in a sequence, and each tree focuses on the mistakes of the previous
one.
The first model may classify most transactions correctly, but some may be misclassified. The
second model will focus on the misclassified transactions by adjusting their weights, and this
process continues iteratively.
1. The first model predicts the average house price (e.g., 350,000). 2. The residuals are calculated:
Residual = Actual Price - Predicted Price. 3. A new decision tree is trained to predict the
residuals, and the model is updated with the predictions from this tree. 4. The process continues
iteratively, improving the model’s predictions with each new tree.
Let’s consider a comprehensive example using the sklearn library to demonstrate the differences
between Bagging, Random Forest, Boosting, and Gradient Boosting for a classification problem.
We will use the Breast Cancer dataset from sklearn.datasets to classify whether a tumor is benign
or malignant. Each method will be trained on the same dataset and evaluated for performance.
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),␣
,→n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
adaboost.fit(X_train, y_train)
y_pred_adaboost = adaboost.predict(X_test)
gradient_boosting.fit(X_train, y_train)
y_pred_gb = gradient_boosting.predict(X_test)
plt.tight_layout()
plt.show()
Random Forests
• Reduces both vari- • Less interpretable
ance and overfitting than a single deci-
by using random sion tree.
subsets of features. • Can still suffer from
• Works well with bias if the base
high-dimensional model is too simple.
data.
• Less sensitive to
noisy data.
Gradient Boosting
• Produces highly ac- • Computationally
curate models. expensive and
• Works well with slower to train.
complex datasets • Requires careful
and captures intri- tuning of hyperpa-
cate relationships. rameters.
• Can handle both • More prone to over-
classification and fitting compared to
regression tasks Random Forests.
effectively.
Ensemble methods such as Bagging, Random Forests, Boosting, and Gradient Boosting are pow-
erful tools for improving the accuracy and robustness of decision trees. By combining multiple
K(x, x′ ) = x · x′
• Polynomial Kernel: This kernel allows for curved decision boundaries. It is useful for
capturing interactions between features.
Dataset Description Features: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phe-
nols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315
of diluted wines, and Proline. Target classes: 3 types of wine labeled as 0, 1, and 2
,→target_names))
accuracy 0.98 54
macro avg 0.98 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54
accuracy 0.98 54
macro avg 0.98 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54
Kernelized SVMs are a powerful and flexible tool for classification and regression tasks, espe-
cially when the data is not linearly separable. By choosing the appropriate kernel function and
Here’s a detailed example of how to use Support Vector Machines (SVM) with different kernel
functions (Polynomial, RBF, and Sigmoid) using the scikit-learn library in Pytho
# Make predictions
y_pred_poly = svm_poly.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)
y_pred_sigmoid = svm_sigmoid.predict(X_test)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
ax.add_artist(legend1)
# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
plt.show()
accuracy 0.76 45
macro avg 0.85 0.73 0.71 45
weighted avg 0.87 0.76 0.74 45
accuracy 0.73 45
macro avg 0.69 0.69 0.69 45
weighted avg 0.73 0.73 0.73 45
accuracy 0.78 45
macro avg 0.76 0.74 0.73 45
weighted avg 0.80 0.78 0.77 45
• Make better decisions: Knowing the uncertainty helps in making more informed de-
cisions. For example, in medical diagnosis, a model that is uncertain about a patient’s
condition might suggest further tests before making a final diagnosis.
4. Ensemble Methods
Ensemble methods, such as Random Forests and Gradient Boosting, average predictions from
multiple models. The variation in the predictions across different models can be interpreted as
an uncertainty estimate.
Example: In a Random Forest, if the trees consistently predict the same class, the model is
confident. If the trees are divided, the model is uncertain.