Supervised - ML Complete Book
Supervised - ML Complete Book
ACADEMIC BACKGROUND
I, Syed Muhammad Awais Raza a passionate Bachelor's inArtificial Intelligence student at COMSATS University Islamabad, deeply interested in
Data Science and Machine Learning. I specialize in data analysis, visualization, and Python programming, with proficiency in key libraries such
as Pandas, NumPy, Seaborn, Plotly, and Scikit-learn. My expertise also extends to TensorFlow and PyTorch, enabling me to build advanced
machine learning models.
Author Details
The content is structured to cater to both beginners and intermediate learners, gradually building complexity. Each section is supplemented
with examples, visualizations, and Python code to reinforce learning.
PREREQUISITES
To get the most out of this guide, it is recommended that readers have:
For those new to these topics, introductory resources are provided in early sections to ensure a smooth learning curve.
GUIDE OBJECTIVE
By the end of this guide, readers will:
This guide is aimed at preparing readers to apply these concepts to real-world data problems, advancing their proficiency in Machine Learning
and Data Science.
Table of Contents
1. What is Machine Learning?
What is Traditional Programming
Key Difference Between Traditional Programming and Machine Learning
Importance of Machine Learning
Types of Machine Learning
2. Supervised Machine Learning
Introduction
Types of Supervised Learning Problems
Regression
Classification
3. Key concepts in Supervised MachineLearning
4. Application of Supervised MachineLearning
5. Data pre processing
6. Supervised ML Models
Models
Other important concepts
7. Linear Regression
Simple linear regression
Multiple Linear Regression
Polynomial Linear Regression
Ridge Regression
Lasso Regression
Logistic Regression
8. Evaluation Metrics
Evaluation metrics for regression
Evaluation metrics for classification
9. Support Vector Machine (SVM)
SVM Regressor
SVM Classifier
10. Parametres of a Model
Parametres used in SVM
11. K- Nearest Neighbors (KNN)
KNN Regressor
KNN classifier
Distances used in KNN
Parametres used in KNN
12. Decision Tree
Important terms
Decision Tree Regressor
Decision Tree Classifier
Splitting criterion
13. Ensemble Algorithms
Bagging
Boostind
Stacking
Blending
14. Bagging
Random Forest
Random Forest Regressor
Random foresst Classifier
15. Boosting
Boosting Algorithms
Adaboost
Adaboost Regressor
Adaboost Classifier
XGBoost
XGBoost Regressor
XGBoost Classifier
CatBoost
Catboost Regressor
Catboost Classifier
16. Hyperparameter Tuning and Cross Validation
Techniques for Hyperparameter Tuning
Grid Search
Random Search
Bayesian Optimization
Cross Validation
K-Fold Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)
Stratified K-Fold Cross-Validation
Time Series Cross-Validation
Group K-Fold Cross-Validation
17. Pipeline
Components
Creating and executing pipeline in python
Advantages of using Pipeline
18. Probability
Introduction
Rules of probability
Bayes' Theorem
Application of probability
19. Naive Bayes Algorithm
Types of Naive bayes
20. Conclusion
1. What is Machine Learning?
Machine Learning allows the machine to learn and reasoning on its own. Machine Learning is different from conventional programming in the
sense that rather than writing code line by line and detailing a program’s activities in advance, it directs the machine to learn on its own from
the data fed into it or from its own input. As it now emerges, the learning process is not fixed and it depends on the kind of machine learning
that is used.
Explicit Instructions: Developing software that involves writing instructions elaborated so comprehensively that a computer must follow
them to the letter.
Rule-Based Logic:In this case the program runs through a set procedure following the direction endued by the human intelligence rules.
Static Behavior:However because it is not an adaptive program, manual changes are needed every time something changes.
Compared to Machine Learning, this traditional approach to programming involves no learning from data and hence the concept of time.
Problem: We want to check if a number is prime (a prime number is a natural number greater than 1 that cannot be formed by
multiplying two smaller natural numbers).
Traditional Programming Approach: We define a function is_prime() that takes an integer n and returns True if it is prime and
False otherwise. The logic checks divisibility by numbers up to √n for efficiency.
Testing: We test the function on a list of numbers and print whether each is prime or not.
This is a classic example of traditional programming where we write explicit logic to solve a specific problem without any learning or data
involved.
# Predict a new sample (sepal length, sepal width, petal length, petal width)
new_sample = [[5.0, 3.5, 1.6, 0.2]]
predicted_class = knn.predict(new_sample)
print(f"Predicted class for the new sample: {iris.target_names[predicted_class][0]}")
Explanation:
Problem: We want to classify flowers into one of three species based on their physical features.
Machine Learning Approach: Instead of writing explicit rules, we train a model (k-Nearest Neighbors) using labeled data (the Iris
dataset). The model "learns" the relationship between features (inputs) and species (outputs) during training.
Testing: We test the model on new, unseen data to check its performance (accuracy). The trained model then predicts the class of new
data points (flowers) based on what it has learned.
Note
In traditional programming, we tell the computer exactly what to do with a set of rules. In machine learning, the
computer finds the rules itself by learning from data.
Note: no need to worry about code we will learn each and everything as we go ahead
2. Unsupervised Learning
Unlabeled Data: Works with data which is not limited prior labelled.
Pattern Discovery: Its main interest is in uncovering some form of a structure or pattern that may be concealed.
Evaluation: Measured in accordance to the discovered patterns’ ability to be relevant and interpretable.
Examples:
Clustering: Market segmentation.
Dimensionality Reduction: Initial big dimension reduction procedure is Principal Component Analysis (PCA).
Algorithms:
1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Principal Component Analysis (PCA)
5. Independent Component Analysis (ICA)
3. Reinforcement Learning
Interaction: It involves using an agent that will be working closely with an environment.
Rewards and Penalties: It can be trained from rewards or penalties with regard to the action that is performed by the agent.
Exploration vs. Exploitation: Introduces new strategies while making the most of the known ones at the same time.
Evaluation: It is in terms of overall ‘credit’ or ‘debit’ accumulated over some time.
Examples:
Game Playing: AlphaGo.
Robotics: Tips : Tasks being learned by robotic arms.
Algorithms:
1. Q-Learning
2. Deep Q-Network (DQN)
3. SARSA (State-Action-Reward-State-Action)
4. Policy Gradient Methods
5. Deep Deterministic Policy Gradient (DDPG)
2. Introduction to Supervised Learning
2.1 What is Supervised Learning?
As we have understood before, supervised learning involves the training of the model with a labeled data. This implies that for each of the
inputs that are used in the training phase of artificial neural networks, the corresponding output or label is well known. In a simplest form the
purpose of the model is to identify the input-output mapping so that it can correctly predict the output for previously unseen inputs.
Supervised Learning vs. Unsupervised Learning: Supervised learning is where the model is presented with the input-output pairs while
in the unsupervised learning the model operates on inputs where the relationships between input and output is unknown.
Supervised Learning vs. Reinforcement Learning: While in supervised learning, the model is trained from some given pre-
specified/available database of inputs and the corresponding output, whereas in reinforcement learning an agent interacts with a given
environment to earn rewards or penalties based on the action performed in the current state or at any time instant.
Definition: Regression tasks are the type of tasks in which one has to predict a rate of change in value from the provided input features.
Examples:
Predicting House Prices: On the basis of features such as area, no. of rooms and area location, H0, the model estimates the house price.
Forecasting Stock Prices: Employing historical stock prices for developing next future stock prices.
Definition: Classification tasks involve predicting discrete labels or categories based on input features.
Examples:
Email Spam Detection: Encoding and decoding of spam and non-spam messages and identification of the criteria by which they may be
judged as such.
Image Recognition: Recognizing an image, that is, flagging an image as ‘cat’ or ‘dog’.
The Iris dataset is a classic example of classification problems. It contains three classes of iris plants: Setosa , Versicolour , and
Virginica , based on four features— sepal length , sepal width , petal length , and petal width .
3. Key Concepts
Labels: The target outcome or the answer that the model is trying to predict. In supervised learning, labels are the known outputs used to
train the model.
Features: The input variables or attributes used by the model to make predictions. Features are the data points that the model uses to
learn patterns.
Training: The process of teaching the machine learning model using a labeled dataset so that it can learn to make predictions or
decisions. During training, the model adjusts its parameters based on the input features and their corresponding labels.
Testing: The process of evaluating the trained model on a separate dataset to assess its performance and accuracy. Testing helps
determine how well the model generalizes to new, unseen data.
Residuals: The differences between the observed values and the predicted values from a regression model. They represent the portion of
the dependent variable that is not explained by the independent variable(s) in the model. The difference between the actual and predicted
values.
Assumptions Important :
assumptions are the conditions or prerequisites that a model's underlying algorithms or statistical methods rely on to produce accurate and
reliable results.
Why is it important to understand the assumptions of each model?
Simplification of Reality: Assumptions make complex real-world problems easier to understand by simplifying the data or relationships.
Model Requirements: They are conditions that must be true for a model or method to work correctly and give valid results.
Accurate Predictions: Assumptions help the model generalize from the data to make accurate predictions for new or unseen data.
Data Relationships: Assumptions often describe the expected relationships between different variables in the data.
Impact of Violations: If assumptions are violated, the model's results may be unreliable or incorrect.
4. Applications of Supervised ML
Disease Diagnosis: Classify medical images or patient data to diagnose diseases.
Predicting Patient Outcomes: Forecast disease progression or patient recovery.
Credit Scoring: Assess creditworthiness based on financial data.
Fraud Detection: Identify fraudulent transactions by analyzing patterns.
Customer Segmentation: Group customers based on purchasing behavior for targeted marketing.
Churn Prediction: Predict which customers are likely to stop using a service.
Recommendation Systems: Suggest products based on past purchase behavior.
Demand Forecasting: Predict future sales for inventory and supply chain management.
5. Data Preprocessing
Before Proceeding, the First Thing We Have to Do is Preprocessing of the Data
import pandas as pd
# Load dataset
df = pd.read_csv('data.csv')
Feature Selection: Choosing the most relevant features to improve model performance.
Feature Engineering: Creating new features from existing data.
# Standard Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)
5.2.4 Handling Categorical Data (Encoding)
# One-Hot Encoding
one_hot_encoder = OneHotEncoder()
X_encoded = one_hot_encoder.fit_transform(df[['categorical_feature']]).toarray()
# Label Encoding
label_encoder = LabelEncoder()
df['encoded_feature'] = label_encoder.fit_transform(df['categorical_feature'])
6. Supervised Learning Models and other concepts that we Will Learn in This Guide:
6.1 Models
1. Linear Regression
2. Simple Linear Regression
3. Multiple Linear Regression
4. Polynomial Linear Regression
5. Ridge Regression
6. Lasso Regression
7. Logistic Regression
8. Support Vector Machine (SVM)
Regressor
Classifier
9. K-Nearest Neighbors (KNN)
Regressor
Classifier
Other concepts that are important to learn:
Euclidean distance
Manhattan distance
Minkowski distance
Hamming distance
10. Decision Tree
Classifier
Regressor
Other concepts that are important to learn:
Entropy
Gini impurity
Information gain
11. Ensemble Algorithms
Bagging:
Random Forest
Boosting:
AdaBoost
Gradient Boosting
XGBoost
LightGBM
CatBoost
Stacking
Blending
12. Naïve Bayes Algorithm
7.1.1 Definition
Linear Regression is a method to predict the value of a dependent variable ( Y ) based on the value(s) of one or more independent variables ( X
).
For a general linear regression model with ( p ) independent variables, the relationship between the dependent variable ( Y ) and the
independent variables ( X_1, X_2, \ldots, X_p ) is expressed as:
Y = β 0 + β 1 X1 + β 2 X2 + ⋯ + β p Xp + ϵ
Where:
1. Independent Variable (Feature): X represents the number of study hours (independent variable). It is the input variable used to predict the
output (dependent variable).
2. Dependent Variable (Target): y represents the exam score (dependent variable). This is the variable that we are trying to predict based on
the independent variable.
3. Linear Relationship: Linear Regression assumes that there is a linear relationship between the independent and dependent variables. The
relationship can be represented by the equation: y = mx + b + ε, where:
5. Prediction: After training, the model uses the learned slope and intercept to predict the exam scores for new study hours (X_test).
6. Evaluation Metrics:
Mean Squared Error (MSE): Measures the average of the squared differences between actual and predicted values. Lower MSE
indicates better fit.
R-squared (R^2 Score): Measures how well the independent variable explains the variance in the dependent variable. Ranges from 0
to 1, with values closer to 1 indicating a better fit.
Note: No need to worry about evaluation metrics we will study about them in detail after linear regression model
7.2 Types of Linear Regression:
Simple Linear Regression
Description: Models the relationship between a single independent variable and a dependent variable with a linear equation.
Ridge Regression
Description: Adds L2 regularization to the linear regression model to penalize large coefficients and prevent overfitting.
Lasso Regression
Description: Adds L1 regularization to the linear regression model, which can shrink some coefficients to zero, effectively performing
feature selection and reducing model complexity.
7.3.2 Definition
Simple Linear Regression models the relationship between two variables with a linear equation.
Where:
7.3.4 Assumptions
Linearity: The relationship between the dependent and independent variables is linear.
Independence: Residuals are independent of each other.
Homoscedasticity: Residuals have constant variance across all levels of the independent variable.
Normality of Residuals: Residuals are normally distributed.
No Endogeneity: The independent variable is not correlated with the residuals.
# Make predictions
y_pred = model.predict(X)
7.4.2 Definition
Multiple Linear Regression (MLR) is a method used to model the linear relationship between a dependent variable (also known as the response
or target variable) and multiple independent variables (also called predictors or features). The goal of MLR is to find the best-fitting linear
equation that explains the relationship between the dependent variable and the independent variables.
y = β0 + β1 x1 + β2 x2 + ⋯ + βn xn + ϵ
Where:
7.5.2 Definition
Polynomial Regression is a regression technique where the model is represented by a polynomial equation.
For a single independent variable x, the polynomial regression model is expressed as:
y = β0 + β1 x + β2 x2 + β3 x3 + ⋯ + βn xn + ϵ
Where:
7.5.4 Assumptions
Linearity in Parameters: The model is linear in terms of the coefficients, even though the relationship between variables may be
nonlinear.
Independence of Errors: The residuals are independent of each other.
Homoscedasticity: The variance of the residuals is constant across all levels of the independent variables.
No Multicollinearity: Independent variables, including polynomial terms, are not perfectly correlated.
Normality of Errors: The residuals are normally distributed.
Sufficiently Large Sample Size: A larger sample size is generally needed to ensure reliable estimates.
p
Cost Function = RSS + λ ∑ |βj |
j=1
Where:
p
∑ |βj |
j=1
is the L1 norm (sum of absolute values) of the coefficients.
p
λ ∑ |βj |
j=1
penalizes large coefficients, which helps in feature selection and reducing overfitting.
p
Cost Function = RSS + λ ∑ βj2
j=1
Where:
p
∑ βj2
j=1
is the L2 norm (sum of squared values) of the coefficients.
p
λ ∑ βj2
j=1
1
P (Y = 1 ∣ X) =
1 + e−(β0 +β1 X1 +β2 X2 +⋯+βp Xp )
P (Y = 1 ∣ X)
Odds =
1 − P (Y = 1 ∣ X)
3. Logit (Log-Odds):
The logit is the natural log of the odds:
P (Y = 1 ∣ X)
Logit = log( )
1 − P (Y = 1 ∣ X)
The logit function transforms the probability into a linear relationship with the predictors:
Logit = β0 + β1 X1 + β2 X2 + ⋯ + βp Xp
This linear relationship allows for the estimation of the coefficients βj using maximum likelihood estimation.
Explanation: Measures the overall correctness of the model by calculating the proportion of true positive and true negative
predictions out of all predictions.
accuracy in Python:
# Assuming y_test contains the true labels and y_pred contains the predicted labels
accuracy = accuracy_score(y_test, y_pred)
2. Precision:
Formula:
True Positives
Precision =
True Positives + False Positives
Explanation: Indicates the proportion of positive identifications that were actually correct, focusing on the accuracy of positive
predictions.
Precision in Python:
from sklearn.metrics import precision_score
# Assuming y_test contains the true labels and y_pred contains the predicted labels
precision = precision_score(y_test, y_pred, average='binary') # use average='macro' for multiclass
3. Recall (Sensitivity):
Formula:
True Positives
Recall =
True Positives + False Negatives
Explanation: Measures the ability of the model to identify all relevant instances, focusing on how well the model captures positive
cases.
Recall in Python:
# Assuming y_test contains the true labels and y_pred contains the predicted labels
recall = recall_score(y_test, y_pred, average='binary') # use average='macro' for multiclass
4. F1 Score:
Formula:
Precision × Recall
F1 Score = 2 ×
Precision + Recall
Explanation: The harmonic mean of precision and recall, providing a balance between the two metrics, especially useful for
imbalanced datasets.
F1 score in Python:
# Assuming y_test contains the true labels and y_pred_proba contains the predicted probabilities
roc_auc = roc_auc_score(y_test, y_pred_proba)
n
1
MAE = ∑ |yi − y^i |
n i=1
Explanation: Calculates the average absolute difference between actual and predicted values, providing a measure of the average
error magnitude.
To compute the Mean Absolute Error (MAE) in Python, use the following code:
# Assuming y_test contains the true labels and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
n
1
MSE = ∑(yi − y^i )2
n i=1
Explanation: Computes the average squared difference between actual and predicted values, emphasizing larger errors due to
squaring.
To compute the Mean Squared Error (MSE) in Python, use the following code:
from sklearn.metrics import mean_squared_error
# Assuming y_test contains the true labels and y_pred contains the predicted values
mse = mean_squared_error(y_test, y_pred)
RMSE = √MSE
Explanation: The square root of the mean squared error, providing error magnitude in the same units as the target variable.
To compute the Root Mean Squared Error (RMSE) in Python, use the following code:
# Assuming y_test contains the true labels and y_pred contains the predicted values
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
To compute the R-squared (coefficient of determination) in Python, you can use the following code:
# Assuming y_test contains the true labels and y_pred contains the predicted values
r_squared = r2_score(y_test, y_pred)
Purpose: Classify data into distinct categories by finding the optimal hyperplane that separates different classes.
Working: Maximizes the margin between classes and uses support vectors (data points closest to the hyperplane) to determine the
optimal boundary.
Kernel Trick: Applies different kernels (linear, polynomial, radial basis function) to handle non-linear data by mapping it to higher
dimensions.
2. SVM as a Regressor (Support Vector Regression - SVR):
Purpose: Predict continuous values by finding a function that approximates the target values within a specified margin of tolerance.
Working: Minimizes the error within a defined margin while allowing some deviations, aiming to fit as many data points as possible
within this margin.
Kernel Trick: Similar to classification, different kernels can be used to handle non-linear relationships between features and target
values.
9.4 SVM Regressor (Support Vector Regression - SVR)
9.4.1 Introduction to SVM Regressor (Support Vector Regression - SVR)
Support Vector Regression (SVR) extends SVM for regression tasks. It finds a function that fits the data within a specified margin of tolerance,
known as the epsilon-insensitive tube. SVR focuses on minimizing the error while allowing some deviations within this margin.
where:
# Making predictions
y_pred = svr.predict(X_test)
9.5.2 Definition
The SVM classifier works by constructing a hyperplane in a high-dimensional space that separates different classes. The key idea is to
maximize the margin between the hyperplane and the nearest data points from each class, known as support vectors.
Here's the mathematical formulation for the Support Vector Machine (SVM) classifier:
For a linearly separable dataset, the SVM aims to find the optimal hyperplane that maximizes the margin between two classes.
1. Hyperplane Equation: The hyperplane that separates the classes is defined as:
w⋅x+b=0
where:
2
Margin =
∥w∥
To maximize this margin, we need to minimize ∥w∥2 subject to the constraint that all data points are correctly classified.
3. Optimization Problem: The SVM optimization problem can be formulated as a quadratic programming problem:
1
Minimize ∥w∥2
2
subject to:
yi (w ⋅ xi + b) ≥ 1 for all i
where yi is the class label of the i-th data point, xi is the feature vector of the i-th data point.
For non-linearly separable data, SVM can use kernel functions to transform the data into a higher-dimensional space where a linear separation
is possible.
1. Kernel Trick: Instead of explicitly mapping the data to a higher-dimensional space, SVM uses a kernel function K(xi , xj ) to compute the
dot product in the transformed space. Common kernels include:
N
1 N N
Maximize ∑ αi − ∑ ∑ αi αj yi yj K(xi , xj )
i=1
2 i=1 j=1
subject to:
0 ≤ αi ≤ C for all i
N
∑ αi y i = 0
i=1
Assumes the data can be separated linearly or can be mapped to a higher-dimensional space where linear separation is possible.
2. Clear Margin of Separation:
Assumes there is a clear margin between classes, which the SVM can maximize to improve classification accuracy.
3. Feature Independence:
Assumes that the features are independent of each other, and each data point is independently and identically distributed (i.i.d.).
4. Feature Scaling:
Assumes that features are on similar scales; otherwise, preprocessing like normalization or standardization is needed.
5. Support Vector Significance:
Assumes that only the support vectors, which are the data points closest to the hyperplane, are crucial in defining the decision
boundary and margin.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
data = datasets.load_iris()
X = data.data
y = data.target
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Make predictions
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)
Explanation
2. Kernel: Defines the function used to map data into higher dimensions to make it linearly separable. Common kernels include:
4. ϵ (Epsilon): In SVR, it specifies the width of the margin within which no penalty is given for errors. It defines the tube around the
predicted function within which errors are ignored.
5. Degree: (for Polynomial Kernel) Specifies the degree of the polynomial used in the kernel function. Higher values create more complex
decision boundaries.
11. K-Nearest Neighbors (KNN)
11.1 Introduction to K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both classification and regression tasks. It operates on the
principle that similar data points are located near each other in feature space.
p
d(x, xi ) = ∑(xj − xij )2
⎷ j=1
where p is the number of features, and xj and xij are the feature values of x and xi , respectively.
2. Finding Nearest Neighbors: Identify the k nearest neighbors based on the smallest distances.
3. Regression Prediction: Predict the target value for x as the average of the target values of its k nearest neighbors:
k
1
y^ = ∑ yi
k i=1
# Making predictions
y_pred = knn_regressor.predict(X_test)
p
d(x, xi ) = ∑(xj − xij )2
⎷ j=1
where p is the number of features, and xj and xij are the feature values of x and xi , respectively.
2. Finding Nearest Neighbors: Identify the k nearest neighbors based on the smallest distances.
3. Classification Decision: Assign the class label to x by majority voting among the k nearest neighbors:
y^ = mode(y1 , y2 , … , yk )
# Making predictions
y_pred = knn_classifier.predict(X_test)
p
d(x, xi ) = ∑(xj − xij )2
⎷ j=1
where xj and xij are feature values of x and xi , and p is the number of features.
b. Manhattan Distance Calculates the sum of the absolute differences between corresponding feature values:
p
d(x, xi ) = ∑ |xj − xij |
j=1
c. Minkowski Distance Generalizes Euclidean and Manhattan distances with a parameter p to control the distance metric:
1/p
j=1
d. Hamming Distance Measures the number of differing elements between two categorical or binary vectors:
p
d(x, xi ) = ∑ 1(xj ≠ xij )
j=1
where 1(xj ≠ xij ) is 1 if xj and xij are different, and 0 if they are the same.
11.7 Parameters Used in KNN
1. k (Number of Neighbors)
Description: The function used to calculate the distance between data points.
Common Types: Euclidean, Manhattan, Minkowski, and Hamming distances.
By adjusting the p values, different distances can be utilized by the model for improved accuracy:
Euclidean Distance: Use p = 2
Manhattan Distance: Use p = 1
Minkowski Distance: p can be any positive value
Hamming Distance: Not influenced by p
3. Weight Function
Description: Determines how the distance affects the contribution of neighbors to the prediction.
Options:
Uniform: All neighbors contribute equally.
Distance: Neighbors closer to the query point have a higher influence, typically inversely proportional to their distance.
12. Decision Tree
12.1 Introduction to Decision Tree
A Decision Tree is a versatile and intuitive machine learning model used for both classification and regression tasks. It splits the data into
subsets based on feature values, forming a tree-like structure of decisions and their possible consequences.
The goal is to split the data in a way that maximizes the information gain or minimizes impurity at each node.
Description: The top node of the tree from which all branches originate. It represents the entire dataset.
2. Leaf Node:
Description: The terminal nodes of the tree where predictions are made. In regression, it represents the average target value; in
classification, it represents the majority class.
3. Internal Node:
Description: Nodes between the root and leaf nodes that represent decisions based on feature values.
4. Branch:
Description: The connection between nodes, representing the outcome of a decision or split.
5. Split:
Description: The process of dividing the dataset into subsets based on feature values.
6. Pruning:
Description: The process of removing branches from the tree to prevent overfitting and improve generalization.
7. Depth:
Description: The number of levels in the tree from the root node to the deepest leaf node.
8. Feature Importance:
Description: A measure of how much a feature contributes to the decision-making process in the tree.
Description: Variance is a statistical measure of the dispersion or spread of a set of values. It quantifies how much the values differ
from their mean.
Mathematical Formula:
n
1
Var(X) = ∑(xi − x̄)2
n i=1
where x1 , x2 , … , xn are the values, x̄ is the mean, and n is the number of values.
2. Splitting Criterion
The goal is to minimize the variance of target values within each subset. For a split s that divides the data into subsets A and B, the
variance reduction ΔVar is given by:
|A| |B|
ΔVar = Var(Y ) − ( Var(YA ) + Var(YB ))
|A| + |B| |A| + |B|
where Var(Y ) is the variance of the target variable Y in the original dataset, and Var(YA ) and Var(YB ) are the variances in subsets
A and B, respectively.
3. Prediction
The prediction for a data point x is the mean of the target values of the data points in the leaf node where x falls:
1
y^ = ∑ yi
|L| i∈L
where L is the set of training samples in the leaf node, and yi are their target values.
Assumes that features are independent and do not interact with each other in affecting the target variable.
2. Linear Relationships:
Assumes that the target variable can be approximated well by linear splits in the feature space, though it can model non-linear
relationships as well.
3. Feature Relevance:
Assumes that relevant features are used for splitting; irrelevant features can reduce model performance.
4. Overfitting:
Assumes that the model might overfit the training data if not properly pruned or regularized.
# Making predictions
y_pred = dt_regressor.predict(X_test)
We generate synthetic data where the target values have a quadratic relationship with the feature values.
We split the data into training and test sets.
We create a DecisionTreeRegressor model and fit it to the training data.
We make predictions on the test data and evaluate the model using Mean Squared Error (MSE) and R^2 Score.
12.6 Decision Tree Classifier
12.6.1 Introduction to Decision Tree Classifier
A Decision Tree Classifier is a type of supervised learning model used for classifying data into distinct categories. It operates by splitting the
dataset into subsets based on feature values, leading to a tree-like structure where each branch represents a decision rule.
Description: Measures the impurity of a node in classification trees. It calculates the likelihood of incorrect classification if a randomly
chosen element is labeled according to the distribution of labels in the node.
Mathematical Formula:
C
Gini = 1 − ∑ p2i
i=1
where pi is the probability of a randomly chosen element being classified correctly for class i out of C total classes.
2. Entropy:
Description: Measures the disorder or randomness in classification trees. It quantifies the amount of uncertainty or surprise in the
node’s class distribution.
Mathematical Formula:
C
Entropy = − ∑ pi log2 (pi )
i=1
where pi is the proportion of instances of class i in the node.
3. Information Gain:
Description: Quantifies the reduction in uncertainty about the target variable after a split in classification trees. It assesses how much
information is gained by dividing the data based on a particular feature.
Mathematical Formula:
k nj
Information Gain = Entropy(parent node) − ∑ × Entropy(child nodej )
j=1
n
where nj is the number of instances in child node j, and n is the total number of instances in the parent node.
4. Decision Rule:
The decision rule for splitting is based on selecting the feature and split point that maximizes the Information Gain or minimizes the
Gini Impurity.
Assumes that features are independent and do not interact with each other.
2. No Multicollinearity:
Assumes that the class distribution is balanced, although the tree can handle imbalances to some extent.
# Making predictions
y_pred = dt_classifier.predict(X_test)
Enhances the ability to generalize to new data by reducing bias and variance.
5. Flexibility:
Can be applied to various base models and handles different data types effectively.
6. Error Reduction:
Reduces both systematic and random errors, leading to more stable outcomes.
14. Random Forest: A Bagging Algorithm
14.1 Introduction
Random Forest is an ensemble learning method that operates as a bagging algorithm to improve classification and regression tasks by
combining the predictions of multiple decision trees.
14.2 Explanation
Training: Random Forest builds a collection of decision trees, each trained on a random subset of the data and features. Each tree is
constructed using a different bootstrap sample (random sampling with replacement) and a random subset of features at each split.
Prediction: The final prediction is made by aggregating the predictions of all individual trees. For classification, predictions are combined
by majority vote, while for regression, predictions are averaged. This approach enhances accuracy, reduces overfitting, and increases
robustness.
Description: Utilizes an ensemble of decision trees to classify data into discrete categories. Each tree casts a vote for a class, and the
class with the majority vote is selected as the final prediction.
Example: Classifying emails as spam or not spam.
2. Random Forest Regressor:
Description: Uses an ensemble of decision trees to predict continuous values. Each tree provides a prediction, and the final
prediction is the average of all individual tree predictions.
Example: Predicting house prices based on features like location and size.
14.4 Random Forest Regressor
14.4.1 Introduction
The Random Forest Regressor is an ensemble learning method designed for regression tasks. It builds multiple decision trees to predict
continuous outcomes and combines their predictions to enhance accuracy and robustness.
14.4.2 Explanation
Training: The Random Forest Regressor creates B decision trees using different bootstrap samples of the data. Each tree is trained on a
random subset of features at each split, which helps in reducing variance and overfitting.
Prediction: For each new input x, the regressor obtains predictions from each tree Tb . The final prediction is computed by averaging the
outputs of all the trees. This averaging smooths out predictions and improves overall accuracy.
Benefits: By combining the predictions of multiple trees, the Random Forest Regressor reduces the impact of noise and variance in the
data, leading to more stable and reliable predictions.
y^b = Tb (x)
14.4.4 Assumptions
1. Feature Independence:
Assumes that the features used for splitting nodes are independent, though it can handle correlated features reasonably well.
2. No Assumption of Linearity:
Does not require a linear relationship between the features and the target variable.
3. Large Sample Size:
Assumes that the dataset is sufficiently large to build multiple robust trees and achieve stable predictions.
4. Diverse Trees:
Assumes that individual decision trees are diverse, which is encouraged by using different subsets of features and data.
5. Complete Data:
Assumes that the data used to train the model is complete and free from missing values, though it can handle missing data to some
extent through imputation or other methods.
# Making predictions
y_pred = rf_regressor.predict(X_test)
14.5.2 Definition
The Random Forest classifier operates by creating an ensemble of decision trees using bootstrap sampling (random sampling with
replacement) and random feature selection at each split. Each tree in the forest makes a prediction, and the final classification is determined by
majority voting among all the trees.
For each tree, a bootstrap sample of the dataset is created by randomly sampling with replacement. This means some observations
may be repeated, and some may be omitted.
2. Tree Construction:
Each decision tree is built using a random subset of features at each split. The goal is to find the best split that maximizes the
reduction in impurity (e.g., Gini impurity or entropy).
3. Voting Mechanism:
For classification tasks, each tree in the forest predicts a class label. The final prediction is determined by majority voting among all
trees:
y^ = mode({yi }Ti=1 )
where y^ is the final prediction, T is the number of trees, and yi is the prediction from the i-th tree.
4. Impurity Measurement:
Commonly used impurity measures include:
Gini Impurity:
K
Gini = 1 − ∑ p2k
k=1
K
Entropy = − ∑ pk log2 (pk )
k=1
14.5.4 Assumptions
1. Independence of Trees:
Assumes that individual decision trees are diverse and independent, which is achieved by random sampling and feature selection.
2. Sufficient Data:
Assumes there is enough data to create multiple robust trees. Each tree should be trained on a representative sample of the data.
3. Feature Randomization:
Assumes that randomly selecting subsets of features at each split will improve generalization by reducing correlation between trees.
4. Bootstrap Sampling:
Assumes that bootstrap samples (samples with replacement) are representative of the overall dataset and provide a diverse basis for
training trees.
5. Aggregation:
Assumes that aggregating predictions from multiple trees (via majority voting) will yield a more accurate and stable model compared
to individual decision trees.
# Making predictions
y_pred = rf_classifier.predict(X_test)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
In this code:
We generate synthetic data with 5 features and binary target values (0 or 1).
We split the data into training and test sets.
We create a RandomForestClassifier model with 100 trees and fit it to the training data.
We make predictions on the test data and evaluate the model using accuracy, precision, recall, and F1 score metrics.
15. Boosting Algorithms
Boosting is an ensemble technique that combines multiple weak learners sequentially, with each new model focusing on correcting the errors
of the previous ones. This iterative approach enhances predictive accuracy and robustness by reducing bias and variance.
Description: AdaBoost adjusts the weights of incorrectly classified instances in the training set, giving more importance to
misclassified examples in subsequent models. The final model is a weighted combination of all the weak learners.
Key Points:
Iteratively focuses on errors made by previous models.
Combines weak learners to form a strong classifier.
Typically uses decision stumps (shallow trees) as base learners.
2. Gradient Boosting Machines (GBM):
Description: GBM builds models sequentially where each new model corrects the errors of the previous ones by minimizing a loss
function using gradient descent. This technique improves the model iteratively.
Key Points:
Uses gradient descent to optimize the loss function.
Each model is trained to correct the residuals of the previous model.
Can handle various loss functions, including regression and classification.
3. XGBoost (Extreme Gradient Boosting):
Description: XGBoost is an optimized and scalable variant of gradient boosting. It incorporates regularization to reduce overfitting
and is designed to be highly efficient and flexible.
Key Points:
Includes L1 and L2 regularization to control overfitting.
Uses a distributed and parallelized computation for faster training.
Provides high performance and is widely used in competitive machine learning.
4. LightGBM (Light Gradient Boosting Machine):
Description: LightGBM is a fast, distributed gradient boosting framework that employs histogram-based techniques to improve
computational efficiency and scalability.
Key Points:
Uses histogram-based algorithms for faster computation.
Handles large datasets and high-dimensional features efficiently.
Provides support for categorical features without needing explicit encoding.
5. CatBoost (Categorical Boosting):
Description: CatBoost is designed to handle categorical features effectively and uses symmetric trees to improve boosting with
gradient descent. It is optimized to work with a variety of data types and provides robust performance.
Key Points:
Handles categorical features directly using advanced encoding techniques.
Uses symmetric trees to reduce the complexity of boosting.
Includes built-in support for various data preprocessing tasks.
These boosting algorithms are widely used for their ability to improve model performance and handle complex data patterns. Each has its
strengths and is chosen based on specific requirements such as speed, handling of categorical features, and scalability.
15.2 AdaBoost: Introduction and Definition
15.2.1 Introduction
AdaBoost, short for Adaptive Boosting, is a popular boosting algorithm that combines multiple weak classifiers to form a strong classifier by
focusing on the mistakes of previous models.
15.2.2 Definition
AdaBoost works by training a series of weak classifiers sequentially, with each subsequent classifier correcting the errors made by the previous
ones. It adjusts the weights of misclassified data points to emphasize them in future iterations and combines the predictions of all classifiers to
improve overall accuracy and robustness.
Description: Uses weak classifiers to boost classification performance by focusing on the errors of previous classifiers and combining
their outputs through weighted voting.
2. AdaBoost Regressor:
Description: Applies the AdaBoost algorithm to regression tasks, improving weak regression models by focusing on errors and
combining their predictions to enhance accuracy.
15.3 AdaBoost Regressor: Introduction and Explanation
15.3.1 Introduction
AdaBoost Regressor is a variant of the AdaBoost algorithm tailored for regression tasks. It combines multiple weak regression models to create
a strong model that improves prediction accuracy by iteratively focusing on errors.
15.3.2 Explanation
Training Process:
ri = yi − fm (xi )
Where yi is the true target value and fm (xi ) is the predicted value by the weak regressor.
Compute the weighted error ϵm of the weak regressor:
∑i wi ⋅ |ri |
ϵm =
∑ i wi
1 1 − ϵm
αm = log( )
2 ϵm
wi ← wi exp(αm |ri |)
wi
wi ←
∑ i wi
3. Final Prediction:
Combine the predictions from all weak regressors using their weights αm :
M
y^ = ∑ αm fm (x)
m=1
Assumes that weak regression models (e.g., decision stumps) are used and can be improved by focusing on their errors.
2. Sequential Learning:
Assumes that models can be improved iteratively by focusing on previously mispredicted data points.
3. Feature Independence:
Assumes that features are not necessarily independent, but the algorithm can handle correlated features.
4. Additive Model:
Assumes that the final model's performance improves additively as weak models are combined.
5. Data Distribution:
Assumes the data is reasonably representative of the problem space, and the model may require a sufficiently large dataset to
capture complex patterns.
# Making predictions
y_pred = ada_boost_regressor.predict(X_test)
15.4.2 Definition
AdaBoost works by training a series of weak classifiers sequentially. Each classifier is trained to correct the errors made by the previous ones.
The algorithm adjusts the weights of misclassified data points to emphasize them in future iterations. Finally, the predictions of all classifiers
are combined using weighted voting to improve overall accuracy and robustness.
N
∑i=1 wi ⋅ I(yi ≠ ht (xi ))
ϵt =
N
∑i=1 wi
1 1 − ϵt
αt = ln( )
2 ϵt
wi
wi ←
∑N
j=1 wj
3. Final Model: The final classifier H(x) is a weighted vote of all weak classifiers:
t=1
15.4.4 Assumptions
1. Weak Learners: AdaBoost assumes that weak classifiers are available, which are slightly better than random guessing.
2. Additive Model: The model assumes that combining multiple weak classifiers can improve overall performance.
3. Sequential Training: The algorithm assumes that sequentially training classifiers and adjusting weights based on errors is effective in
reducing bias and variance.
4. Misclassified Emphasis: AdaBoost assumes that focusing more on misclassified examples will improve the classifier's accuracy.
5. Data Independence: The algorithm assumes that each weak classifier’s errors are somewhat independent of others.
# Making predictions
y_pred = ada_boost_classifier.predict(X_test)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)
In this code:
15.5.2 Explanation:
Gradient Boosting Framework:
XGBoost builds on the gradient boosting framework by creating an ensemble of decision trees where each tree corrects the errors of
the previous ones.
It uses gradient descent to minimize a loss function, iteratively improving model performance.
Optimizations:
Description: Used for regression tasks, predicting continuous values by minimizing a regression-specific loss function using gradient
boosting.
2. XGBoost Classifier:
Description: Applied to classification tasks, predicting discrete class labels by optimizing a classification-specific loss function with
gradient boosting.
15.6 XGBoost Regressor: Introduction and Definition
15.6.1 Introduction: XGBoost Regressor is a powerful machine learning model based on the gradient boosting framework, specifically
designed for regression tasks. It is known for its high efficiency, accuracy, and ability to handle large datasets with numerous features.
15.6.2 Definition: XGBoost Regressor builds an ensemble of decision trees, where each tree is trained to minimize the residual errors of the
previous trees using gradient descent. By optimizing a loss function iteratively and incorporating regularization, XGBoost Regressor produces
robust predictions for continuous target variables.
n K
Objective = ∑ L(yi , y^i ) + ∑ Ω(fk )
i=1 k=1
L(yi , y^i ): The loss function, typically mean squared error (MSE) for regression, which measures the difference between the predicted
value y^i and the actual value yi .
Ω(fk ): The regularization term, which controls the complexity of the model by penalizing the number of leaves in the trees and the
magnitude of leaf weights.
2. Prediction at Step m: The prediction for a data point xi after m iterations is given by:
m
= ∑ fk (xi )
(m)
y^i
k=1
Where fk (xi ) represents the output of the k-th tree for input xi .
(m) (m)
3. Gradient and Hessian Calculation: For each iteration m, the gradients gi and Hessians hi are calculated for each data point:
(m−1)
(m) ∂L(yi , y^i )
gi =
(m−1)
∂ y^i
(m−1)
(m) ∂ 2 L(yi , y^i )
hi =
(m−1) 2
∂(^
yi )
These values are used to update the model in the direction that minimizes the loss.
4. Tree Structure and Leaf Values: The structure of each tree is determined by finding the optimal split points for the features, and the
value of each leaf wj is calculated by:
(m)
∑i∈leaf j gi
wj = −
(m)
∑i∈leaf j hi +λ
M
y^i = ∑ fk (xi )
k=1
Assumes that the final model can be constructed as an additive combination of weak learners (decision trees).
2. Independence of Residuals:
Assumes that the residuals (errors) of the model are independent and that each tree in the ensemble corrects these residuals.
3. No Multicollinearity:
Assumes that the features are not highly collinear, as multicollinearity can affect the importance of features and lead to less reliable
predictions.
4. Sufficient Data:
Assumes that the dataset is large and representative, allowing the model to capture complex patterns without overfitting.
5. Stationarity:
Assumes that the underlying data distribution is stable over time, which is crucial for the model's generalization to new data.
# Making predictions
y_pred = xgboost_regressor.predict(X_test)
15.7.2 Definition
XGBoost is a gradient boosting algorithm that builds an ensemble of decision trees sequentially. Each new tree aims to correct the errors of
the previous trees by minimizing a specific loss function. It uses gradient descent to optimize the model and incorporates regularization
techniques to control model complexity and improve performance.
where Ω(fk ) is the regularization term for each tree fk , and θ represents all model parameters.
2. Loss Function: The loss function measures the difference between predicted values y^i and true values yi . Common choices include mean
squared error for regression and log loss for classification.
3. Regularization Term: XGBoost uses L1 (Lasso) and L2 (Ridge) regularization to penalize complex models and avoid overfitting:
T
1
Ω(f) = γT + λ ∑ ∥wj ∥2
2 j=1
where T is the number of leaves in the tree, γ and λ are regularization parameters, and wj represents the weights of the leaves.
4. Gradient Boosting: During each boosting iteration, XGBoost minimizes the loss function using the gradient descent approach. The new
tree is added to the ensemble to correct the residual errors of the previous model:
where η is the learning rate and ht+1 (x) is the new tree.
15.7.4 Assumptions
1. Weak Learners: XGBoost assumes that decision trees (or other base models) are weak learners that can be improved through boosting.
2. Additive Model: The algorithm assumes that adding new models sequentially will enhance overall performance.
3. Gradient Descent: XGBoost relies on gradient descent to optimize the loss function and adjust the model parameters.
4. Regularization: The algorithm assumes that incorporating regularization will help control model complexity and reduce overfitting.
5. Data Scalability: XGBoost assumes that it can efficiently handle large-scale datasets and complex models.
# Making predictions
y_pred = xgboost_classifier.predict(X_test)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
In this code:
Point to know: you will not see much difference between XGboost regressor and classifier in theory, but when in practical coding you will
see a difference when initiating a model:
In Regressor: xgb.XGBRegressor
In classifier: xgb.XGBClassifier
15.8.2 Explanation: CatBoost builds an ensemble of decision trees using gradient boosting, where each tree corrects the errors of its
predecessors. It incorporates techniques like symmetric trees and ordered boosting to reduce overfitting and improve predictive accuracy. Its
unique handling of categorical variables involves transforming them into numerical values using target statistics, making it particularly
effective for datasets with a lot of categorical features.
Description: A variant of CatBoost designed for classification tasks, predicting categorical outcomes by optimizing a classification-
specific loss function.
CatBoost Regressor:
Description: A variant of CatBoost used for regression tasks, predicting continuous values by minimizing regression-specific loss
functions.
15.9 CatBoost Regressor: Introduction and Explanation
15.9.1 Introduction: CatBoost Regressor is an advanced machine learning model developed by Yandex for regression tasks. It excels at
handling datasets with categorical features and provides high performance through sophisticated gradient boosting methods.
15.9.2 Explanation: CatBoost Regressor builds an ensemble of decision trees using gradient boosting to predict continuous target values. It
optimizes regression-specific loss functions, such as mean squared error (MSE) or mean absolute error (MAE), to minimize prediction errors.
CatBoost’s unique features include effective handling of categorical variables through target encoding and the use of ordered boosting to
prevent overfitting, making it well-suited for diverse and complex datasets.
n K
Objective = ∑ L(yi , y^i ) + ∑ Ω(fk )
i=1 k=1
L(yi , y^i ): The loss function for regression, such as mean squared error (MSE) or mean absolute error (MAE), which measures the
difference between the predicted value y^i and the actual target yi .
Ω(fk ): The regularization term, which penalizes the complexity of each decision tree to prevent overfitting.
(m) (m)
2. Gradient and Hessian Calculation: For each iteration m, the gradients gi and Hessians hi are calculated as:
(m−1)
(m) ∂L(yi , y^i )
gi =
(m−1)
∂ y^i
(m−1)
(m) ∂ 2 L(yi , y^i )
hi =
(m−1) 2
∂(y^i )
These values guide the construction of the next tree to reduce the residuals.
3. Tree Construction: The structure of each decision tree fk is optimized to minimize the loss function:
fk = argminf [∑ (gi
n 2
(m)
− f(xi )) + λ ⋅ Complexity(f)]
i=1
M
y^i = ∑ fk (xi )
k=1
This combines the outputs from all decision trees in the ensemble to produce the final predicted value for each data point.
2. Independent Residuals: Assumes that the residuals or errors from previous iterations are independent and can be corrected by
subsequent trees.
3. Sufficiently Large and Diverse Dataset: Assumes that the dataset is large and diverse enough to capture the underlying patterns and
avoid overfitting.
4. Categorical Feature Handling: Assumes that categorical features are effectively encoded using techniques such as target encoding to
improve model performance.
5. Stationary Data Distribution: Assumes that the data distribution remains stable over time, allowing the model to generalize well to
future data.
# Making predictions
y_pred = catboost_regressor.predict(X_test)
15.10.2 Explanation: CatBoost Classifier builds an ensemble of decision trees using gradient boosting, where each tree improves upon the
errors of its predecessors. It leverages techniques like target encoding to manage categorical variables and ordered boosting to prevent
overfitting. The model optimizes a classification-specific loss function, such as logarithmic loss, to predict categorical outcomes with high
precision and efficiency.
n K
Objective = ∑ L(yi , y^i ) + ∑ Ω(fk )
i=1 k=1
L(yi , y^i ): The classification loss function, such as logarithmic loss (log loss) or cross-entropy loss, which measures the difference
between the predicted probability y^i and the actual class yi .
Ω(fk ): The regularization term that penalizes the complexity of each decision tree to avoid overfitting.
(m) (m)
2. Gradient and Hessian Calculation: For each iteration m, the gradients gi and Hessians hi are computed as:
(m−1)
(m) ∂L(yi , y^i )
gi =
(m−1)
∂ y^i
(m−1)
(m) ∂ 2 L(yi , y^i )
hi =
(m−1) 2
∂(^
yi )
These gradients and Hessians guide the construction of the next tree to minimize the classification error.
3. Tree Construction: Each decision tree fk is optimized to minimize the classification loss:
fk = argminf [∑ (gi
n 2
(m)
− f(xi )) + λ ⋅ Complexity(f)]
i=1
y^i = σ (∑ fk (xi ))
M
k=1
Here, σ is the sigmoid function applied to the sum of all decision trees' outputs to produce the final predicted probability for each class.
2. Independence of Residuals: Assumes that the residuals (errors) from previous iterations are independent and that subsequent trees can
correct these residuals.
3. Sufficiently Large and Diverse Dataset: Assumes that the dataset is large and diverse enough to capture the underlying patterns and
avoid overfitting.
4. Categorical Feature Handling: Assumes effective handling of categorical features through techniques like target encoding to improve
model performance.
5. Stationary Data Distribution: Assumes that the data distribution remains stable over time, allowing the model to generalize well to new
data.
# Making predictions
y_pred = catboost_classifier.predict(X_test)
y_pred_proba = catboost_classifier.predict_proba(X_test)[:, 1] # Probabilities for the positive class
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)
In this code:
Additional Insights
While gradient boosting methods like CatBoost are powerful and popular, they are not always as widely used or emphasized compared to
other techniques. For instance:
LightGBM (Light Gradient Boosting Machine) is less common in certain resources compared to traditional algorithms but is effective in
specific scenarios.
Stacking and Blending are ensemble techniques that involve training multiple base models and combining their predictions using meta-
models or weighted averages. These methods often receive less focus but can significantly enhance model performance.
Understanding these methods can offer valuable insights and improve modeling strategies, even if they are not the main focus in some
learning resources.
16. choosing the Best Model: Hyperparameter Tuning and Cross-Validation
16.1 Hyperparameter Tuning: Explanation and Importance
Hyperparameters significantly impact how well a model learns from data and generalizes to new, unseen data. Tuning these
parameters helps in finding the best configuration that yields the highest performance metrics, such as accuracy, precision, recall, or
F1-score.
2. Avoids Overfitting and Underfitting:
Properly tuned hyperparameters help in balancing the model complexity and prevent overfitting (where the model learns noise in the
training data) or underfitting (where the model fails to capture underlying patterns).
3. Enhances Model Robustness:
Hyperparameter tuning helps in making the model more robust to variations in the data and ensures that it performs well across
different scenarios and datasets.
Controls how much the model's weights are adjusted with each iteration. A smaller learning rate may lead to more precise results but
require more iterations, while a larger learning rate may speed up training but risk overshooting the optimal solution.
2. Number of Trees (in Ensemble Methods):
Determines how many trees are used in models like Random Forest or Gradient Boosting. More trees generally improve model
performance but increase computational cost.
3. Max Depth (in Decision Trees):
Limits the depth of the tree. A deeper tree can model more complex relationships but may lead to overfitting, while a shallower tree
may underfit the data.
4. Regularization Parameters:
Such as alpha in Lasso or lambda in Ridge regression, which control the amount of penalty applied to model complexity. These
parameters help in preventing overfitting by discouraging overly complex models.
5. Number of Neighbors (in KNN):
Defines how many neighboring points are considered when making predictions. A small number may lead to overfitting, while a large
number may smooth out predictions too much.
example in python
# Load data
data = load_iris()
X = data.data
y = data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define model
model = RandomForestClassifier()
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
# Fit GridSearchCV
grid_search.fit(X_train, y_train)
2. Random Search:
Involves sampling hyperparameter values randomly from predefined ranges. While less exhaustive than grid search, it can be more
efficient and often finds good hyperparameters faster.
example in python
# Load data
data = load_iris()
X = data.data
y = data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define model
model = RandomForestClassifier()
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1,
verbose=2, random_state=42)
# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)
3. Bayesian Optimization:
Uses probabilistic models to explore the hyperparameter space more intelligently. It models the performance function and chooses
hyperparameters that are likely to improve the performance.
example in python
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Initialize model
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,
min_samples_split=min_samples_split)
# Load data
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
return accuracy
1. Reduces Overfitting:
By validating the model on different subsets of the data, cross-validation helps in ensuring that the model does not overfit to the
training data and can generalize well to new, unseen data.
2. Provides a Reliable Performance Estimate:
Cross-validation offers a more accurate estimate of a model’s performance by averaging results over multiple iterations, which
reduces the variability that might arise from a single train-test split.
3. Maximizes Data Utilization:
By using different parts of the data for training and validation, cross-validation ensures that the model is evaluated on all available
data, making full use of the dataset.
Example in python
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
data = load_iris()
X = data.data
y = data.target
# Initialize model
model = RandomForestClassifier()
# Print results
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
print("Standard Deviation:", scores.std())
Example in python
# Load data
data = load_iris()
X = data.data
y = data.target
# Initialize model
model = RandomForestClassifier()
# Initialize LeaveOneOut
loo = LeaveOneOut()
# Print results
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))
Example in python
# Load data
data = load_iris()
X = data.data
y = data.target
# Initialize model
model = RandomForestClassifier()
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Print results
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))
Example in python
# Initialize model
model = RandomForestClassifier()
# Initialize TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
# Print results
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))
example in python
# Load data
data = load_iris()
X = data.data
y = data.target
# Initialize model
model = RandomForestClassifier()
# Print results
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", np.mean(scores))
print("Standard Deviation:", np.std(scores))
Explanation:
Divide the dataset into multiple subsets or folds based on the chosen cross-validation method.
2. Model Training and Validation:
For each fold, train the model on the training set and evaluate its performance on the validation set.
3. Performance Aggregation:
Collect performance metrics (e.g., accuracy, precision, recall) from each fold and compute the average to get an overall estimate of
the model’s performance.
17. Pipeline: Detailed Explanation and Importance
17.1 What is a Pipeline?
In machine learning, a pipeline is a sequential process that streamlines the workflow of transforming data, applying algorithms, and making
predictions. It automates the steps involved in preprocessing data, training models, and evaluating performance, ensuring a consistent and
reproducible workflow.
Pipelines organize and automate the steps involved in machine learning processes, making workflows more efficient and easier to
manage. This reduces the complexity of handling different steps separately and ensures a smooth transition between data processing
and model training.
2. Ensures Reproducibility:
By encapsulating all preprocessing, training, and evaluation steps in a single pipeline, it becomes easier to reproduce results. This is
crucial for validating and comparing models consistently.
3. Prevents Data Leakage:
Pipelines help in avoiding data leakage by ensuring that preprocessing steps are only applied to training data during model training
and not to validation or test data. This separation maintains the integrity of the evaluation process.
4. Simplifies Hyperparameter Tuning:
When using techniques like grid search or random search for hyperparameter tuning, pipelines ensure that all preprocessing steps
are consistently applied, making it easier to tune model parameters without reimplementing preprocessing logic.
Feature Engineering: Creating new features or transforming existing ones to improve model performance (e.g., scaling, encoding,
imputation).
Data Transformation: Applying transformations like normalization, standardization, or encoding to prepare data for modeling.
Data Splitting: Dividing data into training, validation, and test sets to evaluate model performance effectively.
2. Model Training:
Algorithm Selection: Choosing the appropriate machine learning algorithm (e.g., decision trees, SVM, or neural networks) based on
the problem type.
Training: Fitting the model to the training data using selected algorithms and hyperparameters.
3. Model Evaluation:
Metrics Calculation: Evaluating the model's performance using metrics such as accuracy, precision, recall, or mean squared error.
Validation: Testing the model on validation data to fine-tune hyperparameters and assess generalization.
4. Prediction and Post-Processing:
Prediction: Using the trained model to make predictions on new or unseen data.
Post-Processing: Applying additional transformations to predictions if needed, such as converting probabilities to class labels.
Pipelines are typically created using libraries like scikit-learn in Python, which provide classes like Pipeline to define and manage
the sequence of operations. Each step in the pipeline is represented as a tuple containing a name and a transformation or model.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
2. Pipeline Execution:
Once created, pipelines can be used to fit data, make predictions, and evaluate models with minimal additional code. The fit()
method trains the model and applies preprocessing steps, while predict() uses the trained model for predictions.
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Ensures that the same preprocessing and modeling steps are applied consistently across different datasets and experiments.
2. Modularity:
Allows for easy modification and experimentation by changing individual components (e.g., swapping models or preprocessing
techniques) without altering the entire workflow.
3. Scalability:
Facilitates scaling machine learning workflows by integrating with tools for automated training and deployment, such as MLflow or
Apache Airflow.
18. Probabilty
18.1 Introduction to Probability
Probability is a branch of mathematics that deals with the likelihood of an event occurring. It provides a systematic way to quantify uncertainty.
The concept of probability is deeply embedded in many fields, from everyday decision-making to scientific research, finance, and machine
learning.
18.2 Definition
At its core, probability is a measure of how likely something is to happen. It helps us answer questions like:
Probability assigns a value between 0 and 1 to the occurrence of an event. A probability of 0 means the event will not happen, while a
probability of 1 means the event will certainly happen. A probability value closer to 0 indicates that the event is less likely to occur, whereas a
value closer to 1 suggests that the event is more likely to happen.
Probability is expressed as a ratio of the number of favorable outcomes to the total number of possible outcomes. If P (A) denotes
the probability of event A, it is calculated as:
2. Probability Values:
Theoretical Probability: Based on reasoning or mathematical principles. For example, the probability of rolling a 3 on a fair six-sided
die is:
1
P (rolling a 3) =
6
since there are 6 possible outcomes and only one favorable outcome.
Experimental Probability: Based on actual experiments or observations. For example, if you roll a die 60 times and get a 3 in 10 of
those rolls, the experimental probability is:
10 1
P (rolling a 3) = =
60 6
Subjective Probability: Based on personal judgment or belief, rather than empirical evidence. For example, estimating the likelihood
of rain based on experience.
4. Probability Rules:
Addition Rule:
For mutually exclusive events (events that cannot occur simultaneously), the probability of either event A or event B occurring is:
P (A ∪ B) = P (A) + P (B)
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Multiplication Rule:
For independent events (events where the occurrence of one does not affect the occurrence of the other), the probability of both
event A and event B occurring is:
P (A ∩ B) = P (A) × P (B)
Complement Rule:
P (A′ ) = 1 − P (A)
5. Bayes' Theorem:
A formula that describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is
given by:
P (B ∣ A) × P (A)
P (A ∣ B) =
P (B)
Here, P (A ∣ B) is the probability of event A given that B has occurred, P (B ∣ A) is the probability of B given A, P (A) is the prior
probability of A, and P (B) is the probability of B.
Probability helps in making informed decisions by evaluating the likelihood of various outcomes. For example, in finance, probability
is used to assess the risk of investments.
2. Predictive Modeling:
In machine learning, probability is used in models such as logistic regression and Naive Bayes classifiers to predict outcomes based
on input features.
3. Games and Gambling:
Probability underpins the strategies and odds in games and gambling, helping players understand their chances of winning or losing.
4. Statistics:
Probability is foundational in statistics for hypothesis testing, confidence intervals, and analyzing data distributions.
Naive Bayes is a robust classification algorithm that utilizes probability to determine class labels. It is based on Bayes' Theorem, which
helps in calculating the probability of a class given the features of an observation. By assuming that features are conditionally
independent given the class label, Naive Bayes simplifies the complex calculations involved in classification tasks.
Probabilistic Approach: Naive Bayes models uncertainty by predicting the likelihood of each class and choosing the one with the
highest probability.
Simplicity: The assumption of feature independence simplifies the model and reduces computational complexity.
Robustness: Despite its simplistic assumptions, Naive Bayes often performs well in practice, especially with large datasets and in
scenarios with incomplete or noisy data.
19. Naive Bayes Algorithm: Introduction and Explanation
19.1 Introduction:
Naive Bayes is a classification algorithm based on Bayes' Theorem, which predicts the class of an observation based on its features. Despite its
simplicity, it is highly effective for tasks like spam filtering, text classification, and sentiment analysis.
19.2 Explanation:
Probabilistic Approach: Naive Bayes calculates the probability of each class given the features of the observation and selects the class
with the highest probability.
Independence Assumption: The algorithm assumes that features are independent of each other given the class label, which simplifies
the calculations and makes it computationally efficient.
Classification: It predicts the class of new observations by comparing the probabilities of all possible classes based on the features
provided.
19.3 Applications:
Text Classification: Categorizing emails as spam or not spam.
Medical Diagnosis: Classifying medical conditions based on symptoms.
Recommendation Systems: Suggesting products based on user preferences.
Description: Assumes that features follow a Gaussian (normal) distribution. It is used when the features are continuous and are assumed
to be normally distributed.
Application: Suitable for problems where features are numeric and follow a normal distribution.
Description: Assumes binary/boolean features (i.e., features are either present or absent). It is used for binary/boolean features and is a
variant of the multinomial Naive Bayes.
Application: Document classification where features are binary, such as the presence or absence of certain words.
Description: An adaptation of the multinomial Naive Bayes, designed to improve performance on imbalanced datasets by
complementing the class distribution.
Application: Suitable for text classification problems with imbalanced classes.
Description: Designed for categorical data where features are categorical rather than continuous. It extends the idea of multinomial
Naive Bayes to handle categorical feature values.
Application: Data with categorical features, like survey responses or categorical demographic data.
Each type of Naive Bayes classifier has its own strengths and is best suited for specific types of data and problems. Choosing the right one
depends on the nature of your features and the problem you're trying to solve.
19.5 Naive Bayes Algorithm: Mathematical Formulation
1. Bayes' Theorem:
Naive Bayes relies on Bayes' Theorem to calculate the probability of a class given the features of an observation. The formula is:
P (X ∣ C) × P (C)
P (C ∣ X) =
P (X)
where:
2. Independence Assumption:
Naive Bayes assumes that features are conditionally independent given the class. This simplifies the likelihood calculation:
n
P (X ∣ C) = ∏ P (xi ∣ C)
i=1
where xi represents individual features. This assumption allows for efficient computation even with many features.
3. Classification Rule:
To classify a new observation, Naive Bayes computes the posterior probability for each class and chooses the class with the highest probability:
C^ = arg max P (C ∣ X)
C
Since P (X) is the same for all classes, it suffices to maximize P (X ∣ C) × P (C):
Description: This law states that the probability of an event can be found by summing the probabilities of that event across all possible
ways it can occur.
Formula:
P (A) = ∑ P (A ∩ Bi ) = ∑ P (A ∣ Bi ) × P (Bi )
i i
Explanation: If Bi are mutually exclusive events that partition the sample space, the probability of event A is the sum of the probabilities
of A occurring given each Bi , weighted by the probability of each Bi .
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Make predictions
y_pred = model.predict(X_test)
# Print results
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
Explanation:
As you close this book, remember that the true power of supervised learning lies in its application. Whether you're building predictive models,
enhancing business insights, or advancing research, the concepts and techniques you've learned are the keys to unlocking new opportunities
and driving innovation.
The future of machine learning is bright, and your journey is just beginning. Keep experimenting, stay curious, and continue to push the
boundaries of what's possible with data. The world of supervised learning is ever-evolving, and with the knowledge you've gained, you're well-
equipped to be at the forefront of this exciting field.
Thank you for joining me on this adventure. The next chapter in your machine learning journey awaits—may it be filled with discovery, growth,
and success.
In shaa Allah new guide on unsupervised Machine Learning will be shared with you soon...............