ML Tutorial
ML Tutorial
Classification Algorithms
Introduction
Logistic Regression is a statistical model commonly used for binary
classification tasks, where the goal is to classify data into one of two possible
outcomes. Unlike linear regression, which predicts a continuous value, logistic
regression predicts the probability of a sample belonging to a particular class.
The core idea of logistic regression is to map input features to a probability
score between 0 and 1 using the logistic (or sigmoid) function. This makes it
suitable for binary classification tasks, such as spam vs. not spam or disease
ML Tutorial 1
vs. no disease. Logistic regression works by finding a decision boundary that
separates the classes in the feature space, maximizing the likelihood of correct
classification.
1
h(x) = 1+e −(β 0 +β 1 x 1 +β 2 x 2 +…+β n x n )
where:
h(x) is the predicted probability that the data point belongs to the
positive class.
If \( h(x) \geq 0.5 \), the point is classified as belonging to the positive class;
otherwise, it is classified as the negative class.
where:
The goal is to minimize the total loss over all training examples, which can be
achieved through gradient descent or other optimization algorithms.
Minimizing this loss function helps the model make accurate predictions.
ML Tutorial 2
Pros and Cons of Logistic Regression
Pros:
Cons:
ML Tutorial 3
Outlier Detection and Removal: Identifying and removing outliers before
training can help improve logistic regression performance.
Additional Notes
Regularization: Regularization techniques, such as L1 (Lasso) and L2
(Ridge) regularization, are often applied to logistic regression to prevent
overfitting and handle multicollinearity. This introduces a penalty for large
coefficients, encouraging the model to find simpler solutions.
Threshold Tuning: The default threshold for classification is 0.5, but it can
be adjusted depending on the specific problem. For example, in medical
diagnoses where false negatives are costly, a lower threshold might be
chosen.
ML Tutorial 4
Tree Models
Introduction
Decision Trees are a popular supervised learning algorithm used for both
classification and regression tasks. They work by recursively splitting the
dataset into subsets based on feature values, creating a tree-like structure
where each internal node represents a feature (or attribute), each branch
represents a decision rule, and each leaf node represents an outcome (or class
label).
The goal of a decision tree is to create a model that predicts the target variable
by learning simple decision rules inferred from the data features. Decision trees
are intuitive, easy to interpret, and can handle both categorical and continuous
data. Their transparency and straightforward visualization make them a popular
choice among practitioners, especially when model interpretability is crucial.
ML Tutorial 5
Formula to Predict a New Point
The prediction of a new data point in a decision tree involves traversing the tree
from the root to a leaf node. At each internal node, the algorithm evaluates a
feature and makes a decision based on its value, directing the traversal to the
left or right branch until it reaches a leaf node, which contains the predicted
outcome.
In order to build the tree, we need to decide which features to split on, both for
the root node and the internal nodes. This is done by evaluating each possible
split and choosing the one that maximizes the "purity" (or minimizes impurity)
of the resulting nodes. Here’s how:
For each feature, test possible thresholds to split the data into two
groups (for binary splits).
2. Impurity Measures:
ML Tutorial 6
Entropy(D) = - \sum_{i=1}^{C} p_i \log_2(p_i)
\]
The feature and threshold with the highest Information Gain (or largest
reduction in impurity) becomes the split point for the current node.
Repeat this process recursively for each child node, treating each child
node as the new parent node.
ML Tutorial 7
Continue until a stopping criterion is met, such as reaching a maximum
tree depth, a minimum number of samples in a node, or achieving zero
impurity.
6. Stopping Criteria:
Maximum Depth: Prevents the tree from growing too complex and
overfitting.
The tree algorithm selects the feature and the split point that minimizes the
impurity for the resulting child nodes, aiming to create homogeneous groups of
outcomes.
ML Tutorial 8
1. Interpretability: Decision trees are easy to visualize and interpret, making it
clear how decisions are made.
4. Handles Both Numerical and Categorical Data: They can work with both
types of data without special transformations.
5. Robust to Feature Scaling: Decision trees are not sensitive to the scale of
data, unlike some other models.
Cons:
2. Sensitive to Noisy Data: Small changes in the data can lead to a completely
different tree structure, which can reduce model stability.
ML Tutorial 9
While outliers are less likely to affect trees significantly, pruning or limiting the
depth can help avoid nodes formed solely due to outliers.
Variance: Decision trees have high variance because small changes in the
training data can lead to completely different splits and a different tree
structure. This high variance makes individual decision trees prone to
overfitting, particularly on small datasets.
Additional Notes
1. Pruning: To avoid overfitting, trees often require pruning. Pruning involves
removing branches that have little importance or adding stopping criteria
(like a maximum depth) to control the tree’s complexity.
2. Tree Depth: Limiting the maximum depth of a tree is another way to control
for overfitting and improve model generalization.
ML Tutorial 10
Random Forest Tutorial
Introduction
Random Forest is an ensemble learning method that combines multiple
decision trees to improve the accuracy and stability of predictions. Developed
by Leo Breiman, Random Forest uses a technique known as bagging (Bootstrap
Aggregating) to build a "forest" of individual decision trees, where each tree is
trained on a random subset of the data. By averaging or majority voting the
predictions of each tree, Random Forest reduces the risk of overfitting that
individual decision trees often face.
Random Forest is highly effective in both classification and regression tasks,
making it versatile across many fields, including finance, healthcare, and e-
commerce. The goal is to improve the accuracy and generalization of
predictions while maintaining model interpretability.
ML Tutorial 11
Formula to Predict a New Point
Random Forest prediction involves aggregating predictions from multiple
decision trees. Each tree in the forest makes an independent prediction, and
the final output is determined by averaging (for regression) or taking the
majority vote (for classification) of these individual predictions.
1. Classification:
\[
\hat{y} = \text{mode}\left\{ T_1(x), T_2(x), \dots, T_m(x) \right\}
\]
where \( T_i(x) \) is the prediction of the \(i\)-th tree in the forest, and \( m \)
is the total number of trees. The mode, or majority vote, of these
predictions determines the final output.
2. Regression:
\[
\hat{y} = \frac{1}{m} \sum_{i=1}^{m} T_i(x)
\]
where \( T_i(x) \) is the prediction of the \(i\)-th tree, and the average value
of the predictions is used as the final output.
The randomness introduced during the training phase, both in the selection of
data samples and feature subsets, helps to reduce variance and improve the
model’s ability to generalize.
The final performance of the forest is typically evaluated based on the chosen
task's loss function, such as cross-entropy or accuracy for classification, and
Mean Squared Error for regression.
ML Tutorial 12
Random Forest uses bagging and feature selection to create diverse trees in
the forest:
3. Handles Large Feature Sets: By using a random subset of features for each
split, Random Forests are effective even with high-dimensional data.
4. Works with Missing Data: It can handle missing values by splitting based
on surrogate splits or by averaging.
Cons:
ML Tutorial 13
2. Less Interpretability: The ensemble of trees makes the model less
interpretable than individual decision trees.
In general, Random Forests strike a good balance between bias and variance,
which often leads to better generalization compared to a single decision tree.
Additional Notes
1. Hyperparameters:
ML Tutorial 14
Since each tree in the Random Forest is trained on a bootstrap sample
(subset with replacement), about 1/3 of the data is left out of each
sample. These “out-of-bag” samples can be used to estimate model
performance without needing a separate validation set.
3. Feature Importance:
4. Applications:
Random Forest is widely used in areas where accuracy and stability are
crucial, such as medical diagnosis, fraud detection, financial modeling,
and image recognition.
ML Tutorial 15
Introduction
Gradient Boosting Machines (GBM) is an ensemble machine learning algorithm
that builds models sequentially by combining the strengths of many weak
learners, typically decision trees, to form a strong predictive model. Unlike
Random Forest, where trees are trained independently, in Gradient Boosting,
each tree is trained to correct the errors of its predecessor. GBMs are highly
effective for both classification and regression tasks, excelling in performance
on tabular datasets with complex relationships.
The core idea behind GBM is to minimize the residual error (or loss) of the
previous model by adding a new tree at each step, designed to model the
residuals or gradient of the loss function.
ML Tutorial 16
1. Mean Squared Error (MSE) for regression:
\[
\text{Loss} = \frac{1}{N} \sum_{i=1}^{N} (y_i - F_T(x_i))^2
\]
where \( y_i \) is the true target value, \( F_T(x_i) \) is the final prediction,
and \( N \) is the number of samples.
3. Learning Rate:
1. High Accuracy: GBM often provides superior predictive accuracy due to its
iterative nature and ability to reduce bias.
Cons:
1. Sensitive to Outliers: Since each new model tries to reduce errors, GBM
can amplify the influence of outliers.
ML Tutorial 17
3. Prone to Overfitting: With many trees and a high learning rate, GBMs can
overfit on noisy datasets.
Variance: GBMs have high variance, as each new tree is trained on the
errors of the previous one. This variance can lead to overfitting, especially
with high learning rates or too many trees. Regularization techniques, such
as shrinkage (lower learning rates) and early stopping, are often used to
manage variance.
Additional Notes
1. Learning Rate:
The learning rate controls how much each tree contributes to the final
model. Lower learning rates often yield better results but require more
trees, increasing training time.
2. Regularization Techniques:
3. Hyperparameter Tuning:
ML Tutorial 18
The main parameters to tune are the number of trees, tree depth, and
learning rate. Grid search and cross-validation are commonly used to
identify the best combination of parameters.
4. Extensions:
5. Feature Importance:
XGBoost Tutorial
1. Introduction
Extreme Gradient Boosting (XGBoost) is an optimized version of Gradient
Boosting that focuses on speed and performance, making it widely popular for
machine learning competitions and practical applications. It was developed
with the aim of improving on traditional Gradient Boosting by offering efficient,
scalable, and flexible implementations. XGBoost achieves these enhancements
by optimizing the algorithm’s core structure and applying advanced
regularization techniques.
In XGBoost, the boosting process works by sequentially adding decision trees
(usually small trees with limited depth) to correct the residual errors made by
previous trees, gradually improving the overall accuracy of the model.
\[
\hat{y} = \sum_{k=1}^{K} f_k(x)
\]
ML Tutorial 19
where:
3. Loss Function
The loss function in XGBoost combines two components:
1. Prediction error: This measures the error between predicted values and
true values, commonly using mean squared error for regression tasks and
log loss for classification.
\[
L = \sum_{i=1}^{N} l(y_i, \hat{y}
i) + \sum{k=1}^{K} \Omega(f_k)
\]
where:
\( l(y_i, \hat{y}_i) \) is the loss function measuring the error between actual \
( y_i \) and predicted \( \hat{y}_i \),
\( \Omega(f_k) \) is the regularization term for each tree \( f_k \), which
controls complexity and penalizes large trees to prevent overfitting.
ML Tutorial 20
\( T \) is the number of leaves in the tree,
Pros
High Performance: XGBoost is highly optimized and fast, often
outperforming other gradient boosting models.
Cons
Complexity: XGBoost has many hyperparameters, which can make tuning
complex and time-consuming.
ML Tutorial 21
Bias: XGBoost generally has low bias. By building successive trees that
learn from residuals, the model reduces bias incrementally, making it
effective for complex tasks.
7. Additional Notes
Learning Rate: The learning rate (also called eta) controls the contribution
of each new tree. Lower values (e.g., 0.01–0.1) usually yield better
generalization but require more trees to converge.
XGBoost is one of the most powerful and flexible models for classification and
regression tasks, but achieving optimal results requires tuning its parameters
and monitoring its behavior carefully.
AdaBoost Tutorial
1. Introduction
Adaptive Boosting (AdaBoost) is an ensemble learning technique designed to
improve the accuracy of weak learners, usually decision stumps (single-split
decision trees), by iteratively focusing on the mistakes made in previous
rounds. The goal of AdaBoost is to combine a sequence of these weak
learners, each improving upon the errors of the last, to form a strong predictive
model.
ML Tutorial 22
AdaBoost achieves this by increasing the weights of misclassified samples,
forcing subsequent learners to pay more attention to them, and thus iteratively
reducing errors. This makes AdaBoost especially useful in classification tasks
where it produces a final model that has reduced error rates compared to
individual weak models.
\( \text{sign} \) outputs the class label based on the sign of the weighted
sum.
\[
\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \text{Error}_t}{\text{Error}_t} \right)
\]
where \( \text{Error}_t \) is the weighted error of the \( t \)-th learner. This
weight ensures that models with lower error have higher influence in the final
prediction.
3. Loss Function
The exponential loss function is typically used in AdaBoost, which penalizes
misclassifications more heavily as the iterations progress. Given a set of
predictions, the exponential loss \( L \) for AdaBoost is defined as:
\[
L = \sum_{i=1}^{N} e^{-y_i \cdot F(x_i)}
ML Tutorial 23
\]
where:
The exponential loss increases significantly as the predictions deviate from the
true labels, which forces AdaBoost to iteratively adjust weights and focus on
hard-to-classify samples.
Pros
Effectively Reduces Bias: By combining weak learners and emphasizing
misclassified points, AdaBoost lowers bias, making it effective for complex
classification tasks.
Cons
Sensitive to Outliers and Noisy Data: Outliers receive increased weight as
misclassified points, which can harm performance, as the model may focus
excessively on these points.
Limited to Weak Learners: Works best with simple learners like decision
stumps, and using complex models can lead to overfitting.
ML Tutorial 24
generalization on test data. Methods like data cleaning or weight clipping can
help mitigate this issue.
7. Additional Notes
Learning Rate: AdaBoost includes a learning rate parameter that scales the
influence of each weak learner. Smaller learning rates can improve
generalization by reducing overfitting but require more learners to achieve
high accuracy.
Bagging Tutorial
1. Introduction
Bagging (Bootstrap Aggregating) is an ensemble learning technique aimed at
reducing variance and preventing overfitting in high-variance models. It works
by creating multiple versions of a dataset using bootstrap sampling and training
a model independently on each version. By aggregating predictions from each
individual model (usually decision trees), Bagging can produce a final, more
robust prediction. Bagging is especially effective when applied to high-variance
ML Tutorial 25
models like decision trees, and one of its most popular implementations is
Random Forest.
The main goal of Bagging is to combine many weak models to create a stronger
model with improved stability and accuracy.
3. Loss Function
Bagging does not use a specific loss function for combining models, as each
individual model is trained independently. However, the overall goal is to reduce
the Mean Squared Error (MSE) for regression tasks or classification error for
classification tasks. Each model in Bagging is trained on its own bootstrapped
sample, where it optimizes a suitable loss function for that model (e.g., Gini
impurity or entropy in decision trees for classification).
ML Tutorial 26
Pros
Reduces Variance: Bagging effectively reduces the variance of models like
decision trees, making them more stable.
Cons
Increased Computational Cost: Training multiple models increases the
computational cost, especially for large datasets or complex models.
Bagging’s effectiveness lies in its ability to lower variance, making it ideal for
high-variance models that may otherwise overfit.
ML Tutorial 27
7. Additional Notes
Bootstrap Sampling: Bagging relies on bootstrap sampling, where each
sample has a 63% chance of being selected in a single bootstrap sample.
This randomness contributes to the diversity among models.
Out-of-Bag (OOB) Error: The OOB error is calculated using data points not
included in the bootstrap sample for each model, providing an unbiased
estimate of the model’s generalization performance.
ML Tutorial 28
For both Stacking and Voting, we have an ensemble of base models, each
providing a prediction for a new data point \( x \). Let \( h_i(x) \) be the
prediction of the \( i \)-th base model.
Stacking
In Stacking, each base model \( h_i(x) \) makes a prediction, and these
predictions are used as features in a meta-model, which learns to combine
them optimally.
For a new point \( x \):
Voting
In Voting, the final prediction is a direct aggregation of all base model
predictions.
where:
3. Loss Function
ML Tutorial 29
Stacking
Stacking’s loss function depends on the meta-model chosen. For example:
Each base model is first trained independently to minimize its own loss, and
then the meta-model is trained to minimize its loss on the predictions of the
base models.
Voting
Voting doesn’t use an explicit loss function to combine predictions. Each base
model is trained independently on its dataset to optimize its respective loss,
and the final prediction is an aggregation of these outputs.
Pros
Increased Predictive Power: Combining diverse models can capture more
patterns and reduce errors in predictions.
Cons
Computational Cost: Training multiple models, especially with a meta-
model, can be computationally intensive.
May Not Always Improve Performance: If base models are too similar, the
ensemble may not perform better than individual models.
ML Tutorial 30
methods average predictions, they can be more robust to outliers than
individual models.
In Stacking, the meta-model can sometimes “learn” to reduce the influence of
outlier-sensitive models if other base models are more robust. In Voting, the
impact of outliers is reduced if robust models are part of the ensemble.
Voting: The bias and variance depend on the types of base models used. If
using diverse models, Voting can balance bias and variance. Majority voting
in classification is less likely to overfit, while averaging in regression can
help reduce variance.
7. Additional Notes
Choice of Meta-Model (Stacking): A linear regression or logistic regression
model is often used as the meta-model for simplicity, but complex models
(e.g., neural networks) can also be used depending on the problem.
Model Diversity: Both Stacking and Voting benefit from diverse base
models to reduce correlation among errors, which improves the ensemble’s
effectiveness.
Stacking and Voting are powerful ensemble techniques that, when used with
diverse models, can significantly enhance performance by aggregating
individual strengths and minimizing weaknesses.
ML Tutorial 31
Introduction
K-Nearest Neighbors (KNN) is a non-parametric, supervised learning
algorithm used for classification and regression tasks. In KNN, predictions for
a new data point are made based on the "k" closest points in the training
dataset. The primary goal of KNN is to classify or predict the outcome for a
new instance by looking at similar data points and finding a consensus.
KNN is straightforward and effective for low-dimensional data and problems
where similar items are likely to belong to the same category. However, it is
computationally expensive for large datasets since it needs to compare each
point with all other points in the dataset.
\[
\text{Distance}(x, x') = \sqrt{\sum_{i=1}^d (x_i - x'_i)^2}
\]
where:
ML Tutorial 32
d is the number of features,
x_i and x'_i are the feature values of the training and new data point
respectively.
Loss Function
KNN does not have a specific loss function because it is a lazy learner: it
doesn’t build a model during training. Instead, it stores the entire training
dataset and calculates distances on-the-fly when making predictions. However,
KNN's accuracy or error can be calculated using metrics like:
Cons:
ML Tutorial 33
Effect of Outliers on KNN
Outliers can significantly affect KNN because it classifies or predicts based on
nearby points. If an outlier lies within the vicinity of a new data point, it may
lead to misclassification. Techniques like standardization, feature scaling, and
choosing an appropriate distance metric can help reduce the effect of outliers
in KNN.
This high variance occurs because KNN directly relies on the dataset without
abstracting patterns or trends. Increasing the value of k can help reduce
variance by averaging more points, but it may increase bias if too many
neighbors are included.
Additional Notes
Choice of Distance Metric: Common choices include Euclidean,
Manhattan, and Minkowski distances. For categorical data, Hamming
Distance is often used.
ML Tutorial 34
The main goal of SVM is to find the optimal hyperplane that separates data
points of different classes with the maximum margin. This maximization of the
margin between classes enhances generalization, making SVM highly effective
in binary classification tasks. SVM is popular in fields like image recognition,
text categorization, and bioinformatics due to its ability to handle high-
dimensional data.
\[
y = \text{sign} (w^T x + b)
\]
where:
ML Tutorial 35
\( \text{sign}(\cdot) \) function returns \( +1 \) or \( -1 \) depending on which
side of the hyperplane the point lies.
where:
The Hinge Loss \( \max(0, 1 - y_i (w^T x_i + b)) \) is zero if the point is correctly
classified and beyond the margin; otherwise, it penalizes based on how far the
point is from the correct margin boundary.
ML Tutorial 36
Robust to overfitting: Especially with the use of regularization (C
parameter).
Cons:
This high variance makes SVM prone to overfitting if parameters (especially the
kernel and \( C \)) are not carefully selected.
Additional Notes
Kernel Trick: The kernel trick allows SVM to map data into higher
dimensions to create a linear separation where it is otherwise impossible.
Some popular kernels include:
ML Tutorial 37
Polynomial Kernel: Captures polynomial relationships.
Support Vectors: Only the points that lie on the margin, known as support
vectors, affect the final model, leading to a sparse solution.
Clustering Tutorial
K-Means
Introduction
K-Means is a popular unsupervised learning algorithm used for clustering,
where it groups data points into a predefined number of clusters. The main
objective of K-Means is to partition the data into clusters such that data points
within a cluster are more similar to each other than to those in other clusters.
This similarity is measured by the distance between points, often using
Euclidean distance. K-Means is effective for segmenting datasets where a
natural grouping exists, making it useful in applications like customer
segmentation, image compression, and pattern recognition.
4. Update centroids: Calculate the mean of all points within each cluster and
update the centroids accordingly.
ML Tutorial 38
Formula to Predict Cluster for a New Point
The formula used in K-Means for assigning a point \( x \) to a cluster is based
on minimizing the Euclidean distance:
\[
d(x, \mu_j) = \sqrt{\sum_{i=1}^{n} (x_i - \mu_{j,i})^2}
\]
where:
The point \( x \) is assigned to the cluster \( j \) that minimizes \( d(x, \mu_j) \).
The objective function for K-Means, called the within-cluster sum of squares
(WCSS), is represented as:
\[
\text{WCSS} = \sum_{j=1}^{k} \sum_{x \in C_j} \| x - \mu_j \|^2
\]
where:
ML Tutorial 39
\( C_j \) represents cluster \( j \),
This function calculates the squared distance between each point and its
centroid, summing it across all clusters. K-Means aims to minimize this value.
Works well with spherical clusters: K-Means performs well with clusters
that are roughly circular in shape.
Cons:
ML Tutorial 40
Low variance: If the initialization process is handled well (e.g., with K-
Means++), K-Means produces consistent results across runs. However, if
centroids are poorly initialized, variance may increase.
Additional Notes
K-Means++ Initialization: K-Means++ is a method to improve centroid
initialization, helping to reduce the likelihood of poor clustering and
convergence to a local minimum.
Agglomerative (most common): Starts with each data point as its own
cluster and merges the closest clusters iteratively until only one cluster
remains.
ML Tutorial 41
Divisive: Starts with all points in a single cluster and recursively splits them
until each data point is its own cluster.
Key Steps:
1. Calculate distances: Compute the distance matrix between each pair of
data points.
2. Merge closest clusters: Find the two closest clusters (based on a distance
metric) and merge them.
Ward’s linkage: Minimizes the variance within each cluster, often producing
balanced clusters.
The choice of metric and linkage can affect cluster shape and separation, so
they should be selected based on data characteristics.
ML Tutorial 42
different levels, allowing the user to decide the best number of clusters based
on their requirements.
Suitable for arbitrary shapes: Can capture clusters of different shapes and
sizes more naturally than K-Means.
Cons:
High bias: This is because once clusters are formed, they cannot be
changed, leading to rigid clustering structures.
Low variance: Results tend to be stable across different runs because there
is no random initialization. However, this depends on the choice of linkage
ML Tutorial 43
and distance metric.
Additional Notes
Dendrogram Cutting: The depth at which the dendrogram is "cut"
determines the number of clusters. The threshold can be chosen by
analyzing the dendrogram and selecting a height where there is a
significant gap between levels.
Scalability: Hierarchical clustering may not perform well with large datasets
due to its high computational requirements, but it is effective for small to
medium datasets.
Agglomerative (bottom-up) or
Algorithm Type Iterative and distance-based
divisive (top-down)
ML Tutorial 44
Speed and Faster, with complexity of \(O(n Slower, with complexity of \
Complexity \times k \times i)\) (O(n^2 \log(n))\)
Summary
K-Means is suitable for large datasets with spherical clusters and requires
pre-specifying the number of clusters.
ML Tutorial 45