0% found this document useful (0 votes)
6 views

ML Tutorial

Outliers can significantly impact logistic regression because they can skew the decision boundary due to the linear nature of the model. For instance, if an outlier is far from the main cluster of data, it may push the boundary closer to the remaining points, causing misclassification.

Uploaded by

ahmed77fouad23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

ML Tutorial

Outliers can significantly impact logistic regression because they can skew the decision boundary due to the linear nature of the model. For instance, if an outlier is far from the main cluster of data, it may push the boundary closer to the remaining points, causing misclassification.

Uploaded by

ahmed77fouad23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

ML Tutorial

Classification Algorithms

Logistic Regression Tutorial

Introduction
Logistic Regression is a statistical model commonly used for binary
classification tasks, where the goal is to classify data into one of two possible
outcomes. Unlike linear regression, which predicts a continuous value, logistic
regression predicts the probability of a sample belonging to a particular class.
The core idea of logistic regression is to map input features to a probability
score between 0 and 1 using the logistic (or sigmoid) function. This makes it
suitable for binary classification tasks, such as spam vs. not spam or disease

ML Tutorial 1
vs. no disease. Logistic regression works by finding a decision boundary that
separates the classes in the feature space, maximizing the likelihood of correct
classification.

Formula to Predict a New Point


For a given data point \( x \) with features \( x_1, x_2, \ldots, x_n \), the
prediction in logistic regression is given by the logistic (or sigmoid) function
applied to the linear combination of the features:

1
h(x) = 1+e −(β 0 +β 1 x 1 +β 2 x 2 +…+β n x n )
​ ​ ​ ​ ​ ​ ​

where:

h(x) is the predicted probability that the data point belongs to the
positive class.

β0  is the intercept term.


β1 , β2 , … , βn  are the coefficients associated with each feature.


​ ​ ​

If \( h(x) \geq 0.5 \), the point is classified as belonging to the positive class;
otherwise, it is classified as the negative class.

The Loss Function


Logistic regression uses Binary Cross-Entropy Loss (or Log Loss) to measure
the error between the predicted probability and the actual class label. The loss
function for a single training example is:

Loss(y, h(x)) = − (y ⋅ log(h(x)) + (1 − y) ⋅ log(1 − h(x)))

where:

\( y \) is the actual class label (0 or 1).

\( h(x) \) is the predicted probability from the logistic function.

The goal is to minimize the total loss over all training examples, which can be
achieved through gradient descent or other optimization algorithms.
Minimizing this loss function helps the model make accurate predictions.

ML Tutorial 2
Pros and Cons of Logistic Regression
Pros:

1. Interpretability: Coefficients provide insight into the importance of each


feature.

2. Efficiency: Simple and computationally efficient, suitable for small to


medium-sized datasets.

3. Probability Outputs: Provides probabilities rather than hard classifications,


which is useful for understanding confidence in predictions.

4. Less Prone to Overfitting: Logistic regression has a lower risk of


overfitting, especially if regularization is applied.

Cons:

1. Assumption of Linearity: Assumes a linear relationship between the


features and the log-odds of the outcome, which may not hold for all
datasets.

2. Limited to Binary Classification: Needs extensions (e.g., multinomial


logistic regression) for multiclass classification tasks.

3. Sensitive to Feature Scaling: Requires standardized or normalized features


for optimal performance.

4. Poor Performance with Complex Relationships: Logistic regression does


not capture non-linear relationships between features and target variables
well.

Effect of Outliers on Logistic Regression


Outliers can significantly impact logistic regression because they can skew
the decision boundary due to the linear nature of the model. For instance, if an
outlier is far from the main cluster of data, it may push the boundary closer to
the remaining points, causing misclassification.
To mitigate the impact of outliers:

Robust Scaling: Scaling features using robust methods can reduce


sensitivity to outliers.

Regularization: Applying regularization, such as L2 (Ridge), can make the


model more robust to extreme values.

ML Tutorial 3
Outlier Detection and Removal: Identifying and removing outliers before
training can help improve logistic regression performance.

Bias and Variance in Logistic Regression


Bias: Logistic regression generally has high bias because it assumes a
linear relationship between the features and the log-odds of the target. This
assumption can lead to underfitting if the true relationship is non-linear.

Variance: Logistic regression typically has low variance, making it a stable


model that does not vary significantly with different training sets. This is
especially true when regularization is applied.

This high-bias, low-variance property makes logistic regression a suitable


choice for simpler problems but limits its effectiveness for complex or highly
non-linear data.

Additional Notes
Regularization: Regularization techniques, such as L1 (Lasso) and L2
(Ridge) regularization, are often applied to logistic regression to prevent
overfitting and handle multicollinearity. This introduces a penalty for large
coefficients, encouraging the model to find simpler solutions.

Threshold Tuning: The default threshold for classification is 0.5, but it can
be adjusted depending on the specific problem. For example, in medical
diagnoses where false negatives are costly, a lower threshold might be
chosen.

Multinomial Logistic Regression: For multiclass classification, logistic


regression can be extended to multinomial logistic regression. Techniques
like One-vs-Rest (OvR) or Softmax can be applied to handle multiple
classes.

Feature Engineering: Logistic regression can benefit significantly from


feature engineering, especially when dealing with non-linear data.
Transforming features (e.g., using polynomial features or interaction terms)
can improve its performance on more complex datasets.

ML Tutorial 4
Tree Models

Decision Trees Tutorial

Introduction
Decision Trees are a popular supervised learning algorithm used for both
classification and regression tasks. They work by recursively splitting the
dataset into subsets based on feature values, creating a tree-like structure
where each internal node represents a feature (or attribute), each branch
represents a decision rule, and each leaf node represents an outcome (or class
label).

The goal of a decision tree is to create a model that predicts the target variable
by learning simple decision rules inferred from the data features. Decision trees
are intuitive, easy to interpret, and can handle both categorical and continuous
data. Their transparency and straightforward visualization make them a popular
choice among practitioners, especially when model interpretability is crucial.

ML Tutorial 5
Formula to Predict a New Point
The prediction of a new data point in a decision tree involves traversing the tree
from the root to a leaf node. At each internal node, the algorithm evaluates a
feature and makes a decision based on its value, directing the traversal to the
left or right branch until it reaches a leaf node, which contains the predicted
outcome.

In order to build the tree, we need to decide which features to split on, both for
the root node and the internal nodes. This is done by evaluating each possible
split and choosing the one that maximizes the "purity" (or minimizes impurity)
of the resulting nodes. Here’s how:

Selecting the Root Node and Internal Nodes


1. Evaluate All Features for the Best Split:

For each feature, test possible thresholds to split the data into two
groups (for binary splits).

Calculate the impurity of each split using a criterion such as Gini


impurity, Entropy, or Mean Squared Error (MSE) for regression.

2. Impurity Measures:

The goal is to reduce impurity at each split. Common impurity measures


are:

Gini Impurity for classification:


\[
Gini(D) = 1 - \sum_{i=1}^{C} p_i^2
\]
where \( p_i \) is the proportion of samples in class \( i \), calculated
as:
\[
p_i = \frac{n_i}{N}
\]
where:

\( n_i \) is the number of samples in class \( i \),

\( N \) is the total number of samples in the dataset.

Entropy for classification:


\[

ML Tutorial 6
Entropy(D) = - \sum_{i=1}^{C} p_i \log_2(p_i)
\]

Mean Squared Error (MSE) for regression:


\[
MSE(D) = \frac{1}{n} \sum_{j=1}^{n} (y_j - \hat{y})^2
\]
where \( y_j \) is the true value of a sample in the node, \( \hat{y} \)
is the average value for that node, and \( n \) is the number of
samples.

3. Calculate Information Gain (or Reduction in Impurity):

For each possible split, compute the Information Gain or Reduction in


Impurity. This is the change in impurity from the parent node to the
child nodes. The split that maximizes this gain is selected.

For a split on feature \( X \) with threshold \( t \), Information Gain can


be calculated as:
\[
IG(D, X, t) = I(D) - \left( \frac{|D_{left}|}{|D|} I(D_{left}) +
\frac{|D_{right}|}{|D|} I(D_{right}) \right)
\]
where:

\( I(D) \) is the impurity of the parent node,

\( I(D_{left}) \) and \( I(D_{right}) \) are the impurities of the child


nodes after the split,

\( |D_{left}| \) and \( |D_{right}| \) are the number of samples in the


left and right child nodes, respectively,

\( |D| \) is the total number of samples in the parent node.

4. Choose the Best Split:

The feature and threshold with the highest Information Gain (or largest
reduction in impurity) becomes the split point for the current node.

5. Recursion for Internal Nodes:

Repeat this process recursively for each child node, treating each child
node as the new parent node.

ML Tutorial 7
Continue until a stopping criterion is met, such as reaching a maximum
tree depth, a minimum number of samples in a node, or achieving zero
impurity.

6. Stopping Criteria:

Maximum Depth: Prevents the tree from growing too complex and
overfitting.

Minimum Samples per Node: Ensures each node represents a


significant portion of data.

Impurity Threshold: Stops the split if the reduction in impurity is below


a defined threshold.

The Loss Function


In the context of decision trees, the loss function is based on the impurity of
the nodes. The goal is to minimize the impurity when splitting nodes.
Commonly used impurity measures are:

Gini Impurity for classification:


\[
Gini(D) = 1 - \sum_{i=1}^{C} p_i^2
\]

Entropy for classification:


\[
Entropy(D) = - \sum_{i=1}^{C} p_i \log_2(p_i)
\]

Mean Squared Error (MSE) for regression:


\[
MSE(D) = \frac{1}{n} \sum_{j=1}^{n} (y_j - \hat{y})^2
\]

The tree algorithm selects the feature and the split point that minimizes the
impurity for the resulting child nodes, aiming to create homogeneous groups of
outcomes.

Pros and Cons of Decision Trees


Pros:

ML Tutorial 8
1. Interpretability: Decision trees are easy to visualize and interpret, making it
clear how decisions are made.

2. Non-linear Relationships: They can capture non-linear relationships


between features and the target variable without requiring any
transformation.

3. Minimal Data Preparation: Decision trees require little data preprocessing,


as they do not require scaling, normalization, or one-hot encoding.

4. Handles Both Numerical and Categorical Data: They can work with both
types of data without special transformations.

5. Robust to Feature Scaling: Decision trees are not sensitive to the scale of
data, unlike some other models.

6. Works Well on Large Datasets: With certain optimizations (like pruning),


decision trees can work well on large datasets.

Cons:

1. Prone to Overfitting: Without limitations on depth or splits, decision trees


can easily overfit the data, especially if the tree is deep.

2. Sensitive to Noisy Data: Small changes in the data can lead to a completely
different tree structure, which can reduce model stability.

3. Biased Towards Dominant Classes: If one class is more prevalent, decision


trees might lean towards predicting it more often, especially in imbalanced
datasets.

4. Suboptimal Splits in High Dimensions: In high-dimensional spaces,


decision trees can struggle to find optimal splits, often leading to subpar
performance compared to other models.

5. Requires Pruning: Pruning is necessary to prevent overfitting but requires


additional computational effort and complexity.

How Outliers Affect Decision Trees


Decision trees are generally robust to outliers because they partition data into
homogeneous groups, rather than relying on statistical parameters (like mean
and standard deviation). However, in certain cases, outliers can still influence
the splits if they cause impurity calculations to favor divisions that isolate them.

ML Tutorial 9
While outliers are less likely to affect trees significantly, pruning or limiting the
depth can help avoid nodes formed solely due to outliers.

Bias and Variance


Bias: Decision trees have low bias. They can fit complex patterns in the
data well, which often allows them to capture non-linear relationships.

Variance: Decision trees have high variance because small changes in the
training data can lead to completely different splits and a different tree
structure. This high variance makes individual decision trees prone to
overfitting, particularly on small datasets.

To address this high variance, ensemble methods like Random Forests or


Gradient Boosted Trees are commonly used, which reduce the variance by
combining multiple trees.

Additional Notes
1. Pruning: To avoid overfitting, trees often require pruning. Pruning involves
removing branches that have little importance or adding stopping criteria
(like a maximum depth) to control the tree’s complexity.

2. Tree Depth: Limiting the maximum depth of a tree is another way to control
for overfitting and improve model generalization.

3. Feature Importance: Decision trees provide a way to assess feature


importance by observing how much each feature reduces impurity across
the splits. Features that lead to greater reductions in impurity are
considered more important.

4. Ensemble Methods: To improve accuracy, decision trees are often used in


ensemble methods such as Random Forests and Gradient Boosting. These
methods combine multiple trees to produce more robust models with better
generalization.

5. Handling Missing Values: Decision trees can handle missing values by


considering multiple possible splits or by using surrogate splits, which look
for alternative splits in case of missing data.

By understanding these key points about decision trees, you’ll be well-


equipped to apply them effectively in various machine learning contexts and
identify when they might be a suitable choice.

ML Tutorial 10
Random Forest Tutorial

Introduction
Random Forest is an ensemble learning method that combines multiple
decision trees to improve the accuracy and stability of predictions. Developed
by Leo Breiman, Random Forest uses a technique known as bagging (Bootstrap
Aggregating) to build a "forest" of individual decision trees, where each tree is
trained on a random subset of the data. By averaging or majority voting the
predictions of each tree, Random Forest reduces the risk of overfitting that
individual decision trees often face.
Random Forest is highly effective in both classification and regression tasks,
making it versatile across many fields, including finance, healthcare, and e-
commerce. The goal is to improve the accuracy and generalization of
predictions while maintaining model interpretability.

ML Tutorial 11
Formula to Predict a New Point
Random Forest prediction involves aggregating predictions from multiple
decision trees. Each tree in the forest makes an independent prediction, and
the final output is determined by averaging (for regression) or taking the
majority vote (for classification) of these individual predictions.

1. Classification:
\[
\hat{y} = \text{mode}\left\{ T_1(x), T_2(x), \dots, T_m(x) \right\}
\]
where \( T_i(x) \) is the prediction of the \(i\)-th tree in the forest, and \( m \)
is the total number of trees. The mode, or majority vote, of these
predictions determines the final output.

2. Regression:
\[
\hat{y} = \frac{1}{m} \sum_{i=1}^{m} T_i(x)
\]
where \( T_i(x) \) is the prediction of the \(i\)-th tree, and the average value
of the predictions is used as the final output.

The randomness introduced during the training phase, both in the selection of
data samples and feature subsets, helps to reduce variance and improve the
model’s ability to generalize.

The Loss Function


The Random Forest model itself doesn’t use a specific loss function during
training. Instead, it relies on the underlying decision trees, which are typically
trained with the following objectives:

Classification: Minimizing impurity at each split, using measures such as


Gini impurity or Entropy.

Regression: Minimizing Mean Squared Error (MSE) at each split within


each tree.

The final performance of the forest is typically evaluated based on the chosen
task's loss function, such as cross-entropy or accuracy for classification, and
Mean Squared Error for regression.

How Trees in the Forest Are Built

ML Tutorial 12
Random Forest uses bagging and feature selection to create diverse trees in
the forest:

1. Bagging (Bootstrap Aggregating):

Each tree is trained on a different random subset of the training data


with replacement. This means that some samples may appear multiple
times in a subset, while others may not appear at all.

2. Random Feature Selection:

For each split in a tree, only a random subset of the features is


considered. This ensures diversity among the trees and reduces
correlation, leading to better generalization.

3. Growing the Trees:

Each tree is grown to its maximum depth, with no pruning, to allow it to


capture complex patterns in the data.

The combination of bagging and random feature selection helps in


decorrelating the trees, so their combined predictions are more robust than
those of individual trees.

Pros and Cons of Random Forest


Pros:

1. High Accuracy: Random Forest generally offers better accuracy than


individual decision trees due to its ensemble approach.

2. Reduces Overfitting: The combination of bagging and random feature


selection reduces overfitting, especially on large datasets.

3. Handles Large Feature Sets: By using a random subset of features for each
split, Random Forests are effective even with high-dimensional data.

4. Works with Missing Data: It can handle missing values by splitting based
on surrogate splits or by averaging.

5. Feature Importance: Random Forests provide feature importance scores,


helping in feature selection and model interpretability.

Cons:

1. Complexity: Compared to a single decision tree, Random Forests are


computationally intensive and require more memory and processing power.

ML Tutorial 13
2. Less Interpretability: The ensemble of trees makes the model less
interpretable than individual decision trees.

3. Longer Training Times: Building a large number of trees can result in


longer training times, especially for large datasets.

4. No Extrapolation for Regression: For regression tasks, Random Forest


cannot extrapolate beyond the training data range.

How Outliers Affect Random Forests


Random Forests are generally robust to outliers because they average the
results of multiple trees. While outliers might affect individual trees, their
influence is diluted by the ensemble’s aggregation mechanism. However, if
outliers are extreme, they can still affect the splits in some trees, leading to
minor noise in predictions.

Bias and Variance


Bias: Random Forests have low bias because they capture complex
patterns due to their high capacity for flexibility within individual trees.

Variance: Random Forests reduce variance by averaging multiple trees,


making the final model less sensitive to noise and fluctuations in the
training data.

In general, Random Forests strike a good balance between bias and variance,
which often leads to better generalization compared to a single decision tree.

Additional Notes
1. Hyperparameters:

The key hyperparameters for Random Forest include the number of


trees ( n_estimators ), the maximum depth of each tree ( max_depth ), the
number of features to consider at each split ( max_features ), and
minimum samples per split ( min_samples_split ).

Tuning these hyperparameters is crucial to achieving optimal


performance, especially for larger datasets.

2. OOB (Out-of-Bag) Error:

ML Tutorial 14
Since each tree in the Random Forest is trained on a bootstrap sample
(subset with replacement), about 1/3 of the data is left out of each
sample. These “out-of-bag” samples can be used to estimate model
performance without needing a separate validation set.

3. Feature Importance:

Random Forest provides feature importance scores based on how much


each feature reduces impurity across splits in the forest. This is
valuable for feature selection and interpretability.

4. Applications:

Random Forest is widely used in areas where accuracy and stability are
crucial, such as medical diagnosis, fraud detection, financial modeling,
and image recognition.

Random Forest provides a flexible, accurate, and robust machine learning


approach, suitable for a wide variety of datasets and problems, while helping
address the overfitting issues associated with single decision trees.

Gradient Boosting Machines (GBM)


Tutorial

ML Tutorial 15
Introduction
Gradient Boosting Machines (GBM) is an ensemble machine learning algorithm
that builds models sequentially by combining the strengths of many weak
learners, typically decision trees, to form a strong predictive model. Unlike
Random Forest, where trees are trained independently, in Gradient Boosting,
each tree is trained to correct the errors of its predecessor. GBMs are highly
effective for both classification and regression tasks, excelling in performance
on tabular datasets with complex relationships.
The core idea behind GBM is to minimize the residual error (or loss) of the
previous model by adding a new tree at each step, designed to model the
residuals or gradient of the loss function.

Formula to Predict a New Point


A GBM model is built incrementally, where each new model \( h_t(x) \) added to
the ensemble targets the residual error of the previous iteration:

1. Initial Prediction: The first model \( F_0(x) \) makes an initial prediction,


typically the mean of the target values for regression.

2. Residual Learning: Each subsequent model, \( h_t(x) \), is trained on the


residuals (errors) from the previous iteration.

3. Final Prediction: The final model \( F_T(x) \) at iteration \( T \) combines all


models:
\[
F_T(x) = F_0(x) + \sum_{t=1}^{T} \gamma_t \cdot h_t(x)
\]
where \( \gamma_t \) is a learning rate that controls how much each model
contributes to the final prediction. For regression, the prediction is simply
the sum of all models, while for classification, it’s often the sum passed
through a transformation (e.g., logistic function) to output probabilities.

4. Gradient Boosting: Each model \( h_t(x) \) is fitted to the gradient of the


loss function (the residuals), making the model iterative and focusing on
errors.

The Loss Function


GBMs can use various loss functions based on the task:

ML Tutorial 16
1. Mean Squared Error (MSE) for regression:
\[
\text{Loss} = \frac{1}{N} \sum_{i=1}^{N} (y_i - F_T(x_i))^2
\]
where \( y_i \) is the true target value, \( F_T(x_i) \) is the final prediction,
and \( N \) is the number of samples.

2. Logistic Loss for binary classification:


\[
\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(F_T(x_i)) + (1 - y_i)
\log(1 - F_T(x_i)) \right]
\]
GBM iteratively fits each model to the gradient of the loss function,
minimizing it over time.

3. Learning Rate:

The learning rate \( \gamma \) is a hyperparameter that scales the


contribution of each tree, balancing the speed and accuracy of training.

Pros and Cons of GBM


Pros:

1. High Accuracy: GBM often provides superior predictive accuracy due to its
iterative nature and ability to reduce bias.

2. Flexibility: It can handle both regression and classification tasks with


various loss functions.

3. Handles Complex Data: GBMs can learn complex, non-linear relationships


in the data.

4. Customizability: Multiple hyperparameters, including learning rate, number


of trees, and tree depth, can be fine-tuned for optimal performance.

Cons:

1. Sensitive to Outliers: Since each new model tries to reduce errors, GBM
can amplify the influence of outliers.

2. Longer Training Time: Due to the sequential training of trees, GBM is


slower than parallel algorithms like Random Forest.

ML Tutorial 17
3. Prone to Overfitting: With many trees and a high learning rate, GBMs can
overfit on noisy datasets.

4. Complexity in Tuning: Tuning parameters such as learning rate, tree depth,


and number of trees can be challenging and requires cross-validation.

How Outliers Affect Gradient Boosting Machines


Gradient Boosting Machines are sensitive to outliers because each new model
emphasizes reducing the residuals of the previous predictions. This can cause
the model to place excessive weight on outliers, leading to potential overfitting.
Techniques like reducing tree depth, adding regularization, or using robust loss
functions (e.g., Huber loss) can mitigate this issue.

Bias and Variance


Bias: GBMs have low bias due to their iterative, residual-reducing
approach. Each new model reduces the bias by focusing on the residuals,
making the ensemble highly flexible and able to capture complex patterns.

Variance: GBMs have high variance, as each new tree is trained on the
errors of the previous one. This variance can lead to overfitting, especially
with high learning rates or too many trees. Regularization techniques, such
as shrinkage (lower learning rates) and early stopping, are often used to
manage variance.

Additional Notes
1. Learning Rate:

The learning rate controls how much each tree contributes to the final
model. Lower learning rates often yield better results but require more
trees, increasing training time.

2. Regularization Techniques:

Regularization is essential to prevent overfitting. Techniques include


early stopping (stopping training when performance on validation data
no longer improves), subsampling (training each tree on a random
subset of data), and shrinkage (applying a small learning rate).

3. Hyperparameter Tuning:

ML Tutorial 18
The main parameters to tune are the number of trees, tree depth, and
learning rate. Grid search and cross-validation are commonly used to
identify the best combination of parameters.

4. Extensions:

XGBoost (Extreme Gradient Boosting), LightGBM, and CatBoost are


advanced implementations of GBM that improve on the standard GBM
in terms of speed, efficiency, and performance, each with unique
optimizations.

5. Feature Importance:

Like Random Forest, GBMs can calculate feature importance, helping to


identify the most influential features in the model.

XGBoost Tutorial
1. Introduction
Extreme Gradient Boosting (XGBoost) is an optimized version of Gradient
Boosting that focuses on speed and performance, making it widely popular for
machine learning competitions and practical applications. It was developed
with the aim of improving on traditional Gradient Boosting by offering efficient,
scalable, and flexible implementations. XGBoost achieves these enhancements
by optimizing the algorithm’s core structure and applying advanced
regularization techniques.
In XGBoost, the boosting process works by sequentially adding decision trees
(usually small trees with limited depth) to correct the residual errors made by
previous trees, gradually improving the overall accuracy of the model.

2. Formula to Predict a New Point


The XGBoost prediction for a new point \( x \) is given by the summation of all
the trees in the model. Each tree \( f_k \) produces a prediction, and the final
prediction \( \hat{y} \) is the sum of these predictions:

\[
\hat{y} = \sum_{k=1}^{K} f_k(x)
\]

ML Tutorial 19
where:

\( K \) is the total number of trees,

\( f_k(x) \) represents the prediction of the \( k \)-th tree,

\( \hat{y} \) is the final prediction, which can be used for regression or as a


probability in classification.

The objective function in XGBoost is minimized by adding a new tree \( f_t(x) \)


at each step that best fits the residuals (the differences between predicted and
actual values).

3. Loss Function
The loss function in XGBoost combines two components:

1. Prediction error: This measures the error between predicted values and
true values, commonly using mean squared error for regression tasks and
log loss for classification.

2. Regularization term: To prevent overfitting, XGBoost incorporates


regularization terms on tree complexity (such as the number of leaves or
leaf weights).

The general form of the objective function \( L \) is:

\[
L = \sum_{i=1}^{N} l(y_i, \hat{y}
i) + \sum{k=1}^{K} \Omega(f_k)
\]
where:

\( l(y_i, \hat{y}_i) \) is the loss function measuring the error between actual \
( y_i \) and predicted \( \hat{y}_i \),

\( \Omega(f_k) \) is the regularization term for each tree \( f_k \), which
controls complexity and penalizes large trees to prevent overfitting.

The regularization term is defined as:


\[
\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2
\]
where:

ML Tutorial 20
\( T \) is the number of leaves in the tree,

\( \gamma \) and \( \lambda \) are hyperparameters that control the


regularization strength,

\( w_j \) are the weights of each leaf node.

4. Pros and Cons of XGBoost

Pros
High Performance: XGBoost is highly optimized and fast, often
outperforming other gradient boosting models.

Regularization: The model has in-built regularization to prevent overfitting,


which helps improve generalization.

Handling of Missing Values: XGBoost can automatically handle missing


data by learning which path in the tree should be taken when missing
values are encountered.

Flexible Objective Functions: It allows custom loss functions and can be


applied to both regression and classification tasks.

Cons
Complexity: XGBoost has many hyperparameters, which can make tuning
complex and time-consuming.

Memory Usage: With large datasets, XGBoost can consume significant


memory, especially for high-dimensional data.

Sensitivity to Noise: While less sensitive than AdaBoost, it can still be


influenced by noisy or highly irrelevant features if not tuned properly.

5. Effect of Outliers on XGBoost


Outliers can have an impact on XGBoost, although it is typically less sensitive
than linear models due to the tree structure. Trees naturally segment the data,
which can reduce the impact of extreme values. However, if outliers affect the
early stages of boosting, subsequent trees may still overfit to those values.
Regularization and careful tuning can help mitigate this impact.

6. Bias and Variance in XGBoost

ML Tutorial 21
Bias: XGBoost generally has low bias. By building successive trees that
learn from residuals, the model reduces bias incrementally, making it
effective for complex tasks.

Variance: XGBoost can exhibit moderate to high variance because it has a


large capacity for fitting complex patterns, which can lead to overfitting on
noisy data. Regularization techniques and hyperparameter tuning (e.g.,
limiting tree depth, adjusting learning rate) are necessary to control
variance and prevent overfitting.

7. Additional Notes
Learning Rate: The learning rate (also called eta) controls the contribution
of each new tree. Lower values (e.g., 0.01–0.1) usually yield better
generalization but require more trees to converge.

Early Stopping: XGBoost allows early stopping based on validation metrics,


which can prevent overfitting by stopping the training process once
performance ceases to improve.

Cross-Validation: Using k-fold cross-validation is often beneficial in


XGBoost to ensure the model is generalizing well.

Distributed Computing: XGBoost supports distributed processing, allowing


it to scale efficiently across multiple cores or machines, making it suitable
for very large datasets.

XGBoost is one of the most powerful and flexible models for classification and
regression tasks, but achieving optimal results requires tuning its parameters
and monitoring its behavior carefully.

AdaBoost Tutorial
1. Introduction
Adaptive Boosting (AdaBoost) is an ensemble learning technique designed to
improve the accuracy of weak learners, usually decision stumps (single-split
decision trees), by iteratively focusing on the mistakes made in previous
rounds. The goal of AdaBoost is to combine a sequence of these weak
learners, each improving upon the errors of the last, to form a strong predictive
model.

ML Tutorial 22
AdaBoost achieves this by increasing the weights of misclassified samples,
forcing subsequent learners to pay more attention to them, and thus iteratively
reducing errors. This makes AdaBoost especially useful in classification tasks
where it produces a final model that has reduced error rates compared to
individual weak models.

2. Formula to Predict a New Point


The final prediction of AdaBoost is a weighted vote from each weak learner. In
the context of binary classification, the final prediction \( F(x) \) for an input \( x
\) is given by:
\[
F(x) = \text{sign} \left( \sum_{t=1}^{T} \alpha_t \cdot h_t(x) \right)
\]
where:

\( T \) is the total number of weak learners,

\( h_t(x) \) is the prediction made by the \( t \)-th weak learner,

\( \alpha_t \) is the weight assigned to the \( t \)-th weak learner, determined


by the error of that learner,

\( \text{sign} \) outputs the class label based on the sign of the weighted
sum.

The weight \( \alpha_t \) for each weak learner is calculated as:

\[
\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \text{Error}_t}{\text{Error}_t} \right)
\]
where \( \text{Error}_t \) is the weighted error of the \( t \)-th learner. This
weight ensures that models with lower error have higher influence in the final
prediction.

3. Loss Function
The exponential loss function is typically used in AdaBoost, which penalizes
misclassifications more heavily as the iterations progress. Given a set of
predictions, the exponential loss \( L \) for AdaBoost is defined as:
\[
L = \sum_{i=1}^{N} e^{-y_i \cdot F(x_i)}

ML Tutorial 23
\]
where:

\( N \) is the total number of training samples,

\( y_i \) is the true label for each sample,

\( F(x_i) \) is the ensemble prediction.

The exponential loss increases significantly as the predictions deviate from the
true labels, which forces AdaBoost to iteratively adjust weights and focus on
hard-to-classify samples.

4. Pros and Cons of AdaBoost

Pros
Effectively Reduces Bias: By combining weak learners and emphasizing
misclassified points, AdaBoost lowers bias, making it effective for complex
classification tasks.

Good Generalization: AdaBoost typically does not overfit easily, especially


when combined with simple base learners.

Focus on Difficult Samples: By giving higher weights to misclassified


samples, AdaBoost excels at learning from hard cases.

Cons
Sensitive to Outliers and Noisy Data: Outliers receive increased weight as
misclassified points, which can harm performance, as the model may focus
excessively on these points.

Limited to Weak Learners: Works best with simple learners like decision
stumps, and using complex models can lead to overfitting.

Not Optimal for Large Datasets: AdaBoost’s iterative nature can be


computationally expensive for very large datasets.

5. Effect of Outliers on AdaBoost


Outliers can have a pronounced effect on AdaBoost. Since AdaBoost increases
the weight of misclassified points, outliers may become highly weighted and
receive undue focus in subsequent rounds. This can make AdaBoost overly
sensitive to outliers, potentially leading to overfitting and reducing

ML Tutorial 24
generalization on test data. Methods like data cleaning or weight clipping can
help mitigate this issue.

6. Bias and Variance in AdaBoost


Bias: AdaBoost has low bias. By combining weak learners and focusing on
hard-to-classify points, the model minimizes the overall error, reducing
bias.

Variance: AdaBoost tends to have higher variance since it is sensitive to


fluctuations in the training data. Changes in the training set, especially in
challenging or noisy points, can affect its performance. However, variance
can be managed by adjusting the number of learners or by applying
AdaBoost to simple learners.

7. Additional Notes
Learning Rate: AdaBoost includes a learning rate parameter that scales the
influence of each weak learner. Smaller learning rates can improve
generalization by reducing overfitting but require more learners to achieve
high accuracy.

Hyperparameter Tuning: The main parameters to tune in AdaBoost are the


number of weak learners and the learning rate. Increasing the number of
weak learners often improves accuracy but increases computational cost.

Binary Classification: AdaBoost is mainly used for binary classification,


although it can be adapted for multi-class classification with techniques like
one-vs-all or one-vs-one.

Bagging Tutorial
1. Introduction
Bagging (Bootstrap Aggregating) is an ensemble learning technique aimed at
reducing variance and preventing overfitting in high-variance models. It works
by creating multiple versions of a dataset using bootstrap sampling and training
a model independently on each version. By aggregating predictions from each
individual model (usually decision trees), Bagging can produce a final, more
robust prediction. Bagging is especially effective when applied to high-variance

ML Tutorial 25
models like decision trees, and one of its most popular implementations is
Random Forest.
The main goal of Bagging is to combine many weak models to create a stronger
model with improved stability and accuracy.

2. Formula to Predict a New Point


For a new point \( x \), the prediction in Bagging is the aggregated result from
all models trained on the bootstrapped samples. Let \( h_i(x) \) be the prediction
from the \( i \)-th model. The Bagging prediction \( \hat{y} \) for regression is
the average of all predictions, and for classification, it is the majority vote.

Regression (average of predictions):


\[
\hat{y} = \frac{1}{N} \sum_{i=1}^{N} h_i(x)
\]

Classification (majority vote):


\[
\hat{y} = \text{mode} \{ h_1(x), h_2(x), \dots, h_N(x) \}
\]
where:

\( N \) is the number of models,

\( h_i(x) \) is the prediction from the \( i \)-th model,

\( \text{mode} \) represents the most frequent class label among all


predictions.

3. Loss Function
Bagging does not use a specific loss function for combining models, as each
individual model is trained independently. However, the overall goal is to reduce
the Mean Squared Error (MSE) for regression tasks or classification error for
classification tasks. Each model in Bagging is trained on its own bootstrapped
sample, where it optimizes a suitable loss function for that model (e.g., Gini
impurity or entropy in decision trees for classification).

4. Pros and Cons of Bagging

ML Tutorial 26
Pros
Reduces Variance: Bagging effectively reduces the variance of models like
decision trees, making them more stable.

Handles Overfitting: Since each model is trained on a different subset,


Bagging reduces the risk of overfitting that may occur with a single decision
tree.

Parallelizable: Bagging can be parallelized easily, as each model is


independent, allowing for faster computation with multiple cores or
distributed systems.

Cons
Increased Computational Cost: Training multiple models increases the
computational cost, especially for large datasets or complex models.

Less Effective on Low-Variance Models: Bagging is best for high-variance


models like decision trees; it may not improve performance significantly on
low-variance models.

Memory Intensive: Bagging requires multiple copies of the data to create


bootstrap samples, which can be memory-intensive for large datasets.

5. Effect of Outliers on Bagging


Bagging is generally less sensitive to outliers than a single decision tree since
each individual model is trained on a slightly different dataset. However, if
outliers appear frequently in the bootstrapped samples, they can still affect
individual model predictions. To mitigate this, Bagging can be combined with
robust algorithms or preprocessing steps like outlier removal.

6. Bias and Variance in Bagging


Bias: Bagging slightly increases bias since it combines multiple models, but
this increase is typically minimal.

Variance: Bagging effectively reduces variance by averaging the


predictions from multiple models, leading to improved generalization on test
data.

Bagging’s effectiveness lies in its ability to lower variance, making it ideal for
high-variance models that may otherwise overfit.

ML Tutorial 27
7. Additional Notes
Bootstrap Sampling: Bagging relies on bootstrap sampling, where each
sample has a 63% chance of being selected in a single bootstrap sample.
This randomness contributes to the diversity among models.

Out-of-Bag (OOB) Error: The OOB error is calculated using data points not
included in the bootstrap sample for each model, providing an unbiased
estimate of the model’s generalization performance.

Feature Importance: In models like Random Forest, which are based on


Bagging, feature importance scores can be derived by observing how much
each feature splits in the trees.

Bagging is a straightforward yet powerful method to enhance model


performance by focusing on reducing variance and stabilizing predictions.

Stacking and Voting Tutorial


1. Introduction
Stacking and Voting are ensemble learning techniques that combine multiple
models to improve overall prediction accuracy. Unlike Bagging and Boosting,
which rely on creating variations of a single model type, Stacking and Voting
allow the use of multiple different model types in a single ensemble. Both
methods aim to aggregate the strengths of individual models, but they differ in
how predictions are combined.

Stacking (or Stacked Generalization) combines predictions from different


base models (also called level-0 models) by training a final “meta-model”
on these predictions. The meta-model learns how to best combine base
model predictions to improve accuracy.

Voting is a simpler method where each model in the ensemble directly


contributes to the final prediction, either through majority voting (for
classification) or averaging (for regression).

2. Formula to Predict a New Point

ML Tutorial 28
For both Stacking and Voting, we have an ensemble of base models, each
providing a prediction for a new data point \( x \). Let \( h_i(x) \) be the
prediction of the \( i \)-th base model.

Stacking
In Stacking, each base model \( h_i(x) \) makes a prediction, and these
predictions are used as features in a meta-model, which learns to combine
them optimally.
For a new point \( x \):

1. Obtain predictions from each base model \( h_i(x) \).

2. Pass these predictions to the meta-model to produce the final prediction, \(


\hat{y} \).

The prediction formula:


\[
\hat{y} = g(h_1(x), h_2(x), \dots, h_N(x))
\]
where \( g \) is the meta-model and \( N \) is the number of base models.

Voting
In Voting, the final prediction is a direct aggregation of all base model
predictions.

For Classification (Majority Voting):


\[
\hat{y} = \text{mode} \{ h_1(x), h_2(x), \dots, h_N(x) \}
\]

For Regression (Averaging):


\[
\hat{y} = \frac{1}{N} \sum_{i=1}^{N} h_i(x)
\]

where:

\( N \) is the number of base models,

\( \text{mode} \) denotes the most common class label in classification.

3. Loss Function

ML Tutorial 29
Stacking
Stacking’s loss function depends on the meta-model chosen. For example:

Classification: The meta-model may minimize logistic loss.

Regression: The meta-model may minimize mean squared error (MSE).

Each base model is first trained independently to minimize its own loss, and
then the meta-model is trained to minimize its loss on the predictions of the
base models.

Voting
Voting doesn’t use an explicit loss function to combine predictions. Each base
model is trained independently on its dataset to optimize its respective loss,
and the final prediction is an aggregation of these outputs.

4. Pros and Cons of Stacking and Voting

Pros
Increased Predictive Power: Combining diverse models can capture more
patterns and reduce errors in predictions.

Flexibility with Different Models: Allows the use of different algorithms


within the same ensemble, optimizing strengths and weaknesses of each.

Reduced Overfitting (Stacking): Stacking reduces the tendency to overfit


by training a meta-model, which better generalizes the output of base
models.

Cons
Computational Cost: Training multiple models, especially with a meta-
model, can be computationally intensive.

Complexity in Tuning (Stacking): Choosing the right base models and a


suitable meta-model can require extensive tuning.

May Not Always Improve Performance: If base models are too similar, the
ensemble may not perform better than individual models.

5. Effect of Outliers on Stacking and Voting


Both Stacking and Voting can be affected by outliers, especially if the base
models are sensitive to them (e.g., Decision Trees). However, since ensemble

ML Tutorial 30
methods average predictions, they can be more robust to outliers than
individual models.
In Stacking, the meta-model can sometimes “learn” to reduce the influence of
outlier-sensitive models if other base models are more robust. In Voting, the
impact of outliers is reduced if robust models are part of the ensemble.

6. Bias and Variance in Stacking and Voting


Stacking: Typically has low bias and low variance due to the diversity of
base models and the aggregation process by the meta-model. By
combining models with differing biases and variances, Stacking can
achieve an optimal bias-variance trade-off.

Voting: The bias and variance depend on the types of base models used. If
using diverse models, Voting can balance bias and variance. Majority voting
in classification is less likely to overfit, while averaging in regression can
help reduce variance.

7. Additional Notes
Choice of Meta-Model (Stacking): A linear regression or logistic regression
model is often used as the meta-model for simplicity, but complex models
(e.g., neural networks) can also be used depending on the problem.

Soft Voting in Classification: Voting can be extended to soft voting for


classification, where the class probabilities from each model are averaged,
and the class with the highest average probability is selected. Soft voting
often yields better performance than hard voting.

Model Diversity: Both Stacking and Voting benefit from diverse base
models to reduce correlation among errors, which improves the ensemble’s
effectiveness.

Stacking and Voting are powerful ensemble techniques that, when used with
diverse models, can significantly enhance performance by aggregating
individual strengths and minimizing weaknesses.

K-Nearest Neighbors (KNN) Tutorial

ML Tutorial 31
Introduction
K-Nearest Neighbors (KNN) is a non-parametric, supervised learning
algorithm used for classification and regression tasks. In KNN, predictions for
a new data point are made based on the "k" closest points in the training
dataset. The primary goal of KNN is to classify or predict the outcome for a
new instance by looking at similar data points and finding a consensus.
KNN is straightforward and effective for low-dimensional data and problems
where similar items are likely to belong to the same category. However, it is
computationally expensive for large datasets since it needs to compare each
point with all other points in the dataset.

How KNN Works


In KNN, the algorithm computes the distance between the new data point and
all points in the training set, then selects the "k" nearest neighbors. Based on
the majority class of these neighbors, KNN assigns the class label (in
classification) or calculates the average outcome (in regression).

Formula to Predict a New Point


For a new data point x', the k closest points are selected using a distance
metric, typically Euclidean Distance. The formula for the Euclidean Distance
between x and x' is:

\[
\text{Distance}(x, x') = \sqrt{\sum_{i=1}^d (x_i - x'_i)^2}
\]
where:

ML Tutorial 32
d is the number of features,

x_i and x'_i are the feature values of the training and new data point
respectively.

Once the k nearest neighbors are identified:

In classification, KNN assigns the most frequent class among the


neighbors.

In regression, KNN averages the outcomes of the nearest neighbors.

Loss Function
KNN does not have a specific loss function because it is a lazy learner: it
doesn’t build a model during training. Instead, it stores the entire training
dataset and calculates distances on-the-fly when making predictions. However,
KNN's accuracy or error can be calculated using metrics like:

Classification Error: Fraction of misclassified points.

Mean Squared Error (MSE) for regression tasks.

Pros and Cons of KNN


Pros:

1. Simple and Intuitive: Easy to understand and implement with minimal


configuration.

2. Non-parametric: No assumptions about data distribution.

3. Adaptable: Can be used for both classification and regression tasks.

Cons:

1. Computationally Expensive: High memory and time complexity, especially


for large datasets.

2. Sensitive to the Choice of "k": A poorly chosen value of k can significantly


impact performance.

3. Sensitive to Feature Scaling: Requires normalization/standardization as it


relies on distance measures.

4. Imbalanced Classes: KNN may perform poorly when classes are


imbalanced.

ML Tutorial 33
Effect of Outliers on KNN
Outliers can significantly affect KNN because it classifies or predicts based on
nearby points. If an outlier lies within the vicinity of a new data point, it may
lead to misclassification. Techniques like standardization, feature scaling, and
choosing an appropriate distance metric can help reduce the effect of outliers
in KNN.

Bias and Variance in KNN


Bias: KNN has low bias because it does not assume a model structure.

Variance: KNN has high variance because it is sensitive to the specific


instances in the training dataset. The performance may change drastically
with slight variations in the data.

This high variance occurs because KNN directly relies on the dataset without
abstracting patterns or trends. Increasing the value of k can help reduce
variance by averaging more points, but it may increase bias if too many
neighbors are included.

Additional Notes
Choice of Distance Metric: Common choices include Euclidean,
Manhattan, and Minkowski distances. For categorical data, Hamming
Distance is often used.

Scaling Features: Since KNN relies on distances, features should be


normalized to avoid bias towards features with larger ranges.

Optimal Value of k: Typically found through cross-validation. An odd


number for k is often chosen to avoid ties in classification tasks.

Support Vector Machine (SVM) Tutorial


Introduction
Support Vector Machine (SVM) is a supervised machine learning algorithm
primarily used for classification tasks but can also be applied to regression.

ML Tutorial 34
The main goal of SVM is to find the optimal hyperplane that separates data
points of different classes with the maximum margin. This maximization of the
margin between classes enhances generalization, making SVM highly effective
in binary classification tasks. SVM is popular in fields like image recognition,
text categorization, and bioinformatics due to its ability to handle high-
dimensional data.

How SVM Works


SVM classifies data by finding a hyperplane that best separates the classes.
The "support vectors" are the data points closest to this hyperplane,
influencing its position and orientation. SVM aims to maximize the distance
(margin) between support vectors of the two classes, ensuring a more robust
classifier.

Formula to Predict a New Point


For a given data point \( x \), SVM predicts the class label \( y \) as follows:

\[
y = \text{sign} (w^T x + b)
\]

where:

\( w \) is the weight vector that defines the orientation of the hyperplane,

\( b \) is the bias term that shifts the hyperplane,

ML Tutorial 35
\( \text{sign}(\cdot) \) function returns \( +1 \) or \( -1 \) depending on which
side of the hyperplane the point lies.

The Loss Function


The objective of SVM is to maximize the margin while minimizing the
classification error. The Hinge Loss function is used to penalize misclassified
points or points within the margin. The loss function for SVM is given by:
\[
\text{Loss} = \frac{1}{2} ||w||^2 + C \sum_{i=1}^N \max(0, 1 - y_i (w^T x_i + b))
\]

where:

\( ||w||^2 \) controls the margin (keeps it large),

\( C \) is a regularization parameter that balances maximizing the margin


and minimizing misclassifications,

\( y_i \) is the true label for point \( x_i \).

The Hinge Loss \( \max(0, 1 - y_i (w^T x_i + b)) \) is zero if the point is correctly
classified and beyond the margin; otherwise, it penalizes based on how far the
point is from the correct margin boundary.

Calculating the Hyperplane and Updating It


To calculate the hyperplane, SVM solves an optimization problem to find \( w \)
and \( b \) that maximize the margin and minimize classification error. The
gradient descent method is typically used to adjust \( w \) and \( b \) iteratively.
The hyperplane can be updated by:

1. Minimizing the loss: Using optimization techniques like Stochastic


Gradient Descent (SGD) or Quadratic Programming.

2. Support Vectors: Since only support vectors define the hyperplane,


adjusting \( w \) and \( b \) to ensure that these points lie on the margin
helps update the hyperplane.

Pros and Cons of SVM


Pros:

Effective in high-dimensional spaces: Works well when the number of


features is greater than the number of samples.

ML Tutorial 36
Robust to overfitting: Especially with the use of regularization (C
parameter).

Kernel Trick: Allows SVM to create non-linear decision boundaries using


kernel functions like RBF or polynomial kernels.

Cons:

Inefficient for large datasets: Training SVM is computationally intensive for


large datasets.

Choice of Kernel: Selecting an appropriate kernel can be challenging and


may impact performance.

Sensitive to parameter tuning: Requires careful tuning of the C and kernel


parameters for optimal results.

Effect of Outliers on SVM


Outliers can significantly affect the hyperplane as they may be located close to
the decision boundary, altering the margin and causing SVM to misclassify
nearby points. Using a soft margin (controlled by parameter \( C \)) helps SVM
tolerate some misclassified points, making it more robust to outliers. A high \( C
\) value places more emphasis on correctly classifying each point, while a low \
( C \) allows more flexibility for misclassification.

Bias and Variance


Bias: SVM generally has low bias, especially with non-linear kernels, as it
can capture complex patterns.

Variance: It tends to have high variance, especially with complex kernels,


as it may be sensitive to changes in the training data.

This high variance makes SVM prone to overfitting if parameters (especially the
kernel and \( C \)) are not carefully selected.

Additional Notes
Kernel Trick: The kernel trick allows SVM to map data into higher
dimensions to create a linear separation where it is otherwise impossible.
Some popular kernels include:

Linear Kernel: Suitable for linearly separable data.

ML Tutorial 37
Polynomial Kernel: Captures polynomial relationships.

RBF (Gaussian) Kernel: Suitable for non-linearly separable data.

Support Vectors: Only the points that lie on the margin, known as support
vectors, affect the final model, leading to a sparse solution.

Clustering Tutorial
K-Means
Introduction
K-Means is a popular unsupervised learning algorithm used for clustering,
where it groups data points into a predefined number of clusters. The main
objective of K-Means is to partition the data into clusters such that data points
within a cluster are more similar to each other than to those in other clusters.
This similarity is measured by the distance between points, often using
Euclidean distance. K-Means is effective for segmenting datasets where a
natural grouping exists, making it useful in applications like customer
segmentation, image compression, and pattern recognition.

How K-Means Works


1. Choose the number of clusters (k): The user predefines the number of
clusters, often selected based on techniques like the Elbow Method.

2. Initialize centroids: Randomly initialize \( k \) centroids (one for each


cluster).

3. Assign points to nearest centroids: Each data point is assigned to the


nearest centroid, creating \( k \) clusters.

4. Update centroids: Calculate the mean of all points within each cluster and
update the centroids accordingly.

5. Repeat steps 3-4: Iterate until convergence (when centroids no longer


move significantly or a maximum number of iterations is reached).

The algorithm aims to minimize the within-cluster sum of squares (WCSS),


which measures the squared distance between each point and its centroid.

ML Tutorial 38
Formula to Predict Cluster for a New Point
The formula used in K-Means for assigning a point \( x \) to a cluster is based
on minimizing the Euclidean distance:
\[
d(x, \mu_j) = \sqrt{\sum_{i=1}^{n} (x_i - \mu_{j,i})^2}
\]
where:

\( x \) is the data point,

\( \mu_j \) is the centroid of cluster \( j \),

\( n \) is the number of features.

The point \( x \) is assigned to the cluster \( j \) that minimizes \( d(x, \mu_j) \).

Objective Function: Within-Cluster Sum of Squares (WCSS)

The objective function for K-Means, called the within-cluster sum of squares
(WCSS), is represented as:
\[
\text{WCSS} = \sum_{j=1}^{k} \sum_{x \in C_j} \| x - \mu_j \|^2
\]
where:

\( k \) is the number of clusters,

ML Tutorial 39
\( C_j \) represents cluster \( j \),

\( x \) is a data point in \( C_j \), and

\( \mu_j \) is the centroid of \( C_j \).

This function calculates the squared distance between each point and its
centroid, summing it across all clusters. K-Means aims to minimize this value.

Pros and Cons of K-Means


Pros:

Simple and efficient: Easy to implement and computationally efficient on


large datasets.

Scalable: Can handle a large number of data points.

Works well with spherical clusters: K-Means performs well with clusters
that are roughly circular in shape.

Cons:

Sensitive to the initial choice of centroids: Random initialization can lead to


different results; hence, running the algorithm multiple times or using
initialization techniques like K-Means++ is recommended.

Difficulty with non-spherical clusters: Struggles to identify clusters of


arbitrary shapes, as it relies on distance-based metrics.

Requires pre-specifying k: The number of clusters \( k \) must be defined


beforehand, which may not be straightforward.

Effect of Outliers on K-Means


Outliers can have a significant impact on K-Means clustering. Since K-Means
relies on calculating centroids as the mean of all points in a cluster, even a
single outlier can distort the centroid's position. This misplacement can cause
the clustering to be inaccurate, leading to clusters that do not represent the
natural groupings in the data.

Bias and Variance


High bias: K-Means is a relatively rigid algorithm, particularly because it
forms clusters based on spherical distance from centroids. This results in a
high bias, making it less flexible for complex data structures.

ML Tutorial 40
Low variance: If the initialization process is handled well (e.g., with K-
Means++), K-Means produces consistent results across runs. However, if
centroids are poorly initialized, variance may increase.

Additional Notes
K-Means++ Initialization: K-Means++ is a method to improve centroid
initialization, helping to reduce the likelihood of poor clustering and
convergence to a local minimum.

Elbow Method: A technique to help determine the optimal number of


clusters by plotting the WCSS for different values of \( k \) and identifying
the point where adding more clusters minimally reduces WCSS.

Interpretability: K-Means clustering can sometimes be challenging to


interpret if clusters are not well-separated.

Hierarchical Clustering Tutorial


Introduction
Hierarchical clustering is an unsupervised learning algorithm used to group
data points into clusters without specifying the number of clusters beforehand.
Unlike K-Means, which forms a predefined number of clusters, hierarchical
clustering creates a tree-like structure (dendrogram) that represents nested
clusters within each other. This algorithm is particularly useful for exploring the
natural structure in data, as it allows you to visualize clusters at various levels
of granularity. It is frequently used in fields like bioinformatics, customer
segmentation, and document clustering.

How Hierarchical Clustering Works


Hierarchical clustering builds clusters in a bottom-up (agglomerative) or top-
down (divisive) approach:

Agglomerative (most common): Starts with each data point as its own
cluster and merges the closest clusters iteratively until only one cluster
remains.

ML Tutorial 41
Divisive: Starts with all points in a single cluster and recursively splits them
until each data point is its own cluster.

Key Steps:
1. Calculate distances: Compute the distance matrix between each pair of
data points.

2. Merge closest clusters: Find the two closest clusters (based on a distance
metric) and merge them.

3. Update distance matrix: Recalculate the distances between the new


cluster and remaining clusters.

4. Repeat until one cluster remains.

Hierarchical clustering does not require specifying the number of clusters in


advance, and clusters can be determined by cutting the dendrogram at the
desired level.

Distance Metrics and Linkage Criteria


Distance and linkage criteria significantly impact hierarchical clustering's
structure. Some popular distance metrics include Euclidean, Manhattan, and
Cosine distances. The linkage criterion determines how distances between
clusters are calculated:

Single linkage: Distance between the closest points of two clusters.

Complete linkage: Distance between the farthest points of two clusters.

Average linkage: Average distance between points across clusters.

Ward’s linkage: Minimizes the variance within each cluster, often producing
balanced clusters.

The choice of metric and linkage can affect cluster shape and separation, so
they should be selected based on data characteristics.

Objective Function: Dendrogram Representation


Unlike K-Means, hierarchical clustering does not optimize a specific objective
function. Instead, it builds a dendrogram: a tree diagram that shows the
hierarchy of clusters. The dendrogram helps visualize how clusters form at

ML Tutorial 42
different levels, allowing the user to decide the best number of clusters based
on their requirements.

Pros and Cons of Hierarchical Clustering


Pros:

No need to specify k: It automatically creates a nested structure of


clusters.

Easy to visualize: Dendrograms allow a clear visualization of data


hierarchy, making it easier to understand the clustering at various levels.

Suitable for arbitrary shapes: Can capture clusters of different shapes and
sizes more naturally than K-Means.

Cons:

Computationally expensive: Hierarchical clustering has a time complexity


of \( O(n^2 \log(n)) \), which can be inefficient for large datasets.

Sensitive to noise and outliers: Outliers can affect the dendrogram


structure and potentially lead to misleading clusters.

Difficulty in adjusting: Once clusters are formed, adjusting the structure


requires re-running the entire algorithm.

How Outliers Affect Hierarchical Clustering


Outliers can have a significant impact on hierarchical clustering. Since the
algorithm continuously merges clusters, an outlier can be treated as a separate
cluster or merged into an existing cluster, disrupting the natural clustering.
Additionally, in cases with single-linkage, outliers may create "chains," where
distant points are connected due to their proximity to other points, distorting
cluster structure.

Bias and Variance


Hierarchical clustering generally exhibits:

High bias: This is because once clusters are formed, they cannot be
changed, leading to rigid clustering structures.

Low variance: Results tend to be stable across different runs because there
is no random initialization. However, this depends on the choice of linkage

ML Tutorial 43
and distance metric.

Additional Notes
Dendrogram Cutting: The depth at which the dendrogram is "cut"
determines the number of clusters. The threshold can be chosen by
analyzing the dendrogram and selecting a height where there is a
significant gap between levels.

Distance Threshold: A cutoff distance can be set to stop merging clusters


at a certain level, helping control cluster compactness.

Scalability: Hierarchical clustering may not perform well with large datasets
due to its high computational requirements, but it is effective for small to
medium datasets.

K-Means VS Hierarchical Clustering

Here's a comparison table between K-Means and Hierarchical Clustering:

Aspect K-Means Clustering Hierarchical Clustering

Partitional (divides data into


Hierarchical (builds a nested
Type distinct, non-overlapping
cluster structure or hierarchy)
clusters)

Determined by analyzing the


Number of
Must be specified in advance dendrogram or setting a distance
Clusters (k)
cutoff

Agglomerative (bottom-up) or
Algorithm Type Iterative and distance-based
divisive (top-down)

Various metrics (Euclidean,


Distance Metric Primarily Euclidean distance
Manhattan, Cosine, etc.)

Single, Complete, Average, or


Linkage Criteria Not applicable
Ward’s linkage

ML Tutorial 44
Speed and Faster, with complexity of \(O(n Slower, with complexity of \
Complexity \times k \times i)\) (O(n^2 \log(n))\)

Scalability Efficient on large datasets Inefficient for very large datasets

Works best with spherical, well- Suitable for clusters of arbitrary


Cluster Shape
separated clusters shapes

Outlier Highly sensitive, especially with


Sensitive to outliers
Sensitivity single linkage

Random Centroid initialization affects No random initialization; stable


Initialization results (e.g., K-Means++) clusters across runs

Moderate; clusters are less High; dendrogram provides a


Interpretability interpretable without clear visualization of cluster
visualization hierarchy

High variance if initialized


Variance and Lower variance; high bias due to
poorly; lower bias with defined
Bias rigid clustering structure
structure

Customer segmentation, image Bioinformatics, gene clustering,


Usage Cases
compression document clustering

Result Dendrogram (tree structure) and


Cluster labels for each point
Representation optional cluster labels

Summary
K-Means is suitable for large datasets with spherical clusters and requires
pre-specifying the number of clusters.

Hierarchical Clustering is ideal for smaller datasets, provides a hierarchical


structure, and is effective when you want to explore nested clusters.

ML Tutorial 45

You might also like