0% found this document useful (0 votes)

10 views45 pages

ML Tutorial

Outliers can significantly impact logistic regression because they can skew the decision boundary due to the linear nature of the model. For instance, if an outlier is far from the main cluster of data, it may push the boundary closer to the remaining points, causing misclassification.

Uploaded by

ahmed77fouad23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views45 pages

ML Tutorial

Uploaded by

ahmed77fouad23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

ML Tutorial

Classification Algorithms

Logistic Regression Tutorial

Introduction
Logistic Regression is a statistical model commonly used for binary
classification tasks, where the goal is to classify data into one of two possible
outcomes. Unlike linear regression, which predicts a continuous value, logistic
regression predicts the probability of a sample belonging to a particular class.
The core idea of logistic regression is to map input features to a probability
score between 0 and 1 using the logistic (or sigmoid) function. This makes it
suitable for binary classification tasks, such as spam vs. not spam or disease

ML Tutorial 1
vs. no disease. Logistic regression works by finding a decision boundary that
separates the classes in the feature space, maximizing the likelihood of correct
classification.

Formula to Predict a New Point

For a given data point \( x \) with features \( x_1, x_2, \ldots, x_n \), the
prediction in logistic regression is given by the logistic (or sigmoid) function
applied to the linear combination of the features:

1
h(x) = 1+e −(β 0 +β 1 x 1 +β 2 x 2 +…+β n x n )

where:

h(x) is the predicted probability that the data point belongs to the
positive class.

β0 is the intercept term.

β1 , β2 , … , βn are the coefficients associated with each feature.

If \( h(x) \geq 0.5 \), the point is classified as belonging to the positive class;
otherwise, it is classified as the negative class.

The Loss Function

Logistic regression uses Binary Cross-Entropy Loss (or Log Loss) to measure
the error between the predicted probability and the actual class label. The loss
function for a single training example is:

Loss(y, h(x)) = − (y ⋅ log(h(x)) + (1 − y) ⋅ log(1 − h(x)))

where:

\( y \) is the actual class label (0 or 1).

\( h(x) \) is the predicted probability from the logistic function.

The goal is to minimize the total loss over all training examples, which can be
achieved through gradient descent or other optimization algorithms.
Minimizing this loss function helps the model make accurate predictions.

ML Tutorial 2
Pros and Cons of Logistic Regression
Pros:

1. Interpretability: Coefficients provide insight into the importance of each

feature.

2. Efficiency: Simple and computationally efficient, suitable for small to

medium-sized datasets.

3. Probability Outputs: Provides probabilities rather than hard classifications,

which is useful for understanding confidence in predictions.

4. Less Prone to Overfitting: Logistic regression has a lower risk of

overfitting, especially if regularization is applied.

Cons:

1. Assumption of Linearity: Assumes a linear relationship between the

features and the log-odds of the outcome, which may not hold for all
datasets.

2. Limited to Binary Classification: Needs extensions (e.g., multinomial

logistic regression) for multiclass classification tasks.

3. Sensitive to Feature Scaling: Requires standardized or normalized features

for optimal performance.

4. Poor Performance with Complex Relationships: Logistic regression does

not capture non-linear relationships between features and target variables
well.

Effect of Outliers on Logistic Regression

Outliers can significantly impact logistic regression because they can skew
the decision boundary due to the linear nature of the model. For instance, if an
outlier is far from the main cluster of data, it may push the boundary closer to
the remaining points, causing misclassification.
To mitigate the impact of outliers:

Robust Scaling: Scaling features using robust methods can reduce

sensitivity to outliers.

Regularization: Applying regularization, such as L2 (Ridge), can make the

model more robust to extreme values.

ML Tutorial 3
Outlier Detection and Removal: Identifying and removing outliers before
training can help improve logistic regression performance.

Bias and Variance in Logistic Regression

Bias: Logistic regression generally has high bias because it assumes a
linear relationship between the features and the log-odds of the target. This
assumption can lead to underfitting if the true relationship is non-linear.

Variance: Logistic regression typically has low variance, making it a stable

model that does not vary significantly with different training sets. This is
especially true when regularization is applied.

This high-bias, low-variance property makes logistic regression a suitable

choice for simpler problems but limits its effectiveness for complex or highly
non-linear data.

Additional Notes
Regularization: Regularization techniques, such as L1 (Lasso) and L2
(Ridge) regularization, are often applied to logistic regression to prevent
overfitting and handle multicollinearity. This introduces a penalty for large
coefficients, encouraging the model to find simpler solutions.

Threshold Tuning: The default threshold for classification is 0.5, but it can
be adjusted depending on the specific problem. For example, in medical
diagnoses where false negatives are costly, a lower threshold might be
chosen.

Multinomial Logistic Regression: For multiclass classification, logistic

regression can be extended to multinomial logistic regression. Techniques
like One-vs-Rest (OvR) or Softmax can be applied to handle multiple
classes.

Feature Engineering: Logistic regression can benefit significantly from

feature engineering, especially when dealing with non-linear data.
Transforming features (e.g., using polynomial features or interaction terms)
can improve its performance on more complex datasets.

ML Tutorial 4
Tree Models

Decision Trees Tutorial

Introduction
Decision Trees are a popular supervised learning algorithm used for both
classification and regression tasks. They work by recursively splitting the
dataset into subsets based on feature values, creating a tree-like structure
where each internal node represents a feature (or attribute), each branch
represents a decision rule, and each leaf node represents an outcome (or class
label).

The goal of a decision tree is to create a model that predicts the target variable
by learning simple decision rules inferred from the data features. Decision trees
are intuitive, easy to interpret, and can handle both categorical and continuous
data. Their transparency and straightforward visualization make them a popular
choice among practitioners, especially when model interpretability is crucial.

ML Tutorial 5
Formula to Predict a New Point
The prediction of a new data point in a decision tree involves traversing the tree
from the root to a leaf node. At each internal node, the algorithm evaluates a
feature and makes a decision based on its value, directing the traversal to the
left or right branch until it reaches a leaf node, which contains the predicted
outcome.

In order to build the tree, we need to decide which features to split on, both for
the root node and the internal nodes. This is done by evaluating each possible
split and choosing the one that maximizes the "purity" (or minimizes impurity)
of the resulting nodes. Here’s how:

Selecting the Root Node and Internal Nodes

1. Evaluate All Features for the Best Split:

For each feature, test possible thresholds to split the data into two
groups (for binary splits).

Calculate the impurity of each split using a criterion such as Gini

impurity, Entropy, or Mean Squared Error (MSE) for regression.

2. Impurity Measures:

The goal is to reduce impurity at each split. Common impurity measures

are:

Gini Impurity for classification:

\[
Gini(D) = 1 - \sum_{i=1}^{C} p_i^2
\]
where \( p_i \) is the proportion of samples in class \( i \), calculated
as:
\[
p_i = \frac{n_i}{N}
\]
where:

\( n_i \) is the number of samples in class \( i \),

\( N \) is the total number of samples in the dataset.

Entropy for classification:

ML Tutorial 6
Entropy(D) = - \sum_{i=1}^{C} p_i \log_2(p_i)
\]

Mean Squared Error (MSE) for regression:

\[
MSE(D) = \frac{1}{n} \sum_{j=1}^{n} (y_j - \hat{y})^2
\]
where \( y_j \) is the true value of a sample in the node, \( \hat{y} \)
is the average value for that node, and \( n \) is the number of
samples.

3. Calculate Information Gain (or Reduction in Impurity):

For each possible split, compute the Information Gain or Reduction in

Impurity. This is the change in impurity from the parent node to the
child nodes. The split that maximizes this gain is selected.

For a split on feature \( X \) with threshold \( t \), Information Gain can

be calculated as:
\[
IG(D, X, t) = I(D) - \left( \frac{|D_{left}|}{|D|} I(D_{left}) +
\frac{|D_{right}|}{|D|} I(D_{right}) \right)
\]
where:

\( I(D) \) is the impurity of the parent node,

\( I(D_{left}) \) and \( I(D_{right}) \) are the impurities of the child

nodes after the split,

\( |D_{left}| \) and \( |D_{right}| \) are the number of samples in the

left and right child nodes, respectively,

\( |D| \) is the total number of samples in the parent node.

4. Choose the Best Split:

The feature and threshold with the highest Information Gain (or largest
reduction in impurity) becomes the split point for the current node.

5. Recursion for Internal Nodes:

Repeat this process recursively for each child node, treating each child
node as the new parent node.

ML Tutorial 7
Continue until a stopping criterion is met, such as reaching a maximum
tree depth, a minimum number of samples in a node, or achieving zero
impurity.

6. Stopping Criteria:

Maximum Depth: Prevents the tree from growing too complex and
overfitting.

Minimum Samples per Node: Ensures each node represents a

significant portion of data.

Impurity Threshold: Stops the split if the reduction in impurity is below

a defined threshold.

The Loss Function

In the context of decision trees, the loss function is based on the impurity of
the nodes. The goal is to minimize the impurity when splitting nodes.
Commonly used impurity measures are:

Gini Impurity for classification:

\[
Gini(D) = 1 - \sum_{i=1}^{C} p_i^2
\]

Entropy for classification:

\[
Entropy(D) = - \sum_{i=1}^{C} p_i \log_2(p_i)
\]

Mean Squared Error (MSE) for regression:

\[
MSE(D) = \frac{1}{n} \sum_{j=1}^{n} (y_j - \hat{y})^2
\]

The tree algorithm selects the feature and the split point that minimizes the
impurity for the resulting child nodes, aiming to create homogeneous groups of
outcomes.

Pros and Cons of Decision Trees

Pros:

ML Tutorial 8
1. Interpretability: Decision trees are easy to visualize and interpret, making it
clear how decisions are made.

2. Non-linear Relationships: They can capture non-linear relationships

between features and the target variable without requiring any
transformation.

3. Minimal Data Preparation: Decision trees require little data preprocessing,

as they do not require scaling, normalization, or one-hot encoding.

4. Handles Both Numerical and Categorical Data: They can work with both
types of data without special transformations.

5. Robust to Feature Scaling: Decision trees are not sensitive to the scale of
data, unlike some other models.

6. Works Well on Large Datasets: With certain optimizations (like pruning),

decision trees can work well on large datasets.

Cons:

1. Prone to Overfitting: Without limitations on depth or splits, decision trees

can easily overfit the data, especially if the tree is deep.

2. Sensitive to Noisy Data: Small changes in the data can lead to a completely
different tree structure, which can reduce model stability.

3. Biased Towards Dominant Classes: If one class is more prevalent, decision

trees might lean towards predicting it more often, especially in imbalanced
datasets.

4. Suboptimal Splits in High Dimensions: In high-dimensional spaces,

decision trees can struggle to find optimal splits, often leading to subpar
performance compared to other models.

5. Requires Pruning: Pruning is necessary to prevent overfitting but requires

additional computational effort and complexity.

How Outliers Affect Decision Trees

Decision trees are generally robust to outliers because they partition data into
homogeneous groups, rather than relying on statistical parameters (like mean
and standard deviation). However, in certain cases, outliers can still influence
the splits if they cause impurity calculations to favor divisions that isolate them.

ML Tutorial 9
While outliers are less likely to affect trees significantly, pruning or limiting the
depth can help avoid nodes formed solely due to outliers.

Bias and Variance

Bias: Decision trees have low bias. They can fit complex patterns in the
data well, which often allows them to capture non-linear relationships.

Variance: Decision trees have high variance because small changes in the
training data can lead to completely different splits and a different tree
structure. This high variance makes individual decision trees prone to
overfitting, particularly on small datasets.

To address this high variance, ensemble methods like Random Forests or

Gradient Boosted Trees are commonly used, which reduce the variance by
combining multiple trees.

Additional Notes
1. Pruning: To avoid overfitting, trees often require pruning. Pruning involves
removing branches that have little importance or adding stopping criteria
(like a maximum depth) to control the tree’s complexity.

2. Tree Depth: Limiting the maximum depth of a tree is another way to control
for overfitting and improve model generalization.

3. Feature Importance: Decision trees provide a way to assess feature

importance by observing how much each feature reduces impurity across
the splits. Features that lead to greater reductions in impurity are
considered more important.

4. Ensemble Methods: To improve accuracy, decision trees are often used in

ensemble methods such as Random Forests and Gradient Boosting. These
methods combine multiple trees to produce more robust models with better
generalization.

5. Handling Missing Values: Decision trees can handle missing values by

considering multiple possible splits or by using surrogate splits, which look
for alternative splits in case of missing data.

By understanding these key points about decision trees, you’ll be well-

equipped to apply them effectively in various machine learning contexts and
identify when they might be a suitable choice.

ML Tutorial 10
Random Forest Tutorial

Introduction
Random Forest is an ensemble learning method that combines multiple
decision trees to improve the accuracy and stability of predictions. Developed
by Leo Breiman, Random Forest uses a technique known as bagging (Bootstrap
Aggregating) to build a "forest" of individual decision trees, where each tree is
trained on a random subset of the data. By averaging or majority voting the
predictions of each tree, Random Forest reduces the risk of overfitting that
individual decision trees often face.
Random Forest is highly effective in both classification and regression tasks,
making it versatile across many fields, including finance, healthcare, and e-
commerce. The goal is to improve the accuracy and generalization of
predictions while maintaining model interpretability.

ML Tutorial 11
Formula to Predict a New Point
Random Forest prediction involves aggregating predictions from multiple
decision trees. Each tree in the forest makes an independent prediction, and
the final output is determined by averaging (for regression) or taking the
majority vote (for classification) of these individual predictions.

1. Classification:
\[
\hat{y} = \text{mode}\left\{ T_1(x), T_2(x), \dots, T_m(x) \right\}
\]
where \( T_i(x) \) is the prediction of the \(i\)-th tree in the forest, and \( m \)
is the total number of trees. The mode, or majority vote, of these
predictions determines the final output.

2. Regression:
\[
\hat{y} = \frac{1}{m} \sum_{i=1}^{m} T_i(x)
\]
where \( T_i(x) \) is the prediction of the \(i\)-th tree, and the average value
of the predictions is used as the final output.

The randomness introduced during the training phase, both in the selection of
data samples and feature subsets, helps to reduce variance and improve the
model’s ability to generalize.

The Loss Function

The Random Forest model itself doesn’t use a specific loss function during
training. Instead, it relies on the underlying decision trees, which are typically
trained with the following objectives:

Classification: Minimizing impurity at each split, using measures such as

Gini impurity or Entropy.

Regression: Minimizing Mean Squared Error (MSE) at each split within

each tree.

The final performance of the forest is typically evaluated based on the chosen
task's loss function, such as cross-entropy or accuracy for classification, and
Mean Squared Error for regression.

How Trees in the Forest Are Built

ML Tutorial 12
Random Forest uses bagging and feature selection to create diverse trees in
the forest:

1. Bagging (Bootstrap Aggregating):

Each tree is trained on a different random subset of the training data

with replacement. This means that some samples may appear multiple
times in a subset, while others may not appear at all.

2. Random Feature Selection:

For each split in a tree, only a random subset of the features is

considered. This ensures diversity among the trees and reduces
correlation, leading to better generalization.

3. Growing the Trees:

Each tree is grown to its maximum depth, with no pruning, to allow it to

capture complex patterns in the data.

The combination of bagging and random feature selection helps in

decorrelating the trees, so their combined predictions are more robust than
those of individual trees.

Pros and Cons of Random Forest

Pros:

1. High Accuracy: Random Forest generally offers better accuracy than

individual decision trees due to its ensemble approach.

2. Reduces Overfitting: The combination of bagging and random feature

selection reduces overfitting, especially on large datasets.

3. Handles Large Feature Sets: By using a random subset of features for each
split, Random Forests are effective even with high-dimensional data.

4. Works with Missing Data: It can handle missing values by splitting based
on surrogate splits or by averaging.

5. Feature Importance: Random Forests provide feature importance scores,

helping in feature selection and model interpretability.

Cons:

1. Complexity: Compared to a single decision tree, Random Forests are

computationally intensive and require more memory and processing power.

ML Tutorial 13
2. Less Interpretability: The ensemble of trees makes the model less
interpretable than individual decision trees.

3. Longer Training Times: Building a large number of trees can result in

longer training times, especially for large datasets.

4. No Extrapolation for Regression: For regression tasks, Random Forest

cannot extrapolate beyond the training data range.

How Outliers Affect Random Forests

Random Forests are generally robust to outliers because they average the
results of multiple trees. While outliers might affect individual trees, their
influence is diluted by the ensemble’s aggregation mechanism. However, if
outliers are extreme, they can still affect the splits in some trees, leading to
minor noise in predictions.

Bias and Variance

Bias: Random Forests have low bias because they capture complex
patterns due to their high capacity for flexibility within individual trees.

Variance: Random Forests reduce variance by averaging multiple trees,

making the final model less sensitive to noise and fluctuations in the
training data.

In general, Random Forests strike a good balance between bias and variance,
which often leads to better generalization compared to a single decision tree.

Additional Notes
1. Hyperparameters:

The key hyperparameters for Random Forest include the number of

trees ( n_estimators ), the maximum depth of each tree ( max_depth ), the
number of features to consider at each split ( max_features ), and
minimum samples per split ( min_samples_split ).

Tuning these hyperparameters is crucial to achieving optimal

performance, especially for larger datasets.

2. OOB (Out-of-Bag) Error:

ML Tutorial 14
Since each tree in the Random Forest is trained on a bootstrap sample
(subset with replacement), about 1/3 of the data is left out of each
sample. These “out-of-bag” samples can be used to estimate model
performance without needing a separate validation set.

3. Feature Importance:

Random Forest provides feature importance scores based on how much

each feature reduces impurity across splits in the forest. This is
valuable for feature selection and interpretability.

4. Applications:

Random Forest is widely used in areas where accuracy and stability are
crucial, such as medical diagnosis, fraud detection, financial modeling,
and image recognition.

Random Forest provides a flexible, accurate, and robust machine learning

approach, suitable for a wide variety of datasets and problems, while helping
address the overfitting issues associated with single decision trees.

Gradient Boosting Machines (GBM)

Tutorial

ML Tutorial 15
Introduction
Gradient Boosting Machines (GBM) is an ensemble machine learning algorithm
that builds models sequentially by combining the strengths of many weak
learners, typically decision trees, to form a strong predictive model. Unlike
Random Forest, where trees are trained independently, in Gradient Boosting,
each tree is trained to correct the errors of its predecessor. GBMs are highly
effective for both classification and regression tasks, excelling in performance
on tabular datasets with complex relationships.
The core idea behind GBM is to minimize the residual error (or loss) of the
previous model by adding a new tree at each step, designed to model the
residuals or gradient of the loss function.

Formula to Predict a New Point

A GBM model is built incrementally, where each new model \( h_t(x) \) added to
the ensemble targets the residual error of the previous iteration:

1. Initial Prediction: The first model \( F_0(x) \) makes an initial prediction,

typically the mean of the target values for regression.

2. Residual Learning: Each subsequent model, \( h_t(x) \), is trained on the

residuals (errors) from the previous iteration.

3. Final Prediction: The final model \( F_T(x) \) at iteration \( T \) combines all

models:
\[
F_T(x) = F_0(x) + \sum_{t=1}^{T} \gamma_t \cdot h_t(x)
\]
where \( \gamma_t \) is a learning rate that controls how much each model
contributes to the final prediction. For regression, the prediction is simply
the sum of all models, while for classification, it’s often the sum passed
through a transformation (e.g., logistic function) to output probabilities.

4. Gradient Boosting: Each model \( h_t(x) \) is fitted to the gradient of the

loss function (the residuals), making the model iterative and focusing on
errors.

The Loss Function

GBMs can use various loss functions based on the task:

ML Tutorial 16
1. Mean Squared Error (MSE) for regression:
\[
\text{Loss} = \frac{1}{N} \sum_{i=1}^{N} (y_i - F_T(x_i))^2
\]
where \( y_i \) is the true target value, \( F_T(x_i) \) is the final prediction,
and \( N \) is the number of samples.

2. Logistic Loss for binary classification:

\[
\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(F_T(x_i)) + (1 - y_i)
\log(1 - F_T(x_i)) \right]
\]
GBM iteratively fits each model to the gradient of the loss function,
minimizing it over time.

3. Learning Rate:

The learning rate \( \gamma \) is a hyperparameter that scales the

contribution of each tree, balancing the speed and accuracy of training.

Pros and Cons of GBM

Pros:

1. High Accuracy: GBM often provides superior predictive accuracy due to its
iterative nature and ability to reduce bias.

2. Flexibility: It can handle both regression and classification tasks with

various loss functions.

3. Handles Complex Data: GBMs can learn complex, non-linear relationships

in the data.

4. Customizability: Multiple hyperparameters, including learning rate, number

of trees, and tree depth, can be fine-tuned for optimal performance.

Cons:

1. Sensitive to Outliers: Since each new model tries to reduce errors, GBM
can amplify the influence of outliers.

2. Longer Training Time: Due to the sequential training of trees, GBM is

slower than parallel algorithms like Random Forest.

ML Tutorial 17
3. Prone to Overfitting: With many trees and a high learning rate, GBMs can
overfit on noisy datasets.

4. Complexity in Tuning: Tuning parameters such as learning rate, tree depth,

and number of trees can be challenging and requires cross-validation.

How Outliers Affect Gradient Boosting Machines

Gradient Boosting Machines are sensitive to outliers because each new model
emphasizes reducing the residuals of the previous predictions. This can cause
the model to place excessive weight on outliers, leading to potential overfitting.
Techniques like reducing tree depth, adding regularization, or using robust loss
functions (e.g., Huber loss) can mitigate this issue.

Bias and Variance

Bias: GBMs have low bias due to their iterative, residual-reducing
approach. Each new model reduces the bias by focusing on the residuals,
making the ensemble highly flexible and able to capture complex patterns.

Variance: GBMs have high variance, as each new tree is trained on the
errors of the previous one. This variance can lead to overfitting, especially
with high learning rates or too many trees. Regularization techniques, such
as shrinkage (lower learning rates) and early stopping, are often used to
manage variance.

Additional Notes
1. Learning Rate:

The learning rate controls how much each tree contributes to the final
model. Lower learning rates often yield better results but require more
trees, increasing training time.

2. Regularization Techniques:

Regularization is essential to prevent overfitting. Techniques include

early stopping (stopping training when performance on validation data
no longer improves), subsampling (training each tree on a random
subset of data), and shrinkage (applying a small learning rate).

3. Hyperparameter Tuning:

ML Tutorial 18
The main parameters to tune are the number of trees, tree depth, and
learning rate. Grid search and cross-validation are commonly used to
identify the best combination of parameters.

4. Extensions:

XGBoost (Extreme Gradient Boosting), LightGBM, and CatBoost are

advanced implementations of GBM that improve on the standard GBM
in terms of speed, efficiency, and performance, each with unique
optimizations.

5. Feature Importance:

Like Random Forest, GBMs can calculate feature importance, helping to

identify the most influential features in the model.

XGBoost Tutorial
1. Introduction
Extreme Gradient Boosting (XGBoost) is an optimized version of Gradient
Boosting that focuses on speed and performance, making it widely popular for
machine learning competitions and practical applications. It was developed
with the aim of improving on traditional Gradient Boosting by offering efficient,
scalable, and flexible implementations. XGBoost achieves these enhancements
by optimizing the algorithm’s core structure and applying advanced
regularization techniques.
In XGBoost, the boosting process works by sequentially adding decision trees
(usually small trees with limited depth) to correct the residual errors made by
previous trees, gradually improving the overall accuracy of the model.

2. Formula to Predict a New Point

The XGBoost prediction for a new point \( x \) is given by the summation of all
the trees in the model. Each tree \( f_k \) produces a prediction, and the final
prediction \( \hat{y} \) is the sum of these predictions:

\[
\hat{y} = \sum_{k=1}^{K} f_k(x)
\]

ML Tutorial 19
where:

\( K \) is the total number of trees,

\( f_k(x) \) represents the prediction of the \( k \)-th tree,

\( \hat{y} \) is the final prediction, which can be used for regression or as a

probability in classification.

The objective function in XGBoost is minimized by adding a new tree \( f_t(x) \)

at each step that best fits the residuals (the differences between predicted and
actual values).

3. Loss Function
The loss function in XGBoost combines two components:

1. Prediction error: This measures the error between predicted values and
true values, commonly using mean squared error for regression tasks and
log loss for classification.

2. Regularization term: To prevent overfitting, XGBoost incorporates

regularization terms on tree complexity (such as the number of leaves or
leaf weights).

The general form of the objective function \( L \) is:

\[
L = \sum_{i=1}^{N} l(y_i, \hat{y}
i) + \sum{k=1}^{K} \Omega(f_k)
\]
where:

\( l(y_i, \hat{y}_i) \) is the loss function measuring the error between actual \
( y_i \) and predicted \( \hat{y}_i \),

\( \Omega(f_k) \) is the regularization term for each tree \( f_k \), which
controls complexity and penalizes large trees to prevent overfitting.

The regularization term is defined as:

\[
\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2
\]
where:

ML Tutorial 20
\( T \) is the number of leaves in the tree,

\( \gamma \) and \( \lambda \) are hyperparameters that control the

regularization strength,

\( w_j \) are the weights of each leaf node.

4. Pros and Cons of XGBoost

Pros
High Performance: XGBoost is highly optimized and fast, often
outperforming other gradient boosting models.

Regularization: The model has in-built regularization to prevent overfitting,

which helps improve generalization.

Handling of Missing Values: XGBoost can automatically handle missing

data by learning which path in the tree should be taken when missing
values are encountered.

Flexible Objective Functions: It allows custom loss functions and can be

applied to both regression and classification tasks.

Cons
Complexity: XGBoost has many hyperparameters, which can make tuning
complex and time-consuming.

Memory Usage: With large datasets, XGBoost can consume significant

memory, especially for high-dimensional data.

Sensitivity to Noise: While less sensitive than AdaBoost, it can still be

influenced by noisy or highly irrelevant features if not tuned properly.

5. Effect of Outliers on XGBoost

Outliers can have an impact on XGBoost, although it is typically less sensitive
than linear models due to the tree structure. Trees naturally segment the data,
which can reduce the impact of extreme values. However, if outliers affect the
early stages of boosting, subsequent trees may still overfit to those values.
Regularization and careful tuning can help mitigate this impact.

6. Bias and Variance in XGBoost

ML Tutorial 21
Bias: XGBoost generally has low bias. By building successive trees that
learn from residuals, the model reduces bias incrementally, making it
effective for complex tasks.

Variance: XGBoost can exhibit moderate to high variance because it has a

large capacity for fitting complex patterns, which can lead to overfitting on
noisy data. Regularization techniques and hyperparameter tuning (e.g.,
limiting tree depth, adjusting learning rate) are necessary to control
variance and prevent overfitting.

7. Additional Notes
Learning Rate: The learning rate (also called eta) controls the contribution
of each new tree. Lower values (e.g., 0.01–0.1) usually yield better
generalization but require more trees to converge.

Early Stopping: XGBoost allows early stopping based on validation metrics,

which can prevent overfitting by stopping the training process once
performance ceases to improve.

Cross-Validation: Using k-fold cross-validation is often beneficial in

XGBoost to ensure the model is generalizing well.

Distributed Computing: XGBoost supports distributed processing, allowing

it to scale efficiently across multiple cores or machines, making it suitable
for very large datasets.

XGBoost is one of the most powerful and flexible models for classification and
regression tasks, but achieving optimal results requires tuning its parameters
and monitoring its behavior carefully.

AdaBoost Tutorial
1. Introduction
Adaptive Boosting (AdaBoost) is an ensemble learning technique designed to
improve the accuracy of weak learners, usually decision stumps (single-split
decision trees), by iteratively focusing on the mistakes made in previous
rounds. The goal of AdaBoost is to combine a sequence of these weak
learners, each improving upon the errors of the last, to form a strong predictive
model.

ML Tutorial 22
AdaBoost achieves this by increasing the weights of misclassified samples,
forcing subsequent learners to pay more attention to them, and thus iteratively
reducing errors. This makes AdaBoost especially useful in classification tasks
where it produces a final model that has reduced error rates compared to
individual weak models.

2. Formula to Predict a New Point

The final prediction of AdaBoost is a weighted vote from each weak learner. In
the context of binary classification, the final prediction \( F(x) \) for an input \( x
\) is given by:
\[
F(x) = \text{sign} \left( \sum_{t=1}^{T} \alpha_t \cdot h_t(x) \right)
\]
where:

\( T \) is the total number of weak learners,

\( h_t(x) \) is the prediction made by the \( t \)-th weak learner,

\( \alpha_t \) is the weight assigned to the \( t \)-th weak learner, determined

by the error of that learner,

\( \text{sign} \) outputs the class label based on the sign of the weighted
sum.

The weight \( \alpha_t \) for each weak learner is calculated as:

\[
\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \text{Error}_t}{\text{Error}_t} \right)
\]
where \( \text{Error}_t \) is the weighted error of the \( t \)-th learner. This
weight ensures that models with lower error have higher influence in the final
prediction.

3. Loss Function
The exponential loss function is typically used in AdaBoost, which penalizes
misclassifications more heavily as the iterations progress. Given a set of
predictions, the exponential loss \( L \) for AdaBoost is defined as:
\[
L = \sum_{i=1}^{N} e^{-y_i \cdot F(x_i)}

ML Tutorial 23
\]
where:

\( N \) is the total number of training samples,

\( y_i \) is the true label for each sample,

\( F(x_i) \) is the ensemble prediction.

The exponential loss increases significantly as the predictions deviate from the
true labels, which forces AdaBoost to iteratively adjust weights and focus on
hard-to-classify samples.

4. Pros and Cons of AdaBoost

Pros
Effectively Reduces Bias: By combining weak learners and emphasizing
misclassified points, AdaBoost lowers bias, making it effective for complex
classification tasks.

Good Generalization: AdaBoost typically does not overfit easily, especially

when combined with simple base learners.

Focus on Difficult Samples: By giving higher weights to misclassified

samples, AdaBoost excels at learning from hard cases.

Cons
Sensitive to Outliers and Noisy Data: Outliers receive increased weight as
misclassified points, which can harm performance, as the model may focus
excessively on these points.

Limited to Weak Learners: Works best with simple learners like decision
stumps, and using complex models can lead to overfitting.

Not Optimal for Large Datasets: AdaBoost’s iterative nature can be

computationally expensive for very large datasets.

5. Effect of Outliers on AdaBoost

Outliers can have a pronounced effect on AdaBoost. Since AdaBoost increases
the weight of misclassified points, outliers may become highly weighted and
receive undue focus in subsequent rounds. This can make AdaBoost overly
sensitive to outliers, potentially leading to overfitting and reducing

ML Tutorial 24
generalization on test data. Methods like data cleaning or weight clipping can
help mitigate this issue.

6. Bias and Variance in AdaBoost

Bias: AdaBoost has low bias. By combining weak learners and focusing on
hard-to-classify points, the model minimizes the overall error, reducing
bias.

Variance: AdaBoost tends to have higher variance since it is sensitive to

fluctuations in the training data. Changes in the training set, especially in
challenging or noisy points, can affect its performance. However, variance
can be managed by adjusting the number of learners or by applying
AdaBoost to simple learners.

7. Additional Notes
Learning Rate: AdaBoost includes a learning rate parameter that scales the
influence of each weak learner. Smaller learning rates can improve
generalization by reducing overfitting but require more learners to achieve
high accuracy.

Hyperparameter Tuning: The main parameters to tune in AdaBoost are the

number of weak learners and the learning rate. Increasing the number of
weak learners often improves accuracy but increases computational cost.

Binary Classification: AdaBoost is mainly used for binary classification,

although it can be adapted for multi-class classification with techniques like
one-vs-all or one-vs-one.

Bagging Tutorial
1. Introduction
Bagging (Bootstrap Aggregating) is an ensemble learning technique aimed at
reducing variance and preventing overfitting in high-variance models. It works
by creating multiple versions of a dataset using bootstrap sampling and training
a model independently on each version. By aggregating predictions from each
individual model (usually decision trees), Bagging can produce a final, more
robust prediction. Bagging is especially effective when applied to high-variance

ML Tutorial 25
models like decision trees, and one of its most popular implementations is
Random Forest.
The main goal of Bagging is to combine many weak models to create a stronger
model with improved stability and accuracy.

2. Formula to Predict a New Point

For a new point \( x \), the prediction in Bagging is the aggregated result from
all models trained on the bootstrapped samples. Let \( h_i(x) \) be the prediction
from the \( i \)-th model. The Bagging prediction \( \hat{y} \) for regression is
the average of all predictions, and for classification, it is the majority vote.

Regression (average of predictions):

\[
\hat{y} = \frac{1}{N} \sum_{i=1}^{N} h_i(x)
\]

Classification (majority vote):

\[
\hat{y} = \text{mode} \{ h_1(x), h_2(x), \dots, h_N(x) \}
\]
where:

\( N \) is the number of models,

\( h_i(x) \) is the prediction from the \( i \)-th model,

\( \text{mode} \) represents the most frequent class label among all

predictions.

3. Loss Function
Bagging does not use a specific loss function for combining models, as each
individual model is trained independently. However, the overall goal is to reduce
the Mean Squared Error (MSE) for regression tasks or classification error for
classification tasks. Each model in Bagging is trained on its own bootstrapped
sample, where it optimizes a suitable loss function for that model (e.g., Gini
impurity or entropy in decision trees for classification).

4. Pros and Cons of Bagging

ML Tutorial 26
Pros
Reduces Variance: Bagging effectively reduces the variance of models like
decision trees, making them more stable.

Handles Overfitting: Since each model is trained on a different subset,

Bagging reduces the risk of overfitting that may occur with a single decision
tree.

Parallelizable: Bagging can be parallelized easily, as each model is

independent, allowing for faster computation with multiple cores or
distributed systems.

Cons
Increased Computational Cost: Training multiple models increases the
computational cost, especially for large datasets or complex models.

Less Effective on Low-Variance Models: Bagging is best for high-variance

models like decision trees; it may not improve performance significantly on
low-variance models.

Memory Intensive: Bagging requires multiple copies of the data to create

bootstrap samples, which can be memory-intensive for large datasets.

5. Effect of Outliers on Bagging

Bagging is generally less sensitive to outliers than a single decision tree since
each individual model is trained on a slightly different dataset. However, if
outliers appear frequently in the bootstrapped samples, they can still affect
individual model predictions. To mitigate this, Bagging can be combined with
robust algorithms or preprocessing steps like outlier removal.

6. Bias and Variance in Bagging

Bias: Bagging slightly increases bias since it combines multiple models, but
this increase is typically minimal.

Variance: Bagging effectively reduces variance by averaging the

predictions from multiple models, leading to improved generalization on test
data.

Bagging’s effectiveness lies in its ability to lower variance, making it ideal for
high-variance models that may otherwise overfit.

ML Tutorial 27
7. Additional Notes
Bootstrap Sampling: Bagging relies on bootstrap sampling, where each
sample has a 63% chance of being selected in a single bootstrap sample.
This randomness contributes to the diversity among models.

Out-of-Bag (OOB) Error: The OOB error is calculated using data points not
included in the bootstrap sample for each model, providing an unbiased
estimate of the model’s generalization performance.

Feature Importance: In models like Random Forest, which are based on

Bagging, feature importance scores can be derived by observing how much
each feature splits in the trees.

Bagging is a straightforward yet powerful method to enhance model

performance by focusing on reducing variance and stabilizing predictions.

Stacking and Voting Tutorial

1. Introduction
Stacking and Voting are ensemble learning techniques that combine multiple
models to improve overall prediction accuracy. Unlike Bagging and Boosting,
which rely on creating variations of a single model type, Stacking and Voting
allow the use of multiple different model types in a single ensemble. Both
methods aim to aggregate the strengths of individual models, but they differ in
how predictions are combined.

Stacking (or Stacked Generalization) combines predictions from different

base models (also called level-0 models) by training a final “meta-model”
on these predictions. The meta-model learns how to best combine base
model predictions to improve accuracy.

Voting is a simpler method where each model in the ensemble directly

contributes to the final prediction, either through majority voting (for
classification) or averaging (for regression).

2. Formula to Predict a New Point

ML Tutorial 28
For both Stacking and Voting, we have an ensemble of base models, each
providing a prediction for a new data point \( x \). Let \( h_i(x) \) be the
prediction of the \( i \)-th base model.

Stacking
In Stacking, each base model \( h_i(x) \) makes a prediction, and these
predictions are used as features in a meta-model, which learns to combine
them optimally.
For a new point \( x \):

1. Obtain predictions from each base model \( h_i(x) \).

2. Pass these predictions to the meta-model to produce the final prediction, \(

\hat{y} \).

The prediction formula:

\[
\hat{y} = g(h_1(x), h_2(x), \dots, h_N(x))
\]
where \( g \) is the meta-model and \( N \) is the number of base models.

Voting
In Voting, the final prediction is a direct aggregation of all base model
predictions.

For Classification (Majority Voting):

\[
\hat{y} = \text{mode} \{ h_1(x), h_2(x), \dots, h_N(x) \}
\]

For Regression (Averaging):

\[
\hat{y} = \frac{1}{N} \sum_{i=1}^{N} h_i(x)
\]

where:

\( N \) is the number of base models,

\( \text{mode} \) denotes the most common class label in classification.

3. Loss Function

ML Tutorial 29
Stacking
Stacking’s loss function depends on the meta-model chosen. For example:

Classification: The meta-model may minimize logistic loss.

Regression: The meta-model may minimize mean squared error (MSE).

Each base model is first trained independently to minimize its own loss, and
then the meta-model is trained to minimize its loss on the predictions of the
base models.

Voting
Voting doesn’t use an explicit loss function to combine predictions. Each base
model is trained independently on its dataset to optimize its respective loss,
and the final prediction is an aggregation of these outputs.

4. Pros and Cons of Stacking and Voting

Pros
Increased Predictive Power: Combining diverse models can capture more
patterns and reduce errors in predictions.

Flexibility with Different Models: Allows the use of different algorithms

within the same ensemble, optimizing strengths and weaknesses of each.

Reduced Overfitting (Stacking): Stacking reduces the tendency to overfit

by training a meta-model, which better generalizes the output of base
models.

Cons
Computational Cost: Training multiple models, especially with a meta-
model, can be computationally intensive.

Complexity in Tuning (Stacking): Choosing the right base models and a

suitable meta-model can require extensive tuning.

May Not Always Improve Performance: If base models are too similar, the
ensemble may not perform better than individual models.

5. Effect of Outliers on Stacking and Voting

Both Stacking and Voting can be affected by outliers, especially if the base
models are sensitive to them (e.g., Decision Trees). However, since ensemble

ML Tutorial 30
methods average predictions, they can be more robust to outliers than
individual models.
In Stacking, the meta-model can sometimes “learn” to reduce the influence of
outlier-sensitive models if other base models are more robust. In Voting, the
impact of outliers is reduced if robust models are part of the ensemble.

6. Bias and Variance in Stacking and Voting

Stacking: Typically has low bias and low variance due to the diversity of
base models and the aggregation process by the meta-model. By
combining models with differing biases and variances, Stacking can
achieve an optimal bias-variance trade-off.

Voting: The bias and variance depend on the types of base models used. If
using diverse models, Voting can balance bias and variance. Majority voting
in classification is less likely to overfit, while averaging in regression can
help reduce variance.

7. Additional Notes
Choice of Meta-Model (Stacking): A linear regression or logistic regression
model is often used as the meta-model for simplicity, but complex models
(e.g., neural networks) can also be used depending on the problem.

Soft Voting in Classification: Voting can be extended to soft voting for

classification, where the class probabilities from each model are averaged,
and the class with the highest average probability is selected. Soft voting
often yields better performance than hard voting.

Model Diversity: Both Stacking and Voting benefit from diverse base
models to reduce correlation among errors, which improves the ensemble’s
effectiveness.

Stacking and Voting are powerful ensemble techniques that, when used with
diverse models, can significantly enhance performance by aggregating
individual strengths and minimizing weaknesses.

K-Nearest Neighbors (KNN) Tutorial

ML Tutorial 31
Introduction
K-Nearest Neighbors (KNN) is a non-parametric, supervised learning
algorithm used for classification and regression tasks. In KNN, predictions for
a new data point are made based on the "k" closest points in the training
dataset. The primary goal of KNN is to classify or predict the outcome for a
new instance by looking at similar data points and finding a consensus.
KNN is straightforward and effective for low-dimensional data and problems
where similar items are likely to belong to the same category. However, it is
computationally expensive for large datasets since it needs to compare each
point with all other points in the dataset.

How KNN Works

In KNN, the algorithm computes the distance between the new data point and
all points in the training set, then selects the "k" nearest neighbors. Based on
the majority class of these neighbors, KNN assigns the class label (in
classification) or calculates the average outcome (in regression).

Formula to Predict a New Point

For a new data point x', the k closest points are selected using a distance
metric, typically Euclidean Distance. The formula for the Euclidean Distance
between x and x' is:

\[
\text{Distance}(x, x') = \sqrt{\sum_{i=1}^d (x_i - x'_i)^2}
\]
where:

ML Tutorial 32
d is the number of features,

x_i and x'_i are the feature values of the training and new data point
respectively.

Once the k nearest neighbors are identified:

In classification, KNN assigns the most frequent class among the

neighbors.

In regression, KNN averages the outcomes of the nearest neighbors.

Loss Function
KNN does not have a specific loss function because it is a lazy learner: it
doesn’t build a model during training. Instead, it stores the entire training
dataset and calculates distances on-the-fly when making predictions. However,
KNN's accuracy or error can be calculated using metrics like:

Classification Error: Fraction of misclassified points.

Mean Squared Error (MSE) for regression tasks.

Pros and Cons of KNN

Pros:

1. Simple and Intuitive: Easy to understand and implement with minimal

configuration.

2. Non-parametric: No assumptions about data distribution.

3. Adaptable: Can be used for both classification and regression tasks.

Cons:

1. Computationally Expensive: High memory and time complexity, especially

for large datasets.

2. Sensitive to the Choice of "k": A poorly chosen value of k can significantly

impact performance.

3. Sensitive to Feature Scaling: Requires normalization/standardization as it

relies on distance measures.

4. Imbalanced Classes: KNN may perform poorly when classes are

imbalanced.

ML Tutorial 33
Effect of Outliers on KNN
Outliers can significantly affect KNN because it classifies or predicts based on
nearby points. If an outlier lies within the vicinity of a new data point, it may
lead to misclassification. Techniques like standardization, feature scaling, and
choosing an appropriate distance metric can help reduce the effect of outliers
in KNN.

Bias and Variance in KNN

Bias: KNN has low bias because it does not assume a model structure.

Variance: KNN has high variance because it is sensitive to the specific

instances in the training dataset. The performance may change drastically
with slight variations in the data.

This high variance occurs because KNN directly relies on the dataset without
abstracting patterns or trends. Increasing the value of k can help reduce
variance by averaging more points, but it may increase bias if too many
neighbors are included.

Additional Notes
Choice of Distance Metric: Common choices include Euclidean,
Manhattan, and Minkowski distances. For categorical data, Hamming
Distance is often used.

Scaling Features: Since KNN relies on distances, features should be

normalized to avoid bias towards features with larger ranges.

Optimal Value of k: Typically found through cross-validation. An odd

number for k is often chosen to avoid ties in classification tasks.

Support Vector Machine (SVM) Tutorial

Introduction
Support Vector Machine (SVM) is a supervised machine learning algorithm
primarily used for classification tasks but can also be applied to regression.

ML Tutorial 34
The main goal of SVM is to find the optimal hyperplane that separates data
points of different classes with the maximum margin. This maximization of the
margin between classes enhances generalization, making SVM highly effective
in binary classification tasks. SVM is popular in fields like image recognition,
text categorization, and bioinformatics due to its ability to handle high-
dimensional data.

How SVM Works

SVM classifies data by finding a hyperplane that best separates the classes.
The "support vectors" are the data points closest to this hyperplane,
influencing its position and orientation. SVM aims to maximize the distance
(margin) between support vectors of the two classes, ensuring a more robust
classifier.

Formula to Predict a New Point

For a given data point \( x \), SVM predicts the class label \( y \) as follows:

\[
y = \text{sign} (w^T x + b)
\]

where:

\( w \) is the weight vector that defines the orientation of the hyperplane,

\( b \) is the bias term that shifts the hyperplane,

ML Tutorial 35
\( \text{sign}(\cdot) \) function returns \( +1 \) or \( -1 \) depending on which
side of the hyperplane the point lies.

The Loss Function

The objective of SVM is to maximize the margin while minimizing the
classification error. The Hinge Loss function is used to penalize misclassified
points or points within the margin. The loss function for SVM is given by:
\[
\text{Loss} = \frac{1}{2} ||w||^2 + C \sum_{i=1}^N \max(0, 1 - y_i (w^T x_i + b))
\]

where:

\( ||w||^2 \) controls the margin (keeps it large),

\( C \) is a regularization parameter that balances maximizing the margin

and minimizing misclassifications,

\( y_i \) is the true label for point \( x_i \).

The Hinge Loss \( \max(0, 1 - y_i (w^T x_i + b)) \) is zero if the point is correctly
classified and beyond the margin; otherwise, it penalizes based on how far the
point is from the correct margin boundary.

Calculating the Hyperplane and Updating It

To calculate the hyperplane, SVM solves an optimization problem to find \( w \)
and \( b \) that maximize the margin and minimize classification error. The
gradient descent method is typically used to adjust \( w \) and \( b \) iteratively.
The hyperplane can be updated by:

1. Minimizing the loss: Using optimization techniques like Stochastic

Gradient Descent (SGD) or Quadratic Programming.

2. Support Vectors: Since only support vectors define the hyperplane,

adjusting \( w \) and \( b \) to ensure that these points lie on the margin
helps update the hyperplane.

Pros and Cons of SVM

Pros:

Effective in high-dimensional spaces: Works well when the number of

features is greater than the number of samples.

ML Tutorial 36
Robust to overfitting: Especially with the use of regularization (C
parameter).

Kernel Trick: Allows SVM to create non-linear decision boundaries using

kernel functions like RBF or polynomial kernels.

Cons:

Inefficient for large datasets: Training SVM is computationally intensive for

large datasets.

Choice of Kernel: Selecting an appropriate kernel can be challenging and

may impact performance.

Sensitive to parameter tuning: Requires careful tuning of the C and kernel

parameters for optimal results.

Effect of Outliers on SVM

Outliers can significantly affect the hyperplane as they may be located close to
the decision boundary, altering the margin and causing SVM to misclassify
nearby points. Using a soft margin (controlled by parameter \( C \)) helps SVM
tolerate some misclassified points, making it more robust to outliers. A high \( C
\) value places more emphasis on correctly classifying each point, while a low \
( C \) allows more flexibility for misclassification.

Bias and Variance

Bias: SVM generally has low bias, especially with non-linear kernels, as it
can capture complex patterns.

Variance: It tends to have high variance, especially with complex kernels,

as it may be sensitive to changes in the training data.

This high variance makes SVM prone to overfitting if parameters (especially the
kernel and \( C \)) are not carefully selected.

Additional Notes
Kernel Trick: The kernel trick allows SVM to map data into higher
dimensions to create a linear separation where it is otherwise impossible.
Some popular kernels include:

Linear Kernel: Suitable for linearly separable data.

ML Tutorial 37
Polynomial Kernel: Captures polynomial relationships.

RBF (Gaussian) Kernel: Suitable for non-linearly separable data.

Support Vectors: Only the points that lie on the margin, known as support
vectors, affect the final model, leading to a sparse solution.

Clustering Tutorial
K-Means
Introduction
K-Means is a popular unsupervised learning algorithm used for clustering,
where it groups data points into a predefined number of clusters. The main
objective of K-Means is to partition the data into clusters such that data points
within a cluster are more similar to each other than to those in other clusters.
This similarity is measured by the distance between points, often using
Euclidean distance. K-Means is effective for segmenting datasets where a
natural grouping exists, making it useful in applications like customer
segmentation, image compression, and pattern recognition.

How K-Means Works

1. Choose the number of clusters (k): The user predefines the number of
clusters, often selected based on techniques like the Elbow Method.

2. Initialize centroids: Randomly initialize \( k \) centroids (one for each

cluster).

3. Assign points to nearest centroids: Each data point is assigned to the

nearest centroid, creating \( k \) clusters.

4. Update centroids: Calculate the mean of all points within each cluster and
update the centroids accordingly.

5. Repeat steps 3-4: Iterate until convergence (when centroids no longer

move significantly or a maximum number of iterations is reached).

The algorithm aims to minimize the within-cluster sum of squares (WCSS),

which measures the squared distance between each point and its centroid.

ML Tutorial 38
Formula to Predict Cluster for a New Point
The formula used in K-Means for assigning a point \( x \) to a cluster is based
on minimizing the Euclidean distance:
\[
d(x, \mu_j) = \sqrt{\sum_{i=1}^{n} (x_i - \mu_{j,i})^2}
\]
where:

\( x \) is the data point,

\( \mu_j \) is the centroid of cluster \( j \),

\( n \) is the number of features.

The point \( x \) is assigned to the cluster \( j \) that minimizes \( d(x, \mu_j) \).

Objective Function: Within-Cluster Sum of Squares (WCSS)

The objective function for K-Means, called the within-cluster sum of squares
(WCSS), is represented as:
\[
\text{WCSS} = \sum_{j=1}^{k} \sum_{x \in C_j} \| x - \mu_j \|^2
\]
where:

\( k \) is the number of clusters,

ML Tutorial 39
\( C_j \) represents cluster \( j \),

\( x \) is a data point in \( C_j \), and

\( \mu_j \) is the centroid of \( C_j \).

This function calculates the squared distance between each point and its
centroid, summing it across all clusters. K-Means aims to minimize this value.

Pros and Cons of K-Means

Pros:

Simple and efficient: Easy to implement and computationally efficient on

large datasets.

Scalable: Can handle a large number of data points.

Works well with spherical clusters: K-Means performs well with clusters
that are roughly circular in shape.

Cons:

Sensitive to the initial choice of centroids: Random initialization can lead to

different results; hence, running the algorithm multiple times or using
initialization techniques like K-Means++ is recommended.

Difficulty with non-spherical clusters: Struggles to identify clusters of

arbitrary shapes, as it relies on distance-based metrics.

Requires pre-specifying k: The number of clusters \( k \) must be defined

beforehand, which may not be straightforward.

Effect of Outliers on K-Means

Outliers can have a significant impact on K-Means clustering. Since K-Means
relies on calculating centroids as the mean of all points in a cluster, even a
single outlier can distort the centroid's position. This misplacement can cause
the clustering to be inaccurate, leading to clusters that do not represent the
natural groupings in the data.

Bias and Variance

High bias: K-Means is a relatively rigid algorithm, particularly because it
forms clusters based on spherical distance from centroids. This results in a
high bias, making it less flexible for complex data structures.

ML Tutorial 40
Low variance: If the initialization process is handled well (e.g., with K-
Means++), K-Means produces consistent results across runs. However, if
centroids are poorly initialized, variance may increase.

Additional Notes
K-Means++ Initialization: K-Means++ is a method to improve centroid
initialization, helping to reduce the likelihood of poor clustering and
convergence to a local minimum.

Elbow Method: A technique to help determine the optimal number of

clusters by plotting the WCSS for different values of \( k \) and identifying
the point where adding more clusters minimally reduces WCSS.

Interpretability: K-Means clustering can sometimes be challenging to

interpret if clusters are not well-separated.

Hierarchical Clustering Tutorial

Introduction
Hierarchical clustering is an unsupervised learning algorithm used to group
data points into clusters without specifying the number of clusters beforehand.
Unlike K-Means, which forms a predefined number of clusters, hierarchical
clustering creates a tree-like structure (dendrogram) that represents nested
clusters within each other. This algorithm is particularly useful for exploring the
natural structure in data, as it allows you to visualize clusters at various levels
of granularity. It is frequently used in fields like bioinformatics, customer
segmentation, and document clustering.

How Hierarchical Clustering Works

Hierarchical clustering builds clusters in a bottom-up (agglomerative) or top-
down (divisive) approach:

Agglomerative (most common): Starts with each data point as its own
cluster and merges the closest clusters iteratively until only one cluster
remains.

ML Tutorial 41
Divisive: Starts with all points in a single cluster and recursively splits them
until each data point is its own cluster.

Key Steps:
1. Calculate distances: Compute the distance matrix between each pair of
data points.

2. Merge closest clusters: Find the two closest clusters (based on a distance
metric) and merge them.

3. Update distance matrix: Recalculate the distances between the new

cluster and remaining clusters.

4. Repeat until one cluster remains.

Hierarchical clustering does not require specifying the number of clusters in

advance, and clusters can be determined by cutting the dendrogram at the
desired level.

Distance Metrics and Linkage Criteria

Distance and linkage criteria significantly impact hierarchical clustering's
structure. Some popular distance metrics include Euclidean, Manhattan, and
Cosine distances. The linkage criterion determines how distances between
clusters are calculated:

Single linkage: Distance between the closest points of two clusters.

Complete linkage: Distance between the farthest points of two clusters.

Average linkage: Average distance between points across clusters.

Ward’s linkage: Minimizes the variance within each cluster, often producing
balanced clusters.

The choice of metric and linkage can affect cluster shape and separation, so
they should be selected based on data characteristics.

Objective Function: Dendrogram Representation

Unlike K-Means, hierarchical clustering does not optimize a specific objective
function. Instead, it builds a dendrogram: a tree diagram that shows the
hierarchy of clusters. The dendrogram helps visualize how clusters form at

ML Tutorial 42
different levels, allowing the user to decide the best number of clusters based
on their requirements.

Pros and Cons of Hierarchical Clustering

Pros:

No need to specify k: It automatically creates a nested structure of

clusters.

Easy to visualize: Dendrograms allow a clear visualization of data

hierarchy, making it easier to understand the clustering at various levels.

Suitable for arbitrary shapes: Can capture clusters of different shapes and
sizes more naturally than K-Means.

Cons:

Computationally expensive: Hierarchical clustering has a time complexity

of \( O(n^2 \log(n)) \), which can be inefficient for large datasets.

Sensitive to noise and outliers: Outliers can affect the dendrogram

structure and potentially lead to misleading clusters.

Difficulty in adjusting: Once clusters are formed, adjusting the structure

requires re-running the entire algorithm.

How Outliers Affect Hierarchical Clustering

Outliers can have a significant impact on hierarchical clustering. Since the
algorithm continuously merges clusters, an outlier can be treated as a separate
cluster or merged into an existing cluster, disrupting the natural clustering.
Additionally, in cases with single-linkage, outliers may create "chains," where
distant points are connected due to their proximity to other points, distorting
cluster structure.

Bias and Variance

Hierarchical clustering generally exhibits:

High bias: This is because once clusters are formed, they cannot be
changed, leading to rigid clustering structures.

Low variance: Results tend to be stable across different runs because there
is no random initialization. However, this depends on the choice of linkage

ML Tutorial 43
and distance metric.

Additional Notes
Dendrogram Cutting: The depth at which the dendrogram is "cut"
determines the number of clusters. The threshold can be chosen by
analyzing the dendrogram and selecting a height where there is a
significant gap between levels.

Distance Threshold: A cutoff distance can be set to stop merging clusters

at a certain level, helping control cluster compactness.

Scalability: Hierarchical clustering may not perform well with large datasets
due to its high computational requirements, but it is effective for small to
medium datasets.

K-Means VS Hierarchical Clustering

Here's a comparison table between K-Means and Hierarchical Clustering:

Aspect K-Means Clustering Hierarchical Clustering

Partitional (divides data into

Hierarchical (builds a nested
Type distinct, non-overlapping
cluster structure or hierarchy)
clusters)

Determined by analyzing the

Number of
Must be specified in advance dendrogram or setting a distance
Clusters (k)
cutoff

Agglomerative (bottom-up) or
Algorithm Type Iterative and distance-based
divisive (top-down)

Various metrics (Euclidean,

Distance Metric Primarily Euclidean distance
Manhattan, Cosine, etc.)

Single, Complete, Average, or

Linkage Criteria Not applicable
Ward’s linkage

ML Tutorial 44
Speed and Faster, with complexity of \(O(n Slower, with complexity of \
Complexity \times k \times i)\) (O(n^2 \log(n))\)

Scalability Efficient on large datasets Inefficient for very large datasets

Works best with spherical, well- Suitable for clusters of arbitrary

Cluster Shape
separated clusters shapes

Outlier Highly sensitive, especially with

Sensitive to outliers
Sensitivity single linkage

Random Centroid initialization affects No random initialization; stable

Initialization results (e.g., K-Means++) clusters across runs

Moderate; clusters are less High; dendrogram provides a

Interpretability interpretable without clear visualization of cluster
visualization hierarchy

High variance if initialized

Variance and Lower variance; high bias due to
poorly; lower bias with defined
Bias rigid clustering structure
structure

Customer segmentation, image Bioinformatics, gene clustering,

Usage Cases
compression document clustering

Result Dendrogram (tree structure) and

Cluster labels for each point
Representation optional cluster labels

Summary
K-Means is suitable for large datasets with spherical clusters and requires
pre-specifying the number of clusters.

Hierarchical Clustering is ideal for smaller datasets, provides a hierarchical

structure, and is effective when you want to explore nested clusters.

ML Tutorial 45

Lecture 09 ML
No ratings yet
Lecture 09 ML
26 pages
Unit II
100% (1)
Unit II
13 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
Data Mining - Concepts, Methods and Applications in Management and Engineering Design
No ratings yet
Data Mining - Concepts, Methods and Applications in Management and Engineering Design
328 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
Accuracy Assessment and Confusion Matrix
No ratings yet
Accuracy Assessment and Confusion Matrix
23 pages
Section 4
No ratings yet
Section 4
40 pages
Mauryan Empire
No ratings yet
Mauryan Empire
11 pages
ML-classification models
No ratings yet
ML-classification models
27 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
3. LR, decision tree
No ratings yet
3. LR, decision tree
48 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Financial Early Warning System Model and Data Mining Application For Risk Detection 2012 Expert Systems With Applications
No ratings yet
Financial Early Warning System Model and Data Mining Application For Risk Detection 2012 Expert Systems With Applications
16 pages
Module 5
No ratings yet
Module 5
6 pages
ML-1
No ratings yet
ML-1
24 pages
Logistic Regression in Data Analysis: An Overview
No ratings yet
Logistic Regression in Data Analysis: An Overview
21 pages
Ml2-Summary
No ratings yet
Ml2-Summary
8 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
UNIT3 Machine Learning
No ratings yet
UNIT3 Machine Learning
53 pages
Logistic Regression[2]
No ratings yet
Logistic Regression[2]
36 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
logistic regression
No ratings yet
logistic regression
6 pages
FAM Unit6
No ratings yet
FAM Unit6
32 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Logistic Regression
No ratings yet
Logistic Regression
1 page
4.Logistic Regression
No ratings yet
4.Logistic Regression
16 pages
Practical - Logistic Regression
No ratings yet
Practical - Logistic Regression
84 pages
Logistic Regression
No ratings yet
Logistic Regression
10 pages
DMML Unit4
No ratings yet
DMML Unit4
77 pages
CH 1
No ratings yet
CH 1
24 pages
Ml Assignment Kv2
No ratings yet
Ml Assignment Kv2
10 pages
ML CLASS 5 Logistic Regression Algorithm
No ratings yet
ML CLASS 5 Logistic Regression Algorithm
16 pages
09_23ECE216_LogisticRegression
No ratings yet
09_23ECE216_LogisticRegression
40 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
AIML_Lab7_Manual (Model Eval-Cross Validation)
No ratings yet
AIML_Lab7_Manual (Model Eval-Cross Validation)
6 pages
Unit 2
No ratings yet
Unit 2
8 pages
Broadly, There Are 3 Types of Machine Learning Algorithms.
No ratings yet
Broadly, There Are 3 Types of Machine Learning Algorithms.
33 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Logistic regression
No ratings yet
Logistic regression
12 pages
Data Mining Algorithmes
No ratings yet
Data Mining Algorithmes
166 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
ML-Unit 4
No ratings yet
ML-Unit 4
29 pages
Linear and Logistic Regression
No ratings yet
Linear and Logistic Regression
21 pages
Logistic Regression
No ratings yet
Logistic Regression
20 pages
Logistic_Regression_Class_Notes
No ratings yet
Logistic_Regression_Class_Notes
3 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
ML Unit 3
No ratings yet
ML Unit 3
40 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
22 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
B-56 Sanket Jambhulkar MLA-3
No ratings yet
B-56 Sanket Jambhulkar MLA-3
7 pages
A Comparitive Study On Different Classification Algorithms Using Airline Dataset
No ratings yet
A Comparitive Study On Different Classification Algorithms Using Airline Dataset
4 pages
Decision Analysis Solution To Solved Problems: 9.S1 New Vehicle Introduction
No ratings yet
Decision Analysis Solution To Solved Problems: 9.S1 New Vehicle Introduction
8 pages
Pengenalan Beragam Macam Data
No ratings yet
Pengenalan Beragam Macam Data
113 pages
Unit-II - Tree Based Methods
No ratings yet
Unit-II - Tree Based Methods
158 pages
Chapter 1 ML
No ratings yet
Chapter 1 ML
30 pages
20.k1.0038 Proposal Project Report Kelar-1
No ratings yet
20.k1.0038 Proposal Project Report Kelar-1
31 pages
Fabric Get Started
No ratings yet
Fabric Get Started
151 pages
Stable Variable Selection For Right Censored Data: Comparison of Methods
No ratings yet
Stable Variable Selection For Right Censored Data: Comparison of Methods
29 pages
FAI Lecture - 4-10-2023 PDF
No ratings yet
FAI Lecture - 4-10-2023 PDF
27 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
23 pages
Automatic Construction of Decision Trees From Data: A Multi-Disciplinary Survey
No ratings yet
Automatic Construction of Decision Trees From Data: A Multi-Disciplinary Survey
49 pages
1-s2.0-S277294192400111X-main-cids
No ratings yet
1-s2.0-S277294192400111X-main-cids
17 pages
Cambly
No ratings yet
Cambly
26 pages
Basic Data Mining Tutorial
No ratings yet
Basic Data Mining Tutorial
35 pages
Machine Learning - Applications, Process and Techniques
No ratings yet
Machine Learning - Applications, Process and Techniques
241 pages
Random Walks: A Review of Algorithms and Applications
No ratings yet
Random Walks: A Review of Algorithms and Applications
13 pages
Final Report
No ratings yet
Final Report
35 pages
Bitcoin Final
No ratings yet
Bitcoin Final
16 pages
B.E Cse Batchno 185
No ratings yet
B.E Cse Batchno 185
42 pages
Divorce Prediction System: Devansh Kapoor 179202050
No ratings yet
Divorce Prediction System: Devansh Kapoor 179202050
12 pages
Score Prediction and Analysis in Cricket
No ratings yet
Score Prediction and Analysis in Cricket
9 pages
Seminar Presentation
No ratings yet
Seminar Presentation
25 pages
Lab1-Algorithms For Information Retrieval. Introduction
No ratings yet
Lab1-Algorithms For Information Retrieval. Introduction
13 pages
Netflix User and Movies Interest Analysis For Asian Countries
No ratings yet
Netflix User and Movies Interest Analysis For Asian Countries
5 pages
Decision Trees - CHAID AND CART 2019 PDF
No ratings yet
Decision Trees - CHAID AND CART 2019 PDF
44 pages
Bok:978 1 4899 7218 7 PDF
No ratings yet
Bok:978 1 4899 7218 7 PDF
375 pages
Accelerator Program in Business Analytics and Data Science: Placement Assurance
No ratings yet
Accelerator Program in Business Analytics and Data Science: Placement Assurance
25 pages
TD1 ELTP 2023 Student
No ratings yet
TD1 ELTP 2023 Student
2 pages
A Novel Approach On Tamil Text Classification Using C Final Modified For Uploading
No ratings yet
A Novel Approach On Tamil Text Classification Using C Final Modified For Uploading
6 pages
Hadoop Vs Spark
No ratings yet
Hadoop Vs Spark
2 pages
Resume Ahmed Essam
No ratings yet
Resume Ahmed Essam
1 page
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

ML Tutorial

Uploaded by

ML Tutorial

Uploaded by

ML Tutorial

Logistic Regression Tutorial

Formula to Predict a New Point

β0 ﻿ is the intercept term.

β1 , β2 , … , βn ﻿ are the coefficients associated with each feature.

The Loss Function

Loss(y, h(x)) = − (y ⋅ log(h(x)) + (1 − y) ⋅ log(1 − h(x)))﻿

\( y \) is the actual class label (0 or 1).

\( h(x) \) is the predicted probability from the logistic function.

1. Interpretability: Coefficients provide insight into the importance of each

2. Efficiency: Simple and computationally efficient, suitable for small to

3. Probability Outputs: Provides probabilities rather than hard classifications,

4. Less Prone to Overfitting: Logistic regression has a lower risk of

1. Assumption of Linearity: Assumes a linear relationship between the

2. Limited to Binary Classification: Needs extensions (e.g., multinomial

3. Sensitive to Feature Scaling: Requires standardized or normalized features

4. Poor Performance with Complex Relationships: Logistic regression does

Effect of Outliers on Logistic Regression

Robust Scaling: Scaling features using robust methods can reduce

Regularization: Applying regularization, such as L2 (Ridge), can make the

Bias and Variance in Logistic Regression

Variance: Logistic regression typically has low variance, making it a stable

This high-bias, low-variance property makes logistic regression a suitable

Multinomial Logistic Regression: For multiclass classification, logistic

Feature Engineering: Logistic regression can benefit significantly from

Decision Trees Tutorial

Selecting the Root Node and Internal Nodes

Calculate the impurity of each split using a criterion such as Gini

The goal is to reduce impurity at each split. Common impurity measures

Gini Impurity for classification:

\( n_i \) is the number of samples in class \( i \),

\( N \) is the total number of samples in the dataset.

Entropy for classification:

Mean Squared Error (MSE) for regression:

3. Calculate Information Gain (or Reduction in Impurity):

For each possible split, compute the Information Gain or Reduction in

For a split on feature \( X \) with threshold \( t \), Information Gain can

\( I(D) \) is the impurity of the parent node,

\( I(D_{left}) \) and \( I(D_{right}) \) are the impurities of the child

\( |D_{left}| \) and \( |D_{right}| \) are the number of samples in the

\( |D| \) is the total number of samples in the parent node.

4. Choose the Best Split:

5. Recursion for Internal Nodes:

Minimum Samples per Node: Ensures each node represents a

Impurity Threshold: Stops the split if the reduction in impurity is below

The Loss Function

Gini Impurity for classification:

Entropy for classification:

Mean Squared Error (MSE) for regression:

Pros and Cons of Decision Trees

2. Non-linear Relationships: They can capture non-linear relationships

3. Minimal Data Preparation: Decision trees require little data preprocessing,

6. Works Well on Large Datasets: With certain optimizations (like pruning),

1. Prone to Overfitting: Without limitations on depth or splits, decision trees

3. Biased Towards Dominant Classes: If one class is more prevalent, decision

4. Suboptimal Splits in High Dimensions: In high-dimensional spaces,

5. Requires Pruning: Pruning is necessary to prevent overfitting but requires

How Outliers Affect Decision Trees

Bias and Variance

To address this high variance, ensemble methods like Random Forests or

3. Feature Importance: Decision trees provide a way to assess feature

4. Ensemble Methods: To improve accuracy, decision trees are often used in

5. Handling Missing Values: Decision trees can handle missing values by

By understanding these key points about decision trees, you’ll be well-

The Loss Function

Classification: Minimizing impurity at each split, using measures such as

Regression: Minimizing Mean Squared Error (MSE) at each split within

How Trees in the Forest Are Built

1. Bagging (Bootstrap Aggregating):

Each tree is trained on a different random subset of the training data

2. Random Feature Selection:

For each split in a tree, only a random subset of the features is

3. Growing the Trees:

Each tree is grown to its maximum depth, with no pruning, to allow it to

The combination of bagging and random feature selection helps in

Pros and Cons of Random Forest

β0 is the intercept term.

β1 , β2 , … , βn are the coefficients associated with each feature.

Loss(y, h(x)) = − (y ⋅ log(h(x)) + (1 − y) ⋅ log(1 − h(x)))