0% found this document useful (0 votes)
24 views61 pages

PART2

Clustering is an unsupervised machine learning technique that groups similar data points based on their characteristics, with applications in market segmentation, anomaly detection, and medical imaging. Various clustering methods include centroid-based (e.g., K-means), density-based (e.g., DBSCAN), and hierarchical clustering, each with distinct algorithms and use cases. The document also discusses association rule mining, particularly in market basket analysis, to identify relationships between items purchased together.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views61 pages

PART2

Clustering is an unsupervised machine learning technique that groups similar data points based on their characteristics, with applications in market segmentation, anomaly detection, and medical imaging. Various clustering methods include centroid-based (e.g., K-means), density-based (e.g., DBSCAN), and hierarchical clustering, each with distinct algorithms and use cases. The document also discusses association rule mining, particularly in market basket analysis, to identify relationships between items purchased together.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Clustering

• Clustering is a technique used in machine learning and data analysis to group similar objects or data points together
based on their inherent characteristics or patterns.
• It is an unsupervised learning method, meaning that it does not rely on labeled data but instead aims to discover
patterns and relationships within the data itself.
• The goal of clustering is to partition a dataset into groups, known as clusters, such that the objects within each cluster
are more similar to each other than to those in other clusters. The similarity between objects is typically measured
using distance metrics, such as Euclidean distance or cosine similarity, depending on the nature of the data.
Uses of Clustering
Market Segmentation – Businesses use clustering to group their customers and use targeted advertisements to attract more
audience.
Market Basket Analysis – Shop owners analyze their sales and figure out which items are majorly bought together by the
customers. For example, In USA, according to a study diapers and beers were usually bought together by fathers.
Social Network Analysis – Social media sites use your data to understand your browsing behaviour and provide you with
targeted friend recommendations or content recommendations.
Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-rays.
Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting fraudulent transactions we can use
clustering to identify them.
Simplify working with large datasets – Each cluster is given a cluster ID after clustering is complete. Now, you may
reduce a feature set’s whole feature set into its cluster ID. Clustering is effective when it can represent a complicated case
with a straightforward cluster ID. Using the same principle, clustering data can make complex datasets simpler.
Types of Clustering
1. Centroid-based Clustering (Partitioning methods)
• Group data points on the basis of their closeness. Similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance.
• Example; K-means clustering
• The primary drawback for these algorithms is the requirement that we establish the number of clusters, “k,” either
intuitively or scientifically.
2. Density-based Clustering (Model-based methods)
• Finds groups based on the density of data points.
• Determines the number of clusters automatically.
• Ideally suited for datasets with irregularly shaped or overlapping clusters.
• Due to its preset number of cluster requirements and extreme sensitivity to the initial positioning of centroids, the
outcomes can vary.
• Example: DBSCAN
Types of Clustering
3. Connectivity-based Clustering (Hierarchical clustering)
• Each data point is initially taken into account as a separate cluster, which is subsequently combined with the
clusters that are the most similar to form one large cluster that contains all of the data points.
4. Distribution-based Clustering
• The data elements are grouped using a probability-based distribution that is based on statistical distributions.
Included are data objects that have a higher likelihood of being in the cluster. A data point is less likely to be
included in a cluster the further it is from the cluster’s central point, which exists in every cluster.
K-means Clustering
The k-means clustering algorithm mainly performs two tasks:

• Determines the best value for K center points or centroids by


an iterative process.
• Assigns each data point to its closest k-center. Those data
points which are near to the particular k-center, create a
cluster.

Algorithm:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
K-means Clustering
choose random k
choose a new centroid
points(centroids) Assign data points to its closest k-point
(i.e compute centre of gravity of each group)
K=2

Remove the assumed No dissimilar points on Draw new Repeat the Reassign each
centroids and the final either side of the line and median line and process finding data point to the
cluster is formed the model is formed reassign the data the new centroid new centroid
K-means Clustering
Example
Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
Suppose that the initial seeds (centers of each cluster) are A1, A4 and A7. Run the k-means algorithm for 1 epoch
only.
At the end of this epoch show:
a) The new clusters (i.e. the examples belonging to each cluster)
b) The centers of the new clusters
c) Draw a 10 by 10 space with all the 8 points and show the clusters after the first epoch and the new centroids.
d) How many more iterations are needed to converge? Draw the result for each epoch.
K-means Clustering The distance matrix based on the Euclidean distance is given below:

Solution:

Randomly selected initial centroids :


seed1=A1=(2,10) seed2=A4=(5,8), seed3=A7=(1,2)
Eucledian distance between 𝑎 and 𝑏

𝑑(𝑎, 𝑏) = 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2
K means Clustering
Epoch 1: Start
Data point Distances with seed1, 2 and 3 Cluster label
New clusters:
Cluster 1: {A1},
A1 0 13 65 Cluster 1 Cluster 2: {A3, A4, A5, A6, A8}
A2 5 4.24 3.16 Cluster 3 Cluster 3: {A2, A7}
A3 6 5 7.28 Cluster 2 Centers of the new clusters:
C1= (2, 10),
A4 13 0 52 Cluster 2
C2= ((8+5+7+6+4)/5, (4+8+5+4+9)/5) = (6, 6),
A5 7.07 3.6 6.7 Cluster 2 C3= ((2+1)/2, (5+2)/2) = (1.5, 3.5)
A6 7.21 4.12 5.38 Cluster 2

A7 65 52 0 Cluster 3

A8 5 2 58 Cluster 2
K means Clustering
After the 2nd epoch the results would be: After the 3rd epoch, the results would be:
Cluster 1: {A1, A8}, Cluster 1: {A1, A4, A8},
Cluster 2: {A3, A4, A5, A6}, Cluster 2: {A3, A5, A6},
Cluster 3: {A2, A7} Cluster 3: {A2, A7} with centers C1=(3.66, 9), C2=(7, 4.33)
with centers C1=(3, 9.5), C2=(6.5, 5.25) and C3=(1.5, 3.5). and C3=(1.5, 3.5).
K means Clustering
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of
WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

• It executes the K-means clustering on a given dataset for


different K values (ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number
of clusters K.
• The sharp point of bend or a point of the plot looks like an arm,
then that point is considered as the best value of K.
Nearest Neighbour Clustering
Use nearest neighbourhood algorithm and Euclidean distance to cluster the examples from the previous exercise.
Assume that the threshold is 4.
Nearest Neighbour Clustering
Density Based Spatial Clustering for Applications with Noise (DBSCAN)
• K-means algorithm work for finding spherical-shaped clusters or convex clusters. (i.e, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.
• Given the data set containing non-convex shape clusters and outliers K-means has difficulty in finding the cluster
• DBSCAN clusters the dense regions in the data space separated by regions of lower density of points.
• The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.
DBSCAN
Parameters required for DBSCAN
eps: It defines the neighborhood around a data point i.e. if the distance between two points is
lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too small
then a large part of the data will be considered as an outlier. If it is chosen very large then the
clusters will merge and the majority of the data points will be in the same clusters. One way to
find the eps value is based on the k-distance graph.

MinPts: Minimum number of neighbors (data points) within eps radius. The larger the dataset, the larger value of MinPts
must be chosen. As a general rule, the minimum MinPts can be derived from the number of dimensions D in the dataset as,
𝑀𝑖𝑛𝑃𝑡𝑠 >= 𝐷 + 1. The minimum value of MinPts must be chosen at least

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the
neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
DBSCAN
Algorithmic steps for DBSCAN clustering
Directly Density Reachable: Data-point a is directly density reachable from a point b if —
|N (b)|≥ MinPts; i.e. b is a core point.
a ∈ N(b) i.e. a is in the epsilon neighborhood of b.

“A point a is density connected to a point b with respect to ϵ and


MinPts, if there is a point c such that, both a and b are density
reachable from c w.r.t. to ϵ and MinPts.”
DBSCAN- Algorithm
1.The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved
from the ϵ parameter.
2.If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise the point is labeled as noise.
This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster.
Concept of density reachable and density connected points are important here.
3.If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster. So all the
points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.
4.The above process continues until the density-connected cluster is completely found.
5.The process restarts with a new point which can be a part of a new cluster or labeled as noise.
Hierarchical Clustering
Hierarchical Clustering
Hierarchical Clustering
Hierarchical Clustering
Complete linkage
Single linkage

D(3,5) is minimum we need to remove the 3 and 5


So merged into a the cluster '35' entries, and replace it by an
entry "35"
d(1,3)= 3 and d(1,5)=11.
So, D(1,"35")=11.
Continuing in this way, after 6 steps, everything is clustered.
Hierarchical Clustering
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other
items in the transaction.

Example:
Market basket analysis using association rules
• This analysis might tell a retailer that customers often purchase shampoo and conditioner together, so putting
both items on promotion at the same time would not create a significant increase in revenue, while a promotion
involving just one of the items would likely drive sales of the other.
• The outcome of this type of technique is, in simple terms, a set of rules that can be understood as “if this, then
that”.

Transaction Items Retailers are interested in analysing the data to learn


t1 {T-shirt, Trousers, Belt} about the purchasing behaviour of their customers.
t2 {T-shirt, Jacket}
t3 {Jacket, Gloves}
t4 {T-shirt, Trousers, Jacket}
t5 {T-shirt, Trousers, Sneakers, Jacket, Belt}
t6 {Trousers, Sneakers, Belt}
t7 {Trousers, Belt, Sneakers}
Association Rule Mining
Association Rule Mining An association rule implies that if an item A occurs, then item B
also occurs with a certain probability.
Example of association rules

metrics to measure the precision of a rule:

Support is an indication of how frequently the item set appears in the data set.
Confidence is an indication of how often the rule has been found to be true.
Lift is the ratio of the observed support to that expected if X and Y were independent Greater lift
values indicate stronger associations.
Association Rule Mining
Example of association rules

Transaction Items Items I={T-Shirt, Trousers, Belt, Jacket, Gloves, Sneakers}


Transactions t1={T-Shirt, Trousers, Belt}
t1 {T-shirt, Trousers, Belt} Then, an association rule is defined as an implication of the form
t2 {T-shirt, Jacket}
t3 {Jacket, Gloves} Support
t4 {T-shirt, Trousers, Jacket}
t5 {T-shirt, Trousers, Sneakers, Jacket, Belt}
t6 {Trousers, Sneakers, Belt}
t7 {Trousers, Belt, Sneakers}

Confidence Lift
Association Rule Mining
Conviction
It can be interpreted as the ratio of the expected frequency that X occurs without Y if X and Y were independent divided
by the observed frequency of incorrect predictions. A high value means that the consequent depends strongly on the
antecedent.
Association Rule Mining
Apriori algorithm generates association rules for a given data set. Association rules are usually required to satisfy a
user-specified minimum support and a user-specified minimum confidence at the same time.
Frequent item set - An item set whose support is greater than or equal to a minsup threshold. Given a set of
transactions T, the goal of association rule mining is to find all rules having
support >= minsup threshold Two step approach
confidence >= minconf threshold 1. Frequent item set generation (Generate all items whose
support >= minsup threshold)
2. Rule generation (Generate high confidence rules from each
frequent item set, where each rule is binary partitioning of a
frequent item set (confidence >= minconf threshold)

Apriori Algorithm

1. Candidate item sets are generated using only the large item
sets of the previous pass without considering the
transactions in the database.
2. The large item set of the previous pass is joined with itself
to generate all item sets whose size is higher by 1.
3. Each generated item set that has a subset which is not
large is deleted. The remaining item sets are the candidate
ones.
Association Rule Mining The Apriori Algorithmis an influential algorithm for
mining frequent itemsets for boolean association rules.
Following are key concepts:-
•Frequent Itemsets: The sets of item which has
minimum support (denoted by Li for ith Itemset).
•Apriori Property: Any subset of frequent itemset
must be frequent.
•Join Operation: To find Lk, a set of candidate k-
itemsets is generated by joining Lk-1with itself.
•Pseudo-code:
•Join Step: Ck is generated by joining Lk-1with itself
•Prune Step: Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1= {frequent items};

( for(k= 1; Lk!=0;k++) do beginCk+1= candidates


generated from Lk;
1,3,2 1
for eachtransaction tin database do
1,3,5 1
increment the count of all candidates in Ck+1that are
1,2,5 1
contained in t
2,3,5 2
Lk+1= candidates in Ck+1with min_support
end
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
• The curse of dimensionality is a common problem in machine learning, where the performance of the model deteriorates
as the number of features increases. This is because the complexity of the model increases with the number of features, and
it becomes more difficult to find a good solution.
• High-dimensional data can also lead to over-fitting, where the model fits the training data too closely and does not
generalize well to new data.
• Dimensionality reduction can help to mitigate these problems by reducing the complexity of the model and improving its
generalization performance.

Two main methods of dimensionality reduction are:


1. Feature selection - to reduce the dimensionality of the feature space, this process finds the most informative features or
eliminate uninformative features. Feature selection can be done manually or using software tools. It requires an
understanding of what aspects of your dataset are important in whatever predictions you're making, and which are not.
2. Feature extraction - derive new features which are composite of the existing features. This reduces the number of
features while keeping as much information as possible.
Dimensionality Reduction
Feature selection methods: Feature extraction:
Filter methods: features are selected based on their statistical • involves creating new features by combining or transforming
properties, such as their correlation with the target variable or their the original features. The goal is to create a set of features
variance. that captures the essence of the original data in a lower-
Wrapper methods: involve training a machine learning model to dimensional space.
evaluate the performance of different subsets of features. In this • methods for feature extraction:
approach, a search algorithm is used to select a subset of features that -Principal component analysis (PCA),
results in the best model performance. -Linear discriminant analysis (LDA)
Embedded methods: hybrid of filter and wrapper methods. In this -t-distributed stochastic neighbor embedding (t-SNE).
approach, feature selection is integrated into the model training • PCA is a popular technique that projects the original features
process, and features are selected based on their importance in the onto a lower-dimensional space while preserving as much of
model. the variance as possible.
Univariate feature selection: involves selecting the features based
on their individual performance in relation to the target
variable. Metrics used for selection are : ANOVA. Chi squared
Dimensionality Reduction
How do we measure the amount of information?
• Variance is a measure of how much a variable is spread out. If the variance of a variable (feature) is very low, it does
not tell us much when building a model.
• Variation within the given datasets must be retained as much as possible while doing dimensionality reduction.

The figure shows the distribution of two variables, x and y. x


ranges from 1 to 6 while y values are in between 1 and 2. In
this case, x has high variance. If these are the only two
features to predict a target variable, the role of x in the
prediction is much higher than y.
What are Principal Components
• New variables that are constructed as linear combinations or
mixtures of the initial variables.
• Combinations are done in such a way that the new variables
(i.e., principal components) are uncorrelated and most of the
information within the initial variables is squeezed or
compressed into the first components.
• So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible
information in the first component, then maximum Reduce dimensionality without losing much
information, and this is done by by discarding
remaining information in the second and so on, like shown
the components with low information and
in the scree plot below. considering the remaining components as your
new variables.
Principal Components
Example: Students’ exam results in ‘Maths’ and ‘Science’.
• Examine how data points get spread in orthogonal directions. First direction A lays on higher variation and
second direction B lays on lower variation and perpendicular to A (as shown in the figure).
• Observe that 1 unit change in C1 has more impact on “Science” than it does on “Math”. However 1 unit
change in C2 affects “Math” more than “Science”. Therefore, “Science” holds more information than
“Math” for C1 and vice versa for C2 . So How can we know which components are the best to explain
variables
Principal Components
• The first principal component PC1 accounts for the largest possible variance in the data set. It explains the
higher variation on the dataset, so it has more information than others.
• Since all principal components are uncorrelated, the second component PC2 will be perpendicular to PC1.
How PCA constructs Principal Components
• Draw random lines that pass through the center point. Find the distance between each projected point and
the center. The line that gives the highest sum of squared distances will be the PC1
Interpretation of the line representing PC1
• Draw random lines that pass through the center point. Find the distance between each projected point and
the center. The line that gives the highest sum of squared distances will be the PC1

for every 2.54 units of Science we go 1 unit up for Math. So,


Science is 2.54 times more important than “Math” for PC1

1 unit of PC1 is equal to 0.93 units of “science” and 0.36 units of Math.
These scores are called “loading scores” of PC1
Interpretation of the line representing PC1
• Since PC2 is perpendicular to PC1, we can easily find the linear combination of “Math” and “Science” for
PC2
• 1 unit change of PC2 equals -0.36 units of Science and 0.93 units of Math. It shows that “Math” is 2.5 times
more important than “Science” for PC2.

The values [0.93, 0.36] are also known as eigenvector


or singular vector of PC1
[-0.36, 0.93] are the eigenvector or singular vector of
PC2.
PCA - example Given the data in the table reduce the dimension from 2 to 1.
Step 1: calculate mean of X1 and X2

Step 2: calculate the covariance matrix


PCA - example
Step 3: Calculate Eigen values of the covariance matrix.

Step 4: Calculate the Eigen vectors

equivalent to the following two equations:

where t is any real number.

Taking t = 1,
Unit Eigen vector corresponding to 1 is
Similarly Eigen vector corresponding to λ2 is
PCA - example
Step 5: Compute PC1

For example, for the first sample,

Results of the calculation

1 30.3849
Explained variance of PC1 is = 30.3849+6.6151 = 0.8212
1 +2

2 6.6151
Explained variance of PC1 is = 30.3849+6.6151 = 0.1788
1 +2
Overfitting and underfitting
Quadratic model is what we Higher degree captures random
Linear Model is underfit
need errors (overfit)
Bias and variance in ML models
Low Bias: A low-bias model generally fits the training data well because it can capture complex relationships in the
data. However, this doesn’t guarantee good performance on unseen data. A low-bias model might still overfit the
training data if variance is too high.
High Bias: A model with high bias tends to be overly simplistic, assuming a linear relationship when the data might
be more complex. For example, if you’re using a linear regression model for non-linear data, it could result in high
bias, which leads to underfitting.
Low Variance: A low-variance model tends to generalize well but may underfit the data if its bias is too high. Such a
model may overlook complex relationships within the data.
High Variance: High-variance models are prone to overfitting, as they adapt too closely to the specific data points in
the training set, including noise. This leads to a model that performs well on the training data but struggles with
generalization, meaning it will perform poorly on new data.
Ways to reduce high bias in ML
1. Use a more complex model
more complex model that can capture the intricate relationships within the data. However, while a more complex model can reduce
bias, it can also introduce higher variance, so it’s essential to monitor the trade-off between bias and variance carefully.
2. Increase the number of features
Sometimes, the issue of high bias arises because the model lacks enough relevant features to accurately represent the underlying
patterns in the data. Including additional, meaningful features allows the model to learn more intricate patterns.
Example: A model predicting house prices might perform better if it considers features like the neighborhood, year built, and number
of bedrooms in addition to square footage.
Feature selection is still important to avoid adding irrelevant data, which might increase the complexity unnecessarily and lead to
overfitting.
3. Reduce regularization
Regularization techniques like L1 and L2 penalties are often used to prevent overfitting by reducing model complexity. However, if
regularization is too strong, it can introduce bias by overly simplifying the model. Adjusting the regularization strength can help:
Reduce regularization slightly to allow the model to capture more patterns from the data.
This adjustment should be carefully balanced to avoid increasing the risk of overfitting.
Increase the size of the training data
A larger dataset allows the model to better learn the underlying patterns, reducing bias. When more training data is available, the model
can differentiate between real patterns and random noise more effectively. If obtaining more data is challenging, data augmentation
techniques can be used to artificially expand the dataset.
Example: In image classification tasks, augmenting images by rotating, flipping, or adjusting brightness can provide additional data
points, helping to improve the model’s performance and reduce bias.
Ways to reduce variance in ML
1. Cross validation
a powerful technique for reducing variance and avoiding overfitting. It works by dividing the dataset into multiple subsets, training the model
on some subsets while validating it on others.
K-Fold Cross-Validation is one of the most popular approaches. The data is split into ‘K’ number of folds, and the model is trained and tested
on each fold. This helps in ensuring that the model performs consistently across different subsets of the data, making it less likely to overfit to
any one subset.

2. Feature selection
Reducing the number of irrelevant or noisy features can help decrease variance. Irrelevant features often introduce noise into the model,
causing it to overfit the training data.

Feature selection techniques, such as Recursive Feature Elimination (RFE) and Principal Component Analysis (PCA), can be employed to
identify the most relevant features and remove the unnecessary ones.
By focusing on the key features, the model becomes simpler and less sensitive to noise, which reduces overfitting.

3. Regularization
Regularization techniques such as L1 (Lasso) and L2 (Ridge) penalize overly complex models, discouraging the model from fitting the noise
in the training data. Regularization can be a powerful tool for controlling the complexity of models like neural networks and logistic
regression.
L2 Regularization works by adding a penalty proportional to the square of the magnitude of the coefficients.
L1 Regularization adds a penalty proportional to the absolute value of the coefficients, effectively leading to sparse models by driving some
coefficients to zero.
Ways to reduce variance in ML
4. Ensemble methods
Ensemble methods combine predictions from multiple models to reduce overall variance. These models average out individual errors, leading
to a more robust and generalizable solution.
Bagging (e.g., Random Forests) reduces variance by training multiple models on different subsets of the data and averaging their predictions.
Boosting (e.g., Gradient Boosting) reduces both bias and variance by sequentially training models, where each model corrects the errors of its
predecessor.

Simplifying the model In some cases, reducing model complexity by using a simpler algorithm can decrease variance. For example,
switching from a complex decision tree to a linear regression model might lead to better generalization, particularly when the dataset is small
or the problem is relatively straightforward.
However, simplifying the model too much may introduce bias, so it’s essential to find the right balance.

Early stopping
In iterative learning processes, such as training deep learning models, early stopping can be an effective technique to prevent overfitting. By
monitoring the performance of the model on a validation set during training, the process can be halted when the validation error starts
increasing, indicating that the model is beginning to overfit the training data.
This prevents the model from becoming too complex and ensures it generalizes better to unseen data.
Cross Validation Techniques
Cross-validation is a statistical method that estimates how well a trained model will work on unseen data. The model's
efficiency is validated by training it on a subset of input data and testing on a different subset. Cross-validation helps in
building a generalized model. Due to the iterative nature of modeling, cross-validation is useful for both performance
estimation and model selection

Training data set — used to train the model, it can vary but typically we use 60% of the available data for
training.
Validation data set — Once we select the model that performs well on training data, we run the model on
validation data set. This is a subset of the data usually ranges from 10% to 20%. Validation data set helps
provide an unbiased evaluation of the model’s fitness. If the error on the validation dataset increases then we
have an overfitting model.
Test dataset — Also called as holdout data set. This dataset contains data that has never been used in the
training. Test data set helps with final model evaluation. Typically would be 5% to 20% of the dataset.
Sometimes there can be only training and test set and no validation set.
Cross Validation Techniques
Cross Validation Techniques

• Due to sample variability between training and test set, our model gives a better prediction on
training data but fail to generalize on test data. This leads to a low training error rate but a high test
error rate.
• When we split the dataset into training, validation and test set, we use only a subset of data and we
know when we train on fewer observations the model will not perform well and overestimate the
test error rate for the model to fit on the entire dataset

To solve the two issue we use an approach called cross-validation


Cross Validation Techniques
What is cross-validation?
• Cross-validation is a statistical technique which involves partitioning the data into subsets, training
the data on a subset and use the other subset to evaluate the model’s performance.
• To reduce variability we perform multiple rounds of cross-validation with different subsets from
the same data. We combine the validation results from these multiple rounds to come up with an
estimate of the model’s predictive performance.
• Cross-validation will give us a more accurate estimate of a model’s performance

Cross-validation techniques
•LOOCV -Leave one out cross-validation
•K Fold
•Stratified cross-validation
•Time series cross-validation
Cross Validation Techniques
Leave one out cross validation — LOOCV
• In LOOCV we divide the data set into two parts. In one
part we have a single observation, which is our test data
and in the other part, we have all the other observations
from the dataset forming our training data.
• If we have a data set with n observations then training
data contains n-1observation and test data contains 1
observation.
• This process is iterated for each data point as shown
below. Repeating this process n times generates n times
Mean Square Error(MSE).
Cross Validation Techniques

Leave one out cross validation — LOOCV


Advantages of LOOCV
Far less bias as we have used the entire dataset for training compared to the validation set approach
where we use only a subset(60% in our example above) of the data for training.
No randomness in the training/test data as performing LOOCV multiple times will yield same
results
Disadvantages of LOOCV
MSE will vary as test data uses a single observation. This can introduce variability. If the data point
is an outlier then the variability will be much higher.
Execution is expensive as the model has to be fitted n times
Cross Validation Techniques
K fold cross validation
This technique involves randomly dividing the dataset into k groups or folds of approximately
equal size. The first fold is kept for testing and the model is trained on k-1 folds.
The process is repeated K times and each time different fold or a different group of data points are
used for validation.
As we repeat the process k times, we get k times Mean Square
Error(MSE). MSE_1, MSE_2, …MSE_K, so k-Fold CV error is
computed by taking average of the MSE over K folds

LOOCV is a variant of K fold where k=n.


Typically the value of K in K fold is 5 or 10. when K is 10 if
also refer it as 10 fold cross validation
Cross Validation Techniques
K fold cross validation
Advantages of K fold or 10-fold cross-validation
Computation time is reduced as we repeated the process only 10 times when the value of k is 10.
Reduced bias
Every data points get to be tested exactly once and is used in training k-1 times
The variance of the resulting estimate is reduced as k increases
Disadvantages of K fold or 10-fold cross-validation
The training algorithm is computationally intensive as the algorithm has to be rerun from scratch k
times.
Cross Validation Techniques
Stratified cross-validation
• Stratification is a technique where we rearrange the data in a way that each fold has a good
representation of the whole dataset. It forces each fold to have at least m instances of each class.
This approach ensures that one class of data is not overrepresented especially when the target
variable is unbalanced.
• For example in a binary classification problem where we want to predict if a passenger on
Titanic survived or not. we have two classes here Passenger either survived or did not survive.
We ensure that each fold has a percentage of passengers that survived and a percentage of
passengers that did not survive.
Cross Validation Techniques
Time series cross-validation
• Splitting time series data randomly does not help as the time-related data will be messed up. If
we are working on predicting stock prices and if we randomly split the data then it will not help.
Hence we need a different approach for performing cross-validation.
• For time series cross-validation we use forward chaining also referred as rolling-origin. Origin
at which the forecast is based rolls forward in time.
• In time series cross-validation each day is a test data and we consider the previous day’s data is
the training set.
Cross Validation Techniques
Time series cross-validation
• D1, D2, D3 etc. are each day’s data and days highlighted in blue are used for training and days
highlighted in yellow are used for test.

we start training the model with a minimum number of observations and use the next day's data to test the
model and we keep moving through the data set. This ensures that we consider the time series aspect of the
data for prediction.

You might also like