PART2
PART2
• Clustering is a technique used in machine learning and data analysis to group similar objects or data points together
based on their inherent characteristics or patterns.
• It is an unsupervised learning method, meaning that it does not rely on labeled data but instead aims to discover
patterns and relationships within the data itself.
• The goal of clustering is to partition a dataset into groups, known as clusters, such that the objects within each cluster
are more similar to each other than to those in other clusters. The similarity between objects is typically measured
using distance metrics, such as Euclidean distance or cosine similarity, depending on the nature of the data.
Uses of Clustering
Market Segmentation – Businesses use clustering to group their customers and use targeted advertisements to attract more
audience.
Market Basket Analysis – Shop owners analyze their sales and figure out which items are majorly bought together by the
customers. For example, In USA, according to a study diapers and beers were usually bought together by fathers.
Social Network Analysis – Social media sites use your data to understand your browsing behaviour and provide you with
targeted friend recommendations or content recommendations.
Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-rays.
Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting fraudulent transactions we can use
clustering to identify them.
Simplify working with large datasets – Each cluster is given a cluster ID after clustering is complete. Now, you may
reduce a feature set’s whole feature set into its cluster ID. Clustering is effective when it can represent a complicated case
with a straightforward cluster ID. Using the same principle, clustering data can make complex datasets simpler.
Types of Clustering
1. Centroid-based Clustering (Partitioning methods)
• Group data points on the basis of their closeness. Similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance.
• Example; K-means clustering
• The primary drawback for these algorithms is the requirement that we establish the number of clusters, “k,” either
intuitively or scientifically.
2. Density-based Clustering (Model-based methods)
• Finds groups based on the density of data points.
• Determines the number of clusters automatically.
• Ideally suited for datasets with irregularly shaped or overlapping clusters.
• Due to its preset number of cluster requirements and extreme sensitivity to the initial positioning of centroids, the
outcomes can vary.
• Example: DBSCAN
Types of Clustering
3. Connectivity-based Clustering (Hierarchical clustering)
• Each data point is initially taken into account as a separate cluster, which is subsequently combined with the
clusters that are the most similar to form one large cluster that contains all of the data points.
4. Distribution-based Clustering
• The data elements are grouped using a probability-based distribution that is based on statistical distributions.
Included are data objects that have a higher likelihood of being in the cluster. A data point is less likely to be
included in a cluster the further it is from the cluster’s central point, which exists in every cluster.
K-means Clustering
The k-means clustering algorithm mainly performs two tasks:
Algorithm:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
K-means Clustering
choose random k
choose a new centroid
points(centroids) Assign data points to its closest k-point
(i.e compute centre of gravity of each group)
K=2
Remove the assumed No dissimilar points on Draw new Repeat the Reassign each
centroids and the final either side of the line and median line and process finding data point to the
cluster is formed the model is formed reassign the data the new centroid new centroid
K-means Clustering
Example
Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
Suppose that the initial seeds (centers of each cluster) are A1, A4 and A7. Run the k-means algorithm for 1 epoch
only.
At the end of this epoch show:
a) The new clusters (i.e. the examples belonging to each cluster)
b) The centers of the new clusters
c) Draw a 10 by 10 space with all the 8 points and show the clusters after the first epoch and the new centroids.
d) How many more iterations are needed to converge? Draw the result for each epoch.
K-means Clustering The distance matrix based on the Euclidean distance is given below:
Solution:
𝑑(𝑎, 𝑏) = 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2
K means Clustering
Epoch 1: Start
Data point Distances with seed1, 2 and 3 Cluster label
New clusters:
Cluster 1: {A1},
A1 0 13 65 Cluster 1 Cluster 2: {A3, A4, A5, A6, A8}
A2 5 4.24 3.16 Cluster 3 Cluster 3: {A2, A7}
A3 6 5 7.28 Cluster 2 Centers of the new clusters:
C1= (2, 10),
A4 13 0 52 Cluster 2
C2= ((8+5+7+6+4)/5, (4+8+5+4+9)/5) = (6, 6),
A5 7.07 3.6 6.7 Cluster 2 C3= ((2+1)/2, (5+2)/2) = (1.5, 3.5)
A6 7.21 4.12 5.38 Cluster 2
A8 5 2 58 Cluster 2
K means Clustering
After the 2nd epoch the results would be: After the 3rd epoch, the results would be:
Cluster 1: {A1, A8}, Cluster 1: {A1, A4, A8},
Cluster 2: {A3, A4, A5, A6}, Cluster 2: {A3, A5, A6},
Cluster 3: {A2, A7} Cluster 3: {A2, A7} with centers C1=(3.66, 9), C2=(7, 4.33)
with centers C1=(3, 9.5), C2=(6.5, 5.25) and C3=(1.5, 3.5). and C3=(1.5, 3.5).
K means Clustering
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of
WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
MinPts: Minimum number of neighbors (data points) within eps radius. The larger the dataset, the larger value of MinPts
must be chosen. As a general rule, the minimum MinPts can be derived from the number of dimensions D in the dataset as,
𝑀𝑖𝑛𝑃𝑡𝑠 >= 𝐷 + 1. The minimum value of MinPts must be chosen at least
Example:
Market basket analysis using association rules
• This analysis might tell a retailer that customers often purchase shampoo and conditioner together, so putting
both items on promotion at the same time would not create a significant increase in revenue, while a promotion
involving just one of the items would likely drive sales of the other.
• The outcome of this type of technique is, in simple terms, a set of rules that can be understood as “if this, then
that”.
Support is an indication of how frequently the item set appears in the data set.
Confidence is an indication of how often the rule has been found to be true.
Lift is the ratio of the observed support to that expected if X and Y were independent Greater lift
values indicate stronger associations.
Association Rule Mining
Example of association rules
Confidence Lift
Association Rule Mining
Conviction
It can be interpreted as the ratio of the expected frequency that X occurs without Y if X and Y were independent divided
by the observed frequency of incorrect predictions. A high value means that the consequent depends strongly on the
antecedent.
Association Rule Mining
Apriori algorithm generates association rules for a given data set. Association rules are usually required to satisfy a
user-specified minimum support and a user-specified minimum confidence at the same time.
Frequent item set - An item set whose support is greater than or equal to a minsup threshold. Given a set of
transactions T, the goal of association rule mining is to find all rules having
support >= minsup threshold Two step approach
confidence >= minconf threshold 1. Frequent item set generation (Generate all items whose
support >= minsup threshold)
2. Rule generation (Generate high confidence rules from each
frequent item set, where each rule is binary partitioning of a
frequent item set (confidence >= minconf threshold)
Apriori Algorithm
1. Candidate item sets are generated using only the large item
sets of the previous pass without considering the
transactions in the database.
2. The large item set of the previous pass is joined with itself
to generate all item sets whose size is higher by 1.
3. Each generated item set that has a subset which is not
large is deleted. The remaining item sets are the candidate
ones.
Association Rule Mining The Apriori Algorithmis an influential algorithm for
mining frequent itemsets for boolean association rules.
Following are key concepts:-
•Frequent Itemsets: The sets of item which has
minimum support (denoted by Li for ith Itemset).
•Apriori Property: Any subset of frequent itemset
must be frequent.
•Join Operation: To find Lk, a set of candidate k-
itemsets is generated by joining Lk-1with itself.
•Pseudo-code:
•Join Step: Ck is generated by joining Lk-1with itself
•Prune Step: Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1= {frequent items};
1 unit of PC1 is equal to 0.93 units of “science” and 0.36 units of Math.
These scores are called “loading scores” of PC1
Interpretation of the line representing PC1
• Since PC2 is perpendicular to PC1, we can easily find the linear combination of “Math” and “Science” for
PC2
• 1 unit change of PC2 equals -0.36 units of Science and 0.93 units of Math. It shows that “Math” is 2.5 times
more important than “Science” for PC2.
Taking t = 1,
Unit Eigen vector corresponding to 1 is
Similarly Eigen vector corresponding to λ2 is
PCA - example
Step 5: Compute PC1
1 30.3849
Explained variance of PC1 is = 30.3849+6.6151 = 0.8212
1 +2
2 6.6151
Explained variance of PC1 is = 30.3849+6.6151 = 0.1788
1 +2
Overfitting and underfitting
Quadratic model is what we Higher degree captures random
Linear Model is underfit
need errors (overfit)
Bias and variance in ML models
Low Bias: A low-bias model generally fits the training data well because it can capture complex relationships in the
data. However, this doesn’t guarantee good performance on unseen data. A low-bias model might still overfit the
training data if variance is too high.
High Bias: A model with high bias tends to be overly simplistic, assuming a linear relationship when the data might
be more complex. For example, if you’re using a linear regression model for non-linear data, it could result in high
bias, which leads to underfitting.
Low Variance: A low-variance model tends to generalize well but may underfit the data if its bias is too high. Such a
model may overlook complex relationships within the data.
High Variance: High-variance models are prone to overfitting, as they adapt too closely to the specific data points in
the training set, including noise. This leads to a model that performs well on the training data but struggles with
generalization, meaning it will perform poorly on new data.
Ways to reduce high bias in ML
1. Use a more complex model
more complex model that can capture the intricate relationships within the data. However, while a more complex model can reduce
bias, it can also introduce higher variance, so it’s essential to monitor the trade-off between bias and variance carefully.
2. Increase the number of features
Sometimes, the issue of high bias arises because the model lacks enough relevant features to accurately represent the underlying
patterns in the data. Including additional, meaningful features allows the model to learn more intricate patterns.
Example: A model predicting house prices might perform better if it considers features like the neighborhood, year built, and number
of bedrooms in addition to square footage.
Feature selection is still important to avoid adding irrelevant data, which might increase the complexity unnecessarily and lead to
overfitting.
3. Reduce regularization
Regularization techniques like L1 and L2 penalties are often used to prevent overfitting by reducing model complexity. However, if
regularization is too strong, it can introduce bias by overly simplifying the model. Adjusting the regularization strength can help:
Reduce regularization slightly to allow the model to capture more patterns from the data.
This adjustment should be carefully balanced to avoid increasing the risk of overfitting.
Increase the size of the training data
A larger dataset allows the model to better learn the underlying patterns, reducing bias. When more training data is available, the model
can differentiate between real patterns and random noise more effectively. If obtaining more data is challenging, data augmentation
techniques can be used to artificially expand the dataset.
Example: In image classification tasks, augmenting images by rotating, flipping, or adjusting brightness can provide additional data
points, helping to improve the model’s performance and reduce bias.
Ways to reduce variance in ML
1. Cross validation
a powerful technique for reducing variance and avoiding overfitting. It works by dividing the dataset into multiple subsets, training the model
on some subsets while validating it on others.
K-Fold Cross-Validation is one of the most popular approaches. The data is split into ‘K’ number of folds, and the model is trained and tested
on each fold. This helps in ensuring that the model performs consistently across different subsets of the data, making it less likely to overfit to
any one subset.
2. Feature selection
Reducing the number of irrelevant or noisy features can help decrease variance. Irrelevant features often introduce noise into the model,
causing it to overfit the training data.
Feature selection techniques, such as Recursive Feature Elimination (RFE) and Principal Component Analysis (PCA), can be employed to
identify the most relevant features and remove the unnecessary ones.
By focusing on the key features, the model becomes simpler and less sensitive to noise, which reduces overfitting.
3. Regularization
Regularization techniques such as L1 (Lasso) and L2 (Ridge) penalize overly complex models, discouraging the model from fitting the noise
in the training data. Regularization can be a powerful tool for controlling the complexity of models like neural networks and logistic
regression.
L2 Regularization works by adding a penalty proportional to the square of the magnitude of the coefficients.
L1 Regularization adds a penalty proportional to the absolute value of the coefficients, effectively leading to sparse models by driving some
coefficients to zero.
Ways to reduce variance in ML
4. Ensemble methods
Ensemble methods combine predictions from multiple models to reduce overall variance. These models average out individual errors, leading
to a more robust and generalizable solution.
Bagging (e.g., Random Forests) reduces variance by training multiple models on different subsets of the data and averaging their predictions.
Boosting (e.g., Gradient Boosting) reduces both bias and variance by sequentially training models, where each model corrects the errors of its
predecessor.
Simplifying the model In some cases, reducing model complexity by using a simpler algorithm can decrease variance. For example,
switching from a complex decision tree to a linear regression model might lead to better generalization, particularly when the dataset is small
or the problem is relatively straightforward.
However, simplifying the model too much may introduce bias, so it’s essential to find the right balance.
Early stopping
In iterative learning processes, such as training deep learning models, early stopping can be an effective technique to prevent overfitting. By
monitoring the performance of the model on a validation set during training, the process can be halted when the validation error starts
increasing, indicating that the model is beginning to overfit the training data.
This prevents the model from becoming too complex and ensures it generalizes better to unseen data.
Cross Validation Techniques
Cross-validation is a statistical method that estimates how well a trained model will work on unseen data. The model's
efficiency is validated by training it on a subset of input data and testing on a different subset. Cross-validation helps in
building a generalized model. Due to the iterative nature of modeling, cross-validation is useful for both performance
estimation and model selection
Training data set — used to train the model, it can vary but typically we use 60% of the available data for
training.
Validation data set — Once we select the model that performs well on training data, we run the model on
validation data set. This is a subset of the data usually ranges from 10% to 20%. Validation data set helps
provide an unbiased evaluation of the model’s fitness. If the error on the validation dataset increases then we
have an overfitting model.
Test dataset — Also called as holdout data set. This dataset contains data that has never been used in the
training. Test data set helps with final model evaluation. Typically would be 5% to 20% of the dataset.
Sometimes there can be only training and test set and no validation set.
Cross Validation Techniques
Cross Validation Techniques
• Due to sample variability between training and test set, our model gives a better prediction on
training data but fail to generalize on test data. This leads to a low training error rate but a high test
error rate.
• When we split the dataset into training, validation and test set, we use only a subset of data and we
know when we train on fewer observations the model will not perform well and overestimate the
test error rate for the model to fit on the entire dataset
Cross-validation techniques
•LOOCV -Leave one out cross-validation
•K Fold
•Stratified cross-validation
•Time series cross-validation
Cross Validation Techniques
Leave one out cross validation — LOOCV
• In LOOCV we divide the data set into two parts. In one
part we have a single observation, which is our test data
and in the other part, we have all the other observations
from the dataset forming our training data.
• If we have a data set with n observations then training
data contains n-1observation and test data contains 1
observation.
• This process is iterated for each data point as shown
below. Repeating this process n times generates n times
Mean Square Error(MSE).
Cross Validation Techniques
we start training the model with a minimum number of observations and use the next day's data to test the
model and we keep moving through the data set. This ensures that we consider the time series aspect of the
data for prediction.