0% found this document useful (0 votes)
20 views35 pages

Unit - 1-1

Unsupervised learning in machine learning involves analyzing unlabeled data to discover patterns and insights without human supervision. Key algorithms include clustering, association rule learning, and dimensionality reduction, each serving various applications such as customer segmentation and fraud detection. While it offers advantages like uncovering hidden patterns and not requiring labeled data, it also faces challenges such as evaluation difficulties and sensitivity to data quality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views35 pages

Unit - 1-1

Unsupervised learning in machine learning involves analyzing unlabeled data to discover patterns and insights without human supervision. Key algorithms include clustering, association rule learning, and dimensionality reduction, each serving various applications such as customer segmentation and fraud detection. While it offers advantages like uncovering hidden patterns and not requiring labeled data, it also faces challenges such as evaluation difficulties and sensitivity to data quality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Unsupervised Learning

In artificial intelligence, machine learning that takes place in the absence of human supervision is
known as unsupervised machine learning. Unsupervised machine learning models, in contrast
to supervised learning, are given unlabeled data and allow discover patterns and insights on their
own—without explicit direction or instruction.

Unsupervised machine learning analyzes and clusters unlabeled datasets using machine learning
algorithms. These algorithms find hidden patterns and data without any human intervention, i.e., we
don’t give output to our model. The training model has only input parameter values and discovers
the groups or patterns on its own.

Unsupervised Learning

How does unsupervised learning work?

Unsupervised learning works by analyzing unlabeled data to identify patterns and relationships. The
data is not labeled with any predefined categories or outcomes, so the algorithm must find these
patterns and relationships on its own. This can be a challenging task, but it can also be very
rewarding, as it can reveal insights into the data that would not be apparent from a labeled dataset.

Data-set in Figure A is Mall data that contains information about its clients that subscribe to them.
Once subscribed they are provided a membership card and the mall has complete information about
the customer and his/her every purchase. Now using this data and unsupervised learning
techniques, the mall can easily group clients based on the parameters we are feeding in.
The input to the unsupervised learning models is as follows:

• Unstructured data: May contain noisy(meaningless) data, missing values, or unknown data

• Unlabeled data: Data only contains a value for input parameters, there is no targeted
value(output). It is easy to collect as compared to the labeled one in the Supervised
approach.

Unsupervised Learning Algorithms

There are mainly 3 types of Algorithms which are used for Unsupervised dataset.

• Clustering

• Association Rule Learning

• Dimensionality Reduction

Clustering

Clustering in unsupervised machine learning is the process of grouping unlabeled data into clusters
based on their similarities. The goal of clustering is to identify patterns and relationships in the data
without any prior knowledge of the data’s meaning.

Broadly this technique is applied to group data based on different patterns, such as similarities or
differences, our machine model finds. These algorithms are used to process raw, unclassified data
objects into groups. For example, in the above figure, we have not given output parameter values, so
this technique will be used to group clients based on the input parameters provided by our data.

Some common clustering algorithms

• K-means Clustering: Partitioning Data into K Clusters

• Hierarchical Clustering: Building a Hierarchical Structure of Clusters


• Density-Based Clustering (DBSCAN): Identifying Clusters Based on Density

• Mean-Shift Clustering: Finding Clusters Based on Mode Seeking

• Spectral Clustering: Utilizing Spectral Graph Theory for Clustering

Association Rule Learning

Association rule learning is also known as association rule mining is a common technique used to
discover associations in unsupervised machine learning. This technique is a rule-based ML technique
that finds out some very useful relations between parameters of a large data set. This technique is
basically used for market basket analysis that helps to better understand the relationship between
different products. For e.g. shopping stores use algorithms based on this technique to find out the
relationship between the sale of one product w.r.t to another’s sales based on customer behavior.
Like if a customer buys milk, then he may also buy bread, eggs, or butter. Once trained well, such
models can be used to increase their sales by planning different offers.

• Apriori Algorithm: A Classic Method for Rule Induction

• FP-Growth Algorithm: An Efficient Alternative to Apriori

• Eclat Algorithm: Exploiting Closed Itemsets for Efficient Rule Mining

• Efficient Tree-based Algorithms: Handling Large Datasets with Scalability

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features in a dataset while
preserving as much information as possible. This technique is useful for improving the performance
of machine learning algorithms and for data visualization. Examples of dimensionality reduction
algorithms includeDimensionality reduction is the process of reducing the number of features in a
dataset while preserving as much information as possible.

• Principal Component Analysis (PCA): Linear Transformation for Reduced Dimensions

• Linear Discriminant Analysis (LDA): Dimensionality Reduction for Discrimination

• Non-negative Matrix Factorization (NMF): Decomposing Data into Non-negative Components

• Locally Linear Embedding (LLE): Preserving Local Geometry in Reduced Dimensions

• Isomap: Capturing Global Relationships in Reduced Dimensions

Challenges of Unsupervised Learning

Here are the key challenges of unsupervised learning

• Evaluation: Assessing the performance of unsupervised learning algorithms is difficult


without predefined labels or categories.

• Interpretability: Understanding the decision-making process of unsupervised learning


models is often challenging.

• Overfitting: Unsupervised learning algorithms can overfit to the specific dataset used for
training, limiting their ability to generalize to new data.
• Data quality: Unsupervised learning algorithms are sensitive to the quality of the input
data. Noisy or incomplete data can lead to misleading or inaccurate results.

• Computational complexity: Some unsupervised learning algorithms, particularly those


dealing with high-dimensional data or large datasets, can be computationally expensive.

Advantages of Unsupervised learning

• No labeled data required: Unlike supervised learning, unsupervised learning does not
require labeled data, which can be expensive and time-consuming to collect.

• Can uncover hidden patterns: Unsupervised learning algorithms can identify patterns and
relationships in data that may not be obvious to humans.

• Can be used for a variety of tasks: Unsupervised learning can be used for a variety of
tasks, such as clustering, dimensionality reduction, and anomaly detection.

• Can be used to explore new data: Unsupervised learning can be used to explore new data
and gain insights that may not be possible with other methods.

Disadvantages of Unsupervised learning

• Difficult to evaluate: It can be difficult to evaluate the performance of unsupervised learning


algorithms, as there are no predefined labels or categories against which to compare results.

• Can be difficult to interpret: It can be difficult to understand the decision-making process of


unsupervised learning models.

• Can be sensitive to the quality of the data: Unsupervised learning algorithms can be
sensitive to the quality of the input data. Noisy or incomplete data can lead to misleading or
inaccurate results.

• Can be computationally expensive: Some unsupervised learning algorithms, particularly


those dealing with high-dimensional data or large datasets, can be computationally
expensive

Applications of Unsupervised learning

• Customer segmentation: Unsupervised learning can be used to segment customers into


groups based on their demographics, behavior, or preferences. This can help businesses to
better understand their customers and target them with more relevant marketing
campaigns.

• Fraud detection: Unsupervised learning can be used to detect fraud in financial data by
identifying transactions that deviate from the expected patterns. This can help to prevent
fraud by flagging these transactions for further investigation.

• Recommendation systems: Unsupervised learning can be used to recommend items to users


based on their past behavior or preferences. For example, a recommendation system might
use unsupervised learning to identify users who have similar taste in movies, and then
recommend movies that those users have enjoyed.

• Natural language processing (NLP): Unsupervised learning is used in a variety of NLP


tasks, including topic modeling, document clustering, and part-of-speech tagging.
• Image analysis: Unsupervised learning is used in a variety of image analysis tasks, including
image segmentation, object detection, and image pattern recognition.

What is Supervised learning?

Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled
data is data that has been tagged with a correct answer or classification.

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised
learning is when we teach or train the machine using data that is well-labelled. Which means some
data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.

For example, a labeled dataset of images of Elephant, Camel and Cow would have each image tagged
with either “Elephant” , “Camel”or “Cow.”

Key Points:

• Supervised learning involves training a machine from labeled data.

• Labeled data consists of examples with the correct answer or classification.

• The machine learns the relationship between inputs (fruit images) and outputs (fruit labels).

• The trained machine can then make predictions on new, unlabeled data.
Supervised vs. Unsupervised Machine Learning

Supervised machine Unsupervised machine


Parameters learning learning

Algorithms are trained using Algorithms are used against


Input Data labeled data. data that is not labeled

Computational Complexity Simpler method Computationally complex

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known

Data Analysis Uses offline analysis Uses real-time analysis of data

Linear and Logistics


regression,KNN Random
K-Means clustering,
forest, multi-class
Hierarchical clustering, Apriori
classification, decision tree,
algorithm, etc.
Support Vector Machine,
Algorithms used Neural Network, etc.

Output Desired output is given. Desired output is not given.

Use training data to infer


No training data is used.
Training data model.

It is not possible to learn


It is possible to learn larger
larger and more complex
and more complex models
models than with supervised
with unsupervised learning.
Complex model learning.

Model We can test our model. We can not test our model.
Supervised machine Unsupervised machine
Parameters learning learning

Supervised learning is also Unsupervised learning is also


Called as called classification. called clustering.

Example: Optical character Example: Find a face in an


Example recognition. image.

supervised learning needs Unsupervised learning does


supervision to train the not need any supervision to
Supervision model. train the model.

What is Clustering ?

The task of grouping data points based on their similarity with each other is called Clustering or
Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we don’t have a
target variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan
distance, etc. and then group the points with highest similarity score together.

For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming
on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can
be arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are not circular in shape.

Types of Clustering

Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:

• Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or
not. For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters.
So each data point will either belong to cluster 1 or cluster 2.

Data Points Clusters

A C1

B C2

C C2

D C1

• Soft Clustering: In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For example,
Let’s say there are 4 data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters. This probability is
calculated for all data points.

Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

Uses of Clustering

Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:

• Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.

• Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.

• Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content
recommendations.

• Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like
X-rays.

• Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting


fraudulent transactions we can use clustering to identify them.

• Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID.
Clustering is effective when it can represent a complicated case with a straightforward cluster
ID. Using the same principle, clustering data can make complex datasets simpler.

There are many more use cases for clustering but there are some of the major and common use
cases of clustering. Moving forward we will be discussing Clustering Algorithms that will help you
perform the above tasks.

Types of Clustering Algorithms

At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest
distance, and the density of the data points are a few of the elements that influence cluster
formation. Clustering is the process of determining how related the objects are based on a metric
called the similarity measure. Similarity metrics are easier to locate in smaller sets of features. It gets
harder to create similarity measures as the number of features increases. Depending on the type of
clustering algorithm being utilized in data mining, several techniques are employed to group the data
from the datasets. In this part, the clustering techniques are described. Various types of clustering
algorithms are:

1. Centroid-based Clustering (Partitioning methods)

2. Density-based Clustering (Model-based methods)

3. Connectivity-based Clustering (Hierarchical clustering)

4. Distribution-based Clustering

We will be going through each of these types in brief.

1. Centroid-based Clustering (Partitioning methods)

Partitioning methods are the most easiest clustering algorithms. They group data points on the basis
of their closeness. Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance. The datasets are separated into a
predetermined number of clusters, and each cluster is referenced by a vector of values. When
compared to the vector value, the input data variable shows no difference and joins the cluster.

The primary drawback for these algorithms is the requirement that we establish the number of
clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any clustering
machine learning system starts allocating the data points. Despite this, it is still the most popular
type of clustering. K-means and K-medoids clustering are some examples of this type clustering.

2. Density-based Clustering (Model-based methods)

Density-based clustering, a model-based method, finds groups based on the density of data points.
Contrary to centroid-based clustering, which requires that the number of clusters be predefined and
is sensitive to initialization, density-based clustering determines the number of clusters automatically
and is less susceptible to beginning positions. They are great at handling clusters of different sizes
and forms, making them ideally suited for datasets with irregularly shaped or overlapping clusters.
These methods manage both dense and sparse data regions by focusing on local density and can
distinguish clusters with a variety of morphologies.

In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters. Due
to its preset number of cluster requirements and extreme sensitivity to the initial positioning of
centroids, the outcomes can vary. Furthermore, the tendency of centroid-based approaches to
produce spherical or convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the drawbacks of centroid-based
techniques by autonomously choosing cluster sizes, being resilient to initialization, and successfully
capturing clusters of various sizes and forms. The most popular density-based clustering algorithm
is DBSCAN.

3. Connectivity-based Clustering (Hierarchical clustering)

A method for assembling related data points into hierarchical clusters is called hierarchical clustering.
Each data point is initially taken into account as a separate cluster, which is subsequently combined
with the clusters that are the most similar to form one large cluster that contains all of the data
points.

Think about how you may arrange a collection of items based on how similar they are. Each object
begins as its own cluster at the base of the tree when using hierarchical clustering, which creates a
dendrogram, a tree-like structure. The closest pairings of clusters are then combined into larger
clusters after the algorithm examines how similar the objects are to one another. When every object
is in one cluster at the top of the tree, the merging process has finished. Exploring various granularity
levels is one of the fun things about hierarchical clustering. To obtain a given number of clusters, you
can select to cut the dendrogram at a particular height. The more similar two objects are within a
cluster, the closer they are. It’s comparable to classifying items according to their family trees, where
the nearest relatives are clustered together and the wider branches signify more general
connections. There are 2 approaches for Hierarchical clustering:

• Divisive Clustering: It follows a top-down approach, here we consider all data points to be
part one big cluster and then this cluster is divide into smaller groups.

• Agglomerative Clustering: It follows a bottom-up approach, here we consider all data points
to be part of individual clusters and then these clusters are clubbed together to make one big
cluster with all data points.

4. Distribution-based Clustering

Using distribution-based clustering, data points are generated and organized according to their
propensity to fall into the same probability distribution (such as a Gaussian, binomial, or other)
within the data. The data elements are grouped using a probability-based distribution that is based
on statistical distributions. Included are data objects that have a higher likelihood of being in the
cluster. A data point is less likely to be included in a cluster the further it is from the cluster’s central
point, which exists in every cluster.

A notable drawback of density and boundary-based approaches is the need to specify the clusters a
priori for some algorithms, and primarily the definition of the cluster form for the bulk of algorithms.
There must be at least one tuning or hyper-parameter selected, and while doing so should be simple,
getting it wrong could have unanticipated repercussions. Distribution-based clustering has a definite
advantage over proximity and centroid-based clustering approaches in terms of flexibility, accuracy,
and cluster structure. The key issue is that, in order to avoid overfitting, many clustering methods
only work with simulated or manufactured data, or when the bulk of the data points certainly belong
to a preset distribution. The most popular distribution-based clustering algorithm is Gaussian
Mixture Model.

Applications of Clustering in different fields:

1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.

2. Biology: It can be used for classification among different species of plants and animals.

3. Libraries: It is used in clustering different books on the basis of topics and information.

4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.

5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.

7. Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.

8. Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.

9. Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.

10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.

11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.

12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.

13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, which can help in detecting fraud or other financial crimes.

14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak
hours, routes, and speeds, which can help in improving transportation planning and
infrastructure.

15. Social network analysis: Clustering is used to identify communities or groups within social
networks, which can help in understanding social behavior, influence, and trends.

16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.

17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and
its impact on the environment.

18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.

19. Crime analysis: Clustering is used to group similar patterns of crime data, such as location,
time, and type, which can help in identifying crime hotspots, predicting future crime trends,
and improving crime prevention strategies.

Partitioning Method: This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the number of clusters
that has to be generated for the clustering methods. In the partitioning method when database(D)
that contains multiple(N) objects then the partitioning method constructs user-specified(K) partitions
of the data in which each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are K-Mean, PAM(K-
Medoids), CLARA algorithm (Clustering Large Applications) etc. In this article, we will be seeing the
working of K Mean algorithm in detail.

K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from the
user and partitions the dataset containing N objects into K clusters so that resulting similarity among
the data objects inside the group (intracluster) is high but the similarity of data objects with the data
objects from outside the cluster is low (intercluster). The similarity of the cluster is determined with
respect to the mean value of the cluster. It is a type of square error algorithm. At the start randomly
k objects from the dataset are chosen in which each of the objects represents a cluster
mean(centre). For the rest of the data objects, they are assigned to the nearest cluster based on
their distance from the cluster mean. The new mean of each of the cluster is then calculated with the
added data objects.

Algorithm:

K mean:

Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters

Method:

1. Randomly assign K objects from the dataset(D) as cluster centres(C)

2. (Re) Assign each object to which object is most similar based upon mean values.

3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.

4. Repeat Step 2 until no change occurs.

Figure – K-mean Clustering


Flowchart:
Figure – K-mean Clustering

Example: Suppose we want to group the visitors to a website using just their age as follows:

16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66

Initial Cluster:

K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]

Note: These two points are chosen randomly from the dataset.

Iteration-1:

C1 = 16.33 [16, 16, 17]


C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-2:

C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-3:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-4:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-29) and (36-
66) as 2 clusters we get using K Mean Algorithm.

Hierarchical clustering is a connectivity-based clustering model that groups the data points together
that are close to each other based on the measure of similarity or distance. The assumption is that
data points that are close to each other are more similar or related than data points that are farther
apart.

A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the hierarchical


relationships between groups. Individual data points are located at the bottom of the dendrogram,
while the largest clusters, which include all the data points, are located at the top. In order to
generate different numbers of clusters, the dendrogram can be sliced at various heights.

The dendrogram is created by iteratively merging or splitting clusters based on a measure of


similarity or distance between data points. Clusters are divided or merged repeatedly until all data
points are contained within a single cluster, or until the predetermined number of clusters is
attained.

We can look at the dendrogram and measure the height at which the branches of the dendrogram
form distinct clusters to calculate the ideal number of clusters. The dendrogram can be sliced at this
height to determine the number of clusters.

Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:

1. Agglomerative Clustering

2. Divisive clustering

Hierarchical Agglomerative Clustering

It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC). A


structure that is more informative than the unstructured set of clusters returned by flat clustering.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively agglomerate
pairs of clusters until all clusters have been merged into a single cluster that contains all data.

Algorithm :

given a dataset (d1, d2, d3, ....dN) of size N

# compute the distance matrix

for i=1 to N:

# as the distance matrix is symmetric about

# the primary diagonal so we compute only lower

# part of the primary diagonal

for j=1 to i:
dis_mat[i][j] = distance[di, dj]

each data point is a singleton cluster

repeat

merge the two cluster having minimum distance

update the distance matrix

until only a single cluster remains

Hierarchical Agglomerative Clustering

Steps:

• Consider each alphabet as a single cluster and calculate the distance of one cluster from all
the other clusters.

• In the second step, comparable clusters are merged together to form a single cluster. Let’s
say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the
second step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]

• We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]

• Repeating the same process; The clusters DEF and BC are comparable and merged together
to form a new cluster. We’re now left with clusters [(A), (BCDEF)].

• At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].

Hierarchical Divisive clustering


It is also known as a top-down approach. This algorithm also does not require to prespecify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains the
whole data and proceeds by splitting clusters recursively until individual data have been split into
singleton clusters.

Algorithm :

given a dataset (d1, d2, d3, ....dN) of size N

at the top we have all data in one cluster

the cluster is split using a flat clustering method eg. K-Means etc

repeat

choose the best cluster among all the clusters to split

split that cluster by the flat clustering algorithm

until each data is in its own singleton cluster

Hierarchical Divisive clustering

Computing Distance Matrix

While merging two clusters we check the distance between two every pair of clusters and merge the
pair with the least distance/most similarity. But the question is how is that distance determined.
There are different ways of defining Inter Cluster distance/similarity. Some of them are:

1. Min Distance: Find the minimum distance between any two points of the cluster.

2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.

4. Ward’s Method: The similarity of two clusters is based on the increase in squared error when
two clusters are merged.

For example, if we group a given data using different methods, we may get different results:

Distance Matrix Comparision in Hierarchical Clustering

Hierarchical Agglomerative vs Divisive Clustering

• Divisive clustering is more complex as compared to agglomerative clustering, as in the case of


divisive clustering we need a flat clustering method as “subroutine” to split each cluster until
we have each data having its own singleton cluster.

• Divisive clustering is more efficient if we do not generate a complete hierarchy all the way
down to individual data leaves. The time complexity of a naive agglomerative clustering
is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in
each of N-1 iterations. Using priority queue data structure we can reduce this complexity
to O(n2logn). By using some more optimizations it can be brought down to O(n2). Whereas
for divisive clustering given a fixed number of top levels, using an efficient flat algorithm like
K-Means, divisive algorithms are linear in the number of patterns and clusters.

• A divisive algorithm is also more accurate. Agglomerative clustering makes decisions by


considering the local patterns or neighbor points without initially taking into account the
global distribution of data. These early decisions cannot be undone. whereas divisive
clustering takes into consideration the global distribution of data when making top-level
partitioning decisions.

Density-Based Spatial Clustering Of Applications With Noise (DBSCAN)

Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that
for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.

Why DBSCAN?

Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact and
well-separated clusters. Moreover, they are also severely affected by the presence of noise and
outliers in the data.

Real-life data may contain irregularities, like:

1. Clusters can be of arbitrary shape such as those shown in the figure below.

2. Data may contain noise.


The figure above shows a data set containing non-convex shape clusters and outliers. Given such
data, the k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance between two points
is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too
small then a large part of the data will be considered as an outlier. If it is chosen very large
then the clusters will merge and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.

2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts
can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core
point.
Noise or outlier: A point which is not a core point or border point.
Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.

2. For each core point if it is not already assigned to a cluster, create a new cluster.

3. Find recursively all its density-connected points and assign them to the same cluster as the
core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is a
neighbor of e, which in turn is neighbor of a implying that b is a neighbor of a.

4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.

Pseudocode For DBSCAN Clustering Algorithm

DBSCAN(dataset, eps, MinPts){

# cluster index

C=1

for each unvisited point p in dataset {

mark p as visited

# find neighbors

Neighbors N = find the neighboring points of p

if |N|>=MinPts:

N = N U N'
if p' is not a member of any cluster:

add p' to cluster C

When Should We Use DBSCAN Over K-Means In Clustering Analysis?

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) and K-Means are both
clustering algorithms that group together data that have the same characteristic. However, They
work on different principles and are suitable for different types of data. We prefer to use DBSCAN
when the data is not spherical in shape or the number of classes is not known beforehand.

Difference Between DBSCAN and K-Means.

DBSCAN K-Means

K-Means is very sensitive to the number of


In DBSCAN we need not specify the number
clusters so it
of clusters.
need to specified

Clusters formed in K-Means are spherical


Clusters formed in DBSCAN can be of any arbitrary or
shape.
convex in shape

K-Means does not work well with outliers


DBSCAN can work well with datasets having noise data. Outliers
and outliers can skew the clusters in K-Means to a very
large extent.

In K-Means only one parameter is


In DBSCAN two parameters are required for required is for training
training the Model
the model
Clusters formed in K-means and DBSCAN

Outlier influence on DBSCAN

Spectral Co-Clustering Algorithm

Spectral co-clustering is a clustering algorithm that uses spectral graph theory to find clusters in both
rows and columns of a data matrix simultaneously. This is done by constructing a bi-partite graph
from the data matrix, where the rows and columns of the matrix are represented as nodes in the
graph, and the entries in the matrix are represented as edges between the nodes.

The spectral co-clustering algorithm then uses the eigenvectors of the graph Laplacian to find the
clusters in the data matrix. This is done by treating the rows and columns of the data matrix as two
separate sets of nodes and using the eigenvectors to partition each set into clusters.
One advantage of the spectral co-clustering algorithm is that it can handle data with missing entries.
This is because the algorithm only uses the non-zero entries in the data matrix to construct the bi-
partite graph, and therefore does not require the matrix to be complete.

Another advantage of the spectral co-clustering algorithm is that it can find clusters of different sizes
and shapes. This is because the algorithm uses the eigenvectors of the graph Laplacian, which are
sensitive to the local structure of the graph and can therefore identify clusters of different shapes
and sizes.

Analyzing patterns to partition the data samples according to some criteria is called clustering. The
data mining technique which allows simultaneous clustering of the rows and columns of a matrix is
called biclustering. A set of m samples represented by an n-dimensional feature vector, the entire
dataset can be represented as m rows in an n column the biclustering algorithm generates biclusters,
a subset of rows that exhibits similar behavior across a subset of columns. A biclustering of a dataset
is a collection of pairs of sample and feature objects B=(l1, F1);(L2, F2); ……..(Lr, Fr) such that
collection (L1, L2, L3……) forms a partitioning of a set of samples, and collections (F1, F2, F3….) form a
partition of the set of features. A set of (Lk, Fk) will be a bicluster.
Types of Biclusters:

• Biclusters with a constant value: It reorders rows and columns to group similar rows and
columns with similar values, constant. A perfect constant bicluster is a matrix having all
values equal.

20.0 20.0 20.0 20.0 20.0

20.0 20.0 20.0 20.0 20.0

20.0 20.0 20.0 20.0 20.0

20.0 20.0 20.0 20.0 20.0

20.0 20.0 20.0 20.0 20.0

• Bicluster with constant values on rows or columns: In these biclusters, rows, and columns
should be normalized.

• Bicluster with constant values on rows:

20.0 20.0 20.0 20.0 20.0

21.0 21.0 21.0 21.0 21.0

22.0 22.0 22.0 22.0 22.0

23.0 23.0 23.0 23.0 23.0

24.0 24.0 24.0 24.0 24.0

• Bicluster with constant value on columns:

20.0 21.0 22.0 23.0 24.0


20.0 21.0 22.0 23.0 24.0

20.0 21.0 22.0 23.0 24.0

20.0 21.0 22.0 23.0 24.0

20.0 21.0 22.0 23.0 24.0

• Bicluster with coherent values: The subsets of rows or columns will almost have the same
score.

• Additive:

1.0 4.0 5.0 0.0 1.5

4.0 7.0 8.0 3.0 4.5

3.0 6.0 7.0 2.0 3.5

5.0 8.0 9.0 4.0 5.5

2.0 5.0 6.0 1.0 2.5

• Multiplicative:

1.0 0.5 2.0 0.2 0.8

2.0 1.0 4.0 0.4 1.6

3.0 1.5 6.0 0.6 2.4

4.0 2.0 8.0 0.8 3.2

5.2 2.5 10.0 1.0 4.0


• Unusually high/low values: In these matrices, we can have decimals, integers, etc, and in the
top left 4 values are negative, and the bottom right 4 values are positive.

-10 -10 0.1 0.1

-10 -10 0.2 0.3

0.3 0.2 10 10

0.3 0.2 10 10

• Submatrices with low variance: In the matrix v , the values in v11,v12,v13,v14,v21,v31,v41


will be from 0.0 to 0.8. The values in v22,v23,v32,v33,v42,v43 will be from 0.1 to 0.2.

0.5 0.5 0.0 0.0

0.5 0.1 0.2 0.7

0.8 0.2 0.2 0.7

0.8 0.1 0.1 0.9

Bi-Partite Graph:

A vertex set divides into two disjoint sets v1,v2 and each edge in the graph joins a vertex in v1 to the
vertex v2.

Row/Column C1 C2 C3 C4

R1 0.1 0.0 0.0 0.2

R2 0.5 0.0 0.0 0.3

R3 0.0 0.2 0.1 0.0


R4 0.0 0.2 0.0 0.2

Spectral Co-Clustering:

Takes inputs as a bipartite graph, the data divides into a set of nodes and connected by edges. It finds
biclusters with higher values and rearranges the matrix with higher values along the diagonal
columns.

For Inputs Matrix Aij:

An = R^-1/2*A*C^-1/2

[R^-1/2 ------> diagonal matrix with entry i summation j (Aij)

C^-1/2 -------> diagonal matrix with entry i summation i (Aij)]

For Singular Value Decomposition:

An = U summation v^T ------> provides rows and columns of A,

left singular vector gives row partition

and right singular vector gives columns.

L = [log2K]
Spectral Biclustering:

It assumes the input matrix has a hidden keyboard structure. In this structure rows and columns are
partitioned so that entries of any bicluster in the cartesian product of row clusters and column
clusters are approximately constants.
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a Market
Based Analysis. Market Based Analysis is one of the key techniques used by large relations to show
associations between items.It allows retailers to identify relationships between the items that people
buy together frequently. Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the transaction.

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions. Support Count( ) – Frequency
of occurrence of a itemset.

Here ({Milk, Bread, Diaper})=2

Frequent Itemset – An itemset whose support is greater than or equal to minsup


threshold. Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.

Example: {Milk, Diaper}->{Beer}

Rule Evaluation Metrics –

• Support(s) – The number of transactions that include items in the {X} and {Y} parts of the
rule as a percentage of the total number of transaction.It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.

• Support = (X+Y) total – It is interpreted as fraction of transactions that contain both X


and Y.

• Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as well as
the no of transactions that includes all items in {A} to the no of transactions that includes all
items in {A}.

• Conf(X=>Y) = Supp(X Y) Supp(X) – It measures how often each item in Y appears in


transactions that contains items in X also.
• Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other.The expected
confidence is the confidence divided by the frequency of {Y}.

• Lift(X=>Y) = Conf(X=>Y) Supp(Y) – Lift value near 1 indicates X and Y almost often appear
together as expected, greater than 1 means they appear together more than expected and
less than 1 means they appear less than expected.Greater lift values indicate stronger
association.

Example – From the above table, {Milk, Diaper}=>{Beer}

s= ({Milk, Diaper, Beer}) |T|

= 2/5

= 0.4

c= (Milk, Diaper, Beer) (Milk, Diaper)

= 2/3

= 0.67

l= Supp({Milk, Diaper, Beer}) Supp({Milk, Diaper})*Supp({Beer})

= 0.4/(0.6*0.6)

= 1.11

The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if certain groups
of items are consistently purchased together and use this data for adjusting store layouts, cross-
selling, promotions based on statistics.

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a
dataset for boolean association rule. Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative approach or level-wise search
where k-frequent itemsets are used to find k+1 itemsets.

To improve the efficiency of level-wise generation of frequent itemsets, an important property is


used called Apriori property which helps by reducing the search space.

Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.

Before we start understanding the algorithm, go through some definitions which are explained in my
previous post.
Consider the following dataset and we will find frequent itemsets and generate association rules for
them.

minimum support count is 2


minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives
us itemset L1.

Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-1 is
that it should have (K-2) elements in common.

• Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)

• Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.

Step-3:

o Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that
it should have (K-2) elements in common. So here, for L2, first element should
match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}

o Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent.
For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every
itemset)
o find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.

Step-4:

o Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4)
is that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.

o Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So
no itemset in C4

o We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association rule
comes into picture. For that we need to calculate confidence of each rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.

Limitations of Apriori Algorithm


Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum support or large itemsets i.e. it is not an
efficient approach for large number of datasets. For example, if there are 10^4 from frequent 1-
itemsets, it need to generate more than 10^7 candidates into 2-length which in turn they will be
tested and accumulate. Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have
to generate 2^100 candidate itemsets that yield on costly and wasting of time of candidate
generation. So, it will check for many sets from candidate itemsets, also it will scan database many
times repeatedly for finding candidate itemsets. Apriori will be very low and inefficiency when
memory capacity is limited with large number of transactions.

You might also like