0% found this document useful (0 votes)
17 views86 pages

Unit 2 Unsupervised Learning

The document discusses the differences between supervised and unsupervised learning in machine learning, highlighting that supervised learning uses labeled data while unsupervised learning works with unlabeled data to find hidden patterns. It details various algorithms and applications of unsupervised learning, such as clustering and association, and explains the K-means clustering algorithm with its steps and applications. The document emphasizes the significance of unsupervised learning in real-world scenarios where labeled data may not be available.

Uploaded by

parevakomal1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views86 pages

Unit 2 Unsupervised Learning

The document discusses the differences between supervised and unsupervised learning in machine learning, highlighting that supervised learning uses labeled data while unsupervised learning works with unlabeled data to find hidden patterns. It details various algorithms and applications of unsupervised learning, such as clustering and association, and explains the K-means clustering algorithm with its steps and applications. The document emphasizes the significance of unsupervised learning in real-world scenarios where labeled data may not be available.

Uploaded by

parevakomal1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Unsupervised Learning

UNIT:-2
SUPERVISED ML ALGORITHM UNSUPERVISED ML ALGORITHM
Unsupervised learning algorithms are trained using unlabeled
Supervised learning algorithms are trained using labeled data.
data.
Supervised learning model takes direct feedback to check if it is
Unsupervised learning model does not take any feedback.
predicting correct output or not.

Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data.
In supervised learning, input data is provided to the model along In unsupervised learning, only input data is provided to the
with the output. model.
The goal of supervised learning is to train the model so that it The goal of unsupervised learning is to find the hidden patterns
can predict the output when it is given new data. and useful insights from the unknown dataset.
Unsupervised learning does not need any supervision to train the
Supervised learning needs supervision to train the model.
model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning can be used for those cases where we know Unsupervised learning can be used for those cases where we have
the input as well as corresponding outputs. only input data and no corresponding output data.

Unsupervised learning model may give less accurate result as


Supervised learning model produces an accurate result.
compared to supervised learning.
Supervised learning is not close to true Artificial intelligence as in Unsupervised learning is more close to the true Artificial
this, we first train the model for each data, and then only it can Intelligence as it learns similarly as a child learns daily routine
predict the correct output. things by his experiences.
It includes various algorithms such as Linear Regression, Logistic
It includes various algorithms such as Clustering, KNN, and Apriori
Regression, Support Vector Machine, Multi-class Classification,
algorithm.
Decision tree, Bayesian Logic, etc.
Unsupervised Learning
 Unsupervised learning is a type of machine learning that learns from unlabeled data. This
means that the data does not have any pre-existing labels or categories.

 The goal of unsupervised learning is to discover patterns and relationships in the data
without any explicit guidance.
 unsupervised learning is a machine learning technique in which models are not supervised
using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:

 Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
 Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the input
data but no corresponding output data. The goal of unsupervised learning is
to find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
 Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs. The algorithm
is never trained upon the given dataset, which means it does not have any
idea about the features of the dataset. The task of the unsupervised
learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities between images.
 Why use Unsupervised Learning?
 Below are some main reasons which describe the importance of Unsupervised
Learning:
• Unsupervised learning is helpful for finding useful insights from the data.
• Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
• Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
• In real-world, we do not always have input data with the corresponding output
so to solve such cases, we need unsupervised learning.
 Working of Unsupervised Learning
 Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning
Types
 Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such
as grouping customers by purchasing behavior.
 Clustering is a type of unsupervised machine learning algorithm that groups similar data points into
clusters. The goal of clustering is to identify patterns or structures in the data that are not easily visible
by other methods.
 Applications of Clustering:
 1. Customer Segmentation: Clustering can be used to group customers based on their behavior,
demographics, and preferences.
 2. Image Segmentation: Clustering can be used to segment images into different regions based on
color, texture, and other features.
 3. Gene Expression Analysis: Clustering can be used to group genes based on their expression levels in
different samples.
 4. Recommendation Systems: Clustering can be used to group users based on their preferences and
recommend items to them.
 Association: An association rule learning problem is where you want to discover rules that describe large portions of
your data, such as people that buy X also tend to buy Y.
 Association in Machine Learning (ML) refers to a type of unsupervised learning algorithm that aims to discover
interesting patterns, relationships, or associations between variables in a dataset.
 Goal of Association : The primary goal of association algorithms is to identify strong rules or patterns that describe the
relationships between different attributes or features in a dataset.

 Applications of Association:
 1. Market Basket Analysis: Association algorithms can be used to analyze customer purchasing behavior and identify
patterns in the items that are purchased together.
 2. Recommendation Systems: Association algorithms can be used to build recommendation systems that suggest
products or services based on a customer's past purchases or behavior.
 3. Anomaly Detection: Association algorithms can be used to detect anomalies or outliers in a dataset by identifying
patterns or relationships that are unusual or unexpected.
Clustering
1. Hierarchical clustering
2. K-means clustering
3. Gaussian Mixture Models (GMMs)
4. Principal Component Analysis:-UNIT-3
5. Singular Value Decomposition:-UNIT:-3
6. Density-Based Spatial Clustering of Applications with Noise (DBSCAN):-UNIT:-3

Association rule learning


• Apriori Algorithm
• FP-Growth Algorithm
K-means Clustering

 K-means clustering is an unsupervised machine learning algorithm used to group a dataset into k clusters. It is
an iterative algorithm that starts by randomly selecting k centroids in the dataset. After selecting the centroids,
the entire dataset is divided into clusters based on the distance of the data points from the centroid. In the new
clusters, the centroids are calculated by taking the mean of the data points.
 With the new centroids, we regroup the dataset into new clusters. This process continues until we get a stable
cluster. K-means clustering is a partition clustering algorithm. We call it partition clustering because of the
reason that the k-means clustering algorithm partitions the entire dataset into mutually exclusive clusters.
 Here K defines the number of pre-defined clusters that need to be created in the process

 It allows us to cluster the data into different groups and a convenient way to discover the categories of groups
in the unlabeled dataset on its own without the need for any training.
 It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

 The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters.
K-means Clustering Algorithm Steps
 K-means Clustering Algorithm
 To understand the process of clustering using the k-means clustering algorithm and solve the
numerical example, let us first state the algorithm. Given a dataset of N entries and a number K
as the number of clusters that need to be formed, we will use the following steps to find the
clusters using the k-means algorithm.
1. First, we will select K random entries from the dataset and use them as centroids.
2. Now, we will find the distance of each entry in the dataset from the centroids. You can use any
distance metric such as euclidean distance, Manhattan distance, or squared euclidean distance.
3. After finding the distance of each data entry from the centroids, we will start assigning the data
points to clusters.We will assign each data point to the cluster with the centroid to which it has
the least distance.
4. After assigning the points to clusters, we will calculate the new centroid of the clusters. For this,
we will use the mean of each data point in the same cluster as the new centroid. If the newly
created centroids are the same as the centroids in the previous iteration, we will consider the
current clusters to be final. Hence, we will stop the execution of the algorithm. If any of the newly
created centroids is different from the centroids in the previous iteration, we will go to step 2.
 Applications of K-means Clustering in Machine Learning
 K-means clustering algorithm finds its applications in various domains. Following are some of the
popular applications of k-means clustering.
• Document Classification: Using k-means clustering, we can divide documents into various
clusters based on their content, topics, and tags.
• Customer segmentation: Supermarkets and e-commerce websites divide their customers into
various clusters based on their transaction data and demography. This helps the business to target
appropriate customers with relevant products to increase sales.
• Cyber profiling: In cyber profiling, we collect data from individuals as well as groups to identify
their relationships. With k-means clustering, we can easily make clusters of people based on their
connection to each other to identify any available patterns.
• Image segmentation: We can use k-means clustering to perform image segmentation by grouping
similar pixels into clusters.
• Fraud detection in banking and insurance: By using historical data on frauds, banks and
insurance agencies can predict potential frauds by the application of k-means clustering.
 K-means Clustering Numerical Example with Solution
 Now that we have discussed the algorithm, let us solve a numerical problem on k means clustering.
The problem is as follows. You are given 15 points in the Cartesian coordinate system as follows.

Point Coordinates
A1 (2,10)
A2 (2,6)
A3 (11,11)
A4 (6,9)
A5 (6,4)
A6 (1,2)
A7 (5,10)
A8 (4,9)
A9 (10,12)
A10 (7,5)
A11 (9,11)
A12 (4,6)
A13 (3,10)
A14 (3,8)

(6,11)
A15

INPUT DATA SET


 We are also given the information that we need to make 3 clusters. It means we are given K=3.We will solve
this numerical on k-means clustering using the approach discussed below.
 First, we will randomly choose 3 centroids from the given data. Let us consider A2 (2,6), A7 (5,10), and A15
(6,11) as the centroids of the initial clusters. Hence, we will consider that
• Centroid 1=(2,6) is associated with cluster 1.
• Centroid 2=(5,10) is associated with cluster 2.
• Centroid 3=(6,11) is associated with cluster 3.
 Now we will find the euclidean distance between each point and the centroids. Based on the minimum
distance of each point from the centroids, we will assign the points to a cluster. I have tabulated the
distance of the given points from the clusters in the following table

Distance from Distance from Distance from
Point Assigned Cluster
Centroid 1 (2,6) Centroid 2 (5,10) Centroid 3 (6,11)

A1 (2,10) 4 3 4.123106 Cluster 2


A2 (2,6) 0 5 6.403124 Cluster 1
A3 (11,11) 10.29563 6.082763 5 Cluster 3
A4 (6,9) 5 1.414214 2 Cluster 2
A5 (6,4) 4.472136 6.082763 7 Cluster 1
A6 (1,2) 4.123106 8.944272 10.29563 Cluster 1
A7 (5,10) 5 0 1.414214 Cluster 2
A8 (4,9) 3.605551 1.414214 2.828427 Cluster 2
A9 (10,12) 10 5.385165 4.123106 Cluster 3
A10 (7,5) 5.09902 5.385165 6.082763 Cluster 1
A11 (9,11) 8.602325 4.123106 3 Cluster 3
A12 (4,6) 2 4.123106 5.385165 Cluster 1
A13 (3,10) 4.123106 2 3.162278 Cluster 2
A14 (3,8) 2.236068 2.828427 4.242641 Cluster 1
A15 (6,11) 6.403124 1.414214 0 Cluster 3

Results from 1st iteration of K means clustering


 At this point, we have completed the first iteration of the k-means clustering algorithm and assigned
each point into a cluster.

 In the above table, you can observe that the point that is closest to the centroid of a given cluster is
assigned to the cluster.

 Now, we will calculate the new centroid for each cluster.

 In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14 (3,8). To
calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of each point
in the cluster. Hence, the new centroid for cluster 1 is (3.833, 5.167).
 In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), and A13 (3,10). Hence, the
new centroid for cluster 2 is (4, 9.6)
 In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and A15 (6,11). Hence, the new
centroid for cluster 3 is (9, 11.25).
 Now that we have calculated new centroids for each cluster, we will calculate the distance of each data
point from the new centroids. Then, we will assign the points to clusters based on their distance from
the centroids. The results for this process have been given in the following table.
Distance from
Distance from centroid Distance from centroid
Point Centroid 1 (3.833, Assigned Cluster
2 (4, 9.6) 3 (9, 11.25)
5.167)

A1 (2,10) 5.169 2.040 7.111 Cluster 2


A2 (2,6) 2.013 4.118 8.750 Cluster 1
A3 (11,11) 9.241 7.139 2.016 Cluster 3
A4 (6,9) 4.403 2.088 3.750 Cluster 2
A5 (6,4) 2.461 5.946 7.846 Cluster 1
A6 (1,2) 4.249 8.171 12.230 Cluster 1
A7 (5,10) 4.972 1.077 4.191 Cluster 2
A8 (4,9) 3.837 0.600 5.483 Cluster 2
A9 (10,12) 9.204 6.462 1.250 Cluster 3
A10 (7,5) 3.171 5.492 6.562 Cluster 1
A11 (9,11) 7.792 5.192 0.250 Cluster 3
A12 (4,6) 0.850 3.600 7.250 Cluster 1
A13 (3,10) 4.904 1.077 6.129 Cluster 2
A14 (3,8) 2.953 1.887 6.824 Cluster 2
A15 (6,11) 6.223 2.441 3.010 Cluster 2
Results from 2nd iteration of K means clustering
 Now, we have completed the second iteration of the k-means clustering algorithm and assigned each
point into an updated cluster. In the above table, you can observe that the point closest to the new
centroid of a given cluster is assigned to the cluster.
 Now, we will calculate the new centroid for each cluster for the third iteration.
• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To calculate the
new centroid for cluster 1, we will find the mean of the x and y coordinates of each point in the cluster.
Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14 (3,8), and
A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new centroid for
cluster 3 is (10, 11.333).
 At this point, we have calculated new centroids for each cluster. Now, we will calculate the distance of
each data point from the new centroids. Then, we will assign the points to clusters based on their
distance from the centroids. The results for this process have been given in the following table.
Distance from Centroid Distance from centroid Distance from centroid 3
Point Assigned Cluster
1 (4, 4.6) 2 (4.143, 9.571) (10, 11.333)

A1 (2,10) 5.758 2.186 8.110 Cluster 2


A2 (2,6) 2.441 4.165 9.615 Cluster 1
A3 (11,11) 9.485 7.004 1.054 Cluster 3
A4 (6,9) 4.833 1.943 4.631 Cluster 2
A5 (6,4) 2.088 5.872 8.353 Cluster 1
A6 (1,2) 3.970 8.197 12.966 Cluster 1
A7 (5,10) 5.492 0.958 5.175 Cluster 2
A8 (4,9) 4.400 0.589 6.438 Cluster 2
A9 (10,12) 9.527 6.341 0.667 Cluster 3
A10 (7,5) 3.027 5.390 7.008 Cluster 1
A11 (9,11) 8.122 5.063 1.054 Cluster 3
A12 (4,6) 1.400 3.574 8.028 Cluster 1
A13 (3,10) 5.492 1.221 7.126 Cluster 2
A14 (3,8) 3.544 1.943 7.753 Cluster 2
A15 (6,11) 6.705 2.343 4.014 Cluster 2
 Now, we have completed the third iteration of the k-means clustering algorithm and assigned each point into an
updated cluster. In the above table, you can observe that the point that is closest to the new centroid of a given
cluster is assigned to the cluster.
 Now, we will calculate the new centroid for each cluster for the third iteration.
• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To calculate the new
centroid for cluster 1, we will find the mean of the x and y coordinates of each point in the cluster. Hence, the
new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14 (3,8), and A15
(6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new centroid for cluster 3
is (10, 11.333).
 Advantages of K-means Clustering Algorithm
 Following are some of the advantages of the k-means clustering algorithm.
• Easy to implement: K-means clustering is an iterable algorithm and a relatively simple
algorithm. In fact, we can also perform k-means clustering manually as we did in the
numerical example.
• Scalability: We can use k-means clustering for even 10 records or even 10 million
records in a dataset. It will give us results in both cases.
• Convergence: The k-means clustering algorithm is guaranteed to give us results. It
guarantees convergence. Thus, we will get the result of the execution of the algorithm
for sure.
• Generalization: K-means clustering doesn’t apply to a specific problem. From
numerical data to text documents, you can use the k-means clustering algorithm on any
dataset to perform clustering. It can also be applied to datasets of different sizes having
entirely different distributions in the dataset. Hence, this algorithm is completely
generalized.
• Choice of centroids: You can warm-start the choice of centroids in an easy manner.
Hence, the algorithm allows you to choose and assign centroids that fit well with the
dataset.
 Disadvantages of K-means Clustering Algorithm
 With all the advantages, the k-means algorithm has certain disadvantages too which are discussed below.
• Deciding the number of clusters: In k-means clustering, you need to decide the number of clusters by using the
elbow method.
• Choice of initial centroids: The number of iterations in the clustering process completely depends on the choice
of centroids. Hence, you need to properly choose the centroids in the initial step for maximizing the efficiency of
the algorithm.
• Effect of outliers: In the execution of the k-means clustering algorithm, we use all the points in a cluster to
determine the centroids for the next iteration. If there are outliers in the dataset, they highly affect the position of
the centroids. Due to this, the clustering becomes inaccurate. To avoid this, you can try to identify outliers and
remove them in the data cleaning process.
• Curse of Dimensionality: If the number of dimensions in the dataset increases, the distance of the data points
from a given point starts converging to a specific value. Due to this, k-means clustering that calculates the
clusters based on the distance between the points becomes inefficient. To overcome this problem, you can use
advanced clustering algorithms like spectral clustering.
Dunn Index

 Inertia actually calculates the sum of distances of all the points within a cluster
from the centroid of that cluster.

 if the distance between the centroid of a cluster and the points in that cluster is
small, it means that the points are closer to each other. So, inertia makes sure
that the first property of clusters is satisfied. But it does not care about the
second property – that different clusters should be as different from each other
as possible.
 This is where the Dunn index comes into action.
Along with the distance between the centroid and points, the Dunn index
also takes into account the distance between two clusters. This
distance between the centroids of two different clusters is known as inter-
cluster distance.
Hierarchical Clustering
 Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabelled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
 Hierarchical clustering is an unsupervised machine learning algorithm used to group data points into
various clusters based on the similarity between them. It is based on the idea of creating a hierarchy
of clusters, where each cluster is made up of smaller clusters that can be further divided into even
smaller clusters. This hierarchical structure makes it easy to visualize the data and identify patterns
within the data.

 In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.

The hierarchical clustering technique has two approaches.

1.Agglomerative Clustring

2.Divisive Clustring
Agglomerative
 Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points as single
clusters and merging them until one cluster is left.
 A bottom-up approach where each data point starts as its own cluster and merges with the closest cluster
progressively.

 Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.


 Agglomerative clustering is a type of data clustering method used in unsupervised learning. It is an iterative
process that groups similar objects into clusters based on some measure of similarity. Agglomerative
clustering uses a bottom-up approach for dividing data points into clusters. It starts with individual data
points and merges them into larger clusters until all of the objects are clustered together.
 The algorithm begins by assigning each object to its own cluster. It then uses a distance metric to determine
the similarity between objects and clusters. If two clusters have similar elements, they are merged together
into a larger cluster. This continues until all objects are grouped into one final cluster. For example,
consider the following image.
Agglomerative Hierarchical clustering

 It means, this algorithm considers each dataset as a single cluster at the


beginning, and then start combining the closest pair of clusters. It does this until
all the clusters are merged into a single cluster that contains all the datasets.
 Key Features of Agglomerative Clustering:

• Hierarchical structure: It generates a hierarchy of clusters, typically visualized


using a dendrogram.

• Distance metric: Determines how similar two clusters or data points are.

• Linkage criterion: Determines how the distance between clusters is measured.


 Calculation of Distance Between Two Clusters
 The distance between clusters in agglomerative clustering can be calculated using three
approaches namely single linkage, complete linkage, and average linkage.
• In the single linkage approach, we take the distance between the nearest points in two
clusters as the distance between the clusters.
• In the complete linkage approach, we take the distance between the farthest points in two
clusters as the distance between the clusters.
• In the average linkage approach, we take the average distance between each pair of points in
two given clusters as the distance between the clusters. You can also take the distance
between the centroids of the clusters as their distance from each other.
Agglomerative Hierarchical Clustering Example
 Workflow for Hierarchical Agglomerative clustering

1. Start with individual points: Each data point is its own cluster. For example if you have 5 data points you
start with 5 clusters each containing just one data point.

2. Calculate distances between clusters: Calculate the distance between every pair of clusters. Initially since
each cluster has one point this is the distance between the two data points.

3. Merge the closest clusters: Identify the two clusters with the smallest distance and merge them into a single
cluster.

4. Update distance matrix: After merging you now have one less cluster. Recalculate the distances between the
new cluster and the remaining clusters.

5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance matrix until you have only
one cluster left.

6. Create a dendrogram: As the process continues you can visualize the merging of clusters using a tree-like
diagram called a dendrogram. It shows the hierarchy of how clusters are merged.
Step-1: Create each data point as a single cluster. Let's say there
are N data points, so the number of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form
one cluster. So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together
to form one cluster. There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
 Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
 The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs.
 In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points,
and the x-axis shows all the data points of the given dataset.
 What is Divisive Clustering?
 Divisive clustering is also a type of hierarchical clustering that is used to create clusters of data points. It is
an unsupervised learning algorithm that begins by placing all the data points in a single cluster and then
progressively splits the clusters until each data point is in its own cluster. Divisive clustering is useful for
analyzing datasets that may have complex structures or patterns, as it can help identify clusters that may
not be obvious at first glance.
 Divisive clustering works by first assigning all the data points to one cluster. Then, it looks for ways to split
this cluster into two or more smaller clusters. This process continues until each data point is in its own
cluster. For example, consider the following image.

Divisive Clustering Example


 Advantages of Hierarchical Clustering
1. Robustness: Hierarchical clustering is more robust than other methods since it does not require a predetermined number of clusters to be
specified. Instead, it creates hierarchical clusters based on the similarity between the objects, which makes it more reliable and accurate.
2. Easy to interpret: Hierarchical clustering produces a tree-like structure that is easy to interpret and understand. This makes it ideal for data
analysis as it can provide insights into the data without requiring complex algorithms or deep learning models.
3. Flexible: Hierarchical clustering is a flexible method that can be used on any type of data. It can also be used with different types of similarity
functions and distance measures, allowing for customization based on the application at hand.
4. Scalable: Hierarchical clustering is a scalable method that can easily handle large datasets without becoming computationallyexpensive or time-
consuming. This makes it suitable for applications such as customer segmentation where large datasets need to be processed quickly and
accurately.
5. Visualization: Hierarchical clustering produces a visual tree structure that can be used to gain insights into the data quickly and easily. This
makes it an ideal choice for exploratory data analysis as it allows researchers to gain an understanding of the data at a glance.
6. Versatile: Hierarchical clustering can be used for both supervised and unsupervised learning tasks, making it extremely versatile in its range of
applications.
7. Easier to apply: Since there are no parameters to specify in hierarchical clustering, it is much easier to apply compared to other methods such
as k-means clustering or k-prototypes clustering. This makes it ideal for novice users who need to quickly apply clustering techniques with
minimal effort.
8. Greater accuracy: Hierarchical clustering often tends to produce superior results compared to other methods of clustering dueto its ability to
create more meaningful clusters based on similarities between objects rather than arbitrary boundaries set by cluster centroids or other
parameters.
9. Non-linearity: Agglomerative or divisive clustering can handle non-linear datasets better than other methods, which makes it suitable for cases
where linearity cannot be assumed in the dataset being analyzed.
10. Multiple-level output: By producing a hierarchical tree structure, hierarchical clustering provides multiple levels of output which allows users to
view data at different levels of detail depending on their needs. This flexibility makes it an attractive choice in many situations where multiple
levels of analysis are required.
 Disadvantages of Hierarchical Clustering

 1. Computational Complexity: Hierarchical clustering can be computationally expensive, especially for


large datasets.
 2. Sensitive to Distance Metric: The choice of distance metric can significantly affect the clustering results.
 3. Difficult to Interpret: The dendrogram can be difficult to interpret for large datasets.
 4. No Clear Cluster Boundaries: Hierarchical clustering does not provide clear cluster boundaries, making it
difficult to determine the optimal number of clusters.
 5. Sensitive to Outliers: Hierarchical clustering can be sensitive to outliers, which can affect the clustering
results.
 6. Not Suitable for High-Dimensional Data: Hierarchical clustering can be challenging for high-dimensional
data, as the distance metric may not be effective.
 7. Difficult to Handle Non-Spherical Clusters: Hierarchical clustering can struggle with non-spherical
clusters or clusters with varying densities.
 8. Requires Careful Selection of Parameters: Hierarchical clustering requires careful selection of
parameters, such as the distance metric and linkage method.
 9. Can be Affected by Noise: Hierarchical clustering can be affected by noise in the data, which can lead to
incorrect clustering results.
 10. Not Suitable for Real-Time Data: Hierarchical clustering can be challenging for real-time data, as the
clustering results may not be updated quickly enough.
 Applications of Hierarchical Clustering
 Hierarchical clustering is a type of unsupervised machine learning that can be used for many different applications. It is used to group
similar data points into clusters, which can then be used for further analysis. Here are 10 applications of hierarchical clustering:
1. Customer segmentation: Agglomerative or divisive clustering can be used to group customers into different clusters based on their
demographic, spending, and other characteristics. This can be used to better understand customer behavior and to target marketing
campaigns.
2. Image segmentation: Hierarchical clustering can be used to segment images into different regions, which can then be used for further
analysis.
3. Text analysis: Hierarchical clustering can be used to group text documents based on their content, which can then be used for text mining or
text classification tasks.
4. Gene expression analysis: Hierarchical clustering can be used to group genes based on their expression levels, which can then be used to
better understand gene expression patterns.
5. Anomaly detection: Hierarchical clustering can be used to detect anomalies in data, which can then be used for fraud detection or other
tasks.
6. Recommendation systems: Hierarchical clustering can be used to group users based on their preferences, which can then be used to
recommend items to them.
7. Risk assessment: Agglomerative or divisive clustering can be used to group different risk factors in order to better understand the overall
risk of a portfolio.
8. Network analysis: Hierarchical clustering can be used to group nodes in a network based on their connections, which can then be used to
better understand network structures.
9. Market segmentation: Hierarchical clustering can be used to group markets into different segments, which can then be used to target
different products or services to them.
10. Outlier detection: Hierarchical clustering can be used to detect outliers in data, which can then be used for further analysis.
Probabilistic Clustering
 Probabilistic clustering is a type of clustering algorithm that assigns data points
to clusters based on the probability that they belong to each cluster.

 Probabilistic hierarchical clustering is a machine-learning technique that groups


similar data points into clusters. This method assigns probabilities to the
likelihood of a point belonging to each cluster.
Gaussian mixture models (GMMs)

 Gaussian mixture models (GMMs) are a type of machine learning algorithm. They
are used to classify data into different categories based on the probability
distribution.
 The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function.
 GMM also requires estimated statistics values such as mean and standard deviation
or parameters.

 It is used to estimate the parameters of the probability distributions to best fit the
density of a given training dataset.

 there are plenty of techniques available to estimate the parameter of the Gaussian
Mixture Model (GMM), the Maximum Likelihood Estimation is one of the most
popular techniques among them.
Estimation-Maximization algorithm is one of the best techniques which helps us to estimate the
parameters of the gaussian distributions.

In the EM algorithm, E-step estimates the expected value for each latent variable, whereas M-
step helps in optimizing them significantly using the Maximum Likelihood Estimation (MLE).

Further, this process is repeated until a good set of latent values, and a maximum likelihood is
achieved that fits the data.
Advantages of EM algorithm

• It is very easy to implement the first two basic steps of the EM algorithm in
various machine learning problems, which are E-step and M- step.

• It is mostly guaranteed that likelihood will enhance after each iteration.

• It often generates a solution for the M-step in the closed form.


Disadvantages of EM algorithm

• The convergence of the EM algorithm is very slow.

• It can make convergence for the local optima only.

• It takes both forward and backward probability into consideration. It is opposite


to that of numerical optimization, which takes only forward probabilities.
Association Rule Learning
 It is a type of unsupervised learning technique that check for dependency of one data type to
another data type and maps accordingly so that it can be more profitable
 It is a rule based machine learning method for discovering interesting relations between variables in
large databases
 Association rule mining finds interesting associations and relationships among large sets of data
items.
 This rule shows how frequently an itemset occurs in a transaction.
 A typical example is a Market-Based Analysis. Market-Based Analysis is one of the key techniques
used by large relations to show associations between items.
 Association rule mining is a powerful technique that can be applied to many different types of
datasets. It is commonly used in market basket analysis to identify products that are frequently
purchased together, but it can also be applied to other domains such as healthcare, finance, and
social media.
 Example: Bread, Milk
 Types of Association Rule Learning
 (i) Apriori algorithm
 (ii) F-P growth algorithm
 (iii) Eclat algorithm
 How does Association Rule Learning work?

 Association rule learning works on the concept of If and Else Statement, such as if A then B.

 Here the If element is called antecedent, and then statement is called as Consequent.

 These types of relationships where we can find out some association or relation between two items
is known as single cardinality. It is all about creating rules, and if the number of items increases, then
cardinality also increases accordingly. So, to measure the associations between thousands of data
items, there are several metrics. These metrics are given below:

• Support

• Confidence

• Lift
 1.Support :(frequency of occurrence)

 Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the transaction T that
contains the itemset X. If there are X datasets, then for transactions T, it can be written as:

 Supp(X) = Freq(X) / T

 2.Confidence(conditional probability)

 Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur together in the dataset when
the occurrence of X is already given. It is the ratio of the transaction that contains X and Y to the number of records that contain X.

 Confidence = Freq(X,Y) / Freq(X)

 3.Lift:

 It is the strength of any rule, which can be defined as below formula: It is the ratio of the observed support measure and expected
support if X and Y are independent of each other. It has three possible values:

 Lift = Supp(X,Y) / Supp(X)*Supp(Y)

• If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.

• Lift>1: It determines the degree to which the two itemsets are dependent to each other.

• Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on another.
Example

 Suppose you have 4000 customer transactions in a Big Bazar. You have to
calculate the Support, Confidence, and Lift for two products, and you may say
Biscuits and Chocolate. This is because customers frequently buy these two
items together.

 Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate,
and these 600 transactions include a 200 that includes Biscuits and chocolates.
Using this data, we will find out the support, confidence, and lift.
Support

 Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by
the total number of transactions.

 Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

 Support (Biscuits) = 400/ 4000 = 10 %


Confidence

 Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that
comprise both biscuits and chocolates by the total number of transactions to get
the confidence.

 Confidence = (Transactions relating both biscuits and Chocolate) / (Total


transactions involving Biscuits)

 Confidence = 200/400 = 50%

It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift

 Lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits. The mathematical equations of lift are given below.

 Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

 Lift = Confidence/ Support

 Lift = 50 /10 = 5
Conclusion

 It means that the probability of people buying both biscuits and chocolates
together is five times more than that of purchasing the biscuits alone.

 If the lift value is below one, it means that the people are unlikely to buy both
the items together.

 The Larger the value, the better is the combination.


Apriori Algorithm
 Apriori algorithm refers to the algorithm that is used to calculate the association rules between objects. It means how
two or more objects are related to one another.
 It is used to find frequent item sets in a transaction database and generate association rules based on those item sets.
The algorithm was first introduced by Rakesh Agrawal and Ramakrishnan Srikant in 1994.
 This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that contain
transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently. It is mainly
used for market basket analysis and helps to understand the products that can be bought together. It can also be used in
the healthcare field to find drug reactions for patients.
 The Apriori algorithm works by iteratively scanning the database to find frequent item sets of increasing size. It uses a
"bottom-up" approach, starting with individual items and gradually adding more items to the candidate item sets until
no more frequent item sets can be found.

 We can also say that the apriori algorithm is an association rule learning that analyzes that people who bought product
A also bought product B.
 Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.

 Generally, you operate the Apriori algorithm on a database that consists of a huge number of transactions.
What is Apriori Algorithm?
 Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant
association rules.

 Generally, the apriori algorithm operates on a database containing a huge number of transactions. For
example, the items customers buy at a Big Bazar.
 Frequent Item Set
 Frequent Itemset is an itemset whose support value is greater than a threshold value(support).
 Apriori algorithm uses frequent itemsets to generate association rules. To improve the efficiency of level-
wise generation of frequent itemsets, an important property is used called Apriori property which helps by
reducing the search space.

 Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.
 The Apriori algorithm is a frequent pattern mining algorithm used in market basket analysis. We use the
apriori algorithm to generate frequent itemsets in a transaction dataset. It is an iterative algorithm that we
used to generate frequent itemsets starting from a set of one item to create bigger itemsets.
 How the Apriori Algorithm Works?
 The Apriori Algorithm operates through a systematic process that involves several key steps:

1. Identifying Frequent Itemsets: The algorithm begins by scanning the dataset to identify individual items (1-item) and
their frequencies. It then establishes a minimum support threshold, which determines whether an itemset is considered
frequent.

2. Creating Possible item group: Once frequent 1-itemgroup(single items) are identified, the algorithm generates candidate
2-itemgroup by combining frequent items. This process continues iteratively, forming larger itemsets (k-itemgroup) until
no more frequent itemgroup can be found.

3. Removing Infrequent Item groups: The algorithm employs a pruning technique based on the Apriori Property, which
states that if an itemset is infrequent, all its supersets must also be infrequent. This significantly reduces the number of
combinations that need to be evaluated.

4. Generating Association Rules: After identifying frequent itemsets, the algorithm generates association rules that
illustrate how items relate to one another, using metrics like support, confidence, and lift to evaluate the strength of these
relationships.

These metrics are given below:

SUPPORT

CONFIDENCE

LIFT
 APRIORI ALGORITHM STEPS:-
 Step 1: Data Preparation Collect and prepare the transactional data, which includes a set of
items and a set of transactions.
 Step 2: Support Calculation Calculate the support for each item in the dataset, which is the
proportion of transactions that contain the item.
 Step 3: Frequent Itemset Generation Generate the frequent itemsets, which are the itemsets
that meet the minimum support threshold.
 Step 4: Candidate Generation Generate candidate itemsets of size k+1 from the frequent
itemsets of size k.
 Step 5: Support Counting Count the support for each candidate itemset.
 Step 6: Pruning Prune, the candidate itemsets that do not meet the minimum support
threshold.
 Step 7: Frequent Itemset Generation (again)Generate the frequent itemsets of size k+1 from
the pruned candidate itemsets.
 Step 8: Association Rule Generation Generate association rules from the frequent itemsets.
 Step 9: Rule Pruning Prune the association rules that do not meet the minimum confidence
threshold. Output the resulting association rules.
 Apriori Algorithm Example

 Consider the following dataset and find frequent item sets and generate association rules for them.
Assume that minimum support threshold (s = 50%) and minimum confident threshold (c = 80%).

 LIST OF ITEMS ASSUMED TO BE 1=PAPAYA 2:-ORANGE 3:- BANANA 4:- APPLE 5:- GRAPES

Transaction List of items


T1 1, 2, 3

T2 2, 3, 4

T3 4, 5

T4 1, 2, 4

T5 1, 2, 3, 5

T6 1, 2, 3, 4
 Solution

 Finding frequent item sets:

 Support threshold=50% ⇒ 0.5*6 = 3 ⇒ min_sup = 3

 Step-1:

 (i) Create a table containing support count of each item present in dataset –
Called C1 (candidate set).

Item Count
1 4
2 5
3 4
4 4
5 2
 (ii) Prune Step: Compare candidate set item’s support count with minimum
support count. The above table shows that I5 item does not meet min_sup = 3,
thus it is removed, only 1, 2, 3, 4 meet min_sup count.

 This gives us the following item set L1.

Item Count
1 4
2 5
3 4
4 4

Step-2:
(i) Join step: Generate candidate set C2 (2-itemset) using L1.And find out the occurrences of 2-
itemset from the given dataset.
Item Count
1, 2 4
1, 3 3
1, 4 2
2, 3 4
2, 4 3
3, 4 2

(ii) Prune Step: Compare candidate set item’s support count with minimum support count. The
above table shows that item sets {1, 4} and {3, 4} does not meet min_sup = 3, thus those are
removed.
This gives us the following item set L2.
Item Count
1, 2 4
1, 3 3
2, 3 4
2, 4 3
 Step-3:

 (i) Join step: Generate candidate set C3 (3-itemset) using L2.And find out the
occurrences of 3-itemset from the given dataset.

Item Count
1, 2, 3 3
1, 2, 4 2
1, 3, 4 1
2, 3, 4 2

(ii) Prune Step: Compare candidate set item’s support count with minimum support count. The
above table shows that itemset {1, 2, 4}, {1, 3, 4} and {2, 3, 4} does not meet min_sup = 3, thus
those are removed. Only the item set {1, 2, 3} meet min_sup count.
 Generate Association Rules:

 Thus, we have discovered all the frequent item-sets. Now we need to generate strong association rules (satisfies
the minimum confidence threshold) from frequent item sets. For that we need to calculate confidence of each
rule.

 The given Confidence threshold is 80%.

 The all-possible association rules from the frequent item set {1, 2, 3} are:

 {1, 2} ⇒ {3}

 Confidence=support {1, 2, 3}support {1, 2} = (3/ 4)* 100 = 75% (Rejected)

 {1, 3} ⇒ {2}
 Confidence
 support {1, 2, 3}
 support {1, 3}
 = (3/ 3)* 100 = 100% (Selected)
 {2, 3} ⇒ {1}
 Confidence
 support {1, 2, 3}
 support {2, 3}
 = (3/ 4)* 100 = 75% (Rejected)
 {1} ⇒ {2, 3}
 Confidence
 support {1, 2, 3}
 support {1}
 = (3/ 4)* 100 = 75% (Rejected)
 {2} ⇒ {1, 3}
 Confidence
 support {1, 2, 3}
 support {2}
 = (3/ 5)* 100 = 60% (Rejected)
 {I3} ⇒ {1, 2}
 Confidence
 support {1, 2, 3}
 support {3}
 = (3/ 4)* 100 = 75% (Rejected) This shows that the association rule {I1, I3} ⇒ {I2} is strong if minimum confidence threshold is 80%.
 Applications of Apriori Algorithm
 Below are some applications of Apriori algorithm used in today’s companies and startups

1. E-commerce: Used to recommend products that are often bought together, like laptop + laptop
bag, increasing sales.

2. Food Delivery Services: Identifies popular combos, such as burger + fries, to offer combo deals to
customers.

3. Streaming Services: Recommends related movies or shows based on what users often watch
together, like action + superhero movies.

4. Financial Services: Analyzes spending habits to suggest personalized offers, such as credit card
deals based on frequent purchases.

5. Travel & Hospitality: Creates travel packages (e.g., flight + hotel) by finding commonly purchased
services together.

6. Health & Fitness: Suggests workout plans or supplements based on users’ past activities,
like protein shakes + workouts.
FP Growth Algorithm
 The FP Growth algorithm in data mining is a popular method for frequent pattern mining.
 The algorithm is efficient for mining frequent item sets in large datasets. It works by constructing
a frequent pattern tree (FP-tree) from the input dataset.
 The FP Growth algorithm is a frequent pattern mining algorithm used in market basket analysis.
 The FP-Growth or Frequent Pattern Growth algorithm is an advancement to the apriori
algorithm. While using the apriori algorithm for association rule mining, we need to scan the
transaction dataset multiple times. In the FP growth algorithm, we just need to scan the dataset
twice.
 we also don’t need to generate candidate sets while generating the frequent itemsets.
 We create an FP-Tree and use it to determine the frequent itemsets. Thus, the FP-Growth
algorithm helps us perform frequent pattern mining with less computing resources and even
lesser time.
 What is an FP-Tree in FP Growth Algorithm?
 An FP-Tree is a tree data structure created from the transaction data while generating frequent
itemsets in the FP growth algorithm. To create an FP-Tree, we first scan the transaction dataset
and record the support count of each item. Then, we create a tree structure where each node in
the tree represents an item in the dataset and its frequency count. The root node has no
associated item and is used as a starting point for the tree. We denote the root node by None or
Null. The children of a node in the fp-tree represent the items that frequently co-occur with the
parent item in the dataset.

 To construct the tree efficiently, we first transform the dataset by sorting the items in each
transaction based on their support count. We do this to make sure that the frequent items appear
early in each transaction. This leads to more frequent items being near the root node resulting in
a compact and efficient tree.
 Shortcomings of Apriori Algorithm

1. Using Apriori needs a generation of candidate itemsets. These itemsets may be large in number if
the itemset in the database is huge.
2. Apriori needs multiple scans of the database to check the support of each itemset generated and
this leads to high costs.

 These shortcomings can be overcome using the FP growth algorithm.
 Frequent Pattern Growth Algorithm

 This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm represents the database in the form of a tree
called a frequent pattern tree or FP tree.
 This tree structure will maintain the association between the itemsets. The database is fragmented
using one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this method, the search for frequent itemsets is
reduced comparatively.
 FP Tree

 Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database.
The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the itemset.
 The root node represents null while the lower nodes represent the itemsets. The association of the
nodes with the lower nodes that is the itemsets with the other itemsets are maintained while forming
the tree.
 Frequent Pattern Algorithm Steps

 The frequent pattern growth method lets us find the frequent pattern without candidate generation.
 Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
 #1) The first step is to scan the database to find the occurrences of the itemsets in the database. This
step is the same as the first step of Apriori. The count of 1-itemsets in the database is called support
count or frequency of 1-itemset.
 #2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
 #3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the next
itemset with lower count and so on. It means that the branch of the tree is constructed with transaction
itemsets in descending order of count.
 #4) The next transaction in the database is examined. The itemsets are ordered in descending order of
count. If any itemset of this transaction is already present in another branch (for example in the 1st
transaction), then this transaction branch would share a common prefix to the root.

 This means that the common itemset is linked to the new node of another itemset in this transaction.
 #5) Also, the count of the itemset is incremented as it occurs in the transactions.
Both the common node and new node count is increased by 1 as they are created
and linked according to transactions.
 #6) The next step is to mine the created FP Tree. For this, the lowest node is
examined first along with the links of the lowest nodes. The lowest node represents
the frequency pattern length 1. From this, traverse the path in the FP Tree. This path
or paths are called a conditional pattern base.
 Conditional pattern base is a sub-database consisting of prefix paths in the FP tree
occurring with the lowest node (suffix).
 #7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the
path. The itemsets meeting the threshold support are considered in the Conditional
FP Tree.
 #8) Frequent Patterns are generated from the Conditional FP Tree.
Transaction ID Items
T1 I1, I3, I4
T2 I2, I3, I5, I6
T3 I1, I2, I3, I5
T4 I2, I5
T5 I1, I3, I5

DataSet for FP-Growth Algorithm Numerical Example

FP-Tree created from the transaction data


 Application Examples of FP-Growth Algorithm

 The FP-Growth algorithm has various practical applications as a data mining algorithm for efficiently
extracting frequent patterns. Some typical applications are described below.
• Market Basket Analysis: Market Basket Analysis is a method to understand what products customers tend
to purchase together. For example, from point-of-sale (POS) data in a supermarket, it is possible to identify
which items are often purchased together, and the FP-Growth algorithm can effectively perform basket
analysis by finding frequent item sets.
• Web Click Stream Analysis: Website click logs can be used to analyze the behavior patterns of website
users, and the FP-Growth algorithm can extract frequent page transition patterns from the web click stream
data to improve websites and build recommendation systems. The FP-Growth algorithm can extract frequent
page transition patterns from web clickstream data and use them to improve websites, build recommendation
systems, etc.
• DNA Analysis: In the fields of biology and bioinformatics, the FP-Growth algorithm is also used in DNA
analysis. By extracting frequent patterns in gene sequences, it can help understand the role and interactions
of specific genes and identify the causes of disease.
• Network Traffic Analysis: The FP-Growth algorithm is sometimes used to detect anomalous behavior in
network traffic data, such as communication patterns or attacks. Finding anomalous communication patterns
can help identify security threats.
• Social Network Analysis: The FP-Growth algorithm may be applied to understand user relationships and
group structure from social network data. For example, it is used to investigate how often friends share
common interests on social networking sites.
 Advantages Of FP Growth Algorithm

1. This algorithm needs to scan the database only twice when compared to Apriori which
scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
 Disadvantages Of FP-Growth Algorithm

1. FP Tree is more cumbersome and difficult to build than Apriori.


2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.
Algorithm by Numerical
 Let’s scan the database and compute the frequency of each item as shown in
the below table.
Let’s consider minimum support as 3. After removing all the items below minimum support in
the above table, we would remain with these items - {K: 5, E: 4, M : 3, O : 3, Y : 3}. Let’s re-
order the transaction database based on the items above minimum support. In this step, in
each transaction, we will remove infrequent items and re-order them in the descending order
of their frequency, as shown in the table below.
First Transaction {K, E, M, O, Y}:
In this transaction, all items are simply linked, and their
support count is initialized as 1.
Second Transaction {K, E, O, Y}:
In this transaction, we will increase the support count of K and E in the
tree to 2. As no direct link is available from E to O, we will insert a new
path for O and Y and initialize their support count as 1.
Third Transaction {K, E, M}:
After inserting this transaction, the tree will look as shown below. We
will increase the support count for K and E to 3 and for M to 2.
Fourth Transaction {K, M, Y} and Fifth
Transaction {K, E, O}:
 FP Growth vs Apriori

FP Growth Apriori
Pattern Generation

FP growth generates pattern by constructing a FP Apriori generates pattern by pairing the items into
tree singletons, pairs and triplets.

Candidate Generation

There is no candidate generation Apriori uses candidate generation

Process

The process is faster as compared to Apriori. The The process is comparatively slower than FP Growth,
runtime of process increases linearly with increase in the runtime increases exponentially with increase in
number of itemsets. number of itemsets

Memory Usage

A compact version of database is saved The candidates combinations are saved in memory

You might also like