Unit 2 Unsupervised Learning
Unit 2 Unsupervised Learning
UNIT:-2
SUPERVISED ML ALGORITHM UNSUPERVISED ML ALGORITHM
Unsupervised learning algorithms are trained using unlabeled
Supervised learning algorithms are trained using labeled data.
data.
Supervised learning model takes direct feedback to check if it is
Unsupervised learning model does not take any feedback.
predicting correct output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data.
In supervised learning, input data is provided to the model along In unsupervised learning, only input data is provided to the
with the output. model.
The goal of supervised learning is to train the model so that it The goal of unsupervised learning is to find the hidden patterns
can predict the output when it is given new data. and useful insights from the unknown dataset.
Unsupervised learning does not need any supervision to train the
Supervised learning needs supervision to train the model.
model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning can be used for those cases where we know Unsupervised learning can be used for those cases where we have
the input as well as corresponding outputs. only input data and no corresponding output data.
The goal of unsupervised learning is to discover patterns and relationships in the data
without any explicit guidance.
unsupervised learning is a machine learning technique in which models are not supervised
using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the input
data but no corresponding output data. The goal of unsupervised learning is
to find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs. The algorithm
is never trained upon the given dataset, which means it does not have any
idea about the features of the dataset. The task of the unsupervised
learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised
Learning:
• Unsupervised learning is helpful for finding useful insights from the data.
• Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
• Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
• In real-world, we do not always have input data with the corresponding output
so to solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning
Types
Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such
as grouping customers by purchasing behavior.
Clustering is a type of unsupervised machine learning algorithm that groups similar data points into
clusters. The goal of clustering is to identify patterns or structures in the data that are not easily visible
by other methods.
Applications of Clustering:
1. Customer Segmentation: Clustering can be used to group customers based on their behavior,
demographics, and preferences.
2. Image Segmentation: Clustering can be used to segment images into different regions based on
color, texture, and other features.
3. Gene Expression Analysis: Clustering can be used to group genes based on their expression levels in
different samples.
4. Recommendation Systems: Clustering can be used to group users based on their preferences and
recommend items to them.
Association: An association rule learning problem is where you want to discover rules that describe large portions of
your data, such as people that buy X also tend to buy Y.
Association in Machine Learning (ML) refers to a type of unsupervised learning algorithm that aims to discover
interesting patterns, relationships, or associations between variables in a dataset.
Goal of Association : The primary goal of association algorithms is to identify strong rules or patterns that describe the
relationships between different attributes or features in a dataset.
Applications of Association:
1. Market Basket Analysis: Association algorithms can be used to analyze customer purchasing behavior and identify
patterns in the items that are purchased together.
2. Recommendation Systems: Association algorithms can be used to build recommendation systems that suggest
products or services based on a customer's past purchases or behavior.
3. Anomaly Detection: Association algorithms can be used to detect anomalies or outliers in a dataset by identifying
patterns or relationships that are unusual or unexpected.
Clustering
1. Hierarchical clustering
2. K-means clustering
3. Gaussian Mixture Models (GMMs)
4. Principal Component Analysis:-UNIT-3
5. Singular Value Decomposition:-UNIT:-3
6. Density-Based Spatial Clustering of Applications with Noise (DBSCAN):-UNIT:-3
K-means clustering is an unsupervised machine learning algorithm used to group a dataset into k clusters. It is
an iterative algorithm that starts by randomly selecting k centroids in the dataset. After selecting the centroids,
the entire dataset is divided into clusters based on the distance of the data points from the centroid. In the new
clusters, the centroids are calculated by taking the mean of the data points.
With the new centroids, we regroup the dataset into new clusters. This process continues until we get a stable
cluster. K-means clustering is a partition clustering algorithm. We call it partition clustering because of the
reason that the k-means clustering algorithm partitions the entire dataset into mutually exclusive clusters.
Here K defines the number of pre-defined clusters that need to be created in the process
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups
in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters.
K-means Clustering Algorithm Steps
K-means Clustering Algorithm
To understand the process of clustering using the k-means clustering algorithm and solve the
numerical example, let us first state the algorithm. Given a dataset of N entries and a number K
as the number of clusters that need to be formed, we will use the following steps to find the
clusters using the k-means algorithm.
1. First, we will select K random entries from the dataset and use them as centroids.
2. Now, we will find the distance of each entry in the dataset from the centroids. You can use any
distance metric such as euclidean distance, Manhattan distance, or squared euclidean distance.
3. After finding the distance of each data entry from the centroids, we will start assigning the data
points to clusters.We will assign each data point to the cluster with the centroid to which it has
the least distance.
4. After assigning the points to clusters, we will calculate the new centroid of the clusters. For this,
we will use the mean of each data point in the same cluster as the new centroid. If the newly
created centroids are the same as the centroids in the previous iteration, we will consider the
current clusters to be final. Hence, we will stop the execution of the algorithm. If any of the newly
created centroids is different from the centroids in the previous iteration, we will go to step 2.
Applications of K-means Clustering in Machine Learning
K-means clustering algorithm finds its applications in various domains. Following are some of the
popular applications of k-means clustering.
• Document Classification: Using k-means clustering, we can divide documents into various
clusters based on their content, topics, and tags.
• Customer segmentation: Supermarkets and e-commerce websites divide their customers into
various clusters based on their transaction data and demography. This helps the business to target
appropriate customers with relevant products to increase sales.
• Cyber profiling: In cyber profiling, we collect data from individuals as well as groups to identify
their relationships. With k-means clustering, we can easily make clusters of people based on their
connection to each other to identify any available patterns.
• Image segmentation: We can use k-means clustering to perform image segmentation by grouping
similar pixels into clusters.
• Fraud detection in banking and insurance: By using historical data on frauds, banks and
insurance agencies can predict potential frauds by the application of k-means clustering.
K-means Clustering Numerical Example with Solution
Now that we have discussed the algorithm, let us solve a numerical problem on k means clustering.
The problem is as follows. You are given 15 points in the Cartesian coordinate system as follows.
Point Coordinates
A1 (2,10)
A2 (2,6)
A3 (11,11)
A4 (6,9)
A5 (6,4)
A6 (1,2)
A7 (5,10)
A8 (4,9)
A9 (10,12)
A10 (7,5)
A11 (9,11)
A12 (4,6)
A13 (3,10)
A14 (3,8)
(6,11)
A15
In the above table, you can observe that the point that is closest to the centroid of a given cluster is
assigned to the cluster.
In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14 (3,8). To
calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of each point
in the cluster. Hence, the new centroid for cluster 1 is (3.833, 5.167).
In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), and A13 (3,10). Hence, the
new centroid for cluster 2 is (4, 9.6)
In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and A15 (6,11). Hence, the new
centroid for cluster 3 is (9, 11.25).
Now that we have calculated new centroids for each cluster, we will calculate the distance of each data
point from the new centroids. Then, we will assign the points to clusters based on their distance from
the centroids. The results for this process have been given in the following table.
Distance from
Distance from centroid Distance from centroid
Point Centroid 1 (3.833, Assigned Cluster
2 (4, 9.6) 3 (9, 11.25)
5.167)
Inertia actually calculates the sum of distances of all the points within a cluster
from the centroid of that cluster.
if the distance between the centroid of a cluster and the points in that cluster is
small, it means that the points are closer to each other. So, inertia makes sure
that the first property of clusters is satisfied. But it does not care about the
second property – that different clusters should be as different from each other
as possible.
This is where the Dunn index comes into action.
Along with the distance between the centroid and points, the Dunn index
also takes into account the distance between two clusters. This
distance between the centroids of two different clusters is known as inter-
cluster distance.
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabelled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
Hierarchical clustering is an unsupervised machine learning algorithm used to group data points into
various clusters based on the similarity between them. It is based on the idea of creating a hierarchy
of clusters, where each cluster is made up of smaller clusters that can be further divided into even
smaller clusters. This hierarchical structure makes it easy to visualize the data and identify patterns
within the data.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
1.Agglomerative Clustring
2.Divisive Clustring
Agglomerative
Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data points as single
clusters and merging them until one cluster is left.
A bottom-up approach where each data point starts as its own cluster and merges with the closest cluster
progressively.
• Distance metric: Determines how similar two clusters or data points are.
1. Start with individual points: Each data point is its own cluster. For example if you have 5 data points you
start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of clusters. Initially since
each cluster has one point this is the distance between the two data points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and merge them into a single
cluster.
4. Update distance matrix: After merging you now have one less cluster. Recalculate the distances between the
new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance matrix until you have only
one cluster left.
6. Create a dendrogram: As the process continues you can visualize the merging of clusters using a tree-like
diagram called a dendrogram. It shows the hierarchy of how clusters are merged.
Step-1: Create each data point as a single cluster. Let's say there
are N data points, so the number of clusters will also be N.
Step-2: Take two closest data points or clusters and merge them to form
one cluster. So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together
to form one cluster. There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs.
In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points,
and the x-axis shows all the data points of the given dataset.
What is Divisive Clustering?
Divisive clustering is also a type of hierarchical clustering that is used to create clusters of data points. It is
an unsupervised learning algorithm that begins by placing all the data points in a single cluster and then
progressively splits the clusters until each data point is in its own cluster. Divisive clustering is useful for
analyzing datasets that may have complex structures or patterns, as it can help identify clusters that may
not be obvious at first glance.
Divisive clustering works by first assigning all the data points to one cluster. Then, it looks for ways to split
this cluster into two or more smaller clusters. This process continues until each data point is in its own
cluster. For example, consider the following image.
Gaussian mixture models (GMMs) are a type of machine learning algorithm. They
are used to classify data into different categories based on the probability
distribution.
The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function.
GMM also requires estimated statistics values such as mean and standard deviation
or parameters.
It is used to estimate the parameters of the probability distributions to best fit the
density of a given training dataset.
there are plenty of techniques available to estimate the parameter of the Gaussian
Mixture Model (GMM), the Maximum Likelihood Estimation is one of the most
popular techniques among them.
Estimation-Maximization algorithm is one of the best techniques which helps us to estimate the
parameters of the gaussian distributions.
In the EM algorithm, E-step estimates the expected value for each latent variable, whereas M-
step helps in optimizing them significantly using the Maximum Likelihood Estimation (MLE).
Further, this process is repeated until a good set of latent values, and a maximum likelihood is
achieved that fits the data.
Advantages of EM algorithm
• It is very easy to implement the first two basic steps of the EM algorithm in
various machine learning problems, which are E-step and M- step.
Association rule learning works on the concept of If and Else Statement, such as if A then B.
Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between two items
is known as single cardinality. It is all about creating rules, and if the number of items increases, then
cardinality also increases accordingly. So, to measure the associations between thousands of data
items, there are several metrics. These metrics are given below:
• Support
• Confidence
• Lift
1.Support :(frequency of occurrence)
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the transaction T that
contains the itemset X. If there are X datasets, then for transactions T, it can be written as:
Supp(X) = Freq(X) / T
2.Confidence(conditional probability)
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur together in the dataset when
the occurrence of X is already given. It is the ratio of the transaction that contains X and Y to the number of records that contain X.
3.Lift:
It is the strength of any rule, which can be defined as below formula: It is the ratio of the observed support measure and expected
support if X and Y are independent of each other. It has three possible values:
• If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
• Lift>1: It determines the degree to which the two itemsets are dependent to each other.
• Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on another.
Example
Suppose you have 4000 customer transactions in a Big Bazar. You have to
calculate the Support, Confidence, and Lift for two products, and you may say
Biscuits and Chocolate. This is because customers frequently buy these two
items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate,
and these 600 transactions include a 200 that includes Biscuits and chocolates.
Using this data, we will find out the support, confidence, and lift.
Support
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by
the total number of transactions.
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that
comprise both biscuits and chocolates by the total number of transactions to get
the confidence.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits. The mathematical equations of lift are given below.
Lift = 50 /10 = 5
Conclusion
It means that the probability of people buying both biscuits and chocolates
together is five times more than that of purchasing the biscuits alone.
If the lift value is below one, it means that the people are unlikely to buy both
the items together.
We can also say that the apriori algorithm is an association rule learning that analyzes that people who bought product
A also bought product B.
Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.
Generally, you operate the Apriori algorithm on a database that consists of a huge number of transactions.
What is Apriori Algorithm?
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant
association rules.
Generally, the apriori algorithm operates on a database containing a huge number of transactions. For
example, the items customers buy at a Big Bazar.
Frequent Item Set
Frequent Itemset is an itemset whose support value is greater than a threshold value(support).
Apriori algorithm uses frequent itemsets to generate association rules. To improve the efficiency of level-
wise generation of frequent itemsets, an important property is used called Apriori property which helps by
reducing the search space.
Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.
The Apriori algorithm is a frequent pattern mining algorithm used in market basket analysis. We use the
apriori algorithm to generate frequent itemsets in a transaction dataset. It is an iterative algorithm that we
used to generate frequent itemsets starting from a set of one item to create bigger itemsets.
How the Apriori Algorithm Works?
The Apriori Algorithm operates through a systematic process that involves several key steps:
1. Identifying Frequent Itemsets: The algorithm begins by scanning the dataset to identify individual items (1-item) and
their frequencies. It then establishes a minimum support threshold, which determines whether an itemset is considered
frequent.
2. Creating Possible item group: Once frequent 1-itemgroup(single items) are identified, the algorithm generates candidate
2-itemgroup by combining frequent items. This process continues iteratively, forming larger itemsets (k-itemgroup) until
no more frequent itemgroup can be found.
3. Removing Infrequent Item groups: The algorithm employs a pruning technique based on the Apriori Property, which
states that if an itemset is infrequent, all its supersets must also be infrequent. This significantly reduces the number of
combinations that need to be evaluated.
4. Generating Association Rules: After identifying frequent itemsets, the algorithm generates association rules that
illustrate how items relate to one another, using metrics like support, confidence, and lift to evaluate the strength of these
relationships.
SUPPORT
CONFIDENCE
LIFT
APRIORI ALGORITHM STEPS:-
Step 1: Data Preparation Collect and prepare the transactional data, which includes a set of
items and a set of transactions.
Step 2: Support Calculation Calculate the support for each item in the dataset, which is the
proportion of transactions that contain the item.
Step 3: Frequent Itemset Generation Generate the frequent itemsets, which are the itemsets
that meet the minimum support threshold.
Step 4: Candidate Generation Generate candidate itemsets of size k+1 from the frequent
itemsets of size k.
Step 5: Support Counting Count the support for each candidate itemset.
Step 6: Pruning Prune, the candidate itemsets that do not meet the minimum support
threshold.
Step 7: Frequent Itemset Generation (again)Generate the frequent itemsets of size k+1 from
the pruned candidate itemsets.
Step 8: Association Rule Generation Generate association rules from the frequent itemsets.
Step 9: Rule Pruning Prune the association rules that do not meet the minimum confidence
threshold. Output the resulting association rules.
Apriori Algorithm Example
Consider the following dataset and find frequent item sets and generate association rules for them.
Assume that minimum support threshold (s = 50%) and minimum confident threshold (c = 80%).
LIST OF ITEMS ASSUMED TO BE 1=PAPAYA 2:-ORANGE 3:- BANANA 4:- APPLE 5:- GRAPES
T2 2, 3, 4
T3 4, 5
T4 1, 2, 4
T5 1, 2, 3, 5
T6 1, 2, 3, 4
Solution
Step-1:
(i) Create a table containing support count of each item present in dataset –
Called C1 (candidate set).
Item Count
1 4
2 5
3 4
4 4
5 2
(ii) Prune Step: Compare candidate set item’s support count with minimum
support count. The above table shows that I5 item does not meet min_sup = 3,
thus it is removed, only 1, 2, 3, 4 meet min_sup count.
Item Count
1 4
2 5
3 4
4 4
Step-2:
(i) Join step: Generate candidate set C2 (2-itemset) using L1.And find out the occurrences of 2-
itemset from the given dataset.
Item Count
1, 2 4
1, 3 3
1, 4 2
2, 3 4
2, 4 3
3, 4 2
(ii) Prune Step: Compare candidate set item’s support count with minimum support count. The
above table shows that item sets {1, 4} and {3, 4} does not meet min_sup = 3, thus those are
removed.
This gives us the following item set L2.
Item Count
1, 2 4
1, 3 3
2, 3 4
2, 4 3
Step-3:
(i) Join step: Generate candidate set C3 (3-itemset) using L2.And find out the
occurrences of 3-itemset from the given dataset.
Item Count
1, 2, 3 3
1, 2, 4 2
1, 3, 4 1
2, 3, 4 2
(ii) Prune Step: Compare candidate set item’s support count with minimum support count. The
above table shows that itemset {1, 2, 4}, {1, 3, 4} and {2, 3, 4} does not meet min_sup = 3, thus
those are removed. Only the item set {1, 2, 3} meet min_sup count.
Generate Association Rules:
Thus, we have discovered all the frequent item-sets. Now we need to generate strong association rules (satisfies
the minimum confidence threshold) from frequent item sets. For that we need to calculate confidence of each
rule.
The all-possible association rules from the frequent item set {1, 2, 3} are:
{1, 2} ⇒ {3}
{1, 3} ⇒ {2}
Confidence
support {1, 2, 3}
support {1, 3}
= (3/ 3)* 100 = 100% (Selected)
{2, 3} ⇒ {1}
Confidence
support {1, 2, 3}
support {2, 3}
= (3/ 4)* 100 = 75% (Rejected)
{1} ⇒ {2, 3}
Confidence
support {1, 2, 3}
support {1}
= (3/ 4)* 100 = 75% (Rejected)
{2} ⇒ {1, 3}
Confidence
support {1, 2, 3}
support {2}
= (3/ 5)* 100 = 60% (Rejected)
{I3} ⇒ {1, 2}
Confidence
support {1, 2, 3}
support {3}
= (3/ 4)* 100 = 75% (Rejected) This shows that the association rule {I1, I3} ⇒ {I2} is strong if minimum confidence threshold is 80%.
Applications of Apriori Algorithm
Below are some applications of Apriori algorithm used in today’s companies and startups
1. E-commerce: Used to recommend products that are often bought together, like laptop + laptop
bag, increasing sales.
2. Food Delivery Services: Identifies popular combos, such as burger + fries, to offer combo deals to
customers.
3. Streaming Services: Recommends related movies or shows based on what users often watch
together, like action + superhero movies.
4. Financial Services: Analyzes spending habits to suggest personalized offers, such as credit card
deals based on frequent purchases.
5. Travel & Hospitality: Creates travel packages (e.g., flight + hotel) by finding commonly purchased
services together.
6. Health & Fitness: Suggests workout plans or supplements based on users’ past activities,
like protein shakes + workouts.
FP Growth Algorithm
The FP Growth algorithm in data mining is a popular method for frequent pattern mining.
The algorithm is efficient for mining frequent item sets in large datasets. It works by constructing
a frequent pattern tree (FP-tree) from the input dataset.
The FP Growth algorithm is a frequent pattern mining algorithm used in market basket analysis.
The FP-Growth or Frequent Pattern Growth algorithm is an advancement to the apriori
algorithm. While using the apriori algorithm for association rule mining, we need to scan the
transaction dataset multiple times. In the FP growth algorithm, we just need to scan the dataset
twice.
we also don’t need to generate candidate sets while generating the frequent itemsets.
We create an FP-Tree and use it to determine the frequent itemsets. Thus, the FP-Growth
algorithm helps us perform frequent pattern mining with less computing resources and even
lesser time.
What is an FP-Tree in FP Growth Algorithm?
An FP-Tree is a tree data structure created from the transaction data while generating frequent
itemsets in the FP growth algorithm. To create an FP-Tree, we first scan the transaction dataset
and record the support count of each item. Then, we create a tree structure where each node in
the tree represents an item in the dataset and its frequency count. The root node has no
associated item and is used as a starting point for the tree. We denote the root node by None or
Null. The children of a node in the fp-tree represent the items that frequently co-occur with the
parent item in the dataset.
To construct the tree efficiently, we first transform the dataset by sorting the items in each
transaction based on their support count. We do this to make sure that the frequent items appear
early in each transaction. This leads to more frequent items being near the root node resulting in
a compact and efficient tree.
Shortcomings of Apriori Algorithm
1. Using Apriori needs a generation of candidate itemsets. These itemsets may be large in number if
the itemset in the database is huge.
2. Apriori needs multiple scans of the database to check the support of each itemset generated and
this leads to high costs.
These shortcomings can be overcome using the FP growth algorithm.
Frequent Pattern Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm represents the database in the form of a tree
called a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is fragmented
using one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this method, the search for frequent itemsets is
reduced comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database.
The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of the
nodes with the lower nodes that is the itemsets with the other itemsets are maintained while forming
the tree.
Frequent Pattern Algorithm Steps
The frequent pattern growth method lets us find the frequent pattern without candidate generation.
Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
#1) The first step is to scan the database to find the occurrences of the itemsets in the database. This
step is the same as the first step of Apriori. The count of 1-itemsets in the database is called support
count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
#3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the next
itemset with lower count and so on. It means that the branch of the tree is constructed with transaction
itemsets in descending order of count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending order of
count. If any itemset of this transaction is already present in another branch (for example in the 1st
transaction), then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions.
Both the common node and new node count is increased by 1 as they are created
and linked according to transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is
examined first along with the links of the lowest nodes. The lowest node represents
the frequency pattern length 1. From this, traverse the path in the FP Tree. This path
or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree
occurring with the lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the
path. The itemsets meeting the threshold support are considered in the Conditional
FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
Transaction ID Items
T1 I1, I3, I4
T2 I2, I3, I5, I6
T3 I1, I2, I3, I5
T4 I2, I5
T5 I1, I3, I5
The FP-Growth algorithm has various practical applications as a data mining algorithm for efficiently
extracting frequent patterns. Some typical applications are described below.
• Market Basket Analysis: Market Basket Analysis is a method to understand what products customers tend
to purchase together. For example, from point-of-sale (POS) data in a supermarket, it is possible to identify
which items are often purchased together, and the FP-Growth algorithm can effectively perform basket
analysis by finding frequent item sets.
• Web Click Stream Analysis: Website click logs can be used to analyze the behavior patterns of website
users, and the FP-Growth algorithm can extract frequent page transition patterns from the web click stream
data to improve websites and build recommendation systems. The FP-Growth algorithm can extract frequent
page transition patterns from web clickstream data and use them to improve websites, build recommendation
systems, etc.
• DNA Analysis: In the fields of biology and bioinformatics, the FP-Growth algorithm is also used in DNA
analysis. By extracting frequent patterns in gene sequences, it can help understand the role and interactions
of specific genes and identify the causes of disease.
• Network Traffic Analysis: The FP-Growth algorithm is sometimes used to detect anomalous behavior in
network traffic data, such as communication patterns or attacks. Finding anomalous communication patterns
can help identify security threats.
• Social Network Analysis: The FP-Growth algorithm may be applied to understand user relationships and
group structure from social network data. For example, it is used to investigate how often friends share
common interests on social networking sites.
Advantages Of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori which
scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
FP Growth Apriori
Pattern Generation
FP growth generates pattern by constructing a FP Apriori generates pattern by pairing the items into
tree singletons, pairs and triplets.
Candidate Generation
Process
The process is faster as compared to Apriori. The The process is comparatively slower than FP Growth,
runtime of process increases linearly with increase in the runtime increases exponentially with increase in
number of itemsets. number of itemsets
Memory Usage
A compact version of database is saved The candidates combinations are saved in memory