0% found this document useful (0 votes)
7 views13 pages

ML 4

K-Means Clustering is an unsupervised machine learning algorithm that groups unlabeled data into clusters based on similarity, utilizing Euclidean distance for categorization. The FP Growth algorithm is a frequent pattern mining method that constructs a compact FP-tree to efficiently identify frequent itemsets in large datasets, outperforming traditional methods like Apriori. Both algorithms have distinct advantages and limitations, with K-Means being faster but less flexible, and FP Growth being efficient and scalable but more complex to implement.

Uploaded by

Bhavya V 8562
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

ML 4

K-Means Clustering is an unsupervised machine learning algorithm that groups unlabeled data into clusters based on similarity, utilizing Euclidean distance for categorization. The FP Growth algorithm is a frequent pattern mining method that constructs a compact FP-tree to efficiently identify frequent itemsets in large datasets, outperforming traditional methods like Apriori. Both algorithms have distinct advantages and limitations, with K-Means being faster but less flexible, and FP Growth being efficient and scalable but more complex to implement.

Uploaded by

Bhavya V 8562
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

K-Means Clustering is an Unsupervised Machine Learning algorithm

which groups unlabeled dataset into different clusters. It is used to


organize data into groups based on their similarity.

How k-means clustering works?


We are given a data set of items with certain features and values for these
features like a vector. The task is to categorize those items into groups.
To achieve this we will use the K-means algorithm. 'K' in the name of the
algorithm represents the number of groups/clusters we want to classify
our items into.

The algorithm will categorize the items into k groups or clusters of similarity.
To calculate that similarity we will use the Euclidean distance as a
measurement. The algorithm works as follows:
1. First we randomly initialize k points called means or cluster centroids.
2. We categorize each item to its closest mean and we update the mean's
coordinates, which are the averages of the items categorized in that
cluster so far.
3. We repeat the process for a given number of iterations and at the end,
we have our clusters.
The "points" mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have
a lot of options. An intuitive method is to initialize the means at random
items in the data set. Another method is to initialize the means at random
values between the boundaries of the data set. For example for a
feature x the items have values in [0,3] we will initialize the means with
values for x at [0,3].

K-means clustering is a special case of the Expectation-Maximization (EM) algorithm, specifically


when applied to a mixture of Gaussians with certain simplifications. Here's a clear explanation:

Expectation-Maximization (EM) Algorithm:

The EM algorithm is an iterative method used to find maximum likelihood estimates of parameters
in probabilistic models, typically where the model depends on unobserved latent variables.

Steps:

1. E-step (Expectation): Compute the expected value of the latent variables using the current
estimates of the parameters.
2. M-step (Maximization): Maximize the expected log-likelihood to update the parameters.

K-Means as a Special Case of EM:

K-means can be interpreted as a simplified EM algorithm applied to a Gaussian Mixture Model


(GMM) with the following restrictions:

Component GMM (EM) K-Means (Special Case)

Clusters Gaussian components Hard-assigned clusters

All clusters have identity matrix (spherical)


Covariance Full covariance matrices
covariance

Responsibilities Soft assignment (probability) Hard assignment (0 or 1)

Parameters Means, covariances, mixing


Only cluster centroids
learned coefficients

Mapping K-Means to EM Steps:


EM Step K-Means Equivalent

E-step Assign each data point to the nearest centroid (hard assignment)

M-step Update each centroid to be the mean of the points assigned to it


Conclusion:

K-means is a limiting case of the EM algorithm:

 It assumes clusters are spherical,


 Uses hard assignments instead of soft probabilities,
 Doesn’t estimate covariances or priors.

Because of these simplifications, K-means is faster but less flexible than a full EM on Gaussian
mixtures.

The FP Growth algorithm in data mining is a popular method for frequent pattern
mining. The algorithm is efficient for mining frequent item sets in large datasets. It
works by constructing a frequent pattern tree (FP-tree) from the input dataset.
FP Growth algorithm was developed by Han in 2000 and is a powerful tool for
frequent pattern mining in data mining. It is widely used in various applications
such as market basket analysis, bioinformatics, and web usage mining.

FP Growth in Data Mining


The FP Growth algorithm is a popular method for frequent pattern mining in data
mining. It works by constructing a frequent pattern tree (FP-tree) from the input
dataset. The FP-tree is a compressed representation of the dataset that captures the
frequency and association information of the items in the data.

The algorithm first scans the dataset and maps each transaction to a path in the
tree. Items are ordered in each transaction based on their frequency, with the most
frequent items appearing first. Once the FP tree is constructed, frequent itemsets
can be generated by recursively mining the tree. This is done by starting at the
bottom of the tree and working upwards, finding all combinations of itemsets that
satisfy the minimum support threshold.

The FP Growth algorithm in data mining has several advantages over other
frequent pattern mining algorithms, such as Apriori. The Apriori algorithm is not
suitable for handling large datasets because it generates a large number of
candidates and requires multiple scans of the database to my frequent items. In
comparison, the FP Growth algorithm requires only a single scan of the data and a
small amount of memory to construct the FP tree. It can also be parallelized to
improve performance.

Working on FP Growth Algorithm


The working of the FP Growth algorithm in data mining can be summarized in the
following steps:
 Scan the database:
In this step, the algorithm scans the input dataset to determine the frequency
of each item. This determines the order in which items are added to the FP
tree, with the most frequent items added first.
 Sort items:
In this step, the items in the dataset are sorted in descending order of
frequency. The infrequent items that do not meet the minimum support
threshold are removed from the dataset. This helps to reduce the dataset's
size and improve the algorithm's efficiency.
 Construct the FP-tree:
In this step, the FP-tree is constructed. The FP-tree is a compact data
structure that stores the frequent itemsets and their support counts.
 Generate frequent itemsets:
Once the FP-tree has been constructed, frequent itemsets can be generated
by recursively mining the tree. Starting at the bottom of the tree, the
algorithm finds all combinations of frequent item sets that satisfy the
minimum support threshold.
 Generate association rules:
Once all frequent item sets have been generated, the algorithm post-
processes the generated frequent item sets to generate association rules,
which can be used to identify interesting relationships between the items in
the dataset.

FP Tree
The FP-tree (Frequent Pattern tree) is a data structure used in the FP Growth
algorithm for frequent pattern mining. It represents the frequent itemsets in the
input dataset compactly and efficiently. The FP tree consists of the following
components:

 Root Node:
The root node of the FP-tree represents an empty set. It has no associated
item but a pointer to the first node of each item in the tree.
 Item Node:
Each item node in the FP-tree represents a unique item in the dataset. It
stores the item name and the frequency count of the item in the dataset.
 Header Table:
The header table lists all the unique items in the dataset, along with their
frequency count. It is used to track each item's location in the FP tree.
 Child Node:
Each child node of an item node represents an item that co-occurs with the
item the parent node represents in at least one transaction in the dataset.
 Node Link:
The node-link is a pointer that connects each item in the header table to the
first node of that item in the FP-tree. It is used to traverse the conditional
pattern base of each item during the mining process.

The FP tree is constructed by scanning the input dataset and inserting each
transaction into the tree one at a time. For each transaction, the items are sorted in
descending order of frequency count and then added to the tree in that order. If an
item exists in the tree, its frequency count is incremented, and a new path is created
from the existing node. If an item does not exist in the tree, a new node is created
for that item, and a new path is added to the tree. We will understand in detail how
FP-tree is constructed in the next section.

Algorithm by Han
Let’s understand with an example how the FP Growth algorithm in data mining
can be used to mine frequent itemsets. Suppose we have a dataset of transactions as
shown below:

Transaction ID Items
T1 {M, N, O, E, K, Y}
T2 {D, O, E, N, Y, K}
T3 {K, A, M, E}
T4 {M, C, U, Y, K}
T5 {C, O, K, O, E, I}

Let’s scan the above database and compute the frequency of each item as shown in
the below table.

Item Frequency
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 3
U 1
Y 3

Let’s consider minimum support as 3. After removing all the items below
minimum support in the above table, we would remain with these items - {K: 5, E:
4, M : 3, O : 3, Y : 3}. Let’s re-order the transaction database based on the items
above minimum support. In this step, in each transaction, we will remove
infrequent items and re-order them in the descending order of their frequency, as
shown in the table below.

Transaction ID Items Ordered Itemset


T1 {M, N, O, E, K, Y} {K, E, M, O, Y}
T2 {D, O, E, N, Y, K} {K, E, O, Y}
T3 {K, A, M, E} {K, E, M}
T4 {M, C, U, Y, K} {K, M, Y}
T5 {C, O, K, O, E, I} {K, E, O}

Now we will use the ordered itemset in each transaction to build the FP tree. Each
transaction will be inserted individually to build the FP tree, as shown below -

 First Transaction {K, E, M, O, Y}:


In this transaction, all items are simply linked, and their support count is
initialized as 1.

 Second Transaction {K, E, O, Y}:


In this transaction, we will increase the support count of K and E in the tree
to 2. As no direct link is available from E to O, we will insert a new path
for O and Y and initialize their support count as 1.

 Third Transaction {K, E, M}:


After inserting this transaction, the tree will look as shown below. We will
increase the support count for K and E to 3 and for M to 2.
 Fourth Transaction {K, M, Y} and Fifth Transaction {K, E, O}:
After inserting the last two transactions, the FP-tree will look like as shown
below:

Now we will create a Conditional Pattern Base for all the items. The conditional
pattern base is the path in the tree ending at the given frequent item. For example,
for item O, the paths {K, E, M} and {K, E} will result in item O. The conditional
pattern base for all items will look like as shown below table:

Item Conditional Pattern Base


Y {K, E, M, O : 1}, {K, E, O : 1}, {K, M : 1}
O {K, E, M : 1}, {K, E : 2}
M {K, E : 2}, {K : 1}
E {K : 4}
K

Now for each item, we will build a conditional frequent pattern tree. It is computed
by identifying the set of elements common in all the paths in the conditional
pattern base of a given frequent item and computing its support count by summing
the support counts of all the paths in the conditional pattern base. The conditional
frequent pattern tree will look like this as shown below table:

Item Conditional Pattern Base Conditional FP Tree


Y {K, E, M, O : 1}, {K, E, O : 1}, {K, M : 1} {K : 3}
O {K, E, M : 1}, {K, E : 2} {K, E : 3}
M {K, E : 2}, {K: 1} {K : 3}
E {K: 4} {K: 4}
K

From the above conditional FP tree, we will generate the frequent itemsets as
shown in the below table:

Item Frequent Patterns


Y {K, Y - 3}
O {K, O - 3}, {E, O - 3}, {K, E, O - 3}
M {K, M - 3}
E {K, E - 4}

FP Growth Algorithm Vs. Apriori Algorithm


Here's a tabular comparison between the FP Growth algorithm and the Apriori
algorithm:

Factor FP Growth Algorithm Apriori Algorithm


Apriori algorithm mines frequent
FP Growth uses FP-tree to mine items in an iterative manner - 1-
Working
frequent itemsets. itemsets, 2-itemsets, 3-itemsets,
etc.
Generates frequent itemsets by
Candidate constructing the FP-Tree and Generates candidate itemsets by
Generation recursively generating conditional joining and pruning.
pattern bases.
Scans the database only twice to
Data Scans the database multiple times
construct the FP-Tree and generate
Scanning for frequent itemsets.
conditional pattern bases.
Requires less memory than Apriori as Requires a large amount of
Memory
it constructs the FP-Tree, which memory to store candidate
Usage
compresses the database itemsets.
Faster due to efficient data
Slower due to multiple database
Speed compression and generation of
scans and candidate generation.
frequent itemsets.
Performs well on large datasets due Performs poorly on large datasets
Scalability to efficient data compression and due to a large number of candidate
generation of frequent itemsets. itemsets.

Advantages of FP Growth Algorithm


The FP Growth algorithm in data mining has several advantages over other
frequent itemset mining algorithms, as mentioned below:
 Efficiency:
FP Growth algorithm is faster and more memory-efficient than other
frequent itemset mining algorithms such as Apriori, especially on large
datasets with high dimensionality. This is because it generates frequent
itemsets by constructing the FP-Tree, which compresses the database and
requires only two scans.
 Scalability:
FP Growth algorithm scales well with increasing database size and itemset
dimensionality, making it suitable for mining frequent itemsets in large
datasets.
 Resistant to noise:
FP Growth algorithm is more resistant to noise in the data than other
frequent itemset mining algorithms, as it generates only frequent itemsets
and ignores infrequent itemsets that may be caused by noise.
 Parallelization:
FP Growth algorithm can be easily parallelized, making it suitable for
distributed computing environments and allowing it to take advantage of
multi-core processors.

Delve Deeper: Our Data Science Course is Your Next Step. Enroll Now and
Transform Your Understanding into Practical Expertise.

Disadvantages of FP Growth Algorithm


While the FP Growth algorithm in data mining has several advantages, it also has
some limitations and disadvantages, as mentioned below:

 Memory consumption:
Although the FP Growth algorithm is more memory-efficient than other
frequent itemset mining algorithms, storing the FP-Tree and the conditional
pattern bases can still require a significant amount of memory, especially for
large datasets.
 Complex implementation:
The FP Growth algorithm is more complex than other frequent itemset
mining algorithms, making it more difficult to understand and implement.

Mining frequent itemsets from an FP-Tree (Frequent Pattern Tree) is a core step in the FP-Growth
algorithm, which is an efficient method of finding frequent patterns in large datasets without
candidate generation.
What is an FP-Tree?

An FP-Tree is a compressed representation of the transaction database that maintains the itemset
association. It's built in such a way that frequent items are arranged in a prefix-tree structure with
their counts.

# Steps to Mine Frequent Itemsets from an FP-Tree

Let’s break the process into 3 major steps:

Step 1: Construct the FP-Tree

From the transaction database:

1. Count frequency of all items.


2. Remove infrequent items (those below the minimum support threshold).
3. Sort each transaction by item frequency (descending).
4. Insert transactions into the tree, sharing common prefixes.

You now have the FP-Tree.

Step 2: Mine the FP-Tree (Recursive Mining)

Mining the FP-Tree is done bottom-up using a technique called conditional FP-Trees.

1. Start from the least frequent item (from the header table).
2. Construct its conditional pattern base (the set of paths leading to that item).
3. Build the conditional FP-Tree from the conditional pattern base.
4. Repeat recursively to mine this conditional FP-Tree.
5. Combine itemsets found in the subtree with the item used to build the conditional FP-Tree.

Step 3: Generate Frequent Itemsets

For every item:

 Collect all its prefix paths.


 Build a conditional FP-tree from those paths.
 Recursively mine the conditional FP-tree.
 Combine patterns with the item itself to generate all frequent itemsets.

✅ Example (Simplified)

Assume a simplified transaction dataset:


makefile
CopyEdit
T1: a b d e
T2: b c d
T3: a b d
T4: a c d e
T5: b c d

 Minimum support: 3
 Frequent items: a, b, c, d
 FP-tree built from sorted transactions.

Now, for item c, we collect prefix paths that end in c, form its conditional FP-tree, and recursively
find combinations with c.

Recursive Nature

The mining is done recursively:

 You break the problem into smaller conditional trees,


 Until the tree is empty or contains a single path (then combinations are easy to generate).

Advantages of FP-Growth

 Avoids candidate generation (unlike Apriori).


 Much faster for dense datasets.
 More memory efficient due to compression.

Summary

Mining frequent itemsets from an FP-Tree involves:

Step Action

1 Build the FP-Tree from the transaction DB

2 Traverse bottom-up to build conditional pattern bases

3 Construct conditional FP-Trees recursively

4 Generate frequent itemsets by combining conditional patterns

What is Dimensionality Reduction?


Dimensionality reduction is the process of reducing the number of features (variables) in a dataset
while preserving as much information as possible. It helps in:
 Speeding up computations
 Removing noise/redundancy
 Visualizing high-dimensional data

1. Principal Component Analysis (PCA)


✅ What is PCA?

PCA transforms the original features into a new set of orthogonal features called principal
components, ordered by the amount of variance they capture.

How PCA Works:

1. Standardize the data (zero mean, unit variance)


2. Compute the covariance matrix
3. Find eigenvalues and eigenvectors of the covariance matrix
4. Sort eigenvectors by eigenvalue magnitude (variance)
5. Project data onto the top-k eigenvectors (principal components)

Key Points:

 Unsupervised method
 Reduces redundancy by projecting onto directions of maximum variance
 First principal component = direction of highest variance
 Can reconstruct original data approximately using fewer components

Example:

From a 100-dimensional dataset, PCA may reduce it to just 2 or 3 dimensions for visualization,
keeping 90–95% of the variance.

; 2. Singular Value Decomposition (SVD)


✅ What is SVD?

SVD is a matrix factorization technique. Any matrix A of size m × n can be decomposed as:

A=UΣV^T

Where:

 U = left singular vectors (orthogonal)


 Σ = diagonal matrix of singular values
 V^T = right singular vectors (orthogonal)

How SVD is Used for Dimensionality Reduction:

1. Compute the SVD of data matrix A


2. Keep only the top-k singular values in Σ, and corresponding columns in U and V
3. Reconstruct the matrix with reduced rank:

Ak=UkΣkVk ^T

This A_k is a low-rank approximation of A.

Key Points:

 Works even when covariance matrix is hard to compute


 Often used in recommendation systems (e.g., Netflix)
 More general than PCA; PCA is derived from SVD of the covariance matrix

; PCA vs SVD
Feature PCA SVD

Based on Covariance matrix Matrix factorization

Output Principal components Singular values & vectors

Use case Variance analysis, visualization Data compression, matrix approximation

Handles sparse data Less ideal More suitable

Can reconstruct data? Yes, approx. Yes, approx.

Real-Life Applications
 PCA: Face recognition, gene expression analysis, data visualization
 SVD: Image compression, collaborative filtering, latent semantic analysis (NLP)

You might also like