ML 4
ML 4
The algorithm will categorize the items into k groups or clusters of similarity.
To calculate that similarity we will use the Euclidean distance as a
measurement. The algorithm works as follows:
1. First we randomly initialize k points called means or cluster centroids.
2. We categorize each item to its closest mean and we update the mean's
coordinates, which are the averages of the items categorized in that
cluster so far.
3. We repeat the process for a given number of iterations and at the end,
we have our clusters.
The "points" mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have
a lot of options. An intuitive method is to initialize the means at random
items in the data set. Another method is to initialize the means at random
values between the boundaries of the data set. For example for a
feature x the items have values in [0,3] we will initialize the means with
values for x at [0,3].
The EM algorithm is an iterative method used to find maximum likelihood estimates of parameters
in probabilistic models, typically where the model depends on unobserved latent variables.
Steps:
1. E-step (Expectation): Compute the expected value of the latent variables using the current
estimates of the parameters.
2. M-step (Maximization): Maximize the expected log-likelihood to update the parameters.
E-step Assign each data point to the nearest centroid (hard assignment)
Because of these simplifications, K-means is faster but less flexible than a full EM on Gaussian
mixtures.
The FP Growth algorithm in data mining is a popular method for frequent pattern
mining. The algorithm is efficient for mining frequent item sets in large datasets. It
works by constructing a frequent pattern tree (FP-tree) from the input dataset.
FP Growth algorithm was developed by Han in 2000 and is a powerful tool for
frequent pattern mining in data mining. It is widely used in various applications
such as market basket analysis, bioinformatics, and web usage mining.
The algorithm first scans the dataset and maps each transaction to a path in the
tree. Items are ordered in each transaction based on their frequency, with the most
frequent items appearing first. Once the FP tree is constructed, frequent itemsets
can be generated by recursively mining the tree. This is done by starting at the
bottom of the tree and working upwards, finding all combinations of itemsets that
satisfy the minimum support threshold.
The FP Growth algorithm in data mining has several advantages over other
frequent pattern mining algorithms, such as Apriori. The Apriori algorithm is not
suitable for handling large datasets because it generates a large number of
candidates and requires multiple scans of the database to my frequent items. In
comparison, the FP Growth algorithm requires only a single scan of the data and a
small amount of memory to construct the FP tree. It can also be parallelized to
improve performance.
FP Tree
The FP-tree (Frequent Pattern tree) is a data structure used in the FP Growth
algorithm for frequent pattern mining. It represents the frequent itemsets in the
input dataset compactly and efficiently. The FP tree consists of the following
components:
Root Node:
The root node of the FP-tree represents an empty set. It has no associated
item but a pointer to the first node of each item in the tree.
Item Node:
Each item node in the FP-tree represents a unique item in the dataset. It
stores the item name and the frequency count of the item in the dataset.
Header Table:
The header table lists all the unique items in the dataset, along with their
frequency count. It is used to track each item's location in the FP tree.
Child Node:
Each child node of an item node represents an item that co-occurs with the
item the parent node represents in at least one transaction in the dataset.
Node Link:
The node-link is a pointer that connects each item in the header table to the
first node of that item in the FP-tree. It is used to traverse the conditional
pattern base of each item during the mining process.
The FP tree is constructed by scanning the input dataset and inserting each
transaction into the tree one at a time. For each transaction, the items are sorted in
descending order of frequency count and then added to the tree in that order. If an
item exists in the tree, its frequency count is incremented, and a new path is created
from the existing node. If an item does not exist in the tree, a new node is created
for that item, and a new path is added to the tree. We will understand in detail how
FP-tree is constructed in the next section.
Algorithm by Han
Let’s understand with an example how the FP Growth algorithm in data mining
can be used to mine frequent itemsets. Suppose we have a dataset of transactions as
shown below:
Transaction ID Items
T1 {M, N, O, E, K, Y}
T2 {D, O, E, N, Y, K}
T3 {K, A, M, E}
T4 {M, C, U, Y, K}
T5 {C, O, K, O, E, I}
Let’s scan the above database and compute the frequency of each item as shown in
the below table.
Item Frequency
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 3
U 1
Y 3
Let’s consider minimum support as 3. After removing all the items below
minimum support in the above table, we would remain with these items - {K: 5, E:
4, M : 3, O : 3, Y : 3}. Let’s re-order the transaction database based on the items
above minimum support. In this step, in each transaction, we will remove
infrequent items and re-order them in the descending order of their frequency, as
shown in the table below.
Now we will use the ordered itemset in each transaction to build the FP tree. Each
transaction will be inserted individually to build the FP tree, as shown below -
Now we will create a Conditional Pattern Base for all the items. The conditional
pattern base is the path in the tree ending at the given frequent item. For example,
for item O, the paths {K, E, M} and {K, E} will result in item O. The conditional
pattern base for all items will look like as shown below table:
Now for each item, we will build a conditional frequent pattern tree. It is computed
by identifying the set of elements common in all the paths in the conditional
pattern base of a given frequent item and computing its support count by summing
the support counts of all the paths in the conditional pattern base. The conditional
frequent pattern tree will look like this as shown below table:
From the above conditional FP tree, we will generate the frequent itemsets as
shown in the below table:
Delve Deeper: Our Data Science Course is Your Next Step. Enroll Now and
Transform Your Understanding into Practical Expertise.
Memory consumption:
Although the FP Growth algorithm is more memory-efficient than other
frequent itemset mining algorithms, storing the FP-Tree and the conditional
pattern bases can still require a significant amount of memory, especially for
large datasets.
Complex implementation:
The FP Growth algorithm is more complex than other frequent itemset
mining algorithms, making it more difficult to understand and implement.
Mining frequent itemsets from an FP-Tree (Frequent Pattern Tree) is a core step in the FP-Growth
algorithm, which is an efficient method of finding frequent patterns in large datasets without
candidate generation.
What is an FP-Tree?
An FP-Tree is a compressed representation of the transaction database that maintains the itemset
association. It's built in such a way that frequent items are arranged in a prefix-tree structure with
their counts.
Mining the FP-Tree is done bottom-up using a technique called conditional FP-Trees.
1. Start from the least frequent item (from the header table).
2. Construct its conditional pattern base (the set of paths leading to that item).
3. Build the conditional FP-Tree from the conditional pattern base.
4. Repeat recursively to mine this conditional FP-Tree.
5. Combine itemsets found in the subtree with the item used to build the conditional FP-Tree.
✅ Example (Simplified)
Minimum support: 3
Frequent items: a, b, c, d
FP-tree built from sorted transactions.
Now, for item c, we collect prefix paths that end in c, form its conditional FP-tree, and recursively
find combinations with c.
Recursive Nature
Advantages of FP-Growth
Summary
Step Action
PCA transforms the original features into a new set of orthogonal features called principal
components, ordered by the amount of variance they capture.
Key Points:
Unsupervised method
Reduces redundancy by projecting onto directions of maximum variance
First principal component = direction of highest variance
Can reconstruct original data approximately using fewer components
Example:
From a 100-dimensional dataset, PCA may reduce it to just 2 or 3 dimensions for visualization,
keeping 90–95% of the variance.
SVD is a matrix factorization technique. Any matrix A of size m × n can be decomposed as:
A=UΣV^T
Where:
Ak=UkΣkVk ^T
Key Points:
; PCA vs SVD
Feature PCA SVD
Real-Life Applications
PCA: Face recognition, gene expression analysis, data visualization
SVD: Image compression, collaborative filtering, latent semantic analysis (NLP)