U3 - FP Trees - 5th Sem - DS
U3 - FP Trees - 5th Sem - DS
As we all know, Apriori is an algorithm for frequent pattern mining that focuses on
generating itemsets and discovering the most frequent itemset. It greatly reduces
the size of the itemset in the database, however, Apriori has its own shortcomings
as well.
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method
for mining the complete set of frequent patterns by pattern fragment growth, using an
extended prefix-tree structure for storing compressed and crucial information about
frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that
his method outperforms other popular methods for mining frequent patterns, e.g. the
Apriori Algorithm and the TreeProjection. In some later works, it was proved that FP-
Growth performs better than other methods, including Eclat and Relim. The popularity
and efficiency of the FP-Growth Algorithm contribute to many studies that propose
variations to improve its performance.
1. One root is labelled as "null" with a set of item-prefix subtrees as children and a
frequent-item-header table.
○ Node-link: links to the next node in the FP-tree carrying the same item
name or null if there is none.
This tree structure will maintain the association between the itemsets. The database
is fragmented using one frequent item. This fragmented part is called a “pattern
fragment”. The itemsets of these fragmented patterns are analyzed. Thus with this
method, the search for frequent itemsets is reduced comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of
the database. The purpose of the FP tree is to mine the most frequent pattern. Each
node of the FP tree represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The
association of the nodes with the lower nodes, that is the itemsets with the other
itemsets are maintained while forming the tree.
Let us see the steps followed to mine the frequent pattern using frequent
pattern growth algorithm:
#1) The first step is to scan the database to find the occurrences of the itemsets in
the database. This step is the same as the first step of Apriori. The count of 1-
itemsets in the database is called support count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree.
The root is represented by null.
#3) The next step is to scan the database again and examine the transactions.
Examine the first transaction and find out the itemset in it. The itemset with the max
count is taken at the top, the next itemset with lower count and so on. It means that
the branch of the tree is constructed with transaction item sets in descending order
of count.
#4) The next transaction in the database is examined. The item sets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch
would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in
this transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions.
Both the common node and new node count is increased by 1 as they are created
and linked according to transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is
examined first along with the links of the lowest nodes. The lowest node represents
the frequency pattern length 1. From this, traverse the path in the FP Tree. This path
or paths are called a conditional pattern base.
FP Growth vs Apriori
FP Growth Apriori
Pattern Generation
Candidate Generation
Process
The process is faster as compared to The process is comparatively slower than
Apriori. The runtime of the process FP Growth, the runtime increases
increases linearly with an increase in the exponentially with increase in number of
number of itemsets. itemsets
Memory Usage
1. This algorithm needs to scan the database only twice when compared
to Apriori which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it
faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent
patterns.
Words co-occurrence statistics describe how words occur together that in turn
captures the relationships between words. Words co-occurrence statistics is
computed simply by counting how two or more words occur together in a
given corpus. Co-occurrence matrices analyze text in context. Word
embeddings and vector semantics are ways to understand words in their
context, namely the semantics analysis in NLP (compare to syntax analysis
such as language modeling using ngram, Part-of-Speech (POS) taggings,
Named Entity Recognition (NER)).
called the term-term matrix. It’s a square matrix as it's a matrix between
in the same context, they are said to have occurred in the same
occurrence context.
(k+n) words. This window will serve as the context now. Terms
words which are though close to each other but are in different sentences.
From the clickstream data itself, a webmaster may be able to identify that users
from certain channels view more or less pages, on average. For example, they
may notice that users from search engines view twice as many pages as users
from social media. As a result of this, the webmaster may choose to focus more
resources on the former channel.
It may sound cliche, but the first step in effective clickstream analysis is
understanding your objectives. Its helpful to know whether you are using
clickstream analysis to review your traffic channels and advertising strategies or
to improve your content and its interlinking. Once you know this, the remaining
steps will be much easier.
Like with any data analysis, the next step in clickstream analysis is to collect the
data itself. There are numerous ways to collect clickstream data which we’ll
discuss later. Following the collection, it’s helpful to have a way to visualize or
otherwise review the data in a convenient format. This may be offered by the
same tools used for data collection.
Generally speaking, data analysis is all about identifying patterns and exceptions.
Regardless of what you are trying to achieve with clickstream analysis, you will
want to look at patterns and exceptions in the way users interact with your pages.
If you are comparing traffic channels or advertising strategies, also look at
patterns and exceptions in how users from varying sources interact with your
website.
After analyzing the patterns and exceptions in clickstream data, you should
be able to draw conclusions about the pages on and users of your website.
Finally, you can implement these conclusions to improve your website.