0% found this document useful (0 votes)
11 views9 pages

U3 - FP Trees - 5th Sem - DS

datasciences

Uploaded by

subbumail051
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

U3 - FP Trees - 5th Sem - DS

datasciences

Uploaded by

subbumail051
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit 3 - FP Trees

Prepared by: Varun Rao (Dean, Data Science & AI)


For: Data Science - 3rd years

As we all know, Apriori is an algorithm for frequent pattern mining that focuses on
generating itemsets and discovering the most frequent itemset. It greatly reduces
the size of the itemset in the database, however, Apriori has its own shortcomings
as well.

The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method
for mining the complete set of frequent patterns by pattern fragment growth, using an
extended prefix-tree structure for storing compressed and crucial information about
frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that
his method outperforms other popular methods for mining frequent patterns, e.g. the
Apriori Algorithm and the TreeProjection. In some later works, it was proved that FP-
Growth performs better than other methods, including Eclat and Relim. The popularity
and efficiency of the FP-Growth Algorithm contribute to many studies that propose
variations to improve its performance.

Han defines the FP-tree as the tree structure given below:

1. One root is labelled as "null" with a set of item-prefix subtrees as children and a
frequent-item-header table.

2. Each node in the item-prefix subtree consists of three fields:

○ Item-name: registers which item is represented by the node;

○ Count: the number of transactions represented by the portion of the path


reaching the node;

○ Node-link: links to the next node in the FP-tree carrying the same item
name or null if there is none.

3. Each entry in the frequent-item-header table consists of two fields:

○ Item-name: as the same to the node;


○ Head of node-link: a pointer to the first node in the FP-tree carrying the
item name.

The construction of a FP-tree is subdivided into three major steps.


1. Scan the data set to determine the support count of each item, discard the
infrequent items and sort the frequent items in decreasing order.
2. Scan the data set one transaction at a time to create the FP-tree. For each
transaction:
1. If it is a unique transaction form a new path and set the counter for each
node to 1.
2. If it shares a common prefix itemset then increment the common itemset
node counters and create new nodes if needed.
3. Continue this until each transaction has been mapped unto the tree.

Shortcomings Of Apriori Algorithm

1. Using Apriori requires a generation of candidate itemsets. These


itemsets may be large in number if the itemset in the database is huge.
2. Apriori needs multiple scans of the database to check the support of
each itemset generated and this leads to high costs.

These shortcomings can be overcome using the FP growth algorithm.

Frequent Pattern Growth Algorithm


This algorithm is an improvement to the Apriori method. A frequent pattern is
generated without the need for candidate generation. FP growth algorithm
represents the database in the form of a tree called a frequent pattern tree or FP
tree.

This tree structure will maintain the association between the itemsets. The database
is fragmented using one frequent item. This fragmented part is called a “pattern
fragment”. The itemsets of these fragmented patterns are analyzed. Thus with this
method, the search for frequent itemsets is reduced comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of
the database. The purpose of the FP tree is to mine the most frequent pattern. Each
node of the FP tree represents an item of the itemset.

The root node represents null while the lower nodes represent the itemsets. The
association of the nodes with the lower nodes, that is the itemsets with the other
itemsets are maintained while forming the tree.

Frequent Pattern Algorithm Steps


The frequent pattern growth method lets us find the frequent pattern without
candidate generation.

Let us see the steps followed to mine the frequent pattern using frequent
pattern growth algorithm:

#1) The first step is to scan the database to find the occurrences of the itemsets in
the database. This step is the same as the first step of Apriori. The count of 1-
itemsets in the database is called support count or frequency of 1-itemset.

#2) The second step is to construct the FP tree. For this, create the root of the tree.
The root is represented by null.

#3) The next step is to scan the database again and examine the transactions.
Examine the first transaction and find out the itemset in it. The itemset with the max
count is taken at the top, the next itemset with lower count and so on. It means that
the branch of the tree is constructed with transaction item sets in descending order
of count.

#4) The next transaction in the database is examined. The item sets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch
would share a common prefix to the root.

This means that the common itemset is linked to the new node of another itemset in
this transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions.
Both the common node and new node count is increased by 1 as they are created
and linked according to transactions.

#6) The next step is to mine the created FP Tree. For this, the lowest node is
examined first along with the links of the lowest nodes. The lowest node represents
the frequency pattern length 1. From this, traverse the path in the FP Tree. This path
or paths are called a conditional pattern base.

Conditional pattern base is a sub-database consisting of prefix paths in the FP tree


occurring with the lowest node (suffix).

#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the


path. The itemsets meeting the threshold support are considered in the Conditional
FP Tree.

#8) Frequent Patterns are generated from the Conditional FP Tree.

FP Growth vs Apriori

FP Growth Apriori

Pattern Generation

FP growth generates pattern by Apriori generates pattern by pairing the


constructing a FP tree items into singletons, pairs and triplets.

Candidate Generation

There is no candidate generation Apriori uses candidate generation

Process
The process is faster as compared to The process is comparatively slower than
Apriori. The runtime of the process FP Growth, the runtime increases
increases linearly with an increase in the exponentially with increase in number of
number of itemsets. itemsets

Memory Usage

A compact version of database is saved The candidates combinations are saved in


memory

Advantages Of FP Growth Algorithm

1. This algorithm needs to scan the database only twice when compared
to Apriori which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it
faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent
patterns.

Disadvantages Of FP-Growth Algorithm

1. FP Tree is more cumbersome and difficult to build than Apriori.


2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared
memory.

Finding co-occurring words

Words co-occurrence statistics describe how words occur together that in turn
captures the relationships between words. Words co-occurrence statistics is
computed simply by counting how two or more words occur together in a
given corpus. Co-occurrence matrices analyze text in context. Word
embeddings and vector semantics are ways to understand words in their
context, namely the semantics analysis in NLP (compare to syntax analysis
such as language modeling using ngram, Part-of-Speech (POS) taggings,
Named Entity Recognition (NER)).

Unlike the Occurrence matrix which is a rectangular matrix, the co-

occurrence matrix is a square matrix where it depicts the co-occurrence of

two terms in a context. Thus, the co-occurrence matrix is also sometimes

called the term-term matrix. It’s a square matrix as it's a matrix between

each term and another term.

Typically there are two approaches which are followed

1. Term-context matrix e.g. Each sentence is represented as a

context (there can be other definitions as well). If two terms occur

in the same context, they are said to have occurred in the same

occurrence context.

2. k-skip-n-gram approach e.g. A sliding window will include the

(k+n) words. This window will serve as the context now. Terms

that co-occur within this context are said to have co-occurred.


Dis-advantage of the term-context matrix is that it will not consider the

words which are though close to each other but are in different sentences.

The concept of looking into word co-occurrences can be extended in


many ways. For example, we may count how many times a sequence of
three words occurs together to generate trigram frequencies. We may
even count how many times a pair of words occurs together in sentences
irrespective of their positions in sentences. Such occurrences are called
skip-bigram frequencies. Because of such variations in how co-
occurrences are specified, these methods in general are known as n-
gram methods. The term context window is often used to specify the co-
occurrence relationship.

Co-occurrence analysis is simply the counting of paired data within a


collection unit. For example, buying shampoo and a brush at a drug store
is an example of co-occurrence. Here the data is the brush and the
shampoo, and the collection unit is the particular transaction. In this
example, the paired data is {shampoo, brush} and it occurs once. Of
course, more items can be purchased at a time, so the pairings become
more numerous as each item is paired with each other item. For
example, if in addition to the two items, a third item is purchased, say,
goo, then there are three pairings ({shampoo, brush}, {shampoo, goo},
{brush, goo}), again each with a count of one.

Mining a click stream from a news site


The clickstream analysis is the tracking and analysis of visits to websites.
Although there are other ways to collect this data, clickstream analysis
typically uses the Web server log files to monitor and measure website
activity. This analysis can be used to report user behavior on a specific
website, such as routing, stickiness (a user’s tendency to remain at the
website), where users come from and where they go from the site. It can also
be used for more aggregate measurements, such as the number of hits
(visits), page views, and unique and repeat visitors, which are of value in
understanding how the website operates from a technical, user experience
and business perspective.

Applications of Clickstream Analysis


Clickstream analysis has a wide range of applications. Using just clickstream data,
webmasters can identify which pieces of content may need to be improved and optimize
the links between other pieces of content. However, combining clickstream data with
session analytics — to compare traffic channels or advertising strategies — may be
most popular.

Comparing traffic channels

Webmasters can use clickstream analysis to compare traffic channels if they


know how their users first reached the website. With most website analytics
tools, webmasters will have this information; for example, whether a given user
reached the website through a search engine, social media, or by typing the
website’s URL into their browser.

From the clickstream data itself, a webmaster may be able to identify that users
from certain channels view more or less pages, on average. For example, they
may notice that users from search engines view twice as many pages as users
from social media. As a result of this, the webmaster may choose to focus more
resources on the former channel.

Improving existing content

Clickstream analysis can still be incredibly powerful, even without session


analytics. By looking at the path users take through a website, webmasters are
able to see where users “drop off”. With this information, they can choose to
improve the pieces of content which caused users to leave the website.

How to Do Clickstream Analysis


Clickstream analysis is surprisingly easy to get started with. In just four steps,
you can begin to gain insights on the behavior of your website’s users.
1. Understand your objectives

It may sound cliche, but the first step in effective clickstream analysis is
understanding your objectives. Its helpful to know whether you are using
clickstream analysis to review your traffic channels and advertising strategies or
to improve your content and its interlinking. Once you know this, the remaining
steps will be much easier.

2. Collect and visualize data

Like with any data analysis, the next step in clickstream analysis is to collect the
data itself. There are numerous ways to collect clickstream data which we’ll
discuss later. Following the collection, it’s helpful to have a way to visualize or
otherwise review the data in a convenient format. This may be offered by the
same tools used for data collection.

3. Identify patterns and exceptions

Generally speaking, data analysis is all about identifying patterns and exceptions.
Regardless of what you are trying to achieve with clickstream analysis, you will
want to look at patterns and exceptions in the way users interact with your pages.
If you are comparing traffic channels or advertising strategies, also look at
patterns and exceptions in how users from varying sources interact with your
website.

4. Draw and implement conclusions

After analyzing the patterns and exceptions in clickstream data, you should
be able to draw conclusions about the pages on and users of your website.
Finally, you can implement these conclusions to improve your website.

You might also like