0% found this document useful (0 votes)
120 views

FP Growth Algorithm Implementation

This document summarizes an article that implements the FP Growth algorithm to find association rules in a dataset. FP Growth is an efficient algorithm that uses an FP-tree structure to find frequent itemsets without candidate generation. The article discusses the FP Growth concept and implements it using Java on a general social survey dataset. This allows the algorithm to determine relevant association rules that occur in the records. The FP-tree structure compresses the dataset and allows mining of frequent patterns by pattern fragment growth without candidate generation.

Uploaded by

Ianmocha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

FP Growth Algorithm Implementation

This document summarizes an article that implements the FP Growth algorithm to find association rules in a dataset. FP Growth is an efficient algorithm that uses an FP-tree structure to find frequent itemsets without candidate generation. The article discusses the FP Growth concept and implements it using Java on a general social survey dataset. This allows the algorithm to determine relevant association rules that occur in the records. The FP-tree structure compresses the dataset and allows mining of frequent patterns by pattern fragment growth without candidate generation.

Uploaded by

Ianmocha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/272863921

FP Growth Algorithm Implementation

Article  in  International Journal of Computer Applications · May 2014


DOI: 10.5120/16233-5613

CITATIONS READS

6 1,884

5 authors, including:

Aditya Nawani Narina Thakur


Indian Institute of Management Bharati Vidyapeeth College of Engineering, Delhi
1 PUBLICATION   6 CITATIONS    14 PUBLICATIONS   23 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Information Retrieval View project

All content following this page was uploaded by Narina Thakur on 14 November 2016.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014

FP Growth Algorithm Implementation

Shivam Sidhu Upendra Kumar Meena Aditya Nawani


Department of Computer Department of Computer Department of Computer
Science, Bharati Vidyapeeth’s Science, Bharati Vidyapeeth’s Science, Bharati Vidyapeeth’s
College of Engineering, New College of Engineering, New College of Engineering, New
Delhi, India Delhi, India Delhi, India

Himanshu Gupta Narina Thakur


Department of Computer Science, Bharati Department of Computer Science, Bharati
Vidyapeeth’s College of Engineering, New Delhi, Vidyapeeth’s College of Engineering, New Delhi,
India India

ABSTRACT 1.1 Data Mining


Data mining is to discover and assess significant patterns from Data mining is the practice of repeatedly searching huge
data, followed by the validation of these identified patterns. chunks of data to determine patterns and trends that go
Data mining is the process to evaluate the data from different beyond simple analysis. Data mining uses[2] sophisticated
perceptions and summarizing it into valuable information. mathematical algorithms to segment the data and evaluate the
This summarized information consequently can be used to probability of future events.
design business strategies to upsurge revenue, occasionally
drive down costs, or both. The Apriori association algorithm 1.2 The need for data mining
is based on pre-computed frequent item sets and it has to scan Enormous data is being collected[3] and warehoused during:
the entire transaction log / dataset or database which will
become a problem with large item sets. With FP trees, there is  purchases at department/grocery[3] stores
no necessity for candidate generation, unlike in the Apriori  Bank/Credit Card transactions
algorithm, and the frequently occurring item sets are  Web data, e-commerce[4]
discovered by just traversing the FP tree. This paper discusses
the FP Tree concept and implements it using Java for a
general social survey dataset. We use this approach to Data is being collected and stored at enormous speeds
determine association rules that occur in the dataset. In this (GB/hour), like in the cases of remote sensors on a satellite,
manner, we can establish relevant rules and patterns in any set telescopes scanning the skies, microarrays generating gene
of records. expression data, scientific simulations generating terabytes of
data.
General Terms Data mining may help scientists
Associaiton, FP Growth Algorithm

Keywords  in classifying and segmenting data


Data mining, Frequent Pattern Tree, Apriori, Association  in Hypothesis Formation

1. INTRODUCTION Moreover, it helps provide better, customized services for an


In the data mining section we would discuss the different edge (e.g. in Customer Relationship Management) in today’s
types of data mining techniques such as association. After a world where competitive pressure is strong.
brief discussion on each type, the concept of FP Trees (which
comes under Association type) would be discussed in detail. 1.3 Association
The main reason of the popularity of the widely used FP Tree Association rules are if-then statements that help uncover
concept is an interesting algorithm defined as the ‘FP Tree relationships between seemingly unrelated data in a relational
growth’ technique, given by Han. database. Association rules which are based on the concept of
strong rules were introduced for discovering regularities
The General Social Survey (GSS) – a statistical dataset was between products in large-scale transaction data recorded
taken from MathCS.org website. The General Social Survey by point-of-sale (POS) systems. An association rule is
(GSS) conducts basic scientific research on the structure and theoretically divided into two parts, an antecedent (if) and a
development of American society with a data-collection consequent (then). An antecedent is an item found in the data.
program designed to monitor social changes within nations. A consequent is an item that is found together with the
The GSS data sets contain a standard ‘core’ of demographic antecedent. Association also uses the criteria of Support and
and attitudinal questions[1], plus topics of special interest, Confidence to identify the most important relationships where
representing the population of adults, 18 years of age or Support is an indication of how frequently the items appear in
older[1]. the database and Confidence indicates the number of times the
if-then statements have been found to be true.

6
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014

For example, the rule {cheese, bread}=>{eggs} found in the Procedure FP-growth(Tree, a) {
sales data of a supermarket would indicate that if a customer (01) if the Tree comprises a unique prefix path then
buys cheese and bread together, she is likely to also buy eggs. // Mining single prefix-path FP-tree {
(02) let P be the unique prefix-path element of the
Algorithms used in association rules are: Tree;
 Apriori algorithm (03) Assuming Q to be the multipath element with
the topmost branching node replaced by a null root;
 FP Tree growth algorithm
(04) for each combination (denoted as ß) of the
 Eclat algorithm nodes in the path P do
 GUHA procedure ASSOC (05) generate pattern ß ∪ a with support = minimum
support of nodes in ß;
1.4 FP Growth Algorithm (06) letfreq pattern set(P) be the set of patterns so
The Association technique gave way to the FP-Growth
generated;
Algorithm, propounded by Han[5]. It is an efficient method
}
wherein the mining is done by an extended prefix-tree
(07) else let Q be Tree;
structure on a complete set of frequent patterns by patterns
(08) for each item ai in Q do { // Mining multipath
fragment growth. The tree structure stores the compressed
FP-tree
information about frequent patterns. In his study, Han proved
(09) generate pattern ß = ai∪ a with support = ai
that due to the Divide and Conquer method and other
.support;
methods, this algorithm is more efficient than other popular
(10) build ß’s pattern-base (which is dependent on
methods for frequent pattern mining e.g. the Apriori
conditions) and then ß’s conditional FP-tree Tree ß;
Algorithm.[6]
(11) if Tree ß ≠ Ø then
This algorithm begins by compressing the input database, (12) call FP-growth(Tree ß , ß);
thereby developing an instance of a frequent pattern tree. The (13) letfreq pattern set(Q) be the set of patterns so
compressed database is then divided into a few conditional generated;
databases, where every database represents one unique }
frequent pattern. Finally, mining of every database is carried (14) return(freq pattern set(P) ∪freq pattern set(Q)
out discretely. Hence, the search costs are significantly ∪ (freq pattern set(P) × freq pattern set(Q)))
lessened, offering good selectivity.[7] }
The reasons of the FP Growth algorithm being more efficient
than other algorithms are: 3. IMPLEMENTATION
1. Divide and Conquer:
The mining data is decomposed into sub-datasets
according to the frequent patterns identified.
It leads to more focused search of smaller databases.
2. There is no candidate generation. As a result no
candidate test is required.
3. No repeated scans of the whole database.
1.5 FP Trees
 A frequent pattern tree consists of a root[8] labelled
as null, a set of item-prefix subtrees as the children
of the root, and a frequent item header table.
 Each node in the item-prefix subtree[8] consists of
three fields: item-name, count and node-link, where Fig 1: Flowchart of stages during implementation
item-name registers which item the node represents,
count registers the number of transactions Figure 1 shows the implementation process. The
represented by the portion of the path reaching that implementation starts with the user feedback dataset obtained
node and node-link links to the next node in the FP online and comprised of a range of attributes. This dataset is
Tree, that carries the same item-name or null if there then cleaned by rectifying and resolving the missing and
is none. incorrect values. Then the available FP Growth algorithm is
 Each entry in the frequent–item- header table applied on the clean dataset which results in formation of
consists of two fields: an item-name and a head of association rules required for analysis.
the node-link.[9].
3.1 Dataset
2. FP-TREE CONSTRUCTIVE The dataset was obtained online and comprised of a range of
attributes: race, age, sex, marital status, number of siblings,
ALGORITHM number of children etc. It is a user feedback dataset. Different
Algorithm : FP-Growth people from many backgrounds and societies were asked
questions, and they provided information about themselves.
Input: DB Database, depicted by FP-tree built
according to Algorithm 1, and a minimum support Now the main parameters that were taken into consideration
threshold ?. are discussed. SEX is classified as Male or Female. Marital
Output: The entire group of frequently occurring Status is divided into five groups- Married, Never married,
rules. Widowed, Divorced as well as Separated. HIGHEST DEGREE
Method: call FP-growth(FP-tree, null). obtained by an individual can be- High School, Graduate,

7
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014

Bachelor, Less than High School (Less than HS), Junior field ‘No’, where 102 depicts the value ‘Female’ of SEX
College. SPEAK LANGUAGE OTHER THAN ENGLISH attribute, 301 depicts the value ‘High School’ of HIGHEST
can be Yes or No. Out of the many attributes, SPEAK DEGREE attribute, and 2 depicts the ‘No’ value of SPEAK
LANGUAGE OTHER THAN ENGLISH, SEX, MARTIAL LANGUAGE OTHER THAN ENGLISH attribute. In a
STATUS, HIGHEST DEGREE were the major ones taken nutshell, when a person is female and her highest degree of
into consideration. qualification is High School, then 81.05% times, she cannot
speak any language other than English. Figure 2 shows the
3.2 Data Preprocessing association rules that were found for the dataset.
First, the data was cleaned. All the missing values were
resolved, and wrong values were rectified. This is essential, as FP TREE CREATION....
cleaner data would provide for a better analysis. File name is: tp14.txt
No. of records in input file is: 2023
A numeric value was assigned to each of the input entries.
No. of columns in input file is: 304
This way, all the categories taken had a unique numeric
identity. The excel file was converted to a .txt file and fed as Min support is: 404 records (20%)
input to the system. Since four broad categories were selected, Confidence is: 80%
four columns consisting of unique numeric assignments were FP TREE
used. (1) 2:1468 (ref to null)
(1.1) 102:808 (ref to null)
Table 1. Parameters of dataset and codes assigned to them
(1.1.1) 301:445 (ref to 301:341)
(1.1.1.1) 201:210 (ref to 201:77)
(1.1.1.2) 202:91 (ref to null)
COLUMN CODE ATTRIBUTES (1.1.2) 201:163 (ref to 201:210)
ASSIGNED
(1.1.3) 202:75 (ref to 202:47)
(1.2) 301:341 (ref to 301:113)
SPEAK 1 Yes
(1.2.1) 201:167 (ref to 201:185)
LANGUAGE
OTHER THAN
(1.2.1.1) 101:167 (ref to 101:185)
2 No (1.2.2) 101:174 (ref to 101:167)
ENGLISH
(1.2.2.1) 202:103 (ref to 202:51)
SEX 101 Male (1.3) 201:185 (ref to null)
(1.3.1) 101:185 (ref to 101:72)
102 Female (1.4) 101:134 (ref to 101:77)
(1.4.1) 202:75 (ref to 202:50)
MARITAL 201 Married (2) 102:286 (ref to 102:808)
STATUS (2.1) 301:104 (ref to 301:445)
202 Never Married
(2.1.1) 201:40 (ref to 201:89)
(2.1.1.1) 1:40 (ref to 1:89)
(2.1.2) 1:64 (ref to 1:72)
203 Widowed
(2.1.2.1) 202:39 (ref to 202:103)
(2.2) 201:89 (ref to 201:167)
204 Divorced
(2.2.1) 1:89 (ref to 1:79)
(2.3) 1:93 (ref to 1:77)
205 Separated
(2.3.1) 202:51 (ref to 202:75)
(3) 301:113 (ref to null)
HIGHEST 300 Less than HS
(3.1) 201:41 (ref to 201:163)
DEGREE
(3.1.1) 101:41 (ref to 101:134)
301 High School
(3.1.1.1) 1:41 (ref to 1:93)
(3.2) 101:72 (ref to null)
302 Junior College
(3.2.1) 1:72 (ref to null)
(3.2.1.1) 202:50 (ref to 202:75)
303 Bachelor (4) 201:77 (ref to 201:40)
(4.1) 101:77 (ref to 101:79)
304 Graduate (4.1.1) 1:77 (ref to 1:40)
(5) 101:79 (ref to 101:174)
(5.1) 1:79 (ref to 1:64)
4. RESULT AND ANALYSIS (5.1.1) 202:47 (ref to 202:91)
4.1 Finding association rules FP tree storage is: 914 bytes.
Finally, after readying the dataset for input and usage, Association Rules obtained from FP tree:- a) {102
association rules in it are found. The default support and 301} -> {2} 81.05%
confidence levels are taken as 20% and 80% respectively.
Ultimately, one rule is found, signifying the rules {102,301} -
> {2}, which suggests that when the element fields ‘Female’ Fig 2: The association rules in the dataset
and ‘High School’ are found, they are accompanied by the

8
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014

4.2 Analysis of association rules


Finally, after readying the dataset for input and usage,
association rules in it are found. The default support and
confidence levels are taken as 20% and 80% respectively.
Ultimately, one rule is found, signifying the rules {102,301} -
> {2}, which suggests that when the element fields ‘Female’
and ‘High School’ are found, they are accompanied by the
field ‘No’, where 102 depicts the value ‘Female’ of SEX
attribute, 301 depicts the value ‘High School’ of HIGHEST
DEGREE attribute, and 2 depicts the ‘No’ value of SPEAK
LANGUAGE OTHER THAN ENGLISH attribute. In a
nutshell, when a person is female and her highest degree of
qualification is High School, then 81.05% times, she cannot
speak any language other than English.

Fig 6: Frequency of occurrences for all records on the


basis of ‘HIGHEST DEGREE’

From Figure 3, we observed that people speaking only


English language are more than those speaking other
language. Figure 4 talks about frequency of occurrences on
the basis of ‘SEX’. Here, we can conclude that female
population exceeds the male population and hence the sex
Fig 3: Frequency of occurrences for all records on the
ratio (Number of males/Number of females) is 1.17. In Figure
basis of ‘SPEAK LANGUAGE OTHER THAN
5, the frequency of occurrences is based on ‘MARITAL
ENGLISH’ STATUS’. Here it can be inferred that nearly half the
population is married and more than a quarter are single.
Similarly, from Figure 6 we can observe that the graduates
constitute the least percent of population whereas high school
passouts are in majority.

5. CONCLUSION
This paper presents the importance of using the FP Tree
algorithm in order to obtain association rules between related
data, which would help in targeting favourable association
rules according to the requirements. This technique can prove
to be extremely useful in market researches. One can find
otherwise hidden information and relationships from the data,
and take further decisions based on the acquired knowledge.
Fig 4: Frequency of occurrences for all records on the
basis of ‘SEX’ For long, many different algorithms like the Apriori algorithm
have been used in the field of analysis of patterns. But it has
been found that these algorithms possess some drawbacks
such as repeated scans of the whole database, and candidate
key generation, which further requires candidate tests. Hence,
if the data is too large or complex, the time and complexity
are increased.
The FP-growth algorithm uses the ‘Divide and Conquer’
strategy and does not require candidate key generation tests.
Furthermore, it doesn’t undergo repeated scans of the data.
So, it can be safely concluded that the FP-growth algorithm
has a vast future scope in the area of marketing in the
organized sector. Hence, we could see greater involvement of
the FP Tree concept in competitive global markets in the
future.

Fig 5: Frequency of occurrences for all records on the


basis of ‘MARITAL STATUS’

9
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014

6. REFERENCES [6] Yong Qiu ;Yong-Jie Lan ; Qing-Song Xie, “An


[1] General Social Survey (Subset) of 2008, 1 Oct 2009 improved algorithm of mining from FP-tree”, Machine
(http://sda.berkeley.edu/archive.htm) Learning and Cybernetics, 2004. Proceedings of 2004
International Conference on (Volume:3)
[2] J. Bhatia, Anu Gupta, “Mining of Quantitative
Association Rules in Agricultural Data Warehouse: A [7] Yi Sui; FengJing Shao ; Rencheng Sun ; Jinlong Wang,
Road Map”, International Journal of Information Science “A Sequential Pattern Mining Algorithm Based on
and Intelligent System, 3(1): 187-198, 2014. Improved FP-tree”, Software Engineering, Artificial
Intelligence, Networking, and Parallel/Distributed
[3] D. PUGAZHENDI, “Apriori algorithm on Marine Computing, 2008. SNPD '08. Ninth ACIS International
Fisheries Biological Data”, International Journal of Conference
Computer Science & Engineering Technology, Dec 12,
2013. [8] M.H Nadimi-Shahraki, Norwati Mustapha, MdNasir B
Sulaiman, Ali B Mamat, “Efficient Candidacy Reduction
[4] Santhosh Kumar, B.; Rukmani, K. V. “Implementation For Frequent Pattern Mining”, International Journal of
of Web Usage Mining Using APRIORI and FP Growth Computer Science and Information Security, Vol. 6, No.
Algorithms”, International Journal of Advanced 3, 2009.
Networking & Applications . May/Jun2010, Vol. 1 Issue
6. [9] Changjie Tang, Charles X. Ling, Xiaofang Zhou, Nick
Cercone, Xue Li, “ Advanced Data Mining and
[5] Jiawei Han, MichelineKamber, “Data Mining:Concepts Applications”, 4th International Conference, ADMA
and Techniques”, June 2011, Elsevier. 2008, Chengdu, China, October 8-10, 2008,
Proceedings, Springer 2008.

IJCATM : www.ijcaonline.org 10

View publication stats

You might also like