FP Growth Algorithm Implementation
FP Growth Algorithm Implementation
net/publication/272863921
CITATIONS READS
6 1,884
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Narina Thakur on 14 November 2016.
6
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014
For example, the rule {cheese, bread}=>{eggs} found in the Procedure FP-growth(Tree, a) {
sales data of a supermarket would indicate that if a customer (01) if the Tree comprises a unique prefix path then
buys cheese and bread together, she is likely to also buy eggs. // Mining single prefix-path FP-tree {
(02) let P be the unique prefix-path element of the
Algorithms used in association rules are: Tree;
Apriori algorithm (03) Assuming Q to be the multipath element with
the topmost branching node replaced by a null root;
FP Tree growth algorithm
(04) for each combination (denoted as ß) of the
Eclat algorithm nodes in the path P do
GUHA procedure ASSOC (05) generate pattern ß ∪ a with support = minimum
support of nodes in ß;
1.4 FP Growth Algorithm (06) letfreq pattern set(P) be the set of patterns so
The Association technique gave way to the FP-Growth
generated;
Algorithm, propounded by Han[5]. It is an efficient method
}
wherein the mining is done by an extended prefix-tree
(07) else let Q be Tree;
structure on a complete set of frequent patterns by patterns
(08) for each item ai in Q do { // Mining multipath
fragment growth. The tree structure stores the compressed
FP-tree
information about frequent patterns. In his study, Han proved
(09) generate pattern ß = ai∪ a with support = ai
that due to the Divide and Conquer method and other
.support;
methods, this algorithm is more efficient than other popular
(10) build ß’s pattern-base (which is dependent on
methods for frequent pattern mining e.g. the Apriori
conditions) and then ß’s conditional FP-tree Tree ß;
Algorithm.[6]
(11) if Tree ß ≠ Ø then
This algorithm begins by compressing the input database, (12) call FP-growth(Tree ß , ß);
thereby developing an instance of a frequent pattern tree. The (13) letfreq pattern set(Q) be the set of patterns so
compressed database is then divided into a few conditional generated;
databases, where every database represents one unique }
frequent pattern. Finally, mining of every database is carried (14) return(freq pattern set(P) ∪freq pattern set(Q)
out discretely. Hence, the search costs are significantly ∪ (freq pattern set(P) × freq pattern set(Q)))
lessened, offering good selectivity.[7] }
The reasons of the FP Growth algorithm being more efficient
than other algorithms are: 3. IMPLEMENTATION
1. Divide and Conquer:
The mining data is decomposed into sub-datasets
according to the frequent patterns identified.
It leads to more focused search of smaller databases.
2. There is no candidate generation. As a result no
candidate test is required.
3. No repeated scans of the whole database.
1.5 FP Trees
A frequent pattern tree consists of a root[8] labelled
as null, a set of item-prefix subtrees as the children
of the root, and a frequent item header table.
Each node in the item-prefix subtree[8] consists of
three fields: item-name, count and node-link, where Fig 1: Flowchart of stages during implementation
item-name registers which item the node represents,
count registers the number of transactions Figure 1 shows the implementation process. The
represented by the portion of the path reaching that implementation starts with the user feedback dataset obtained
node and node-link links to the next node in the FP online and comprised of a range of attributes. This dataset is
Tree, that carries the same item-name or null if there then cleaned by rectifying and resolving the missing and
is none. incorrect values. Then the available FP Growth algorithm is
Each entry in the frequent–item- header table applied on the clean dataset which results in formation of
consists of two fields: an item-name and a head of association rules required for analysis.
the node-link.[9].
3.1 Dataset
2. FP-TREE CONSTRUCTIVE The dataset was obtained online and comprised of a range of
attributes: race, age, sex, marital status, number of siblings,
ALGORITHM number of children etc. It is a user feedback dataset. Different
Algorithm : FP-Growth people from many backgrounds and societies were asked
questions, and they provided information about themselves.
Input: DB Database, depicted by FP-tree built
according to Algorithm 1, and a minimum support Now the main parameters that were taken into consideration
threshold ?. are discussed. SEX is classified as Male or Female. Marital
Output: The entire group of frequently occurring Status is divided into five groups- Married, Never married,
rules. Widowed, Divorced as well as Separated. HIGHEST DEGREE
Method: call FP-growth(FP-tree, null). obtained by an individual can be- High School, Graduate,
7
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014
Bachelor, Less than High School (Less than HS), Junior field ‘No’, where 102 depicts the value ‘Female’ of SEX
College. SPEAK LANGUAGE OTHER THAN ENGLISH attribute, 301 depicts the value ‘High School’ of HIGHEST
can be Yes or No. Out of the many attributes, SPEAK DEGREE attribute, and 2 depicts the ‘No’ value of SPEAK
LANGUAGE OTHER THAN ENGLISH, SEX, MARTIAL LANGUAGE OTHER THAN ENGLISH attribute. In a
STATUS, HIGHEST DEGREE were the major ones taken nutshell, when a person is female and her highest degree of
into consideration. qualification is High School, then 81.05% times, she cannot
speak any language other than English. Figure 2 shows the
3.2 Data Preprocessing association rules that were found for the dataset.
First, the data was cleaned. All the missing values were
resolved, and wrong values were rectified. This is essential, as FP TREE CREATION....
cleaner data would provide for a better analysis. File name is: tp14.txt
No. of records in input file is: 2023
A numeric value was assigned to each of the input entries.
No. of columns in input file is: 304
This way, all the categories taken had a unique numeric
identity. The excel file was converted to a .txt file and fed as Min support is: 404 records (20%)
input to the system. Since four broad categories were selected, Confidence is: 80%
four columns consisting of unique numeric assignments were FP TREE
used. (1) 2:1468 (ref to null)
(1.1) 102:808 (ref to null)
Table 1. Parameters of dataset and codes assigned to them
(1.1.1) 301:445 (ref to 301:341)
(1.1.1.1) 201:210 (ref to 201:77)
(1.1.1.2) 202:91 (ref to null)
COLUMN CODE ATTRIBUTES (1.1.2) 201:163 (ref to 201:210)
ASSIGNED
(1.1.3) 202:75 (ref to 202:47)
(1.2) 301:341 (ref to 301:113)
SPEAK 1 Yes
(1.2.1) 201:167 (ref to 201:185)
LANGUAGE
OTHER THAN
(1.2.1.1) 101:167 (ref to 101:185)
2 No (1.2.2) 101:174 (ref to 101:167)
ENGLISH
(1.2.2.1) 202:103 (ref to 202:51)
SEX 101 Male (1.3) 201:185 (ref to null)
(1.3.1) 101:185 (ref to 101:72)
102 Female (1.4) 101:134 (ref to 101:77)
(1.4.1) 202:75 (ref to 202:50)
MARITAL 201 Married (2) 102:286 (ref to 102:808)
STATUS (2.1) 301:104 (ref to 301:445)
202 Never Married
(2.1.1) 201:40 (ref to 201:89)
(2.1.1.1) 1:40 (ref to 1:89)
(2.1.2) 1:64 (ref to 1:72)
203 Widowed
(2.1.2.1) 202:39 (ref to 202:103)
(2.2) 201:89 (ref to 201:167)
204 Divorced
(2.2.1) 1:89 (ref to 1:79)
(2.3) 1:93 (ref to 1:77)
205 Separated
(2.3.1) 202:51 (ref to 202:75)
(3) 301:113 (ref to null)
HIGHEST 300 Less than HS
(3.1) 201:41 (ref to 201:163)
DEGREE
(3.1.1) 101:41 (ref to 101:134)
301 High School
(3.1.1.1) 1:41 (ref to 1:93)
(3.2) 101:72 (ref to null)
302 Junior College
(3.2.1) 1:72 (ref to null)
(3.2.1.1) 202:50 (ref to 202:75)
303 Bachelor (4) 201:77 (ref to 201:40)
(4.1) 101:77 (ref to 101:79)
304 Graduate (4.1.1) 1:77 (ref to 1:40)
(5) 101:79 (ref to 101:174)
(5.1) 1:79 (ref to 1:64)
4. RESULT AND ANALYSIS (5.1.1) 202:47 (ref to 202:91)
4.1 Finding association rules FP tree storage is: 914 bytes.
Finally, after readying the dataset for input and usage, Association Rules obtained from FP tree:- a) {102
association rules in it are found. The default support and 301} -> {2} 81.05%
confidence levels are taken as 20% and 80% respectively.
Ultimately, one rule is found, signifying the rules {102,301} -
> {2}, which suggests that when the element fields ‘Female’ Fig 2: The association rules in the dataset
and ‘High School’ are found, they are accompanied by the
8
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014
5. CONCLUSION
This paper presents the importance of using the FP Tree
algorithm in order to obtain association rules between related
data, which would help in targeting favourable association
rules according to the requirements. This technique can prove
to be extremely useful in market researches. One can find
otherwise hidden information and relationships from the data,
and take further decisions based on the acquired knowledge.
Fig 4: Frequency of occurrences for all records on the
basis of ‘SEX’ For long, many different algorithms like the Apriori algorithm
have been used in the field of analysis of patterns. But it has
been found that these algorithms possess some drawbacks
such as repeated scans of the whole database, and candidate
key generation, which further requires candidate tests. Hence,
if the data is too large or complex, the time and complexity
are increased.
The FP-growth algorithm uses the ‘Divide and Conquer’
strategy and does not require candidate key generation tests.
Furthermore, it doesn’t undergo repeated scans of the data.
So, it can be safely concluded that the FP-growth algorithm
has a vast future scope in the area of marketing in the
organized sector. Hence, we could see greater involvement of
the FP Tree concept in competitive global markets in the
future.
9
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No.8, May 2014
IJCATM : www.ijcaonline.org 10