0% found this document useful (0 votes)
20 views22 pages

CS-DM Module-4

module 4

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

CS-DM Module-4

module 4

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 22

MODULE-4

Classifier and Association Analysis

Classifiers:
Classification is a form of data analysis that extracts models describing important data classes. Such models,
called classifiers, predict categorical (discrete, unordered) class labels. For example, we can build a
classification model to categorize bank loan applications as either safe or risky. Such analysis can help
provide us with a better understanding of the data at large. Many classification methods have been proposed
by researchers in machine learning, pattern recognition, and statistics.
Classification: Alternative Techniques:
Bayesian Classification:
 Bayesian classifiers are statistical classifiers.
 They can predict class membership probabilities, such as the probability that a given tuple belongs to a
particular class.
 Bayesian classification is based on Bayes’ theorem.
Bayes’ Theorem:
 Let X be a data tuple. In Bayesian terms, X is considered ― “evidence” and it is described by
measurements made on a set of n attributes.
 Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
 For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds
given the ―evidence‖ or observed data tuple X.
 P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).

𝑷 ( 𝑿|𝑯) 𝑷(𝑯)
𝑷 (𝑯|𝑿) =
𝑷(𝑿)
Example:
Problem
Three boxes labeled as A, B, and C, are present. Details of the boxes are:
Box A contains 2 red and 3 black balls
Box B contains 3 red and 1 black ball
And box C contains 1 red ball and 4 black balls
All the three boxes are identical having an equal probability to be picked up. Therefore, what is the
probability that the red ball was picked up from box A?
Solution :
Let E denote the event that a red ball is picked up and A, B and C denote that the ball is picked up from
their respective boxes. Therefore the conditional probability would be P(A|E) which needs to be
calculated.
The existing probabilities P(A) = P(B) = P (C) = 1 / 3,
since all boxes have equal probability of getting picked.
P(E|A) = Number of red balls in box A / Total number of balls in box A = 2 / 5
Similarly, P(E|B) = 3 / 4 and P(E|C) = 1 / 5
Then evidence P(E) = P(E|A)*P(A) + P(E|B)*P(B) + P(E|C)*P(C)
= (2/5) * (1/3) + (3/4) * (1/3) + (1/5) * (1/3) = 0.45
Therefore, P(A|E) = P(E|A) * P(A) / P(E) = (2/5) * (1/3) / 0.45 = 0.296
Naive Bayesian Classification:
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let X be a training set of tuples and their associated class labels. As usual, each tuple is represented by an
n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n measurements made on the tuple from n
attributes, respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian
classifier predicts that tuple X belongs to the class Ci if and only if
𝑷(𝑪𝒊|𝑿)>𝑃(𝑪𝒋|𝑿) 𝒇𝒐𝒓 ( 𝟏<𝒋<𝒎,𝒋≠𝒊).
Thus we maximize(𝐶𝑗|𝑋). The class Ci for which (𝐶𝑗|𝑋). is maximized is called the maximum posteriori
hypothesis. By Bayes’ theorem

1. .As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1)
=P(C2)=…= P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
2. Given data sets with many attributes, it would be extremely computationally expensive to compute
P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption of class conditional
independence is made. This presumes that the values of the attributes are conditionally independent of one
another, given the class label of the tuple. Thus,

3. We can easily estimate the probabilities P(x1|Ci),P(x2|Ci),::: ,P(xn|Ci) from the training tuples.
4. .For each attribute, we look at whether the attribute is categorical or continuous- valued. For instance,
to compute P(X|Ci), we consider the following:
If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in=having the value xk for Ak,
divided by |Ci ,D| the number of tuples of class Ci in D.
If Ak is continuous-valued, then we need to do a bit more work, but the calculation is pretty
straightforward
Example:

We wish to predict the class label of a tuple using naïve Bayesian classification, given the same training data
above. The training data were shown above in Table. The data tuples are described by the attributes age,
income, student, and credit rating. The class label attribute, buys computer, has two distinct values (namely,
{yes, no}). LetC1 correspond to the class buys computer=yes and C2 correspond to buys computer=no. The
tuple we wish to classify is
X={age=“youth”,income=“medium”,student=“yes”,credit_rating=“fair”}
We need to maximize P(X|Ci)P(Ci),for i=1,2.P(Ci),the prior probability of each class, can be computed based
on the training tuples:
P(buys computer= yes)= 9/14=0.643
P(buys computer= no)= 5/14=0.357
To compute P(X|Ci),for i=1,2,wecompute the n following conditional probabilities:
P(age = youth | buys computer =yes) =2/9 =0.222
P(income=medium |buys computer=yes) =4/9 =0.444
P(student=yes | buys computer=yes) =6/9 =0.667
P(credit rating=fair| buys computer=yes) =6/9 =0.667

P(age=youth | buys computer=no) =3/5 =0.600

P(income=medium |buys computer=no) =2/5 =0.400


P(student=yes |buys computer=no) =1/5 =0.200
P(credit rating=fair | buys computer=no) =2/5 =0.400

Using these probabilities, we obtain


P(X | buys computer=yes) = P(age=youth | buys computer=yes)
×P(income=medium | buys computer=yes)
×P(student=yes | buys computer=yes
×P(credit rating=fair|buys computer=yes)
= 0.222×0.444×0.667×0.667=0.044.
Similarly,
P(X | buys computer=no) = 0.600×0.400×0.200×0.400=0.019.

To find the class, Ci,that P(X|Ci) P(Ci), we compute


P(X| buys computer=yes) P(buys computer=yes)=0.044× 0.643 =0.028
P(X | buys computer=no) P(buys computer=no) =0.019 × 0.357 =0.007
Therefore ,the naïve Bayesian classifier predicts buys computer=yes for tuple
Bayesian Belief Networks:
Bayesian belief networks—probabilistic graphical models, which unlike naïve Bayesian classifiers allow the
representation of dependencies among subsets of attributes.
The naïve Bayesian classifier makes the assumption of class conditional independence, that is, given the class
label of a tuple, the values of the attributes are assumed to be conditionally independent of one another.
When the assumption holds true, then the naïve Bayesian classifier is the most accurate in comparison with
all other classifiers.
They provide a graphical model of causal relationships, on which learning can be performed.
A belief network is defined by two components—a directed acyclic graph and a set of
conditional probability tables (See Figure).

Each node in the directed acyclic graph represents a random variable. The variables may be discrete- or
continuous-valued.
They may correspond to actual attributes given in the data or to “hidden variables” believed to form a
relationship.
Each arc represents a probabilistic dependence. If an arc is drawn from a node Y to a node Z, then Y is a
parent or immediate predecessor of Z, and Z is a descendant of Y.
Each variable is conditionally independent of its non descendants in the graph, given its parents.

A belief network has one conditional probability table (CPT) for each variable. The CPT for a variable Y
specifies the conditional distribution P(Y|Parents(Y)), where Parents(Y) are the parents of Y.
For example a new burglar alarm installed at home.It is fairly reliable at detecting burglary but also
somrtimes responds to minor earth quakes.We have two neighbors john and marry who promised to call you
at work when they hear the alarm.John always calls when he hears the alarm but some times confuses
telephone rininging with the alarm and and calls too.Marry likes music loudly and sometimes misses the
alarm.Given the evidence of who has or has not called,we would ike to estimate the probability of burglary.
Problem:
What is the probability that the alarm has sounded but neither a burglary nor earth quake has occurred and
both john and marry call.
Sol: P(JʌMʌAʌ⌐Bʌ⌐E)

=P(J|A) P(M|A)P(A|⌐B,⌐E) P(⌐B)P(|⌐E)


=0.90*0.70*0.001*0.999*0.998
=0.00062
Association Analysis
Association:
Association mining aims to extract interesting correlations, frequent patterns, associations or casual
structures among sets of items or objects in transaction databases, relational database or other data
repositories. Association rules are widely used in various areas such as telecommunication networks, market
and risk management, inventory control, cross-marketing, catalog design, loss-leader analysis, clustering,
classification, etc.
Problem Definition:

This section reviews the basic terminology used in association analysis and presents a formal
description of the task.
Binary Representation Market basket data can be represented in a binary format as shown in Table
6.2, where each row corresponds to a transaction and each column corresponds to an item. An item can
be treated as a binary variable whose value is one if the item is present in a transaction and zero
otherwise. Because the presence of an item in a transaction is often considered more important than its
absence, an item is an asymmetric binary variable.

TID Bread Milk Diapers Beer Eggs Cola


1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
Table 6.2. A binary 0/1 representation of market basket data.

Itemset and Support Count Let I = {i1,i2,.. .,id} be the set of all items in a market basket data
and T = {t1, t 2 ,..., tN } be the set of all transactions. Each transaction ti contains a subset of items
chosen from I. In association analysis, a collection of zero or more items is termed an itemset. If an
itemset contains k items, it is called a k-itemset. For instance, {Beer, Diapers, Milk} is an example
of a 3-itemset. The null (or empty) set is an itemset that does not contain any items.

Mathematically, the support count, σ(X), for an item set X can be stated as follows:

σ(X) = {ti |X ⊆ ti , ti ∈ T }.,

where the symbol | · | denote the number of elements in a set. In the data set shown in Table 6.2, the
support count for {Beer, Diapers, Milk} is equal to two because there are only two transactions
that contain all three items.

Association Rule An association rule is an implication expression of the form X −→ Y , where


X and Y are disjoint itemsets, i.e., X ∩ Y = ∅. The strength of an association rule can be
measured in terms of its support and confidence. Support determines how often a rule is
applicable to a given data set, while confidence determines how frequently items in Y appear in
transactions that contain X. The formal definitions of these metrics are
σ(X ∪ Y )
Support, s(X −→ Y ) = ; (6.1)
Nσ(X ∪ Y )
Confidence, c(X Y) = . (6.2)

σ(X)

Example 6.1. Consider the rule {Milk, Diapers} −→ {Beer}. Since the support count for
{Milk, Diapers, Beer} is 2 and the total number of transactions is 5, the rule’s support is 2/5 =
0.4. The rule’s confidence is obtained by dividing the support count for {Milk, Diapers, Beer} by
the support count for {Milk, Diapers}. Since there are 3 transactions that contain milk and
diapers, the confidence for this rule is 2/3 = 0.67.
Definition 6.1 (Association Rule Discovery). Given a set of transactions T, find all the rules
having support ≥ minsup and confidence ≥ minconf , where minsup and minconf are the
corresponding support and confidence thresholds.
A brute-force approach for mining association rules is to compute the sup- port and confidence for
every possible rule. , the total number of possible rules extracted from a data set that contains d
items is
R = 3d − 2d+1 + 1.

this approach requires us to compute the support and confidence for 3 6 − 27 + 1 = 602 rules. More than
80% of the rules are discarded after applying minsup = 20% and minconf = 50%, thus making most of
the computations become wasted. To avoid performing needless computations, it would be useful to
prune the rules early without having to compute their support and confidence values.Equation 6.2,
notice that the support of a rule X −→ Y depends only onthe support of its corresponding itemset,
X ∪ Y.For example, the following rules have identical support because they involve items from the
same itemset,
{ Beer, Diapers, Milk}:
{Beer, Diapers} −→ {Milk}, {Beer, Milk} −→ {Diapers},
{Diapers, Milk} −→ {Beer}, {Beer} −→ {Diapers, Milk},
{Milk} −→ {Beer,Diapers}, {Diapers} −→ {Beer,Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without our
having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining algorithms is to decompose
the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the item- sets that satisfy the minsup
threshold. These itemsets are called frequent item sets.

2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent itemsets
found in the previous step. These rules are called strong rules.

Frequent Item set Generation:


A lattice structure can be used to enumerate the list of all possible item sets. below Figure shows an item set lattice for
I = {a, b, c, d, e}. In general, a data set that contains k items can potentially generate up to 2 k−1 frequent item sets,
excluding the null set. Because k can be very large in many practical applications, the search space of item sets that
need to be explored is exponentially large.
To find frequent item sets we have two algorithms
a) Apriori Algorithm
b) FP-Growth
a) Apriori Algorithm:
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent item sets
for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior
knowledge of frequent item set properties, as we shall see later. Apriori employs an iterative approach known
as a level-wise search, where k- item sets are used to explore (k+1)-item sets.
 First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each
item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1.
 Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no
more frequent k-item sets can be found. The finding of each Lk requires one full scan of the database.
 To improve the efficiency of the level-wise generation of frequent item sets, an important property called
the Apriori property is used to reduce the search space.
Apriori property: All non empty subsets of a frequent item set must also be frequent.
The Apriori property is based on the following observation. By definition, if an item set I does not satisfy the
minimum support threshold, min sup, then I is not frequent, that is, P(I)<min sup. If an item A is added to the
item set I, then the resulting item set (i.e., I U A) cannot occur more frequently than I. Therefore, (I U A) is
not frequent either, that is, P( I U A) <min sup.
This property belongs to a special category of properties called anti monotonicity in the sense that if a set
cannot pass a test, all of its supersets will fail the same test as well. It is called anti monotonicity because the
property is monotonic in the context of failing a test.
A two-step process is followed, consisting of join and prune actions.
1. The join step: To find Lk, a set of candidate k-item sets is generated by joining Lk-1with itself. This set of
candidates is denoted Ck.
2. The prune step: Ckis a superset of Lk, that is, its members may or may not be frequent, but all of the
frequent k-item sets are included in Ck. A database scan to determine the countof each candidate in Ckwould
result in the determination of Lk.
Example:
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1⋈ L1to generate a candidate
set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no candidates are removed from C2during the
prune step because each subset of the candidates is also frequent.
4 .Next, the transactions in D are scanned and the support count of each candidate item set in C2 is
accumulated, as shown in the middle table of the second row in Figure
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2- item sets
In C2 having minimum support.
6..The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure From the join step, we first
get C3= L2⋈ L2= {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}} Based on the
Apriori property that all subsets of a frequent item set must also be frequent, we can determine that the four
latter candidates cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort
of unnecessarily obtaining their counts during the subsequent scan of D to determine L3.
7. The transactions in D are scanned to determine L 3, consisting of those candidate 3- item sets in C 3having
minimum support.
8. The algorithm uses L3⋈L3 to generate a candidate set of 4-itemsets, C 4. Although the join results in
{I1,I2, I3, I5}, item set{I1, I2, I3, I5} is pruned because its subset {I2, I3, I5} is not frequent. Thus, C4≠Ø, and
the algorithm terminates, having found all of the frequent item sets.
Rule generation:

This section describes how to extract association rules e fficiently from a given frequent itemset. Each
frequent k-itemset, Y , can produce up to 2k −2 association rules

Example 6.2. Let X = {1, 2, 3} be a frequent itemset. There are six candi- date association rules that can be
generated from X: {1, 2} −→ {3}, {1, 3} −→ {2}, {2, 3} −→ {1}, {1} −→ {2, 3}, {2} −→ {1,
3}, and {3} −→ {1, 2}. As each of their support is identical to the support for X, the rules must satisfy
the support threshold.

Computing the confidence of an association rule does not require additional scans of the transaction
data set. Consider the rule {1, 2} −→ {3}, which is generated from the frequent itemset X = {1, 2, 3}.
The confidence for this rule is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the anti-monotone
prop- erty of support ensures that {1, 2} must be frequent, too. Since the support counts for both itemsets
were already found during frequent itemset generation.

Confidence-Based Pruning
Unlike the support measure, confidence does not have any monotone property. For example, the
confidence for X −→ Y can be larger, smaller, or equal to the confidence for another rule X˜ −→ Y˜ ,
where X˜ ⊆ X and Y˜ ⊆ Y (see Exercise 3 on page 405). Nevertheless, if we compare rules
generated from the same
frequent itemset Y , the following theorem holds for the confidence measure.

Theorem 6.2. If a rule X −→ Y −X does not satisfy the confidence threshold, then any rule XJ −→
Y − XJ, where XJ is a subset of X, must not satisfy the confidence threshold as well.

To prove this theorem, consider the following two rules: XJ −→ Y −X J and X −→ Y − X, where XJ
⊂ X. The confidence of the rules are σ(Y ) /σ (XJ) and σ(Y) / σ(X), respectively. Since XJ is a subset
of X, σ(XJ) ≥ σ(X). Therefore, the former rule cannot have a higher confidence than the latter
rule.

Rule Generation in Apriori Algorithm

The Apriori algorithm uses a level-wise approach for generating association rules, where each level
corresponds to the number of items that belong to the rule consequent. Initially, all the high-
confidence rules that have only one item in the rule consequent are extracted. These rules are then
used to generate new candidate rules. For example, if {acd} −→ {b} and {abd} −→ {c} are
high-confidence rules, then the candidate rule {ad} −→ {bc} is generated by merging the consequents
of both rules. Figure 6.15 shows a lattice structure for the association rules generated from the frequent
itemset {a, b, c, d}. If any node in the lattice has low confidence, the according to theorem 6.2

entire sub graph spanned by the node can be pruned immediately. Suppose the confidence for
{bcd} −→ {a} is low. All the rules containing item a in its consequent, including {cd} −→
{ab}, {bd} −→ {ac}, {bc} −→ {ad}, and a{d} −→ {abc} can be discarded.
Algorithm 6.2 Rule generation of the Apriori algorithm.
1: for each frequent k-itemset fk, k ≥ 2 do
2: H1 = {i | i ∈ fk} {1-item consequents of the rule.}
3: call ap-genrules(fk, H1.)
4: end for


Algorithm 6.3 Procedure ap-genrules(fk, Hm).
1: k = |fk | {size of frequent itemset.} 2: m = Hm size
of rule consequent.| 3:| if {
k > m + 1 then
4: Hm+1 = apriori-gen(H} m).

5: for each hm+1 ∈ Hm+1 do


6: conf = σ(fk)/σ(fk − hm+1).
7: if conf ≥ minconf then
8: output the rule (fk hm+1) — hm+1.
9: else
10: delete hm+1 from Hm+1.
11: end if
12: end for
13: call ap-genrules(fk, Hm+1.)
14: end if

Compact Representation of Frequent Itemsets


In practice, the number of frequent item sets produced from a transaction data set can be very large. It is
useful to identify a small representative set of item sets from which all other frequent item sets can be
derived. Two such representations are presented in this section in the form of maximal and closed frequent
item sets.
:

Maximal Frequent Item sets:

The item sets in the lattice are divided into two groups: those that are frequent and those that are
infrequent. A frequent item set border, which is represented by a dashed line, is also illustrated in the
diagram. Every item set located above the border is frequent, while those located below the border (the
shaded nodes) are infrequent. Among the item sets residing near the border,{a, d}, {a, c, e}, and
{b,c,d ,e} are considered to be maximal frequent item sets because their immediate supersets are
infrequent. An item set such as {a, d} is maximal frequent because all of its immediate supersets, {a,
b, d}, {a, c, d}, and {a, d, e}, are infrequent. In contrast, {a, c} is non-maximal because one of its
immediate supersets, {a, c, e}, is frequent.

Maximal frequent item sets effectively provide a compact representation of frequent item sets. In other
words, they form the smallest set of item sets which all frequent item sets can be derived. For example,
the frequent item sets shown in Figure 6.16 can be divided into two groups:

• Frequent item sets that begin with item a and that may contain items c, d, or e. This
group includes item sets such as {a}, {a, c}, {a, d}, {a, e}, and {a, c, e}.

• Frequent item sets that begin with items b, c, d, or e. This group includes item sets
such as {b}, {b, c}, {c, d},{b, c, d, e}, etc.
Frequent item sets that belong in the first group are subsets of either {a, c, e} or {a, d}, while those
that belong in the second group are subsets of {b, c, d, e}. Hence, the maximal frequent item sets {a, c,
e}, {a, d}, and {b, c, d, e} provide a compact representation of the frequent item sets as shown in
figure.
Closed Frequent Item sets:
An item set X is closed if none of its immediate supersets has exactly the same support
count as X.

The support count of each item set, we have associated each node (item set) in the lattice with a
list of its corresponding transaction IDs. For example, since the node {b, c} is associated with
transac tion IDs 1, 2, and 3, its support count is equal to three. From the transactions given in this
diagram, notice that every transaction that contains b also contains c. Consequently, the support for
{b} is identical to {b, c} and {b} should not be considered a closed itemset. Similarly, since c
occurs in every transaction that contains both a and d, the itemset {a, d} is not closed. On the other hand,
{b, c} is a closed itemset because it does not have the same support count as any of its supersets

In the previous example, assuming that the support threshold is 40%, {b,c} is a closed frequent itemset
because its support is 60%. The rest of the closed frequent itemsets are indicated by the shaded nodes.
The items can be divided into three groups:
(1) Group A, which contains items a1 through a5;
(2) Group B, which contains items b1 through b5; and
(3) Group C, which contains items c1 through c5.

Note that items within each group are perfectly associated with each other and they do not appear with
items from another group. Assuming the support threshold is 20%, the total number of frequent itemsets
is 3 ×(25 − 1) = 93. However, there are only three closed frequent itemsets in the data: ({a1, a2, a3, a4,
a5}, {b1, b2, b3, b4, b5}, and {c1, c2, c3, c4, c5}). It is often sufficient to present only the closed frequent
itemsets to the analysts instead of the entire set of frequent itemsets.

Table 6.5. A transaction data set for mining closed itemsets.


TI a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5
D
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

Figure 6.18. Relationships among frequent, maximal frequent, and closed


frequent itemsets.

Closed frequent itemsets are useful for removing some of the redundant association rules. An
association rule X −→ Y is redundant if there exists another rule XJ −→ Y J, where X is a subset of XJ
and Y is a subset of Y J, such that the support and confidence for both rules are identical. In the example
shown in Figure 6.17, {b} is not a closed frequent itemset while {b, c} is closed. The association rule
{b} −→ {d, e} is therefore redundant because it has the same support and confidence as {b, c} −→

Frequen
t
Itemset
s
Closed
Frequen
t
Itemset
s

Maxim
al
Frequen
t
Itemset
s
{d, e}. Such redundant rules are not generated if closed frequent itemsets are used for rule
generation.

FP-Growth Algorithm :

FP-growth that takes a radically different approach to discovering frequent itemsets. The algorithm
does not subscribe to the generate-and-test paradigm of Aprior

FP-Tree Representation:
An FP-tree is a compressed representation of the input data. It is constructed by reading the data set one
transaction at a time and mapping each transaction onto a path in the FP-tree. As different transactions
can have several items in common, their paths may overlap. The more the paths overlap with one another,
the more compression we can achieve using the FP-tree structure. If the size of the FP-tree is small
enough to fit into main memory, this will allow us to extract frequent itemsets directly from the
structure in memory instead of making repeated passes over the data stored on disk.

Figure 6.24 shows a data set that contains ten transactions and five items. The structures of the FP-tree
after reading the first three transactions are also depicted in the diagram. Each node in the tree
contains the label of an item along with a counter that shows the number of transactions mapped onto
the given path. Initially, the FP-tree contains only the root node represented by the null symbol. The
FP-tree is subsequently extended in the following way:

Figure 6.24 shows a data set that contains ten transactions and five items. The structures of the FP-tree
after reading the first three transactions are also depicted in the diagram. Each node in the tree
contains the label of an item along with a counter that shows the number of transactions mapped onto
the given path. Initially, the FP-tree contains only the root node represented by the null symbol. The
FP-tree is subsequently extended in the following way:

1. The data set is scanned once to determine the support count of each item. Infrequent items are
discarded, while the frequent items are sorted in decreasing support counts. For the data set
shown in Figure 6.24, a is the most frequent item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP- tree. After reading the first
transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null
→ a → b to encode the transaction. Every node along the path has a frequency count of 1.

3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c, and
d. A path is then formed to represent the transaction by connecting the nodes null → b → c
→ d. Every node along this path also has a frequency count equal to one. Although the first
two transactions have an item in common, which is b, their paths are disjoint because the
transactions do not share a common prefix.

4. The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first
transaction. As a result, the path for the third transaction, null → a → c → d → e, overlaps with
the path for the first transaction, null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two, while the frequency counts for the newly
created nodes, c, d, and e, are equal to one.

5. This process continues until every transaction has been mapped onto one of the paths given in
the FP-tree. The resulting FP-tree after reading all the transactions is shown at the bottom of
Figure 6.24.

The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common. In the best-case scenario, where
all the transactions have the same set of items, the FP-tree contains only a single branch of nodes. The
worst-case scenario happens when every transaction has a unique set of items. As none of the
transactions have any items in common, the size of the FP-tree is electively the same as the size of the
original data. However, the physical storage requirement for the FP-tree is higher because it requires
additional space to store pointers between nodes and counters for each item.

The size of an FP-tree also depends on how the items are ordered. If the ordering scheme
in the preceding example is reversed, i.e., from lowest to highest support item, the resulting FP-tree
is shown in Figure 6.25. The tree appears to be denser because the branching factor at the root node has
increased from 2 to 5 and the number of nodes containing the high support items such as a and b
has increased from 3 to 12. Nevertheless, ordering by decreasing support counts does not always lead
to the smallest tree. For example, suppose we augment the data set given in Figure 6.24 with 100
transactions that contain {e}, 80 transactions that contain {d}, 60 transactions

e:3 d:3 b:1


c:2

d:2 c:2 a:1


b:2
b:1
c:1
b:2 a:1
c:1

b:1
a:1 a:2
a1 a:1
:1: a:1
Figure 6.25. An FP-tree representation for the data set shown in Figure 6.24 with a
different item ordering scheme.

that contain {c}, and 40 transactions that contain {b}. Item e is now most frequent, followed by d, c,
b, and a. With the augmented transactions, ordering by decreasing support counts will result in an FP-
tree similar to Figure 6.25, while a scheme based on increasing support counts produces a smaller FP-
tree similar to Figure 6.24(iv).

Frequent Itemset Generation in FP-Growth Algorithm:

FP-growth is an algorithm that generates frequent item sets from an FP-tree by exploring the tree in a
bottom-up fashion. Given the example tree shown in Figure 6.24, the algorithm looks for frequent item
sets ending in e first, followed by d, c, b, and finally, a. This bottom-up strategy for finding frequent
item- sets ending with a particular item is equivalent to the su ffix-based approach described in Section
6.5. Since every transaction is mapped onto a path in the FP-tree, we can derive the frequent item sets
ending with a particular item, say, e, by examining only the paths containing node e. These paths can be
accessed rapidly using the pointers associated with node e. The extracted paths are shown in Figure
6.26(a).

Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.

Suffix Frequent Itemsets


e {e}, {d,e}, {a,d,e}, {c,e},{a,e}
d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d},
{a,d}
c {c}, {b,c}, {a,b,c}, {a,c}
b {b}, {a,b}
a {a}

After finding the frequent item sets ending in e, the algorithm proceeds to look for frequent item sets
ending in d by processing the paths associated with node d. The corresponding paths are shown in
Figure 6.26(b). This process continues until all the paths associated with nodes c, b, and finally a,
are processed. The paths for these items are shown in Figures 6.26(c), (d), and (e), while their
corresponding frequent item sets are summarized in Table 6.6.
FP-growth finds all the frequent item sets ending with a particular suffix by employing a divide-
and-conquer strategy to split the problem into smaller sub problems. For example, suppose we are
interested in finding all frequent
item sets ending in e. To do this, we must first check whether the item set {e} itself is frequent. If
it is frequent, we consider the sub problem of finding frequent item sets ending in de, followed by ce,
be, and ae. In turn, each of these sub problems are further decomposed into smaller sub problems. By
merging the solutions obtained from the sub problems, all the frequent item sets ending in e can be found.

This divide-and-conquer approach is the key strategy employed by the FP-growth algorithm.
For a more concrete example on how to solve the sub problems, consider the task of finding frequent
itemsets ending with e.

1. The first step is to gather all the paths containing node e. These initial paths are
called prefix paths and are shown in Figure 6.27(a).

2. From the prefix paths shown in Figure 6.27(a), the support count for e is obtained by
adding the support counts associated with node e. Assuming that the minimum
support count is 2, {e} is declared a frequent item set because its support count is
3.
3. Because {e} is frequent, the algorithm has to solve the sub problems of finding frequent item
sets ending in de, ce, be, and ae. Before solving these sub problems, it must first convert
the prefix paths into a conditional FP-tree, which is structurally similar to an FP-tree,
except it is used to find frequent item sets ending with a particular su ffix. A conditional
FP-tree is obtained in the following way:
(a) First, the support counts along the prefix paths must be updated because some of the
counts include transactions that do not contain item e. For example, the rightmost path
shown in Figure 6.27(a), null −→ b:2 −→ c:2 −→ e:1, includes a transaction {b, c}
that does not contain item e. The counts along the prefix path must therefore be
adjusted to 1 to reflect the actual number of transactions containing {b, c, e}.
(b) The prefix paths are truncated by removing the nodes for e. These nodes can be removed
because the support counts along the prefix paths have been updated to reflect only
transactions that contain e and the sub problems of finding frequent item sets ending in de,
ce, be, and ae no longer need information about node e.
(c) After updating the support counts along the prefix paths, some of the items may no
longer be frequent. For example, the node b appears only once and has a support count
equal to 1, which means that there is only one transaction that contains both b and e. Item
b can be safely ignored from subsequent analysis because all itemsets ending in be must
be infrequent.

The conditional FP-tree for e is shown in Figure 6.27(b). The tree looks different than the original
prefix paths because the frequency counts have been updated and the nodes b and e have been
eliminated.

4. FP-growth uses the conditional FP-tree for e to solve the sub problems of finding frequent itemsets
ending in de, ce, and ae. To find the frequent itemsets ending in de, the prefix paths for d are gathered
from the conditional FP-tree for e (Figure 6.27(c)). By adding the frequency counts associated with
node d, we obtain the support count for {d, e}. Since the support count is equal to 2, {d, e} is
declared a frequent item set. Next, the algorithm constructs the conditional FP-tree for de using the
approach described in step 3. After updating the support counts and removing the infrequent item c,
the conditional FP-tree for de is shown in Figure 6.27(d). Since the conditional FP-tree contains
only one item, a, whose support is equal to minsup, the algorithm extracts the frequent item set

{a, d, e} and moves on to the next sub problem, which is to generate frequent itemsets ending
in ce. After processing the prefix paths for c, only {c, e} is found to be frequent. The algorithm
proceeds to solve the next subprogram and found {a, e} to be the only frequent item set remaining.

This example illustrates the divide-and-conquer approach used in the FP- growth algorithm.
At each recursive step, a conditional FP-tree is constructed by updating the frequency counts along the
prefix paths and removing all infrequent items. Because the sub problems are disjoint, FP-growth will
not generate any duplicate item sets. In addition, the counts associated with the nodes allow the
algorithm to perform support counting while generating the common suffix item sets.
FP-growth is an interesting algorithm because it illustrates how a compact representation of
the transaction data set helps to efficiently generate frequent item sets. In addition, for certain
transaction data sets, FP-growth outperforms the standard Apriori algorithm by several orders of
magnitude. The run-time performance of FP-growth depends on the compaction factor of the data set.
If the resulting conditional FP-trees are very bushy (in the worst case, a full prefix tree), then the
performance of the algorithm degrades significantly because it has to generate a large number of sub
problems and merge the results returned by each sub problem.

You might also like