0% found this document useful (0 votes)
45 views

Week 6 - Basic Association Analysis

The document describes the Apriori algorithm for mining association rules from transactional data. The algorithm uses the Apriori property that any subset of a frequent itemset must also be frequent. It generates candidate itemsets in multiple passes over the data, pruning the search space at each pass by removing any candidates with infrequent subsets. This reduces the computational complexity compared to a brute force approach.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Week 6 - Basic Association Analysis

The document describes the Apriori algorithm for mining association rules from transactional data. The algorithm uses the Apriori property that any subset of a frequent itemset must also be frequent. It generates candidate itemsets in multiple passes over the data, pruning the search space at each pass by removing any candidates with infrequent subsets. This reduces the computational complexity compared to a brute force approach.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3209 - DATA MINING


Week 6: Basic Association Analysis

Duc-Trong Le

Slide credit: Vipin Kumar et al.,


https://www-users.cse.umn.edu/~kumar001/dmbook

Hanoi, 09/2021
Association Rule Mining

● Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules

{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},

Implication means co-occurrence,


not causality!

2
Definition: Frequent Itemset
● Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}

– k-itemset
◆ An itemset that contains k items
● Support count (σ)
– Frequency of occurrence of an itemset

– E.g. σ({Milk, Bread,Diaper}) = 2


● Support
– Fraction of transactions that contain an itemset

– E.g. s({Milk, Bread, Diaper}) = 2/5


● Frequent Itemset
– An itemset whose support is greater than or equal to a
3
minsup threshold
Definition: Association Rule
● Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}

● Rule Evaluation Metrics


– Support (s)
◆ Fraction of transactions that contain Example:
both X and Y
– Confidence (c)
◆ Measures how often items in Y
appear in transactions that
contain X

4
Association Rule Mining Task

● Given a set of transactions T, the goal of


association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

● Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each
rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive! 5
Computational Complexity
● Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

If d=6, R = 602 rules

6
Mining Association Rules

Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
● All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
● Rules originating from the same itemset have identical support but
can have different confidence
● Thus, we may decouple the support and confidence requirements
7
Mining Association Rules

● Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset

● Frequent itemset generation is still


computationally expensive

8
Frequent Itemset Generation

Given d items, there


are 2d possible
candidate itemsets
9
Frequent Itemset Generation
● Brute-force approach:
– Each itemset in the lattice is a candidate frequent
itemset
– Count the support of each candidate by scanning the
database

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!! 10
Frequent Itemset Generation Strategies

● Reduce the number of candidates (M)


– Complete search: M=2d
– Use pruning techniques to reduce M
● Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
● Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction

11
Reducing Number of Candidates

● Apriori principle:
– If an itemset is frequent, then all of its subsets
must also be frequent

● Apriori principle holds due to the following


property of the support measure:

– Support of an itemset never exceeds the


support of its subsets
– This is known as the anti-monotone property of
support 12
Illustrating Apriori Principle

Found to be
Infrequent

Pruned
supersets
13
Illustrating Apriori Principle

Items (1-itemsets)

Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

14
Illustrating Apriori Principle

Items (1-itemsets)

Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

15
Illustrating Apriori Principle

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

16
Illustrating Apriori Principle

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

17
Illustrating Apriori Principle

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

18
Illustrating Apriori Principle

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

19
Illustrating Apriori Principle

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
6 + 6 + 1 = 13

20
Apriori Algorithm

– Fk: frequent k-itemsets


– Lk: candidate k-itemsets
● Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
◆ Candidate Generation: Generate Lk+1 from Fk
◆ Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
◆ Support Counting: Count the support of each
candidate in Lk+1 by scanning the DB
◆ Candidate Elimination: Eliminate candidates in Lk+1
that are infrequent, leaving only those that are frequent
=> Fk+1
21
Candidate Generation: Brute-force method

22
Candidate Generation: Merge Fk-1 and F1 itemsets

23
Candidate Generation: Fk-1 x Fk-1 Method

● Merge two frequent (k-1)-itemsets if their first (k-2) items


are identical

● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share


only prefix of length 1 instead of length 2

24
Candidate Pruning

● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
be the set of frequent 3-itemsets

● L4 = {ABCD,ABCE,ABDE} is the set of candidate


4-itemsets generated (from previous slide)

● Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

● After candidate pruning: L4 = {ABCD}


25
Candidate Generation: Fk-1 x Fk-1 Method

26
Illustrating Apriori Principle

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support counting
step.
27
Alternate Fk-1 x Fk-1 Method

● Merge two frequent (k-1)-itemsets if the last (k-2) items


of the first one is identical to the first (k-2) items of the
second.

● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE

28
Candidate Pruning for Alternate Fk-1 x Fk-1 Method

● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
be the set of frequent 3-itemsets

● L4 = {ABCD,ABDE,ACDE,BCDE} is the set of


candidate 4-itemsets generated (from previous
slide)
● Candidate pruning
– Prune ABDE because ADE is infrequent
– Prune ACDE because ACE and ADE are infrequent
– Prune BCDE because BCE
● After candidate pruning: L4 = {ABCD}
29
Support Counting of Candidate Itemsets

● Scan the database of transactions to determine the


support of each candidate itemset
– Must match every candidate itemset against every transaction,
which is an expensive operation

30
Support Counting of Candidate Itemsets

● To reduce number of comparisons, store the candidate


itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

31
Support Counting: An Example
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

How many of these itemsets are supported by transaction (1,2,3,5,6)?

32
Rule Generation

● Given a frequent itemset L, find all non-empty


subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate
rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
● If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L → ∅ and ∅ → L)
33
Rule Generation

● In general, confidence does not have an


anti-monotone property
c(ABC →D) can be larger or smaller than
c(AB →D)

● But confidence of rules generated from the same


itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent
4-itemset:

c(ABC → D) ≥ c(AB → CD) ≥ c(A →


BCD)
34
Rule Generation for Apriori Algorithm

Lattice of rules
Low
Confidence
Rule

Pruned
Rules
35
Association Analysis: Basic Concepts
and Algorithms

Algorithms and Complexity

36
Factors Affecting Complexity of Apriori

● Choice of minimum support threshold

● Dimensionality (number of items) of the data set

● Size of database

● Average transaction width


37
Factors Affecting Complexity of Apriori

● Choice of minimum support threshold


– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set

● Size of database

● Average transaction width

38
Impact of Support Based Pruning

Items (1-itemsets)

Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
If every subset is considered,
6 + 15 + 20 = 41 6
C1 + 6C2 + 6C3 + 6C4
With support-based pruning,
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16

39
Factors Affecting Complexity of Apriori

● Choice of minimum support threshold


– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database

● Average transaction width


40
Factors Affecting Complexity of Apriori

● Choice of minimum support threshold


– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width

41
Factors Affecting Complexity of Apriori

● Choice of minimum support threshold


– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting

42
Factors Affecting Complexity of Apriori

43
Compact Representation of Frequent Itemsets

● Some frequent itemsets are redundant because their


supersets are also frequent
Consider the following data set. Assume support threshold =5

Number of frequent itemsets

● Need a compact representation

44
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its
immediate supersets is frequent

Maximal
Itemsets

Infrequent
Itemsets Border

45
What are the Maximal Frequent Itemsets in this Data?

Minimum support threshold = 5

(A1-A10)
(B1-B10)
(C1-C10)

46
An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5


1 Frequent itemsets: ?
Maximal itemsets: ?
2
3
4
Transactions

5
6
7
8
9
10

47
An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5


1 Frequent itemsets: {F}
Maximal itemsets: {F}
2
3 Support threshold (by count): 4
4 Frequent itemsets: ?
Transactions

Maximal itemsets: ?
5
6
7
8
9
10

48
An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5


1 Frequent itemsets: {F}
Maximal itemsets: {F}
2
3 Support threshold (by count): 4
4 Frequent itemsets: {E}, {F},
Transactions

{E,F}, {J}
5
Maximal itemsets: {E,F}, {J}
6
7 Support threshold (by count): 3
Frequent itemsets: ?
8
Maximal itemsets: ?
9
10

49
An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5


1 Frequent itemsets: {F}
Maximal itemsets: {F}
2
3 Support threshold (by count): 4
4 Frequent itemsets: {E}, {F}, {E,F},
Transactions

{J}
5
Maximal itemsets: {E,F}, {J}
6
7 Support threshold (by count): 3
Frequent itemsets:
8
All subsets of {C,D,E,F} + {J}
9 Maximal itemsets:
10 {C,D,E,F}, {J}

50
Another illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5


1 Maximal itemsets: {A}, {B}, {C}

2 Support threshold (by count): 4


3 Maximal itemsets: {A,B},
4 {A,C},{B,C}
Transactions

5 Support threshold (by count): 3


6 Maximal itemsets: {A,B,C}
7
8
9
10

51
Closed Itemset

● An itemset X is closed if none of its immediate supersets


has the same support as the itemset X.
● X is not closed if at least one of its immediate supersets
has support count as X.

52
Closed Itemset

● An itemset X is closed if none of its immediate supersets


has the same support as the itemset X.
● X is not closed if at least one of its immediate supersets
has support count as X.

53
Maximal Frequent vs Closed Frequent Itemsets

Minimum support = 2 Closed but


not maximal

Closed and
maximal

# Closed frequent = 9
# Maximal freaquent = 4

54
What are the Closed Itemsets in this Data?

(A1-A10)
(B1-B10)
(C1-C10)

55
Maximal vs Closed Itemsets

56
Pattern Evaluation

● Association rule algorithms can produce large


number of rules

● Interestingness measures can be used to


prune/rank the patterns
– In the original formulation, support &
confidence are the only measures used

57
Computing Interestingness Measure

● Given X → Y or {X,Y}, information needed to compute


interestingness can be obtained from a contingency table
Contingency table
Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 N f00: support of X and Y

Used to define various measures


● support, confidence, Gini,
entropy, etc.
58
Drawback of Confidence
Custo Tea Coffee …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …

Association Rule: Tea → Coffee

Confidence ≅ P(Coffee|Tea) = 150/200 = 0.75


Confidence > 50%, meaning people who drink tea are more
likely to drink coffee than not drink coffee
So rule seems reasonable
59
Drawback of Confidence

Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 150/200 = 0.75

but P(Coffee) = 0.8, which means knowing that a person drinks


tea reduces the probability that the person drinks coffee!
⇒ Note that P(Coffee| Tea ) = 650/800 = 0.8125

60
Drawback of Confidence
Custo Tea Honey …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …

Association Rule: Tea → Honey


Confidence ≅ P(Honey|Tea) = 100/200 = 0.50
Confidence = 50%, which may mean that drinking tea has little
influence whether honey is used or not
So rule seems uninteresting
But P(Honey) = 120/1000 = .12 (hence tea drinkers are far
more likely to have honey
61
Measure for Association Rules

● So, what kind of rules do we really want?


– Confidence(X → Y) should be sufficiently high
◆ To ensure that people who buy X will more likely
buy Y than not buy Y

– Confidence(X → Y) > support(Y)


◆ Otherwise, rule will be misleading because
having item X actually reduces the chance of
having item Y in the same transaction
◆ Is there any measure that capture this
constraint?
– Answer: Yes. There are many of them.
62
Statistical Relationship between X and Y

● The criterion
confidence(X → Y) = support(Y)

is equivalent to:
– P(Y|X) = P(Y)
– P(X,Y) = P(X) × P(Y) (X and Y are
independent)

If P(X,Y) > P(X) × P(Y) : X & Y are positively


correlated

63
If P(X,Y) < P(X) × P(Y) : X & Y are negatively
Measures that take into account statistical dependence

lift is used for rules while


interest is used for itemsets

64
Example: Lift/Interest

Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.8
⇒ Interest = 0.15 / (0.2×0.8) = 0.9375 (< 1, therefore is negatively
associated)
So, is it enough to use confidence/Interest for pruning?
65
There are lots of
measures proposed in
the literature

66
Simpson’s Paradox

● Observed relationship in data may be influenced


by the presence of other confounding factors
(hidden variables)
– Hidden variables may cause the observed
relationship to disappear or reverse its
direction!

● Proper stratification is needed to avoid


generating spurious patterns

67
Simpson’s Paradox

● Recovery rate from Covid


– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?

68
Simpson’s Paradox

● Recovery rate from Covid


– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?
● Covid recovery rate on older population
– Hospital A: 50%
– Hospital B: 30%
● Covid recovery rate on younger population
– Hospital A: 99%
– Hospital B: 98%

69
Simpson’s Paradox

● Covid-19 death: (per 100,000 of population)


– County A: 15
– County B: 10
● Which state is managing the pandemic better?

70
Simpson’s Paradox

● Covid-19 death: (per 100,000 of population)


– County A: 15
– County B: 10
● Which state is managing the pandemic better?
● Covid death rate on older population
– County A: 20
– County B: 40
● Covid death rate on younger population
– County A: 2
– County B: 5

71

You might also like