0% found this document useful (0 votes)
19 views32 pages

13 Assoc2

This document outlines concepts related to association rule mining, including: - Single-dimensional and multi-level association rule mining algorithms - Multi-dimensional association rules can involve two or more dimensions or predicates - Techniques for mining multi-dimensional associations include concept-based, distribution-based, and distance-based approaches

Uploaded by

eshwarpunna98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views32 pages

13 Assoc2

This document outlines concepts related to association rule mining, including: - Single-dimensional and multi-level association rule mining algorithms - Multi-dimensional association rules can involve two or more dimensions or predicates - Techniques for mining multi-dimensional associations include concept-based, distribution-based, and distance-based approaches

Uploaded by

eshwarpunna98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Association Rules

CS 5331 by Rattikorn Hewett


Texas Tech University 1

Outline
n Association Rule Mining – Basic Concepts
n Association Rule Mining Algorithms:
¨ Single-dimensional Boolean associations
¨ Multi-level associations
¨ Multi-dimensional associations
n Association vs. Correlation
n Adding constraints
n Applications/extensions of frequent pattern mining
n Summary

1
Multiple-level Association Rules
Why?
n Hard to find strong associations among low conceptual
level data (e.g., less support counts for “skim milk” than
“milk”)
n Associations among high-level data are likely to be known
and uninteresting
n Easier to find interesting associations among items at
multiple conceptual levels, rather than only among single
level data

Approaches: uniform vs. reduced threshold


n Uniform min support
¨ uses same min support threshold at all levels
+ Search is simplified (dealing with one threshold) and
optimized (omitting itemsets that has an infrequent itemset child)
- If the threshold is set too high
à might miss associations at low level
if it is set too low
à too many uninteresting associations

uniform support reduced support


Level 1 Milk Level 1
min_sup = 5% [support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

2
Reduced Min Support
n Four strategies:
1.level-by-level: full breath search on every node
2.level-cross filtering by single item: items are examined only
if parents are frequent (e.g., do not examine 2%Milk and Skim Milk)
3.level-cross filtering by k-itemsets: examine only children of
frequent k-itemsets (e.g., the 2-itemset Milk&Bread is infrequent so do not
examine all its children)
Milk Top level: min_sup = 5%
[support = 4%]
Bottom level: min_sup = 3%
2% Milk Skim Milk
[support = 3%] [support = 2%]

Milk Milk & Bread


[support = 4%] [support = 4%]

2% Milk Skim Milk 2% Milk & wheat bread Skim Milk & white bread
[support = 3%] [support = 2%] [support = 2%] [support = 1%]

Search 1 is too relax, 3 is too limited, 2 is like 3 but less restricted


because it deals with 1-item set 5

Reduced Min Support (cont)

4. Controlled level-cross filtering by single item: add level


passage threshold (e.g., user slide the level passage threshold between 5%
and 2% -- can do this for each concept hierarchy)

Top level: min_sup = 5%


Milk
[support = 4%] Bottom level: min_sup = 2%
2% Milk Skim Milk Method 2. could miss associations:
[support = 3%] [support = 2%]
2%Milk à Skim Milk

Top level: min_sup = 5%


Milk level-passage-sup = 4%
[support = 4%]
Bottom level: min_sup = 2%
2% Milk Skim Milk
[support = 3%] [support = 2%]
6

3
Flexible Support Constraints
n Why flexible support constraints?
¨ Real life occurrence frequencies vary greatly
n Diamond, watch, pens in a shopping basket
¨ Uniform support may not be an interesting model
n A flexible model
¨ Usually,lower-level, more dimension combination, and
longer pattern length ---> smaller support
¨ General rules should be easy to specify and understand
¨ Special items and special group of items may be
specified individually and have higher priority
7

Multi-Level Mining
n A top-down, progressive deepening approach:
¨ First mine high-level frequent items:
milk (15%), bread (10%)
¨ Then mine their lower-level “weaker” frequent itemsets:
skim milk (5%), wheat bread (4%)

n Different min_support threshold across multi-levels


lead to different algorithms:
¨ If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.
¨ If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support
is frequent/non-negligible.
8

4
Redundancy checking
n Must check if the resulting rules from multi-
level association mining are redundant
E.g.,
1. Milk ⇒ Bread [support 8%, confidence 70%]
2. Skim Milk ⇒ Bread [support 2%, confidence 72%]
Suppose about 1/4 of milk sales are skim milk, then
Rule 1. can estimate that
Skim Milk ⇒ Bread [support = 1/4 of 8% = 2%, confidence 70%]
This makes Rule 2. “redundant” since it’s closed to what
is “expected”
9

Outline
n Association Rule Mining – Basic Concepts
n Association Rule Mining Algorithms:
¨ Single-dimensional Boolean associations
¨ Multi-level associations
¨ Multi-dimensional associations
n Association vs. Correlation
n Adding constraints
n Applications/extensions of frequent pattern mining
n Summary

10

5
Multi-dimensional Associations
n Involve two or more dimensions (or predicates)
Example:
Single-dimensional rule: buys(X, “milk”) ⇒ buys(X, “bread”)
Multi-dimensional rule: age(X, “0..10”) ⇒ income(X, “0..2K”)
n Two types of multi-dimensional assoc. rules:
¨ Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)
¨ hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
n Here we’ll deal with inter-dimension associations

11

Multi-dimension Mining
n Attribute types:
¨ Categorical: finite number of values, no ordering among values
¨ Quantitative: numeric, implicit ordering among values

n Techniques for mining multi-dimensional associations


¨ Search for frequent predicate sets (as opposed to frequent itemsets)
¨ Classified by how “quantitative” attributes are treated
E.g., {age, occupation, buys} is a 3-predicate set
Techniques can be categorized by how age values are treated

12

6
Multi-dimension Mining (MDM) Techniques
1. Concept-based
¨ Quantitative attribute values are treated as predefined
categories/ranges
¨ Discretization occurs prior to mining using predefined concept
hierarchies
2. Distribution-based
¨ Quantitative attribute values are treated as quantities to satisfy
some criteria (e.g., max confidence)
¨ Discretization occurs during mining process using “bins” based
on the distribution of the data
3. Distance-based
¨ Quantitative attribute values are treated as quantities to capture
meaning of interval data
¨ Discretization occurs during mining process using the distance
between data points
13

Concept-based MDM
n Numeric values are replaced by ranges or predefined concepts
n Two approaches depending on how data are stored:
¨ Relational tables
n Modify the Apriori to finding all frequent predicate sets

n Finding k-predicate sets will require k or k+1 table scans.

¨ Data cubes
n Well suited since data cubes are multi-dimensional structures

n The cells of n-D cuboid store support/confidence of n-


predicate sets (cuboids represent aggregated dimensions)
n To reduce candidates generated, apply the Apriori principle :
every subset of frequent predicate set must be frequent

14

7
Distribution-based MDM
n Unlike concept-based approach, numeric attribute values are
dynamically discretized to meet some criteria
¨ Example of discretization: binning
n Equiwidth: same interval size
n Equidepth: same number of data points in each bin
n Homogeneity-based: data points in each bin are uniformly distributed
¨ Example of criteria:
n Compact
n Strong rules (i.e., high confidence/support)

n Resulting rules are referred to as Quantitative Association Rules


n Consider a 2-D quantitative association rule: Aquan1 ∧ Bquan2 ⇒ Ccat
E.g., age(X,“30-39”) ∧ income(X, “40K-44K”) ⇒ buys(X, “HD TV”)

15

Distribution-based MDM - Example


ARCS – Association Rule Clustering System
n For each quantitative attribute, discretize the numeric values based on
the data distribution, e.g., by binning techniques Income 2 3 11 4
30-34K
¨ 2-D table of the resulting bins of the two
35-39K 5 20 23 6
quantitative attributes on LHS of the rule
¨ Each cell holds count distribution in each 40-44K 1 33 46 9

category of the attribute on the RHS of the rule 45-49K 2 12 10 14

n Finding frequent predicate sets 25-29 30-34 35-39 40-44


Age
¨ Generate strong associations (same as in Apriori)
age(X,“30-34”) ∧ income(X, “40K-44K”) ⇒ buys(X, “HD TV”)
age(X,“35-39”) ∧ income(X, “40K-44K”) ⇒ buys(X, “HD TV”)

n Simplify resulting rules


¨ Rule “clusters” ( here in “grids”) are further combined
age(X,“30-39”) ∧ income(X, “40K-44K”) ⇒ buys(X, “HD TV”)

16

8
Distance-based MDM
n Binning methods do not capture the semantics of interval
data, e.g., Price ($): 7 20 22 50 51 53
Equi-width Equi-depth Distance-
(width $10) (depth 2) based
[0,10] [7,20] [7,7]
[11,20] [22,50] [20,22]
[21,30] [51,53] [50,53]
[31,40]
[41,50]
[51,60]
n Distance-based partitioning, more meaningful discretization
considering:
¨ density/number of points in an interval
¨ “closeness” of points in an interval

17

Distance-based MDM (contd)


n Distance measures: e.g., two points (x1,x2,x3) and (t1,t2,t3)
¨ Euclidean 3
∑ i =1
( xi − ti ) 2
¨ Manhattan 3
| xi − ti |
n Two phases: ∑ i =1

¨ Identify clusters (Ch 8)


n Data points in each cluster satisfy both frequency threshold
and density threshold ~ support
¨ Obtain association rules
n Define degree of associations ~ confidence, e.g., centroid
(average of data points in the cluster) Manhattan distance
n Three conditions:
¨ Clusters in LHS are each strongly associated with each clusters in RHS
¨ Clusters in LHS collectively occur together
¨ Clusters in RHS collectively occur together
18

9
Outline
n Association Rule Mining – Basic Concepts
n Association Rule Mining Algorithms:
¨ Single-dimensional Boolean associations
¨ Multi-level
associations
¨ Multi-dimensional associations
n Association vs. Correlation
n Adding constraints
n Applications/extensions of frequent pattern mining
n Summary

19

Association & Correlation analysis


Basketball Not basketball Sum (row)

Cereal 400 350 750

Not cereal 200 50 250

Sum(col.) 600 400 1000

n Suppose: min support 20%, min confidence = 50%


n Probability of buying cereal = 750/1000 = 75%
n basketball ⇒ cereal [400/1000 = 40%, 400/600 = 66.7%]
Chance of buying cereal (even without this rule) is already higher than 66.7%
à the implication of this rule is not interesting
“strong” rule (high conf) but “uninformative” (prob on RHS > conf)

20

10
Association & Correlation analysis (contd)
Basketball Not basketball Sum (row)

Cereal 400 350 750

Not cereal 200 50 250

Sum(col.) 600 400 1000

P(A ∩ B) 1 if A and B are independent


n Define corrA ,B = = < 1 if A & B are –ve correlated
P(A)P(B) > 1 if A & B are +ve correlated
= Lift(A ⇒ B) Does this notation fit the definition?

n Corrbasketball, cereal = (400/1000)/[(600/1000)(750/1000)] = 0.89


€à basketball and cereal are negatively correlated
n Corrbasketball, not cereal = (200/1000)/[(600/1000)(250/1000)] = 1.3
à basketball and not cereal are positively correlated
But basketball ⇒ not cereal [200/1000 = 20%,200/600 = 33.3%]
“Not strong” but “informative” (prob of not buying cereal only 25%)
21

Association & Correlation analysis (contd)

n Association and Correlation are not the same


¨ basketball ⇒ cereal -ve corr:
P(A & B) < P(A) P(B)
strong
Informative:
uninformative & -vely correlated P(B) < Conf(AàB)
P(B) < P(A & B)/P(A)
¨ basketball ⇒ not cereal P(B)P(A) < P(A & B)
not strong -ve corr = uninformative
informative & +vely correlated
Can LHS and RHS of a rule be negatively
correlated and yet the rule is informative?

22

11
Association & Correlation analysis (contd)

n Association and Correlation are not the same


n Mining of correlated rules
¨ I.e., rules involve correlated itemsets (instead of frequent itemsets)
¨ Correlation value of a set of items can be calculated (cf. corrA,B)
¨ Use the χ2 statistic to test if the correlation value is statistically
significant
¨ Upward closed property – If A has a property, so is A’s superset
n Correlation is upward closed (A is an correlated itemset, so is its superset)
n χ2 is upward closed (within each significance level)
¨ Search upward for correlated itemsets starting from an empty set
to find minimal correlated item sets
n In datacube – random walk algorithms are used
n In general – still an open problem when dealing with large dimensions
n See also [Brin et al., 97] Is “frequent itemset” upward closed?
23

Outline
n Association Rule Mining – Basic Concepts
n Association Rule Mining Algorithms:
¨ Single-dimensional Boolean associations
¨ Multi-level associations
¨ Multi-dimensional associations
n Association vs. Correlation
n Adding constraints
n Applications/extensions of frequent pattern mining
n Summary

24

12
Constraint-based Mining
n Finding all the patterns in a database
autonomously? — unrealistic!
¨ The patterns could be too many but not focused!
n Constraint-based mining allows
¨ Specification
of constraints on what to be mined
à more effective mining, e.g.,
Metarule: Template A(x,y) + B(x,w) ⇒ buys(x, “HD TV”) to guide search
Rule constraint: small sales (price<$10) triggers big sales (sum>$200)
¨ Systemoptimization
à more efficient mining, e.g., data mining query optimization
n Constraint-based mining aims to reduce search and
find all answers that satisfy a given constraint 25

Constrained Frequent Pattern Mining


A Mining Query Optimization Problem
n Given a frequent pattern mining query with a set of constraints C, the
algorithm should be
¨ sound: it only finds frequent sets that satisfy C
Which is harder?
¨ complete: all frequent sets satisfying C are found
n A naïve solution:
¨ First find all frequent sets, and then test them for constraint
satisfaction
n More efficient approaches:
¨ Analyze the properties of constraints comprehensively
¨ Push them as deeply as possible inside the frequent pattern
computation and still ensure completeness of the answer.

What kind of rule constraints can be pushed as above?


26

13
Rule constraints
n Types of rule constraints:
¨ Anti-monotone
¨ Monotone
¨ Succinct
¨ Convertible
¨ Inconvertible
n The first four types can be pushed in the mining process to
improve efficiency without losing completeness of the
answers

27

(Anti-)monotone constraints
c = a rule constraint
A = an itemset, B = a proper superset of A
n Monotone: A satisfies c àany B satisfies c
n Anti-monotone: A doesn’t satisfy c ànone of B satisfies c
Examples: Item Profit
n sum(A.Price) ≥ v is monotone a 40
n min(A.Price) ≤ v is monotone b 0
n sum(A.Price) ≤ v is anti-monotone c -20
n min(A.Price) ≥ v is anti-monotone d 10
n C: range(A.profit) ≤ 15 is anti-monotone e -30
¨ Itemset ab violates C f 30
¨ So does every superset of ab g 20
h -10

28

14
Succinct constraints
n Succinct: there is a “formula” to generate precisely all itemsets
satisfying the constraint
¨ itemsets satisfying the constraint can be enumerated before support
counting starts
Item Price
¨ Succinct constraints are pre-counting prunable
a 40
Examples: b 10
n c: max(A.Price) ≥ 20 is monotone and succinct c 22
An itemset satisfies c is of the form A1 ∪ A2, where d 25
A2 is {b} - a set (can be empty) of items with prices ≤ v e 30
A1 is a non-empty subset of {a, c, d, e} - a set of items with prices ≥ v
n min(A.Price) ≤ v is succinct and monotone
TID Transaction
n sum(A.Price) ≤ v is not succinct but anti-monotone 10 a, b, c, d
n sum(A.Price) ≥ v is not succinct but monotone 20 a, c, d
30 a, b, d

29

The Apriori Algorithm — Example


Database D itemset sup itemset sup. Min support = 2
TID Items C1 {1} 2 L1 {1} 2
100 134 {1} 3 {2} 3
200 235
Scan D {3} 3
{3} 3
{4} 1
300 1235 {5} 3 {5} 3
400 25
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 {1 2}
{1 3} 2 {1 3} 2 Scan D {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 5} 3 {2 3} 2 {2 3}
{3 5} 2 {2 5} 3 {2 5}
{3 5} 2 {3 5}

C3 itemset Scan D L3 itemset sup


{2 3 5} {2 3 5} 2
30

15
Näive: Apriori + Constraint: Sum{S.price < 5}
price of item k is k
Database D itemset sup itemset sup. Min support = 2
TID Items C1 {1} 2 L1 {1} 2
100 134 {2} 3 {2} 3
200 235
Scan D {3} 3
{3} 3
{4} 1
300 1235 {5} 3 {5} 3
400 25
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 {1 2}
{1 3} 2 {1 3} 2 Scan D {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 5} 3 {2 3} 2 {2 3}
{3 5} 2 {2 5} 3 {2 5}
{3 5} 2 {3 5}

C3 itemset Scan D L3 itemset sup


{2 3 5} {2 3 5} 2
31

Pushing constraint: Sum{S.price < 5} price of item k is k

Database D itemset sup itemset sup. Min support = 2


TID Items C1 {1} 2 L1 {1} 2
100 134 {1} 3 {2} 3
200 235
Scan D {3} 3
{3} 3
{4} 1
300 1235 {5} 3 {5} 3
400 25
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 {1 2}
{1 3} 2 {1 3} 2 Scan D {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 5} 3 {2 3} 2 {2 3}
{3 5} 2 {2 5} 3 {2 5}
{3 5} 2 {3 5}

C3 itemset Scan D L3 itemset sup


{2 3 5} {2 3 5} 2
32

16
Pushing Succinct Constraint: Min{S.price ≤ 1}
Database D itemset sup itemset sup. Min support = 2
TID Items C1 {1} 2 L1 {1} 2
100 134 {2} 3 {2} 3
200 235
Scan D {3} 3
{3} 3
{4} 1
300 1235 {5} 3 {5} 3
400 25
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 {1 2}
{1 3} 2 {1 3} 2 Scan D {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 5} 3 {2 3} 2 {2 3}
{3 5} 2 {2 5} 3 {2 5}
{3 5} 2 {3 5}

C3 itemset Scan D L3 itemset sup


{2 3 5} {2 3 5} 2
33

Convertible constraints
n Constraints that can become anti-monotone or monotone when items
in itemsets are ordered in a certain way Item Profit
a 40
Example: b 0
c -20
C: avg(S.profit) ≥ 15 d 10
e -30
n C is not anti-monotone nor monotone f 30
g 20
n If Items are added in value-descending order h -10

<a, f, g, d, b, h, c, e>
40 30 20 10 0 -10 -20 -30
ascending dg satisfies C, so does dg*
gb violates C, so does gbh, and
gb* (note * = strings representing itemsets with each item value ≤ b’s value)
à C becomes anti-monotone
ascending monotone
n C with respect to value-descending order is anti-monotone convertible
34

17
Strongly Convertible Constraints
n avg(X) ≥ 15 is convertible anti-monotone w.r.t. item
value descending order R: <a, f, g, d, b, h, c, e>

n avg(X) ≥ 15 is convertible monotone w.r.t. item


value ascending order R-1: <e, c, h, b, d, g, f, a>

n We say, avg(X) ≥ 15 is strongly convertible

35

More examples
Convertible Convertible Strongly
Constraint anti-monotone monotone convertible
avg(S) ≤ , ≥ v Yes Yes Yes
median(S) ≤ , ≥ v Yes Yes Yes
sum(S) ≤ v (items could be of any
Yes No No
value, v ≥ 0)
sum(S) ≤ v (items could be of any
No Yes No
value, v ≤ 0)
sum(S) ≥ v (items could be of any
No Yes No
value, v ≥ 0)
sum(S) ≥ v (items could be of any
Yes No No
value, v ≤ 0)
……

36

18
Common SQL-based constraints
Constraint Antimonotone Monotone Succinct
v∈S no yes yes
S⊆V no yes yes
S⊆V yes no yes
min(S) ≤ v no yes yes
min(S) ≥ v yes no yes
max(S) ≤ v yes no yes

max(S) ≥ v no yes yes


count(S) ≤ v yes no weakly
count(S) ≥ v no yes weakly
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
range(S) ≥ v no yes no
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible convertible no
support(S) ≥ ξ yes no no
support(S) ≤ ξ no yes no
37

Classification of Constraints

Monotone
Antimonotone

Strongly
convertible
Succinct

Convertible Convertible
anti-monotone monotone

Inconvertible

38

19
Mining with convertible constraints
TID Transaction

n C: avg(S.profit) ≥ 25 10
20
a, b, c, d, f
b, c, d, f, g
n List items in every transaction in value 30
40
a, c, d, e, f
c, e, f, g, h
descending order R: <a, f, g, d, b, h, c, e> Item sup Profit
a 2 40
¨ C is convertible anti-monotone w.r.t. R f 4 30
g 2 20
n Scan transaction DB once d 3 10
b 2 0
¨ remove infrequent items: drop h h 1 -10
n C can’t be pushed in level-wise framework c
e
4
2
-20
-30
¨ Itemset df violates C - we want to prune it
TID Transaction
¨ Since adf satisfies C, Apriori needs df to assemble 10 a, f, d, b, c
adf, df cannot be pruned 20 f, g, d, b, c
n But C can be pushed into frequent-pattern 30 a, f, d, c, e
40 f, g, h, c, e
growth framework!
39

Recap: Constraint-based mining


n All types of rule constraints but inconvertible can be used to
guide the mining process to improve mining efficiency
n Anti-monotone constraints can be applied at each iteration
of Apriori-like algorithms while guaranteeing completeness
¨ Pushing non-anti-monotone constraints into the mining process will
not guarantee completeness
n Itemsets satisfy succinct constraints can be determined
before support counting begins
¨ no need to iteratively check the rule constraint during the mining
process
¨ succinct constraints are pre-computing pushable
n Convertible constraints can’t be pushed in level-wise
mining algorithm such as Apriori 40

20
Handling Multiple Constraints
n Different constraints may require different or even
conflicting item-ordering
n If there exists an order R s.t. both C1 and C2 are convertible
w.r.t. R, then there is no conflict between the two
convertible constraints
n If there exists conflict on order of items
¨ Try to satisfy one constraint first
¨ Then using the order for the other constraint to mine frequent
itemsets in the corresponding projected database

41

Outline
n Association Rule Mining – Basic Concepts
n Association Rule Mining Algorithms:
¨ Single-dimensional Boolean associations
¨ Multi-level associations
¨ Multi-dimensional associations
n Association vs. Correlation
n Adding constraints
n Applications/extensions of frequent pattern mining
n Summary

42

21
Extensions/applications
n The following is not an exhaustive list
n Some topics are likely to be assigned for
your presentations in the second half of this
class

43

Sequential Pattern Mining


n Sequence data vs. Time-series data
¨ sequences of ordered events (with or without explicit notion of time)
¨ sequences of values/events typically measured at equal time intervals
n Time-series data are sequence data but not viz.
n Sequential Pattern mining
¨ Deals with frequent sequential patterns (as opposed to frequent patterns)
¨ Problem: given a set of sequences, find the complete set of frequent
subsequences

n Applications of sequential pattern mining


¨ Customer shopping sequences, e.g., First buy computer, then CD-ROM,
and then digital camera, within 3 months.
¨ Medical treatment, natural disasters (e.g., earthquakes), science &
engineering processes, stocks and markets, etc.
¨ Telephone calling patterns, Weblog click streams
¨ DNA sequences and gene structures
44

22
Studies on Sequential Pattern Mining
n Concept introduction and an initial Apriori-like algorithm
¨ R. Agrawal & R. Srikant. “Mining sequential patterns,” ICDE’95
n GSP—An Apriori-based, influential mining method (developed at IBM
Almaden)
¨ R. Srikant & R. Agrawal. “Mining sequential patterns: Generalizations and
performance improvements,” EDBT’96
n From sequential patterns to episodes (Apriori-like + constraints)
¨ H. Mannila, H. Toivonen & A.I. Verkamo. “Discovery of frequent episodes
in event sequences,” Data Mining and Knowledge Discovery, 1997
n Mining sequential patterns with constraints
¨ M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining
with Regular Expression Constraints. VLDB 1999

45

Classification-Based on Associations
n Mine association possible rules (PR) in form of
condset è c
¨ Condset: a set of attribute-value pairs
¨ C: class label
n Build Classifier
¨ Organize rules according to decreasing precedence
based on confidence and support
n B. Liu, W. Hsu & Y. Ma. Integrating classification and
association rule mining. In KDD’98

46

23
Iceberg Cube computation
n It is too costly to materialize a high dimen. cube
¨ 20 dimensions each with 99 distinct values may lead to 10020 cube cells
¨ Even if there is only one nonempty cell in each 1010 cells, the cube will still
contain 1030 nonempty cells
n Observation: Trivial cells are usually not interesting
¨ Nontrivial: large volume of sales, or high profit
n Solution:
¨ Iceberg cube—materialize only nontrivial cells of a data cube – cf.
tip of the iceberg
¨ Computation: Based on Apriori-like pruning, e.g.,
n BUC [Bayer & Ramakrishnan, 99]
n bottom-up cubing, efficient bucket-sort alg.
n Only handles anti-monotonic iceberg cubes
¨ If a cell c violates the HAVING clause, so do all more specific cells

47

Spatial and Multi-Media Association


A Progressive Refinement Method: Why?
n Mining operator can be expensive or cheap, fine or rough
¨ Trade speed with quality: step-by-step refinement.
n Superset coverage property:
¨ Preserve all the positive answers—allow a positive false
test but not a false negative test.
n Two- or multi-step mining:
¨ First apply rough/cheap operator (superset coverage)
¨ Then apply expensive algorithm on a substantially
reduced candidate set (Koperski & Han, SSD’95).

48

24
Spatial Associations
n Hierarchy of spatial relationship:
¨ “g_close_to”: near_by, touch, intersect, contain, etc.
¨ First search for rough relationship and then refine it.
n Two-step mining of spatial association:
¨ Step1: rough spatial computation (as a filter)
¨ Step2: Detailed spatial algorithm (as refinement)
n Apply only to those objects which have passed the rough
spatial association test (no less than min_support)

49

Mining Multimedia Associations


Correlations with color, spatial relationships, etc.
From coarse to fine resolution mining

50

25
Outline
n Association Rule Mining – Basic Concepts
n Association Rule Mining Algorithms:
¨ Single-dimensional Boolean associations
¨ Multi-level
associations
¨ Multi-dimensional associations
n Association vs. Correlation
n Adding constraints
n Applications/extensions of frequent pattern mining
n Summary

51

Achievements
n Frequent pattern mining—an important task in data mining
n Frequent pattern mining methodology
¨ Candidate generation-test vs. projection-based (frequent-pattern growth)
¨ Vertical vs. horizontal format (itemsets vs. transactionsets)
¨ Various optimization methods: database partition, scan reduction, hash
tree, sampling, border computation, clustering, etc.
n Related frequent pattern mining algorithm: scope extension
¨ Mining closed frequent itemsets and max-patterns (e.g., MaxMiner,
CLOSET, CHARM, etc.)
¨ Mining multi-level, multi-dimensional frequent patterns with flexible
support constraints
¨ Constraint pushing for mining optimization
¨ From frequent patterns to correlation and causality

52

26
Applications
n Related problems which need frequent pattern mining
¨ Association-based classification
¨ Iceberg cube computation
¨ Database compression by frequent patterns
¨ Mining sequential patterns (GSP, PrefixSpan, SPADE, etc.)
n Mining partial periodicity, cyclic associations, etc.
n Mining frequent structures, trends, etc.
n Typical application examples
¨ Market-basket analysis, Weblog analysis, DNA mining,
etc.

53

Some Research Problems


n Multi-dimensional gradient analysis: patterns regarding
changes and differences
¨ Not just counts—other measures, e.g., avg(profit)
n Mining top-k frequent patterns without support constraint
n Partial periodic patterns
n DNA sequence analysis and pattern classification

54

27
References
Frequent-pattern Mining Methods
n R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for
generation of frequent itemsets. Journal of Parallel and Distributed Computing, 2000.
n R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. SIGMOD'93, 207-216, Washington, D.C.
n R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94
487-499, Santiago, Chile.
n J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate generation”. In
Proc. ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May 2000.
n H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94, 181-192, Seattle, WA, July 1994.

55

References
Frequent-pattern Mining Methods
n A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland.
n C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98, 594-605, New York, NY.
n R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419,
Zurich, Switzerland, Sept. 1995.
n R. Srikant and R. Agrawal. Mining quantitative association rules in large relational
tables. SIGMOD'96, 1-12, Montreal, Canada.
n H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145,
Bombay, India, Sept. 1996.
n M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery
of association rules. KDD’97. August 1997.

56

28
References
Performance Improvements
n S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97, Tucson, Arizona, May 1997.
n D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association
rules in large databases: An incremental updating technique. ICDE'96, New Orleans,
LA.
n T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-
dimensional optimized association rules: Scheme, algorithms, and visualization.
SIGMOD'96, Montreal, Canada.
n E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association
rules. SIGMOD'97, Tucson, Arizona.
n J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining
association rules. SIGMOD'95, San Jose, CA, May 1995.

57

References
Performance Improvements
n G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G.
Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in Databases,. AAAI/MIT
Press, 1991.
n J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining
association rules. SIGMOD'95, San Jose, CA.
n S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD'98, Seattle, WA.
n K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized
rectilinear regions for association rules. KDD'97, Newport Beach, CA, Aug. 1997.
n M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of
association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.

58

29
References
Multi-level, correlation, ratio rules, etc
n S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association
rules to correlations. SIGMOD'97, 265-276, Tucson, Arizona.
n J. Han and Y. Fu. Discovery of multiple-level association rules from large databases.
VLDB'95, 420-431, Zurich, Switzerland.
n M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94, 401-408,
Gaithersburg, Maryland.
n F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast,
quantifiable data mining. VLDB'98, 582-593, New York, NY
n B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231,
Birmingham, England.
n R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461,
Tucson, Arizona.
n A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a
large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
n J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining: Problems and
Challenges. SIGMOD DMKD’01, Santa Barbara, CA.

59

References
Mining Max-patterns and Closed itemsets
n R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93,
Seattle, Washington.
n J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for Mining Frequent
Closed Itemsets", Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and
Knowledge Discovery (DMKD'00), Dallas, TX, May 2000.
n N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets
for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
n M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug.
2000
n M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, SIAM’02

60

30
References
Constraint-based Mining
n G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated
sets. ICDE'00, 512-521, San Diego, CA, Feb. 2000.
n Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases.
KDOOD'95, 39-46, Singapore, Dec. 1995.
n J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data
Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.
n L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of Constrained
Frequent Set Queries with 2-Variable Constraints", SIGMOD’99
n R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining and pruning
optimizations of constrained association rules.” SIGMOD’98
n J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets with Convertible
Constraints", Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), April 2001.
n J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?",
Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA,
August 2000.
n R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints.
KDD'97, 67-73, Newport Beach, California

61

References

Sequential Pattern Mining Methods


n R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
n R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. EDBT’96.
n J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent
Pattern-Projected Sequential Pattern Mining", Proc. 2000 Int. Conf. on Knowledge
Discovery and Data Mining (KDD'00), Boston, MA, August 2000.
n H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997.
n J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan: Mining
Sequential Patterns Efficiently by Prefix-Projected Pattern Growth", Proc. 2001 Int.
Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.
n B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98,
412-421, Orlando, FL.
n S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting
patterns in association rules. VLDB'98, 368-379, New York, NY.
n M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98. Novermber 1998.
n M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular
Expression Constraints. VLDB 1999: 223-234, Edinburgh, Scotland.
62

31
References
Mining in Spatial, Multimedia, Text & Web Databases
n K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image Data through
Progressive Refinement Methods", Revue internationale de gomatique (European
Journal of GIS and Spatial Analysis), 9(4):425-440, 1999.
n A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of Transactions:
Mining Inter-Transaction Association Rules", Proc. 1999 Int. Conf. on Knowledge
Discovery and Data Mining (KDD'99), San Diego, CA, Aug. 1999, pp. 297-301.
n J. Han, G. Dong and Y. Yin, "Efficient Mining of Partial Periodic Patterns in Time Series
Database", Proc. 1999 Int. Conf. on Data Engineering (ICDE'99), Sydney, Australia,
March 1999, pp. 106-115
n H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association Analysis:Mining
Multi-Dimensional Inter-Transaction Association Rules", ACM Transactions on
Information Systems (TOIS’00), 18(4): 423-454, 2000.
n O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and Trends by
Applying OLAP and Data Mining Technology on Web Logs," Proc. Advances in Digital
Librar ies Conf. (ADL'98), Santa Barbara, CA, April 1998, pp. 19-29
n O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in Multimedia with
Progressive Resolution Refinement", ICDE'00, San Diego, CA, Feb. 2000, pp. 461-470

63

References

Mining for Classification and Data Cube Computation


n K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes.
SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
n M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing
iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
n J. Han, J. Pei, G. Dong, and K. Wang, “Computing Iceberg Data Cubes with Complex
Measures”, Proc. ACM-SIGMOD’2001, Santa Barbara, CA, May 2001.
n M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional
association rules using data cubes. KDD'97, 207-210, Newport Beach, California.
n K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes.
SIGMOD’99
n T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association
rules. Technical Report, Aug. 2000

64

32

You might also like