6asso ST
6asso ST
6asso ST
Concepts and
Techniques
— Slides for Textbook —
— Chapter 6 —
Applications:
Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc.
[ 2%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”)
[1%, 75%]
Association Rule: Basic
Concepts
Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find: all rules that correlate the presence of one set
confidence, c, conditional
probability that a transaction
having {X Y} also contains
Z
C A (50%, 100%)
Association Rule Mining: A Road
Map
Boolean vs. quantitative associations (Based on the types of
values handled)
buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x,
“DBMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)
[1%, 75%]
Single dimension vs. multiple dimensional associations (see ex.
Above)
Single level vs. multiple-level analysis
Items brought are referenced at different levels of
abstraction.
age(x, “30..39”) =>buys(x, “laptop computer”)
age(x, “30..39”) =>buys(x, “computer”)
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
Data Mining: Concepts and
04/08/24 Techniques 16
Methods to Improve Apriori’s
Efficiency
Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent
Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent
Data Mining: Concepts and
04/08/24 Techniques 17
Is Apriori Fast Enough? —
Performance Bottlenecks
104 frequent 1-itemset will generate 107 candidate 2-
itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2,
…, a100}, one needs to generate 2100 1030 candidates.
Multiple scans of database:
04/08/24
Needs (n +1 ) scans, n
Data Mining: is the
Concepts
Techniques
length
and of the longest 18
Mining Frequent Patterns
Without Candidate
Generation
Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent
pattern mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent
pattern mining method
A divide-and-conquer methodology: decompose
mining tasks into smaller ones
Avoid candidate generation: sub-database test
only! Data Mining: Concepts and
04/08/24 Techniques 19
Construct FP-tree from a
Transaction DB
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support =
0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Steps: Header Table
1. Scan DB once, find Item frequency head f:4 c:1
frequent 1-itemset f 4
(single item pattern) c 4 c:3 b:1 b:1
a 3
2. Order frequent items in b 3 a:3 p:1
frequency descending m 3
order p 3 m:2 b:1
3. Scan DB again, Data Mining: Concepts and
construct FP-tree
04/08/24 Techniques p:2 m:1 20
Benefits of the FP-tree Structure
Completeness:
never breaks a long pattern of any transaction
pattern mining
Compactness
are gone
frequency descending ordering: more frequent
the FP-tree
Method
For each item, construct its conditional pattern-
conditional FP-tree
Until the resulting FP-tree is empty, or it
frequent item
Accumulate all of transformed prefix paths of that item to
base
Construct the FP-tree for the frequent items
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
{}
All frequent patterns
concerning m
f:3
m,
c:3
fm, cm, am,
a:3 fcm, fam, cam,
fcam
m-conditional FP-tree
Data Mining: Concepts and
04/08/24 Techniques 29
Principles of Frequent Pattern
Growth
Pattern growth property
Let be a frequent itemset in DB, B be 's
conditional pattern base, and be an itemset
in B. Then is a frequent itemset in DB
iff is frequent in B.
“abcdef ” is a frequent pattern, if and only if
“abcde ” is a frequent pattern, and
“f ” is frequent in the set of transactions
containing “abcde ”
Data Mining: Concepts and
04/08/24 Techniques 30
Why Is Frequent Pattern Growth
Fast?
Our performance study shows
FP-growth is an order of magnitude faster
than Apriori, and is also faster than tree-
projection
Reasoning
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operation is counting and FP-tree
Data Mining: Concepts and
04/08/24 Techniques 31
Chapter 6: Mining
Association Rules in Large
Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
Data Mining: Concepts and
04/08/24 Techniques 32
Multiple-Level Association Rules
all
Items often form
hierarchy. computer printer
Items at the lower level
at
appropriate levels could
be quite useful.
Transaction database can
be encoded based on
dimensions and levels
Data Mining: Concepts and
We can explore shared Techniques
04/08/24 33
A concept Hierarchy
all
computer printer
IBM Dell
all
computer
compute software printer
accessory
r
deskto laptop edu financ
p s/w e
colo b/w
IBM Micr r
osof
HP sony
04/08/24
t Data Mining: Concepts and
Techniques 35
Mining Multi-Level Associations
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item
Level 1 computer
min_sup = 5%
[support = 10%]
Back
Data Mining: Concepts and
04/08/24 Techniques 38
Reduced Support
Level 1 Computer
min_sup = 12%
[support = 10%]
Level 1 Computer
min_sup = 12%
[support = 10%]
Level 1 Computer
min_sup = 12%
[support = 10%]
Level passage sup=8%
Level 2
min_sup = 3%
Laptop desktop
Computer Computer and
Example
“computer b/w printer” where items within
Quantitative Attributes
numeric, implicit ordering among values (e.g.
age,income,price)
predicate set.
Techniques can be categorized by how age
are treated.
1. Using static discretization of quantitative
attributes
Quantitative attributes are statically
discretized by using predefined concept
hierarchies.
2. Quantitative association rules
Quantitative attributes are dynamically
discretized into “bins”based on the
distribution of the data.
Data Mining: Concepts and
04/08/24 Techniques 50
Distance-based association rules
This is a dynamic discretization process that considers
the distance between data points.
Static Discretization of
Quantitative Attributes
Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.
In relational database, finding all frequent k-
predicate sets will require k or k+1 table scans.
()
Data cube is well suited for mining.
The cells of an n-dimensional (age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
Mining from data cubes
can be much faster. (age,income,buys)
Data Mining: Concepts and
04/08/24 Techniques 52
Quantitative Association Rules
Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid.
1. Binning
2. Find frequent
predicateset
3. Clustering
Item_type(x,”electronic”)^
manufacturer(x,”foreign”) price(X,$200)
Association rules don’t allow for approximation
of attribute values. Support and confidence
measures do not consider the closeness of values
for a given attribute.
That is why distance based association rules are
required.
d ( S [ X ])=
∑ ∑ distX (ti[ X ] ,tj[ X ])
N ( N −1)
distance or Manhattan
Data Mining: Concepts and
04/08/24 Techniques 63
Clusters and Distance
Measurements(Cont.)
The diameter, d, assesses the density of a
cluster CX , where X
d(CX )≤d 0
|CX|≥s0
Finding clusters and distance-based rules
the density threshold, d , replaces the
0
notion of support
modified version of the BIRCH
clustering algorithm
Data Mining: Concepts and
04/08/24 Techniques 64
Chapter 6: Mining
Association Rules in Large
Databases
Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
Data Mining: Concepts and
04/08/24 Techniques 65
Interestingness Measurements
Objective measures
Two popular measurements:
support; and
confidence
and/or
actionable (the user can do something
with it)
Data Mining: Concepts and
04/08/24 Techniques 66
Criticism to Support and
Confidence
Example 1: (Aggarwal & Yu, PODS98)
Among 5000 students
3000 play basketball
P ( A ∪B)
lift=
P( A )P (B )
2000 /5000
lift ( B,C )= =0 . 89
3000 /5000∗3750/5000
1000 /5000
lift ( B,¬C )= =1 .33
3000 /5000∗1250/5000
Data Mining: Concepts and
04/08/24 Techniques 68
Criticism to Support and
Confidence (Cont.)
Example 2:
X and Y: positively X 1 1 1 1 0 0 0 0
correlated, Y 1 1 0 0 0 0 0 0
X and Z, negatively related
constraints!
What kinds of constraints can be used in mining?
Knowledge type constraint: classification,
association, etc.
Data constraint: SQL-like queries
Find product pairs sold together in Vancouver in Dec.’98.
Dimension/level constraints:
in relevance to region, price, brand, customer category.
Rule constraints
small sales (price < $10) triggers big sales (sum > $200).
Data Mining: Concepts and
04/08/24
Interestingness constraints:Techniques 72
Rule Constraints in Association
Mining
Two kind of rule constraints:
Rule form constraints: meta-rule guided mining.
R).
sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
Data Mining: Concepts and
04/08/24 Techniques 73
Constrain-Based Association
Query
Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
where C is a set of constraints on S1, S2 including frequency
constraint
A classification of (single-variable) constraints:
Class constraint: S A. e.g. S Item
Domain constraint:
S v, { , , , , , }. e.g. S.Price < 100
v S, is or . e.g. snacks S.Type
V S, or S V, { , , , , }
e.g. {snacks, sodas } S.Type
Aggregation constraint: agg(S) v, where agg is in
{min, max, sum, count, avg}, and { , , , ,
, }.
Data Mining: Concepts and
04/08/24
e.g. count(S1.Type) 1Techniques
, avg(S2.Price) 100 74
Constrained Association Query
Optimization Problem
Given a CAQ = { (S1, S2) | C }, the algorithm should be :
sound: It only finds frequent sets that satisfy
the given constraints C
complete: All frequent sets satisfy the given
Succinctness
Anti-monotonicity Monotonicity
Convertible constraints
Inconvertible constraints
Data Mining: Concepts and
04/08/24 Techniques 79