6asso ST

Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

Data Mining:

Concepts and
Techniques
— Slides for Textbook —
— Chapter 6 —

©Jiawei Han and Micheline Kamber


Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
Data Mining: Concepts and
04/08/24 Techniques 1
Chapter 6: Mining
Association Rules in Large
 Databases
Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Data Mining: Concepts and
04/08/24 Techniques 2
What Is Association
Mining?
 Association rule mining:
 Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.

 Applications:
 Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc.

Data Mining: Concepts and


04/08/24 Techniques 3
Examples.
Rule form: “Body ead [support, confidence]”.

buys(x, “computers”)  buys(x, financial_mgmt_s/w”)

[ 2%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”)
[1%, 75%]
Association Rule: Basic
Concepts
 Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
 Find: all rules that correlate the presence of one set

of items with that of another set of items


 E.g., 98% of people who purchase tires and auto

accessories also get automotive services done


 Applications

 *  Maintenance Agreement (What the store


should do to boost Maintenance Agreement
sales)
 Home Electronics  * (What other products

should the store stocks up?)


 Attached mailingDatain direct marketing
Mining: Concepts and
04/08/24 Techniques 5
Rule Measures: Support and
Confidence
Customer
buys both
Customer Find all the rules X & Y 
 Z
Buys
computer
with minimum confidence and
support

support, s, probability that a


Customer transaction contains
Buys financial mgmt s/w
{X Y  Z}

 confidence, c, conditional
probability that a transaction
having {X Y} also contains
Z

Data Mining: Concepts and


04/08/24 Techniques 6
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F

Let minimum support 50%, and


minimum confidence 50%, we
have
 A  C (50%, 66.6%)

 C  A (50%, 100%)
Association Rule Mining: A Road
Map
 Boolean vs. quantitative associations (Based on the types of
values handled)
 buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x,
“DBMiner”) [0.2%, 60%]
 age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)
[1%, 75%]
 Single dimension vs. multiple dimensional associations (see ex.
Above)
 Single level vs. multiple-level analysis
 Items brought are referenced at different levels of
abstraction.
 age(x, “30..39”) =>buys(x, “laptop computer”)
 age(x, “30..39”) =>buys(x, “computer”)

Data Mining: Concepts and


04/08/24 Techniques 8
Chapter 6: Mining
Association Rules in Large
 Databases
Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Data Mining: Concepts and
04/08/24 Techniques 9
Mining Association Rules—An
Example
Transaction ID Items Bought Min. support 50%
2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
For rule A  C: {A,C} 50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) =
66.6%
The Apriori principle:
Any subset of a frequent
Data Mining:itemset
Concepts and must be
04/08/24 Techniques 10
Mining Frequent Itemsets:
the Key Step
 Find the frequent itemsets: the sets of items
that have minimum support
 A subset of a frequent itemset must also be a
frequent itemset
 i.e., if {AB} is a frequent itemset, both {A} and
{B} should be a frequent itemset
 Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
 Use the frequent itemsets to generate
association rules.
Data Mining: Concepts and
04/08/24 Techniques 11
The Apriori Algorithm
 Join Step: C is generated by joining L with itself
k k-1

 Prune Step: Any (k-1)-itemset that is not frequent


cannot be a subset of a frequent k-itemset
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
Data Mining: Concepts and
04/08/24 return k Lk; Techniques 12
The Apriori Algorithm —
Exampleitemset sup.
Database D L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} Data Mining:{2 3 5}and 2
Concepts
04/08/24 Techniques 13
How to Generate Candidates?
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2,
p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then
Data Mining: delete c from Ck
Concepts and
04/08/24 Techniques 14
How to Count Supports of
Candidates?
 Why counting supports of candidates a problem?
 The total number of candidates can be very
huge
 One transaction may contain many
candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of
itemsets and counts
 Interior node contains a hash table
 Subset function: finds
Data all the
Mining: Concepts and candidates
04/08/24 Techniques 15
Example of Generating
Candidates
 L3={abc, abd, acd, ace, bcd}

 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3

 C4={abcd}
Data Mining: Concepts and
04/08/24 Techniques 16
Methods to Improve Apriori’s
Efficiency
 Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent
Data Mining: Concepts and
04/08/24 Techniques 17
Is Apriori Fast Enough? —
Performance Bottlenecks

 The core of the Apriori algorithm:


 Use frequent (k – 1)-itemsets to generate candidate
frequent k-itemsets
 Use database scan and pattern matching to collect counts
for the candidate itemsets
 The bottleneck of Apriori: candidate generation
 Huge candidate sets:


104 frequent 1-itemset will generate 107 candidate 2-
itemsets
 To discover a frequent pattern of size 100, e.g., {a1, a2,
…, a100}, one needs to generate 2100  1030 candidates.
 Multiple scans of database:
04/08/24
 Needs (n +1 ) scans, n
Data Mining: is the
Concepts
Techniques
length
and of the longest 18
Mining Frequent Patterns
Without Candidate
Generation
 Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent
pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent
pattern mining method
 A divide-and-conquer methodology: decompose
mining tasks into smaller ones
 Avoid candidate generation: sub-database test
only! Data Mining: Concepts and
04/08/24 Techniques 19
Construct FP-tree from a
Transaction DB
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support =
0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Steps: Header Table
1. Scan DB once, find Item frequency head f:4 c:1
frequent 1-itemset f 4
(single item pattern) c 4 c:3 b:1 b:1
a 3
2. Order frequent items in b 3 a:3 p:1
frequency descending m 3
order p 3 m:2 b:1
3. Scan DB again, Data Mining: Concepts and
construct FP-tree
04/08/24 Techniques p:2 m:1 20
Benefits of the FP-tree Structure

 Completeness:
 never breaks a long pattern of any transaction

 preserves complete information for frequent

pattern mining
 Compactness

 reduce irrelevant information—infrequent items

are gone
 frequency descending ordering: more frequent

items are more likely to be shared


 never be larger than the original database (if not

count node-links and counts)


 Example: For Connect-4 DB, compression ratio
Data Mining: Concepts and
04/08/24 could be over 100 Techniques 21
Mining Frequent Patterns Using FP-
tree
 General idea (divide-and-conquer)
 Recursively grow frequent pattern path using

the FP-tree
 Method
 For each item, construct its conditional pattern-

base, and then its conditional FP-tree


 Repeat the process on each newly created

conditional FP-tree
 Until the resulting FP-tree is empty, or it

contains only one path (single path will generate all


the combinations of its sub-paths, each of which is a
frequent pattern)
Data Mining: Concepts and
04/08/24 Techniques 22
Major Steps to Mine FP-tree

1) Construct conditional pattern base for each


node in the FP-tree
2) Construct conditional FP-tree from each
conditional pattern-base
3) Recursively mine conditional FP-trees and grow
frequent patterns obtained so far
 If the conditional FP-tree contains a single
path, simply enumerate all the patterns

Data Mining: Concepts and


04/08/24 Techniques 23
Step 1: From FP-tree to Conditional
Pattern Base
Starting at the frequent header table in the FP-tree

 Traverse the FP-tree by following the link of each

frequent item
 Accumulate all of transformed prefix paths of that item to

form a conditional pattern base


Header Table {}
Conditional pattern bases
Item frequency head f:4 c:1
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a fc:3
a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
Data Mining: Concepts and
04/08/24 Techniques 24
Properties of FP-tree for Conditional
Pattern Base Construction
 Node-link property
 For any frequent item ai, all the possible
frequent patterns that contain ai can be
obtained by following ai's node-links, starting
from ai's head in the FP-tree header
 Prefix path property
 To calculate the frequent patterns for a node ai
in a path P, only the prefix sub-path of ai in P
need to be accumulated, and its frequency
count should carry theConcepts
Data Mining: same and count as node ai.
04/08/24 Techniques 25
Step 2: Construct Conditional FP-
tree
 For each pattern-base
 Accumulate the count for each item in the

base
 Construct the FP-tree for the frequent items

of the pattern base


{} m-conditional pattern
Header Table base:
Item frequency head f:4 c:1 fca:2, fcab:1
f 4 All frequent patterns
c 4 c:3 b:1 b:1 {} concerning m
a 3  m,
b 3 a:3 p:1 f:3  fm, cm, am,
m 3 fcm, fam, cam,
m:2 b:1 c:3
p 3 fcam
p:2 m:1 a:3
m-conditional
Data Mining: Concepts and FP-tree
04/08/24 Techniques 26
Mining Frequent Patterns by
Creating Conditional Pattern-
Bases

Item Conditional pattern-base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
Data Mining: Concepts and
04/08/24 Techniques 27
Step 3: Recursively mine
the conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree

Data Mining: Concepts and


04/08/24 Techniques 28
Single FP-tree Path Generation

 Suppose an FP-tree T has a single path P


 The complete set of frequent pattern of T can be
generated by enumeration of all the combinations
of the sub-paths of P

{}
All frequent patterns
concerning m
f:3
m,
c:3 
fm, cm, am,
a:3 fcm, fam, cam,
fcam
m-conditional FP-tree
Data Mining: Concepts and
04/08/24 Techniques 29
Principles of Frequent Pattern
Growth
 Pattern growth property
 Let  be a frequent itemset in DB, B be 's
conditional pattern base, and  be an itemset
in B. Then    is a frequent itemset in DB
iff  is frequent in B.
 “abcdef ” is a frequent pattern, if and only if
 “abcde ” is a frequent pattern, and
 “f ” is frequent in the set of transactions
containing “abcde ”
Data Mining: Concepts and
04/08/24 Techniques 30
Why Is Frequent Pattern Growth
Fast?
 Our performance study shows
 FP-growth is an order of magnitude faster
than Apriori, and is also faster than tree-
projection
 Reasoning
 No candidate generation, no candidate test
 Use compact data structure
 Eliminate repeated database scan
 Basic operation is counting and FP-tree
Data Mining: Concepts and
04/08/24 Techniques 31
Chapter 6: Mining
Association Rules in Large
 Databases
Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Data Mining: Concepts and
04/08/24 Techniques 32
Multiple-Level Association Rules
all
 Items often form
hierarchy. computer printer
 Items at the lower level

are expected to have desktop laptop color b/w


lower support.
dell HP
 Rules regarding itemsets

at
appropriate levels could
be quite useful.
 Transaction database can

be encoded based on
dimensions and levels
Data Mining: Concepts and
We can explore shared Techniques
04/08/24 33
A concept Hierarchy

all

computer printer

desktop laptop color b/w

IBM Dell

Data Mining: Concepts and


04/08/24 Techniques 34
A concept Hierarchy

all

computer
compute software printer
accessory
r
deskto laptop edu financ
p s/w e
colo b/w
IBM Micr r
osof
HP sony
04/08/24
t Data Mining: Concepts and
Techniques 35
Mining Multi-Level Associations

 A top_down, progressive deepening approach:

 First find high-level strong rules:


computer  printer [20%, 60%].

 Then find their lower-level “weaker” rules:


laptop computer  color printer [6%,
50%].

Data Mining: Concepts and


04/08/24 Techniques 36
Multi-level Association: Uniform
Support vs. Reduced Support
 Uniform Support: the same minimum support for all levels
 + One minimum support threshold. No need to examine
itemsets containing any item whose ancestors do not have
minimum support.
 – Lower level items do not occur as frequently. If support
threshold
 too high  miss low level associations

 too low  generate too many high level associations

 Reduced Support: reduced minimum support at lower levels


 There are 4 search strategies:

 Level-by-level independent
 Level-cross filtering by k-itemset
 Level-cross filtering by single item
 Controlled level-cross filtering by single item

Data Mining: Concepts and


04/08/24 Techniques 37
Uniform Support
Multi-level mining with uniform support

Level 1 computer
min_sup = 5%
[support = 10%]

Level 2 Laptop Desktop


min_sup = 5% computer computer
[support = 6%] [support = 4%]

Back
Data Mining: Concepts and
04/08/24 Techniques 38
Reduced Support

Level 1 Computer
min_sup = 12%
[support = 10%]

Level 2 Laptop Desktop


min_sup = 3% Computer computer
[support = 6%] [support = 4%]

Data Mining: Concepts and


04/08/24 Techniques 39
Level cross filtering by a single
item

Level 1 Computer
min_sup = 12%
[support = 10%]

Level 2 Laptop Desktop


min_sup = 3% computer computer
Not examined Not examined

Data Mining: Concepts and


04/08/24 Techniques 40
Level cross filtering by k-itemset

Level 1 Computer and


min_sup = 5% printer
[support = 7%]
Level 2
min_sup = 2%
Laptop Desktop
Laptop Computer and Computer and
Computer and color printer b/w printer
b/w printer
[support = 2%] [support = 1%]
[support = 1%]
Data Mining: Concepts and
04/08/24 Techniques 41
Controlled level cross filtering by
single item

Level 1 Computer
min_sup = 12%
[support = 10%]
Level passage sup=8%
Level 2
min_sup = 3%
Laptop desktop
Computer Computer and

[support = 6%] [support = 4%]

Data Mining: Concepts and


04/08/24 Techniques 42
Cross level association rules

 Example
 “computer  b/w printer” where items within

the rule are not required to belong to the same


concept level.

 Ex of same level association rules


 “computer  printer”
 “desktop computer  b/w printer”

Data Mining: Concepts and


04/08/24 Techniques 43
Multi-level Association: Redundancy
Filtering

 Some rules may be redundant due to “ancestor”


relationships between items.
 Example
 Desktop computer  b/w printer [support = 8%,
confidence = 70%]
 IBM desktop computer  b/w printer [support =
2%, confidence = 72%]
 We say the first rule is an ancestor of the second
rule.
 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
Data Mining: Concepts and
04/08/24 Techniques 44
Chapter 6: Mining Association Rules
in Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Data Mining: Concepts and
04/08/24 Techniques 47
Multi-Dimensional Association:
Concepts

Single-dimensional rules(Intra dimension association


rule)

buys(X, “IBM desktop computer”)  buys(X, “sony


b/w printer”)

 Multi-dimensional rules:  2 dimensions or predicates

 Inter-dimension association rules (no repeated


predicates)age(X,”19-25”)  occupation(X,“student”) 
buys(X,“laptop”)

 hybrid-dimension association rules (repeated predicates)


age(X,”20-29”)  buys(X, “laptop”)  buys(X, “b/w
printer”)
Data Mining: Concepts and
04/08/24 Techniques 48
Types of Attributes

 Categorical Attributes (nominal attributes)


 finite number of possible values, no ordering

among values (e.g. occupation, brand ,color)

 Quantitative Attributes
 numeric, implicit ordering among values (e.g.

age,income,price)

Data Mining: Concepts and


04/08/24 Techniques 49
Techniques for Mining MD Associations

 Search for frequent k-predicate set:


 Example: {age, occupation, buys} is a 3-

predicate set.
 Techniques can be categorized by how age

are treated.
1. Using static discretization of quantitative
attributes
 Quantitative attributes are statically
discretized by using predefined concept
hierarchies.
2. Quantitative association rules
 Quantitative attributes are dynamically
discretized into “bins”based on the
distribution of the data.
Data Mining: Concepts and
04/08/24 Techniques 50
Distance-based association rules
This is a dynamic discretization process that considers
the distance between data points.
Static Discretization of
Quantitative Attributes
 Discretized prior to mining using concept hierarchy.
 Numeric values are replaced by ranges.
 In relational database, finding all frequent k-
predicate sets will require k or k+1 table scans.
()
 Data cube is well suited for mining.
 The cells of an n-dimensional (age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
 Mining from data cubes
can be much faster. (age,income,buys)
Data Mining: Concepts and
04/08/24 Techniques 52
Quantitative Association Rules

 Numeric attributes are dynamically discretized


 Such that the confidence or compactness of the

rules mined is maximized.

 2-D quantitative association rules:


Aquan1  Aquan2  Acat

Data Mining: Concepts and


04/08/24 Techniques 53
age(X,”30..39”)  income(X,”42k …48K”)  buys(X, resolution TV”)
age(X,34)  income(X,”31K - 40K”)  buys(X,”high resolution TV”)
age(X,35)  income(X,”31K - 40K”)  buys(X,”high resolution TV”)
age(X,34)  income(X,”41K - 50K”)  buys(X,”high resolution TV”)
age(X,35)  income(X,”41K - 50K”)  buys(X,”high resolution TV”)
Clustering the Association Rules

 Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid.

age(X,”34..35”)  income(X,”31k …50K”)  buys(X, resolution


TV”)

Data Mining: Concepts and


04/08/24 Techniques 55
ARCS (Association Rule Clustering System)

How does ARCS


work?

1. Binning

2. Find frequent
predicateset

3. Clustering

Data Mining: Concepts and


4. Optimize
04/08/24 Techniques 56
Limitations of ARCS

 Only quantitative attributes on LHS of rules.


 Only 2 attributes on LHS. (2D limitation)
 An alternative to ARCS
 Non-grid-based
 equi-depth binning
 clustering based on a measure of partial
completeness.

Data Mining: Concepts and


04/08/24 Techniques 57
Mining Distance-based Association
Rules
 Binning methods do not capture the semantics of
interval data
Equi-width Equi-depth Distance-
Price($) (width $10) (depth 2) based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]

 Distance-based partitioning, more meaningful


discretization considering:
 density/number of points in an interval

 “closeness” of points in an interval

Data Mining: Concepts and


04/08/24 Techniques 58
Mining Distance-based
Association Rules
 Equidepth partitioning -groups distant
values together.
 Equiwidth partitioning - split values that
are close together and create intervals for
which there is no data.
 Distance based partitioning – considers
the density or number of points in an
interval,closeness of points in an interval.

Data Mining: Concepts and


04/08/24 Techniques 59
Approximation in data values

 Item_type(x,”electronic”)^
manufacturer(x,”foreign”)  price(X,$200)
 Association rules don’t allow for approximation
of attribute values. Support and confidence
measures do not consider the closeness of values
for a given attribute.
 That is why distance based association rules are
required.

Data Mining: Concepts and


04/08/24 Techniques 60
Mining Distance-based
Association Rules
 A 2 phase algorithm is used to mine
distance based association rules.
 The first employs clustering to find the
intervals or clusters.
 The second phase obtains distance based
association rules by searching for group
of clusters that occur frequently together.
 Example of distance based association rule
Cx  Cy
Data Mining: Concepts and
04/08/24 Techniques 61
Mining Distance-based
Association Rules
 Ex. Cx cluster for age and Cy cluster for
income.
 When age clustered tuples Cx are
projected onto the attribute income , their
corresponding income values lie within
the income cluster Cy or close to it.
 The distance measures the degree of
association between Cx and Cy. The
smaller the distance stronger the
association is.
Data Mining: Concepts and
04/08/24 Techniques 62
Clusters and Distance
Measurements
 S[X] is a set of N tuples t1, t2, …, tN ,
projected on the attribute set X
 The diameter of S[X]:

d ( S [ X ])=
∑ ∑ distX (ti[ X ] ,tj[ X ])
N ( N −1)

 distx:distance metric, e.g. Euclidean

distance or Manhattan
Data Mining: Concepts and
04/08/24 Techniques 63
Clusters and Distance
Measurements(Cont.)
 The diameter, d, assesses the density of a
cluster CX , where X
d(CX )≤d 0
|CX|≥s0
 Finding clusters and distance-based rules
 the density threshold, d , replaces the
0
notion of support
 modified version of the BIRCH

clustering algorithm
Data Mining: Concepts and
04/08/24 Techniques 64
Chapter 6: Mining
Association Rules in Large
 Databases
Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Data Mining: Concepts and
04/08/24 Techniques 65
Interestingness Measurements
 Objective measures
Two popular measurements:
 support; and

 confidence

 Subjective measures (Silberschatz &


Tuzhilin, KDD95)
A rule (pattern) is interesting if
 it is unexpected (surprising to the user);

and/or
 actionable (the user can do something

with it)
Data Mining: Concepts and
04/08/24 Techniques 66
Criticism to Support and
Confidence
 Example 1: (Aggarwal & Yu, PODS98)
 Among 5000 students
 3000 play basketball

 3750 eat cereal

 2000 both play basket ball and eat cereal

 play basketball  eat cereal [40%, 66.7%] is misleading

because the overall percentage of students eating cereal


is 75% which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more

accurate, although with lower support and confidence


basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) Data Mining:3000
Concepts and
2000 5000
04/08/24 Techniques 67
lift

P ( A ∪B)
lift=
P( A )P (B )
2000 /5000
lift ( B,C )= =0 . 89
3000 /5000∗3750/5000
1000 /5000
lift ( B,¬C )= =1 .33
3000 /5000∗1250/5000
Data Mining: Concepts and
04/08/24 Techniques 68
Criticism to Support and
Confidence (Cont.)
 Example 2:
 X and Y: positively X 1 1 1 1 0 0 0 0
correlated, Y 1 1 0 0 0 0 0 0
 X and Z, negatively related

 support and confidence of


Z 0 1 1 1 1 1 1 1
X=>Z dominates
 We need a measure of
Rule Support Confidence
dependent or P(correlated
A∪B)
corr = X=>Y 25% 50%
events A,B
P( A )P ( B) X=>Z 37.50% 75%

Data Mining: Concepts and


04/08/24 Techniques 69
Other Interestingness Measures:
Interest
 Interest (correlation, lift) P ( A∧B )
P ( A )P( B )
 taking both P(A) and P(B) in consideration
 P(A^B)=P(B)*P(A), if A and B are independent
events
 A and B negatively correlated, if the value is less
than 1; otherwise A and B positively correlated
Itemset Support Interest
X 1 1 1 1 0 0 0 0
X,Y 25% 2
Y 1 1 0 0 0 0 0 0 X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57

Data Mining: Concepts and


04/08/24 Techniques 70
Chapter 6: Mining
Association Rules in Large
 Databases
Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Data Mining: Concepts and
04/08/24 Techniques 71
Constraint-Based Mining

 Interactive, exploratory mining giga-bytes of


data?
 Could it be real? — Making good use of

constraints!
 What kinds of constraints can be used in mining?
 Knowledge type constraint: classification,

association, etc.
 Data constraint: SQL-like queries
 Find product pairs sold together in Vancouver in Dec.’98.
 Dimension/level constraints:
 in relevance to region, price, brand, customer category.
 Rule constraints
 small sales (price < $10) triggers big sales (sum > $200).
Data Mining: Concepts and
04/08/24
 Interestingness constraints:Techniques 72
Rule Constraints in Association
Mining
 Two kind of rule constraints:
 Rule form constraints: meta-rule guided mining.

 P(x, y) ^ Q(x, w) takes(x, “database systems”).


 Rule (content) constraint: constraint-based query
optimization.
 sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^
sum(RHS) > 1000
 1-variable vs. 2-variable constraints:
 1-var: A constraint confining only one side (L/R)

of the rule, e.g., as shown above.


 2-var: A constraint confining both sides (L and

R).
 sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
Data Mining: Concepts and
04/08/24 Techniques 73
Constrain-Based Association
Query
 Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
 A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },

where C is a set of constraints on S1, S2 including frequency
constraint
 A classification of (single-variable) constraints:
 Class constraint: S  A. e.g. S  Item
 Domain constraint:
 S v,   { , , , , ,  }. e.g. S.Price < 100
 v S,  is  or . e.g. snacks  S.Type
 V S, or S V,   { , , , ,  }

e.g. {snacks, sodas }  S.Type
 Aggregation constraint: agg(S)  v, where agg is in
{min, max, sum, count, avg}, and   { , , , ,
,  }.
Data Mining: Concepts and
04/08/24
 e.g. count(S1.Type)  1Techniques
, avg(S2.Price)  100 74
Constrained Association Query
Optimization Problem
 Given a CAQ = { (S1, S2) | C }, the algorithm should be :
 sound: It only finds frequent sets that satisfy
the given constraints C
 complete: All frequent sets satisfy the given

constraints C are found


 A naïve solution:

 Apply Apriori for finding all frequent sets, and

then to test them for constraint satisfaction


one by one.
 Our approach:

 Comprehensive analysis of the properties of

constraints and try to push them as deeply as


04/08/24
possible inside the
Data frequent
Mining: Concepts and set computation.
Techniques 75
Anti-monotone and Monotone
Constraints
 A constraint Ca is anti-monotone iff. for
any pattern S not satisfying Ca, none of
the super-patterns of S can satisfy Ca
 A constraint Cm is monotone iff. for any
pattern S satisfying Cm, every super-
pattern of S also satisfies it

Data Mining: Concepts and


04/08/24 Techniques 76
Succinct Constraint

 A subset of item Is is a succinct set, set if it


can be expressed as p(I) for some
selection predicate p, where  is a
selection operator
 SP2I is a succinct power set, set if there is a
fixed number of succinct set I1, …, Ik I,
s.t. SP can be expressed in terms of the
strict power sets of I1, …, Ik using union
and minus
 A constraint C Data is succinct
Mining: Concepts andprovided
04/08/24 s Techniques 77
Convertible Constraint

 Suppose all items in patterns are listed in


a total order R
 A constraint C is convertible anti-
monotone iff a pattern S satisfying the
constraint implies that each suffix of S
w.r.t. R also satisfies C
 A constraint C is convertible monotone iff
a pattern S satisfying the constraint
implies that each pattern of which S is a
suffix w.r.t. R also satisfies C
Data Mining: Concepts and
04/08/24 Techniques 78
Relationships Among Categories
of Constraints

Succinctness

Anti-monotonicity Monotonicity

Convertible constraints

Inconvertible constraints
Data Mining: Concepts and
04/08/24 Techniques 79

You might also like