Chapter 26: Data Mining: (Some Slides Courtesy of Rich Caruana, Cornell University)
Chapter 26: Data Mining: (Some Slides Courtesy of Rich Caruana, Cornell University)
phi
(s,t) = phi(t) p
L
phi(t
L
) p
R
phi(t
R
)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example (Contd.)
Impurity of root node:
phi(0.5,0.5)
Impurity of whole tree:
0.6* phi(0.83,0.17)
+ 0.4 * phi(0,1)
Impurity reduction:
phi(0.5,0.5)
- 0.6* phi(0.83,0.17)
- 0.4 * phi(0,1)
X1<=1 (50%,50%)
Yes
(83%,17%)
No
(0%,100%)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Error Reduction as Impurity Function
Possible impurity
function:
Resubstitution error
R(T,D).
Example:
R(no tree, D) = 0.5
R(T
1
,D) = 0.6*0.17
R(T
2
,D) =
0.4*0.25 + 0.6*0.33
X1<=1 (50%,50%)
X2<=1 (50%,50%)
Yes
(83%,17%)
No
(25%,75%)
No
(0%,100%)
Yes
(66%,33%)
T
1
T
2
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problems with Resubstitution Error
Obvious problem:
There are situations
where no split can
decrease impurity
Example:
R(no tree, D) = 0.2
R(T
1
,D)
=0.6*0.17+0.4*0.25
=0.2
X3<=1 (80%,20%)
Yes
6: (83%,17%)
Yes
4: (75%,25%)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problems with Resubstitution Error
More subtle problem:
X3<=1 8: (50%,50%)
Yes
4: (75%,25%)
No
4: (25%,75%)
X4<=1 (50%,50%)
No
6: (33%,66%)
Yes
2: (100%,0%)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problems with Resubstitution Error
Root node: n records, q of class 1
Left child node: n1 records, q of class 1
Right child node: n2 records, (q-q) of class 1,
n1+n2 = n
X3<=1 n: (q, (n-q))
Yes
n1: (q/n1, (n1-q)/n1)
Yes
n2: ((q-q)/n2, (n2-(q-q)/n2)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problems with Resubstitution Error
Tree structure:
Root node: n records (q/n, (n-q))
Left child: n1 records (q/n1, (n1-q)/n1)
Right child: n2 records ((q-q)/n2, (n2-q)/n2)
Impurity before split:
Error: q/n
Impurity after split:
Left child: n1/n * q/n1 = q/n
Right child: n2/n * (q-q)/n2 = (q-q)/n
Total error: q/n + (q-q)/n = q/n
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problems with Resubstitution Error
Heart of the problem:
Assume two classes:
phi(p(1|t), p(2|t)) = phi(p(1|t), 1-p(1|t))
= phi (p(1|t))
Resubstitution errror has the following
property:
phi(p1 + p2) = phi(p1)+phi(p2)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example: Only Root Node
phi
X3<=1 8: (50%,50%)
0 0.5 1
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example: Split (75,25), (25,75)
phi
X3<=1 8: (50%,50%)
Yes
4: (75%,25%)
No
4: (25%,75%)
0 0.5 1
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example: Split (33,66), (100,0)
phi
X4<=1 (80%,20%)
No
6: (33%,66%)
Yes
2: (100%,0%)
0 0.5 1
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Remedy: Concavity
Use impurity functions that are concave:
phi
< 0
Example impurity functions
Entropy:
phi(t) = - p(j|t) log(p(j|t))
Gini index:
phi(t) = p(j|t)
2
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example Split With Concave Phi
phi
X4<=1 (80%,20%)
No
6: (33%,66%)
Yes
2: (100%,0%)
0 0.5 1
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Nonnegative Decrease in Impurity
Theorem: Let phi(p
1
, , p
J
) be a strictly
concave function on j=1, , J,
j
p
j
= 1.
Then for any split s:
phi
(s,t) >= 0
With equality if and only if:
p(j|t
L
) = p(j|t
R
) = p(j|t), j = 1, , J
Note: Entropy and gini-index are concave.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
CART Univariate Split Selection
Use gini-index as impurity function
For each numerical or ordered attribute X,
consider all binary splits s of the form
X <= x
where x in dom(X)
For each categorical attribute X, consider all
binary splits s of the form
X in A, where A subset dom(X)
At a node t, select split s* such that
phi
(s*,t) is maximal over all s considered
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
CART: Shortcut for Categorical Splits
Computational shortcut if |Y|=2.
Theorem: Let X be a categorical attribute with
dom(X) = {b
1
, , b
k
}, |Y|=2, phi be a concave
function, and let
p(X=b
1
) <= <= p(X=b
k
).
Then the best split is of the form:
X in {b
1
, b
2
, , b
l
} for some l < k
Benefit: We need only to check k-1 subsets of
dom(X) instead of 2
(k-1)
-1 subsets
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
CART Multivariate Split Selection
For numerical predictor variables, examine
splitting predicates s of the form:
i
a
i
X
i
<= c
with the constraint:
i
a
i
2
= 1
Select splitting predicate s* with
maximum decrease in impurity.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problems with CART Split Selection
Biased towards variables with more splits
(M-category variable has 2
M-1
-1) possible
splits, an M-valued ordered variable has
(M-1) possible splits
Computationally expensive for categorical
variables with large domains
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Pruning Methods
Test dataset pruning
Direct stopping rule
Cost-complexity pruning
MDL pruning
Pruning by randomization testing
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Top-Down and Bottom-Up Pruning
Two classes of methods:
Top-down pruning: Stop growth of the
tree at the right size. Need a statistic that
indicates when to stop growing a subtree.
Bottom-up pruning: Grow an overly large
tree and then chop off subtrees that
overfit the training data.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Stopping Policies
A stopping policy indicates when further growth of
the tree at a node t is counterproductive.
All records are of the same class
The attribute values of all records are identical
All records have missing values
At most one class has a number of records
larger than a user-specified number
All records go to the same child node if t is split
(only possible with some split selection
methods)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Test Dataset Pruning
Use an independent test sample D to
estimate the misclassification cost using
the resubstitution estimate R(T,D) at
each node
Select the subtree T of T with the
smallest expected cost
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Test Dataset Pruning Example
X1<=1 (50%,50%)
(83%,17%) X2<=1
No
(100%,0%)
No
(0%,100%)
Yes
(75%,25%)
Test set:
X1 X2 Class
1 1 Yes
1 2 Yes
1 2 Yes
1 2 Yes
1 1 Yes
1 2 No
2 1 No
2 1 No
2 2 No
2 2 No
Only root: 10% misclassification
Full tree: 30% misclassification
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost Complexity Pruning
(Breiman, Friedman, Olshen, Stone, 1984)
Some more tree notation
t: node in tree T
leaf(T): set of leaf nodes of T
|leaf(T)|: number of leaf nodes of T
T
t
: subtree of T rooted at t
{t}: subtree of T
t
containing only node t
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Notation: Example
leaf(T) = {t1,t2,t3}
|leaf(T)|=3
Tree rooted
at node t: T
t
Tree consisting
of only node t: {t}
leaf(T
t
)={t1,t2}
leaf({t})={t}
X1<=1
t: X2<=1
t1: No
t3: No
t2: Yes
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost-Complexity Pruning
Test dataset pruning is the ideal case, if we
have a large test dataset. But:
We might not have a large test dataset
We want to use all available records for tree
construction
If we do not have a test dataset, we do not
obtain honest classification error estimates
Remember cross-validation: Re-use training
dataset in a clever way to estimate the
classification error.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost-Complexity Pruning
1. /* cross-validation step */
Construct tree T using D
2. Partition D into V subsets D
1
, , D
V
3. for (i=1; i<=V; i++)
Construct tree T
i
from (D \ D
i
)
Use D
i
to calculate the estimate R(T
i
, D \ D
i
)
endfor
4. /* estimation step */
Calculate R(T,D) from R(T
i
, D \ D
i
)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cross-Validation Step
R?
R
1
R
2
R
3
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost-Complexity Pruning
Problem: How can we relate the
misclassification error of the CV-trees to
the misclassification error of the large
tree?
Idea: Use a parameter that has the same
meaning over different trees, and relate
trees with similar parameter settings.
Such a parameter is the cost-complexity
of the tree.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost-Complexity Pruning
Cost complexity of a tree T:
R
alpha
(T) = R(T) + alpha |leaf(T)|
For each A, there is a tree that minimizes the
cost complexity:
alpha = 0: full tree
alpha = infinity: only root node
alpha=0.6
alpha=0.4
alpha=0.25 alpha=0.0
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost-Complexity Pruning
When should we prune the subtree rooted at t?
R
alpha
({t}) = R(t) + alpha
R
alpha
(T
t
) = R(T
t
) + alpha |leaf(T
t
)|
Define
g(t) = (R(t)-R(T
t
)) / (|leaf(T
t
)|-1)
Each node has a critical value g(t):
Alpha < g(t): leave subtree T
t
rooted at t
Alpha >= g(t): prune subtree rooted at t to {t}
For each alpha we obtain a unique minimum
cost-complexity tree.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example Revisited
alpha>=0.45
0.3<alpha<0.45
0.2<alpha<=0.3
0<alpha<=0.2
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost Complexity Pruning
1. Let T
1
> T
2
> > {t} be the nested cost-
complexity sequence of subtrees of T rooted
at t.
Let alpha
1
< < alpha
k
be the sequence of
associated critical values of alpha. Define
alpha
k
=squareroot(alpha
k
* alpha
k+1
)
2. Let T
i
be the tree grown from D \ D
i
3. Let T
i
(alpha
k
) be the minimal cost-complexity
tree for alpha
k
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost Complexity Pruning
4. Let R(T
i
)(alpha
k
)) be the
misclassification cost of T
i
(alpha
k
) based
on D
i
5. Define the V-fold cross-validation
misclassification estimate as follows:
R*(T
k
) = 1/V
i
R(T
i
(alpha
k
))
6. Select the subtree with the smallest
estimated CV error
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
k-SE Rule
Let T* be the subtree of T that minimizes
the misclassification error R(T
k
) over all k
But R(T
k
) is only an estimate:
Estimate the estimated standard error
SE(R(T*)) of R(T*)
Let T** be the smallest tree such that
R(T**) <= R(T*) + k*SE(R(T*)); use T**
instead of T*
Intuition: A smaller tree is easier to
understand.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Cost Complexity Pruning
Advantages:
No independent test dataset necessary
Gives estimate of misclassification error, and
chooses tree that minimizes this error
Disadvantages:
Originally devised for small datasets; is it still
necessary for large datasets?
Computationally very expensive for large
datasets (need to grow V trees from nearly all
the data)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Missing Values
What is the problem?
During computation of the splitting predicate,
we can selectively ignore records with missing
values (note that this has some problems)
But if a record r misses the value of the
variable in the splitting attribute, r can not
participate further in tree construction
Algorithms for missing values address this
problem.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Mean and Mode Imputation
Assume record r has missing value r.X, and
splitting variable is X.
Simplest algorithm:
If X is numerical (categorical), impute the
overall mean (mode)
Improved algorithm:
If X is numerical (categorical), impute the
mean(X|t.C) (the mode(X|t.C))
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Decision Trees: Summary
Many application of decision trees
There are many algorithms available for:
Split selection
Pruning
Handling Missing Values
Data Access
Decision tree construction still active research
area (after 20+ years!)
Challenges: Performance, scalability, evolving
datasets, new applications
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Lecture Overview
Data Mining I: Decision Trees
Data Mining II: Clustering
Data Mining III: Association Analysis
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Supervised Learning
F(x): true function (usually not known)
D: training sample drawn from F(x)
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
0
1
0
1
1
0
1
0
0
1
0
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Supervised Learning
F(x): true function (usually not known)
D: training sample (x,F(x))
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 1
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 1
G(x): model learned from D
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
Goal: E[(F(x)-G(x))
2
] is small (near zero) for
future samples
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Supervised Learning
Well-defined goal:
Learn G(x) that is a good approximation
to F(x) from training sample D
Well-defined error metrics:
Accuracy, RMSE, ROC,
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Supervised Learning
Training dataset:
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
0
1
0
1
1
0
1
0
0
1
0
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Un-Supervised Learning
Training dataset:
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
0
1
0
1
1
0
1
0
0
1
0
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Un-Supervised Learning
Training dataset:
Test dataset:
71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
0
1
0
1
1
0
1
0
0
1
0
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Un-Supervised Learning
Data Set:
57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0
84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Supervised vs. Unsupervised Learning
Supervised
y=F(x): true function
D: labeled training set
D: {x
i
,F(x
i
)}
Learn:
G(x): model trained to
predict labels D
Goal:
E[(F(x)-G(x))
2
] 0
Well defined criteria:
Accuracy, RMSE, ...
Unsupervised
Generator: true model
D: unlabeled data sample
D: {x
i
}
Learn
??????????
Goal:
??????????
Well defined criteria:
??????????
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
What to Learn/Discover?
Statistical Summaries
Generators
Density Estimation
Patterns/Rules
Associations (see previous segment)
Clusters/Groups (this segment)
Exceptions/Outliers
Changes in Patterns Over Time or Location
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Clustering: Unsupervised Learning
Given:
Data Set D (training set)
Similarity/distance metric/information
Find:
Partitioning of data
Groups of similar/close items
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Similarity?
Groups of similar customers
Similar demographics
Similar buying behavior
Similar health
Similar products
Similar cost
Similar function
Similar store
Similarity usually is domain/problem specific
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Distance Between Records
d-dim vector space representation and distance
metric
r
1
: 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
r
2
: 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
...
r
N
: 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Distance (r
1
,r
2
) = ???
Pairwise distances between points (no d-dim space)
Similarity/dissimilarity matrix
(upper or lower diagonal)
Distance: 0 = near, = far
Similarity: 0 = far, = near
-- 1 2 3 4 5 6 7 8 9 10
1 - d d d d d d d d d
2 - d d d d d d d d
3 - d d d d d d d
4 - d d d d d d
5 - d d d d d
6 - d d d d
7 - d d d
8 - d d
9 - d
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Properties of Distances: Metric Spaces
A metric space is a set S with a global
distance function d. For every two points
x, y in S, the distance d(x,y) is a
nonnegative real number.
A metric space must also satisfy
d(x,y) = 0 iff x = y
d(x,y) = d(y,x) (symmetry)
d(x,y) + d(y,z) >= d(x,z) (triangle inequality)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Minkowski Distance (L
p
Norm)
Consider two records x=(x
1
,,x
d
), y=(y
1
,,y
d
):
Special cases:
p=1: Manhattan distance
p=2: Euclidean distance
p
p
d d
p p
y x y x y x y x d | | ... | | | | ) , (
2 2 1 1
| | ... | | | | ) , (
2 2 1 1
p p
y x y x y x y x d
2 2
2 2
2
1 1
) ( ... ) ( ) ( ) , (
d d
y x y x y x y x d
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Only Binary Variables
2x2 Table:
Simple matching coefficient:
(symmetric)
Jaccard coefficient:
(asymmetric)
d c b a
c b
y x d
) , (
d c b
c b
y x d
) , (
0 1 Sum
0 a b a+b
1 c d c+d
Sum a+c b+d a+b+c+d
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Nominal and Ordinal Variables
Nominal: Count number of matching variables
m: # of matches, d: total # of variables
Ordinal: Bucketize and transform to numerical:
Consider record x with value x
i
for i
th
attribute of
record x; new value x
i
:
d
m d
y x d
) , (
1 ) (
1
'
i
i
X dom
x
i
x
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Mixtures of Variables
Weigh each variable differently
Can take importance of variable into
account (although usually hard to quantify
in practice)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Clustering: Informal Problem Definition
Input:
A data set of N records each given as a d-
dimensional data feature vector.
Output:
Determine a natural, useful partitioning of the
data set into a number of (k) clusters and noise
such that we have:
High similarity of records within each cluster (intra-
cluster similarity)
Low similarity of records between clusters (inter-
cluster similarity)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Types of Clustering
Hard Clustering:
Each object is in one and only one cluster
Soft Clustering:
Each object has a probability of being in each
cluster
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Clustering Algorithms
Partitioning-based clustering
K-means clustering
K-medoids clustering
EM (expectation maximization) clustering
Hierarchical clustering
Divisive clustering (top down)
Agglomerative clustering (bottom up)
Density-Based Methods
Regions of dense points separated by sparser regions
of relatively low density
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
K-Means Clustering Algorithm
Initialize k cluster centers
Do
Assignment step: Assign each data point to its closest cluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
Visualization at:
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Issues
Why is K-Means working:
How does it find the cluster centers?
Does it find an optimal clustering
What are good starting points for the algorithm?
What is the right number of cluster centers?
How do we know it will terminate?
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
K-Means: Distortion
Communication between sender and receiver
Sender encodes dataset: x
i
{1,,k}
Receiver decodes dataset: j center
j
Distortion:
A good clustering has minimal distortion.
N
x encode i i center x
D
1
) (
2
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Properties of the Minimal Distortion
Recall: Distortion
Property 1: Each data point x
i
is encoded by its
nearest cluster center center
j
. (Why?)
Property 2: When the algorithm stops, the
partial derivative of the Distortion with respect
to each center attribute is zero.
N
x encode i i center x
D
1
) (
2
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Property 2 Followed Through
Calculating the partial derivative:
Thus at the minimum:
k
j center Cluster i
j i
N
x encode i
j
i center x center x
D
1 ) (
2
1
) (
2
) (
) (
!
) (
2
0 ) ( 2
) (
j j c Cluster i
j i
c Cluster i
j i
j j
center x
center center
D
center x
) (
| )} ( { |
1
j center Cluster i
i
j
j
x
center Cluster i
center
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
K-Means Minimal Distortion Property
Property 1: Each data point x
i
is encoded by its
nearest cluster center center
j
Property 2: Each center is the centroid of its
cluster.
How do we improve a configuration:
Change encoding (encode a point by its nearest
cluster center)
Change the cluster center (make each center the
centroid of its cluster)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
K-Means Minimal Distortion Property
(Contd.)
Termination? Count the number of distinct
configurations
Optimality? We might get stuck in a local
optimum.
Try different starting configurations.
Choose the starting centers smart.
Choosing the number of centers?
Hard problem. Usually choose number of
clusters that minimizes some criterion.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
K-Means: Summary
Advantages:
Good for exploratory data analysis
Works well for low-dimensional data
Reasonably scalable
Disadvantages
Hard to choose k
Often clusters are non-spherical
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
K-Medoids
Similar to K-Means, but for categorical
data or data in a non-vector space.
Since we cannot compute the cluster
center (think text data), we take the
most representative data point in the
cluster.
This data point is called the medoid (the
object that lies in the center).
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Agglomerative Clustering
Algorithm:
Put each item in its own cluster (all singletons)
Find all pairwise distances between clusters
Merge the two closest clusters
Repeat until everything is in one cluster
Observations:
Results in a hierarchical clustering
Yields a clustering for each possible number of clusters
Greedy clustering: Result is not optimal for any cluster
size
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Agglomerative Clustering Example
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Density-Based Clustering
A cluster is defined as a connected dense
component.
Density is defined in terms of number of
neighbors of a point.
We can find clusters of arbitrary shape
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
DBSCAN
E-neighborhood of a point
NE(p) = {q D | dist(p,q) E}
Core point
|NE(q)| MinPts
Directly density-reachable
A point p is directly density-reachable from a point q wrt. E, MinPts if
1) p NE(q) and
2) |NE(q)| MinPts (core point condition).
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
DBSCAN
Density-reachable
A point p is density-reachable from a point q wrt. E and MinPts if
there is a chain of points p
1
, ..., p
n
, p
1
= q, p
n
= p such that p
i+1
is directly density-reachable from p
i
Density-connected
A point p is density-connected to a point q wrt. E and MinPts if
there is a point o such that both, p and q are density-reachable
from o wrt. E and MinPts.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
DBSCAN
Cluster
A cluster C satisfies:
1) p, q: if p C and q is density-reachable
from p wrt. E and MinPts, then q C.
(Maximality)
2) p, q C: p is density-connected to q wrt.
E and MinPts. (Connectivity)
Noise
Those points not belonging to any cluster
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
DBSCAN
Can show
(1) Every density-reachable set is a cluster:
The set
O = {o | o is density-reachable from p wrt. Eps and MinPts}
is a cluster wrt. Eps and MinPts.
(2) Every cluster is a density-reachable set:
Let C be a cluster wrt. Eps and MinPts and let p be any point in C
with |N
Eps
(p)| MinPts. Then C equals to the set
O = {o | o is density-reachable from p wrt. Eps and MinPts}.
This motivates the following algorithm:
For each point, DBSCAN determines the Eps-environment and
checks whether it contains more than MinPts data points
If so, it labels it with a cluster number
If a neighbor q of a point p has already a cluster number,
associate this number with p
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
DBSCAN
Arbitrary shape clusters found by DBSCAN
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
DBSCAN: Summary
Advantages:
Finds clusters of arbitrary shapes
Disadvantages:
Targets low dimensional spatial data
Hard to visualize for >2-dimensional data
Needs clever index to be scalable
How do we set the magic parameters?
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Lecture Overview
Data Mining I: Decision Trees
Data Mining II: Clustering
Data Mining III: Association Analysis
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Market Basket Analysis
Consider shopping cart filled with several
items
Market basket analysis tries to answer the
following questions:
Who makes purchases?
What do customers buy together?
In what order do customers purchase items?
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Market Basket Analysis
Given:
A database of
customer transactions
Each transaction is a
set of items
Example:
Transaction with TID
111 contains items
{Pen, Ink, Milk, Juice}
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Market Basket Analysis (Contd.)
Coocurrences
80% of all customers purchase items X, Y and
Z together.
Association rules
60% of all customers who purchase X and Y
also buy Z.
Sequential patterns
60% of customers who first buy X also
purchase Y within three weeks.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Confidence and Support
We prune the set of all possible association
rules using two interestingness measures:
Confidence of a rule:
X Y has confidence c if P(Y|X) = c
Support of a rule:
X Y has support s if P(XY) = s
We can also define
Support of an itemset (a coocurrence) XY:
XY has support s if P(XY) = s
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example
Examples:
{Pen} => {Milk}
Support: 75%
Confidence: 75%
{Ink} => {Pen}
Support: 100%
Confidence: 100%
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example
Find all itemsets with
support >= 75%?
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example
Can you find all
association rules with
support >= 50%?
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Market Basket Analysis: Applications
Sample Applications
Direct marketing
Fraud detection for medical insurance
Floor/shelf planning
Web site layout
Cross-selling
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Applications of Frequent Itemsets
Market Basket Analysis
Association Rules
Classification (especially: text, rare
classes)
Seeds for construction of Bayesian
Networks
Web log analysis
Collaborative filtering
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Association Rule Algorithms
More abstract problem redux
Breadth-first search
Depth-first search
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problem Redux
Abstract:
A set of items {1,2,,k}
A dabase of transactions
(itemsets) D={T1, T2, , Tn},
Tj subset {1,2,,k}
GOAL:
Find all itemsets that appear in at
least x transactions
(appear in == are subsets of)
I subset T: T supports I
For an itemset I, the number of
transactions it appears in is called
the support of I.
x is called the minimum support.
Concrete:
I = {milk, bread, cheese, }
D = { {milk,bread,cheese},
{bread,cheese,juice}, }
GOAL:
Find all itemsets that appear in at
least 1000 transactions
{milk,bread,cheese} supports
{milk,bread}
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Problem Redux (Contd.)
Definitions:
An itemset is frequent if it is a
subset of at least x
transactions. (FI.)
An itemset is maximally
frequent if it is frequent and it
does not have a frequent
superset. (MFI.)
GOAL: Given x, find all frequent
(maximally frequent) itemsets
(to be stored in the FI (MFI)).
Obvious relationship:
MFI subset FI
Example:
D={ {1,2,3}, {1,2,3}, {1,2,3},
{1,2,4} }
Minimum support x = 3
{1,2} is frequent
{1,2,3} is maximal frequent
Support({1,2}) = 4
All maximal frequent itemsets:
{1,2,3}
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
The Itemset Lattice
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Frequent Itemsets
Frequent itemsets
Infrequent itemsets
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Breath First Search: 1-Itemsets
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
The Apriori Principle:
I infrequent (I union {x}) infrequent
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Breath First Search: 2-Itemsets
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Breath First Search: 3-Itemsets
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Breadth First Search: Remarks
We prune infrequent itemsets and avoid to
count them
To find an itemset with k items, we need to
count all 2
k
subsets
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Depth First Search (1)
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Depth First Search (2)
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Depth First Search (3)
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Depth First Search (4)
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Depth First Search (5)
{}
{2} {1} {4} {3}
{1,2} {2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Depth First Search: Remarks
We prune frequent itemsets and avoid counting
them (works only for maximal frequent
itemsets)
To find an itemset with k items, we need to
count k prefixes
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
BFS Versus DFS
Breadth First Search
Prunes infrequent
itemsets
Uses anti-
monotonicity: Every
superset of an
infrequent itemset is
infrequent
Depth First Search
Prunes frequent
itemsets
Uses monotonicity:
Every subset of a
frequent itemset is
frequent
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Extensions
Imposing constraints
Only find rules involving the dairy department
Only find rules involving expensive products
Only find expensive rules
Only find rules with whiskey on the right hand side
Only find rules with milk on the left hand side
Hierarchies on the items
Calendars (every Sunday, every 1
st
of the month)
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Itemset Constraints
Definition:
A constraint is an arbitrary property of itemsets.
Examples:
The itemset has support greater than 1000.
No element of the itemset costs more than $40.
The items in the set average more than $20.
Goal:
Find all itemsets satisfying a given constraint P.
Solution:
If P is a support constraint, use the Apriori Algorithm.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Negative Pruning in Apriori
{}
{2} {1} {4} {3}
{2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
{1,2}
Frequent
Infrequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Frequent
Infrequent
Currently examined
Dont know
Negative Pruning in Apriori
{}
{2} {1} {4} {3}
{2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
{1,2}
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Negative Pruning in Apriori
{}
{2} {1} {4} {3}
{2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
{1,2}
Frequent
Infrequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Two Trivial Observations
Apriori can be applied to any constraint P that is
antimonotone.
Start from the empty set.
Prune supersets of sets that do not satisfy P.
Itemset lattice is a boolean algebra, so Apriori
also applies to a monotone Q.
Start from set of all items instead of empty set.
Prune subsets of sets that do not satisfy Q.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Negative Pruning a Monotone Q
{}
{2} {1} {4} {3}
{2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
{1,2}
Satisfies Q
Doesnt satisfy Q
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Positive Pruning in Apriori
{}
{2} {1} {4} {3}
{2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
{1,2}
Frequent
Infrequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
{2,3}
Positive Pruning in Apriori
{}
{2} {1} {4} {3}
{1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
{1,2}
Frequent
Infrequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Positive Pruning in Apriori
{}
{2} {1} {4} {3}
{2,3} {1,3} {1,4} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
{1,2}
Frequent
Infrequent
Currently examined
Dont know
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Classifying Constraints
Antimonotone:
support(I) > 1000
max(I) < 100
Neither:
average(I) > 50
variance(I) < 2
3 < sum(I) < 50
Monotone:
sum(I) > 3
min(I) < 40
These are the
constraints we
really want.
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
The Problem Redux
Current Techniques:
Approximate the difficult constraints.
Monotone approximations are common.
New Goal:
Given constraints P and Q, with P antimonotone
(support) and Q monotone (statistical constraint).
Find all itemsets that satisfy both P and Q.
Recent solutions:
Newer algorithms can handle both P and Q
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Conceptual Illustration of Problem
Satisfies Q
Satisfies P & Q
Satisfies P
{}
D
All supersets
satisfy Q
All subsets
satisfy P
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Applications
Spatial association rules
Web mining
Market basket analysis
User/customer profiling
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Extensions: Sequential Patterns