All Data Mining Chapters
All Data Mining Chapters
Types of Data
Data Quality
Data Preprocessing
Objects
– Attribute is also known as
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
Types of Attributes
female} test
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
09/14/2020 Introduction to Data Mining, 2nd Edition 11
Tan, Steinbach, Karpatne, Kumar
Important Characteristics of Data
– Sparsity
◆ Only presence counts
– Resolution
◆ Patterns depend on the scale
– Size
◆ Type of analysis may depend on size of data
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
09/14/2020 Introduction to Data Mining, 2nd Edition 20
Tan, Steinbach, Karpatne, Kumar
Graph Data
2
5 1
2
5
Sequences of transactions
Items/Events
An element of
the sequence
09/14/2020 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Causes?
09/14/2020 Introduction to Data Mining, 2nd Edition 28
Tan, Steinbach, Karpatne, Kumar
Missing Values
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
09/14/2020 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
09/14/2020 Introduction to Data Mining, 2nd Edition 34
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
r = 2. Euclidean distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
09/14/2020 Introduction to Data Mining, 2nd Edition 37
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance
Covariance
Matrix:
0.3 0.2
=
C
0.2 0.3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
0.3 0.2
⸪Σ =
0.2 0.3
−1 1 0.3 −0.2 0.3 −0.2 6 −4
⸫𝛴 = = 20 =
0.09−0.04 −0.2 0.3 −0.2 0.3 −4 6
6 −4 0.5
0.5 −0.5 =5
−4 6 −0.5
09/14/2020 Introduction to Data Mining, 2nd Edition 40
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Distance
x= 1000000000
y= 0000001001
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Scatter plots
showing the
similarity from
–1 to 1.
yi = xi2
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
Domain of application
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
However, one can talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
The measure must be applicable to the data and
produce results that agree with domain knowledge
09/14/2020 Introduction to Data Mining, 2nd Edition 51
Tan, Steinbach, Karpatne, Kumar
Information Based Measures
For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
𝑛
𝐻 𝑋 = − 𝑝𝑖 log 2 𝑝𝑖
𝑖=1
Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
𝑛
𝑚𝑖 𝑚𝑖
𝐻 𝑋 = − log 2
𝑚 𝑚
𝑖=1
σ𝑛
𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱,𝐲)
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = σ𝑛
𝑘=1 𝜔𝑘 𝛿𝑘
Aggregation
Sampling
Discretization and Binarization
Attribute Transformation
Dimensionality Reduction
Feature subset selection
Feature creation
Purpose
– Data reduction
◆ Reduce the number of attributes or objects
– Change of scale
◆ Cities aggregated into regions, states, countries, etc.
◆ Days aggregated into weeks, months, or years
– More “stable” data
◆ Aggregated data tends to have less variability
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Counts
30
20
10
0
0 2 4 6 8
Petal Length
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
x1
09/14/2020 Introduction to Data Mining, 2nd Edition 87
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA
Frequency
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
02/03/2018 Introduction to Data Mining, 2nd Edition 5
Problem with Accuracy
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
02/03/2018 Introduction to Data Mining, 2nd Edition 8
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0 .5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 * 1 * 0 .5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0 .5
990
Accuracy = = 0.99
1000
PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0 .1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1 * 1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0 .1
991
Accuracy = = 0.991
1000
02/03/2018 Introduction to Data Mining, 2nd Edition 10
Alternative Measures
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8
PREDICTED CLASS
Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
PREDICTED CLASS
Precision (p) =~ 0.04
Class=Yes Class=No
TPR = Recall (r) = 0.8
ACTUAL
Class=Yes 40 10 FPR = 0.2
CLASS Class=No 1000 4000 F - measure (F) =~ 0.08
Accuracy =~ 0.8
PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
TPR = Recall (r) = 0.2
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS
PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
TPR = Recall (r) = 0.5
Class=Yes 25 25
ACTUAL FPR = 0.5
Class=No 25 25
CLASS
(TPR,FPR):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class
x2 < 12.63
x1 < 7.24
x2 < 8.64
x1 < 13.29 x2 < 17.35
x1 < 12.11
x2 < 1.38 x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 18.88
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
x2 < 12.63
x1 < 7.24
x2 < 8.64 0.071
0.107
x1 < 12.11
x2 < 1.38 0.727
0.164
x1 < 18.88
0.143 0.669 0.271
0.654 0
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
02/03/2018 Introduction to Data Mining, 2nd Edition 21
Using ROC for Model Comparison
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
Cost-sensitive classification
– Misclassifying rare class as majority class is
more expensive than misclassifying majority
as rare class
Sampling-based approaches
PREDICTED CLASS
Class=Yes Class=No
ACTUAL
CLASS Class=Yes f(Yes, Yes) f(Yes,No)
C(i,j): Cost of
Class=No f(No, Yes) f(No, No)
misclassifying class i
example as class j
Cost PREDICTED CLASS
Matrix
C(i, j) Class=Yes Class=No Cost = C (i, j ) f (i, j )
Class=Yes C(Yes, Yes) C(Yes, No)
ACTUAL
CLASS
Class=No C(No, Yes) C(No, No)
Instance-Based Learning
Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
𝑑(𝒙, 𝒚) = (𝒙𝒊 − 𝒚𝒊 )2
𝑖
111111111110 000000000001
vs
011111111111 100000000000
Nearest neighbor
classifiers are local
classifiers
Bayesian Classifiers
al al us
ir c ir c o
• Given a record with attributes (X1, te go
te go
ntin
u
la ss
c a c a co c
X2,…, Xd) Tid Refund Marital Taxable
Status Income Evade
– Goal is to predict class Y
1 Yes Single 125K No
– Specifically, we want to find the value of 2 No Married 100K No
Y that maximizes P(Y| X1, X2,…, Xd ) 3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
• Can we estimate P(Y| X1, X2,…, Xd ) 6 No Married 60K No
• Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem
P ( X 1 X 2 X d | Y ) P (Y )
P (Y | X 1 X 2 X n ) =
P( X 1 X 2 X d )
P(Refund=Yes|Yes)=0
1 −
( 120−110 ) 2
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P ( A | M ) = = 0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M ) = 0.06 = 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004 = 0.0027
eagle no yes no yes non-mammals 20
D
D is parent of C
A is child of C
C
B is descendant of D
D is ancestor of A
A B
X1 X2 X3 X4 ... Xd
Exercise Diet
Blood
Chest Pain
Pressure
Rule-Based
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
consequent of a (Status=Single) → No
rule Coverage = 40%, Accuracy = 50%
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
2/12/2020 Introduction to Data Mining, 2nd Edition 7
Characteristics of Rule Sets: Strategy 2
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
2/12/2020 Introduction to Data Mining, 2nd Edition 9
Rule Ordering Schemes
Rule-based ordering
– Individual rules are ranked based on their quality
Class-based ordering
– Rules that belong to the same class appear together
Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R
Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules
R1 R1
R2
Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1
– 𝐺𝑎𝑖𝑛 𝑅 , 𝑅 = 𝑝 × [ 𝑙𝑜𝑔 𝑝1 𝑝0
0 1 1 2 − 𝑙𝑜𝑔2 ]
𝑝1 + 𝑛1 𝑝0 + 𝑛0
P
No Yes
Q R Rule Set
Rule-Based
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
consequent of a (Status=Single) → No
rule Coverage = 40%, Accuracy = 50%
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
2/12/2020 Introduction to Data Mining, 2nd Edition 7
Characteristics of Rule Sets: Strategy 2
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
2/12/2020 Introduction to Data Mining, 2nd Edition 9
Rule Ordering Schemes
Rule-based ordering
– Individual rules are ranked based on their quality
Class-based ordering
– Rules that belong to the same class appear together
Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R
Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules
R1 R1
R2
Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1
– 𝐺𝑎𝑖𝑛 𝑅 , 𝑅 = 𝑝 × [ 𝑙𝑜𝑔 𝑝1 𝑝0
0 1 1 2 − 𝑙𝑜𝑔2 ]
𝑝1 + 𝑛1 𝑝0 + 𝑛0
P
No Yes
Q R Rule Set
Chapter 5
Association Analysis: Basic Concepts
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
02/14/2018 Introduction to Data Mining, 2nd Edition 5
Computational Complexity
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
d d − k
R =
d −1 d −k
k j
k =1 j =1
= 3 − 2 +1 d d +1
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
02/14/2018 Introduction to Data Mining, 2nd Edition 7
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
02/14/2018 Introduction to Data Mining, 2nd Edition 13
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 3
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 3
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
234
567
1 2 41 4 5 136 345 356 367
457 357 368
125 689
4 5 81 2 4 125 159
1 5 94 5 7 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
02/14/2018 Introduction to Data Mining, 2nd Edition 41