0% found this document useful (0 votes)
24 views64 pages

DMML Unit 2

Uploaded by

Aditya Khajuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views64 pages

DMML Unit 2

Uploaded by

Aditya Khajuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

ASSOCIATION MINING

Association Rule Mining

 Given a set of transactions, find rules that will


predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions Example of Association


Rules
TID Items {Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs, Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
What Is Association Mining?

 Association rule mining:


 Finding frequent patterns, associations,
correlations, or causal structures among
sets of items or objects in transaction
databases, relational databases, and
other information repositories.
 Frequent pattern: pattern (set of items,
sequence, etc.) that occurs frequently in a
database.
Definition: Frequent Itemset
 Itemset
 A collection of one or more items
 Example: {Milk, Bread, Diaper} TID Items
 k-itemset 1 Bread, Milk
 An itemset that contains k items 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
 Support count ()
4 Bread, Milk, Diaper, Beer
 Frequency of occurrence of an itemset
5 Bread, Milk, Diaper, Coke
 E.g. ({Milk, Bread,Diaper}) = 2
 Support
 Fraction of transactions that contain an
itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
 An itemset whose support is greater than
or equal to a minsup threshold
Definition: Association Rule
TID Items
• Association Rule
– An implication expression of the form X  Y, 1 Bread, Milk
where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer} 4 Bread, Milk, Diaper, Beer
• Rule Evaluation Metrics 5 Bread, Milk, Diaper, Coke
– Support (s) Example:
• Fraction of transactions that contain {Milk , Diaper }  Beer
both X and Y
– Confidence (c)  ( Milk , Diaper, Beer ) 2
s   0 .4
• Measures how often items in Y |T| 5
appear in transactions that  ( Milk, Diaper, Beer ) 2
contain X c   0.67
 ( Milk , Diaper ) 3
April 29, 2024 Data Preprocessing 6
April 29, 2024 Data Preprocessing 7
April 29, 2024 Data Preprocessing 8
Apriori: A Candidate Generation-and-Test
Approach
 Any subset of a frequent itemset must be frequent
 if {beer, diaper, nuts} is frequent, so is {beer, diaper}
 Every transaction having {beer, diaper, nuts} also contains {beer,
diaper}
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
 Method:
 generate length (k+1) candidate itemsets from length k frequent
itemsets, and
 test the candidates against DB
Apriori Algorithm for Frequent Itemset
Generation

A two-step process is followed, consisting of join and


prune actions.

April 29, 2024 Data Preprocessing 15


Apriori – Solved Example

April 29, 2024 Data Preprocessing 16


Apriori – Solved Example
The Apriori Algorithm—An Example
Itemset sup
Itemset sup
Database TDB {A} 2 L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E
1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
CFrequency
2 ≥ 50%, Confidence
Itemset sup C2 Itemset 100%:
{A, B} 1
L2 Itemset sup
{A, C}
A scan
2nd  C {A, B}
2
{A, C}
{B, C}
2
2
{A, E} 1 BE {A, C}

{B, E} 3
{B, C} 2 BC  E {A, E}
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2
CE  B {B, E}
BE  C {C, E}
C3 Itemset 3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
The Apriori Algorithm — Example

Database D ite m s e t s u p .
L1 ite m s e t s u p .
T ID Item s C1 {1 } 2 {1 } 2
100 1 3 4 {2 } 3 {2 } 3
200 2 3 5 Scan D {3 } 3 {3 } 3
300 1 2 3 5 {4 } 1 {5 } 3
400 2 5 {5 } 3
C2 item s et s up C2 item s et
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}

C3 itemset Scan D L3 itemset sup


{2 3 5} {2 3 5} 2
28
April 29, 2024 Data Preprocessing 29
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
How to Generate Candidates?
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
 itemsets c in Ck do
 (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Applications of Apriori Algorithm

• Education field: Extracting association rules in


data mining of admitted students through
characteristics and specialties.
• Medical field: Analysis of the patient’s database.
• Forestry: Analysis of probability and intensity of
forest fire with the forest fire data.
• Recommender system: Used by companies like
Amazon and Google for the autocomplete feature

32
Drawbacks of Apriori Algorithm

 Using Apriori needs a generation of candidate


itemsets. These itemsets may be large in number
if the itemset in the database is huge.
 Apriori needs multiple scans of the database to
check the support of each itemset generated and
this leads to high costs
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1. Scan DB once, find Header Table
frequent 1-itemset f:4 c:1
Item frequency head
(single item pattern) f 4
2. Sort frequent items in c 4 c:3 b:1 b:1
frequency descending a 3
order, f-list b 3 a:3 p:1
m 3
3. Scan DB again, p 3
m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
37
April 29, 2024 Data Preprocessing 38
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not count
node-links and the count field)
 For Connect-4 DB, compression ratio could be over
100

39
DECISION TREES
Example of a Decision Tree

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


Refund
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Apply Model to Test Data
Test Data
Start at the root of tree Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Each node represents a test on an attribute
of the instance to be classified and each
outgoing arch a possible outcome, leading to
a further test.

The leafs correspond to classification


actions.
Decision tree representation
(PlayTennis)

Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong No

4/29/2024 Maria Simi


Decision trees expressivity
 Decision trees represent a disjunction of
conjunctions on constraints on the value of
attributes:
(Outlook = Sunny  Humidity = Normal) 
(Outlook = Overcast) 
(Outlook = Rain  Wind = Weak)

4/29/2024
When to use Decision Trees

 Problem characteristics:
 Instances can be described by attribute value pairs

 Target function is discrete valued

 Disjunctive hypothesis may be required

 Possibly noisy training data samples


 Robust to errors in training data
 Missing attribute values
 Different classification problems:
 Equipment or medical diagnosis

 Credit risk analysis

 Several tasks in natural language processing

4/29/2024
Top-down induction of Decision Trees
 ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
 Given a training set of examples, the algorithms for building
DT performs search in the space of decision trees
 The construction of the tree is top-down. The algorithm is
greedy.
 The fundamental question is “which attribute should be
tested next? Which question gives us more information?”
 Select the best attribute
 A descendent node is then created for each possible value of
this attribute and examples are partitioned according to this
value
 The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
4/29/2024
Which attribute is the best classifier?

 A statistical property called information gain, measures


how well a given attribute separates the training
examples
 Information gain uses the notion of entropy, commonly
used in information theory
 Information gain = expected reduction of entropy

4/29/2024
Entropy in binary classification

 Entropy measures the impurity of a collection of


examples. It depends from the distribution of the
random variable p.
 S is a collection of training examples

 p+ the proportion of positive examples in S

 p– the proportion of negative examples in S


Entropy (S)  – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0,94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0  p  1, 0  entropy  1

4/29/2024
Entropy

4/29/2024 Maria Simi


Information gain as entropy
reduction
 Information gain is the expected reduction in
entropy caused by partitioning the examples on an
attribute.
 The higher the information gain the more effective
the attribute in classifying training data.
 Expected reduction in entropy knowing A
|Sv|
Gain(S, A) = Entropy(S) −  |S| Entropy(Sv)
v  Values(A)
Values(A) possible values for A
Sv subset of S for which A has value v

4/29/2024
Example: expected information gain

 Let
 Values(Wind) = {Weak, Strong}
 S = [9+, 5−]
 SWeak = [6+, 2−]
 SStrong = [3+, 3−]
 Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14
Entropy(SStrong)
= 0,94 − 8/14  0,811 − 6/14  1,00
= 0,048

4/29/2024
Which attribute is the best classifier?

4/29/2024 Maria Simi


Example

4/29/2024
First step: which attribute to test at the
root?
 Which attribute should be tested at the root?
 Gain(S, Outlook) = 0.246

 Gain(S, Humidity) = 0.151

 Gain(S, Wind) = 0.084

 Gain(S, Temperature) = 0.029

 Outlook provides the best prediction for the target


 Lets grow the tree:
 add to the tree a successor for each possible
value of Outlook
 partition the training samples according to the
value of Outlook
4/29/2024
After first step

4/29/2024 Maria Simi


Second step
 Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970  3/5  0.0  2/5  0.0 =
0.970
Gain(SSunny, Wind) = 0.970  2/5  1.0  3.5  0.918 = 0
.019
Gain(SSunny, Temp.) = 0.970  2/5  0.0  2/5  1.0  1/5
 0.0 = 0.570
 Humidity provides the best prediction for the target
 Lets grow the tree:
 add to the tree a successor for each possible value
of Humidity
 partition the training samples according to the
value of Humidity

4/29/2024
Second and third steps

{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No

4/29/2024 Maria Simi


ID3: algorithm

ID3(X, T, Attrs) X: training examples:


T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A  best attribute; decision attribute for Root  A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi  subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs  {A})
return Root

4/29/2024

You might also like