DMML Unit 2

Uploaded by

Aditya Khajuria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views64 pages

DMML Unit 2

Uploaded by

Aditya Khajuria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

ASSOCIATION MINING

Association Rule Mining

 Given a set of transactions, find rules that will

predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions Example of Association

Rules
TID Items {Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs, Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
What Is Association Mining?

 Association rule mining:

 Finding frequent patterns, associations,
correlations, or causal structures among
sets of items or objects in transaction
databases, relational databases, and
other information repositories.
 Frequent pattern: pattern (set of items,
sequence, etc.) that occurs frequently in a
database.
Definition: Frequent Itemset
 Itemset
 A collection of one or more items
 Example: {Milk, Bread, Diaper} TID Items
 k-itemset 1 Bread, Milk
 An itemset that contains k items 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
 Support count ()
4 Bread, Milk, Diaper, Beer
 Frequency of occurrence of an itemset
5 Bread, Milk, Diaper, Coke
 E.g. ({Milk, Bread,Diaper}) = 2
 Support
 Fraction of transactions that contain an
itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
 An itemset whose support is greater than
or equal to a minsup threshold
Definition: Association Rule
TID Items
• Association Rule
– An implication expression of the form X  Y, 1 Bread, Milk
where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer} 4 Bread, Milk, Diaper, Beer
• Rule Evaluation Metrics 5 Bread, Milk, Diaper, Coke
– Support (s) Example:
• Fraction of transactions that contain {Milk , Diaper }  Beer
both X and Y
– Confidence (c)  ( Milk , Diaper, Beer ) 2
s   0 .4
• Measures how often items in Y |T| 5
appear in transactions that  ( Milk, Diaper, Beer ) 2
contain X c   0.67
 ( Milk , Diaper ) 3
April 29, 2024 Data Preprocessing 6
April 29, 2024 Data Preprocessing 7
April 29, 2024 Data Preprocessing 8
Apriori: A Candidate Generation-and-Test
Approach
 Any subset of a frequent itemset must be frequent
 if {beer, diaper, nuts} is frequent, so is {beer, diaper}
 Every transaction having {beer, diaper, nuts} also contains {beer,
diaper}
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
 Method:
 generate length (k+1) candidate itemsets from length k frequent
itemsets, and
 test the candidates against DB
Apriori Algorithm for Frequent Itemset
Generation

A two-step process is followed, consisting of join and

prune actions.

April 29, 2024 Data Preprocessing 15

Apriori – Solved Example

April 29, 2024 Data Preprocessing 16

Apriori – Solved Example
The Apriori Algorithm—An Example
Itemset sup
Itemset sup
Database TDB {A} 2 L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E
1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
CFrequency
2 ≥ 50%, Confidence
Itemset sup C2 Itemset 100%:
{A, B} 1
L2 Itemset sup
{A, C}
A scan
2nd  C {A, B}
2
{A, C}
{B, C}
2
2
{A, E} 1 BE {A, C}

{B, E} 3
{B, C} 2 BC  E {A, E}
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2
CE  B {B, E}
BE  C {C, E}
C3 Itemset 3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
The Apriori Algorithm — Example

Database D ite m s e t s u p .
L1 ite m s e t s u p .
T ID Item s C1 {1 } 2 {1 } 2
100 1 3 4 {2 } 3 {2 } 3
200 2 3 5 Scan D {3 } 3 {3 } 3
300 1 2 3 5 {4 } 1 {5 } 3
400 2 5 {5 } 3
C2 item s et s up C2 item s et
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}

C3 itemset Scan D L3 itemset sup

{2 3 5} {2 3 5} 2
28
April 29, 2024 Data Preprocessing 29
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
How to Generate Candidates?
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
 itemsets c in Ck do
 (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Applications of Apriori Algorithm

• Education field: Extracting association rules in

data mining of admitted students through
characteristics and specialties.
• Medical field: Analysis of the patient’s database.
• Forestry: Analysis of probability and intensity of
forest fire with the forest fire data.
• Recommender system: Used by companies like
Amazon and Google for the autocomplete feature

32
Drawbacks of Apriori Algorithm

 Using Apriori needs a generation of candidate

itemsets. These itemsets may be large in number
if the itemset in the database is huge.
 Apriori needs multiple scans of the database to
check the support of each itemset generated and
this leads to high costs
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1. Scan DB once, find Header Table
frequent 1-itemset f:4 c:1
Item frequency head
(single item pattern) f 4
2. Sort frequent items in c 4 c:3 b:1 b:1
frequency descending a 3
order, f-list b 3 a:3 p:1
m 3
3. Scan DB again, p 3
m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
37
April 29, 2024 Data Preprocessing 38
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not count
node-links and the count field)
 For Connect-4 DB, compression ratio could be over
100

39
DECISION TREES
Example of a Decision Tree

Tid Refund Marital Taxable

Status Income Cheat

1 Yes Single 125K No

Refund
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Apply Model to Test Data
Test Data
Start at the root of tree Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K