0% found this document useful (0 votes)
52 views36 pages

MS (Data Science) Fall 2020 Semester

This document provides information about a data mining course taught by Dr. Sohail Abdul Sattar at NED University. It lists three recommended textbooks for the course and covers topics that will be discussed, including market basket analysis, frequent itemsets, and algorithms for mining association rules from transactional data.

Uploaded by

Faiza Israr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views36 pages

MS (Data Science) Fall 2020 Semester

This document provides information about a data mining course taught by Dr. Sohail Abdul Sattar at NED University. It lists three recommended textbooks for the course and covers topics that will be discussed, including market basket analysis, frequent itemsets, and algorithms for mining association rules from transactional data.

Uploaded by

Faiza Israr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

MS (Data Science)

Fall 2020 Semester

CT-530 DATA MINING

Dr. Sohail Abdul Sattar


Ex Professor & Co-chairman
Department of Computer Science & Information
Technology
NED University of Engineering & Technology
2

Course Teacher

Dr. Sohail Abdul Sattar


Ex Professor & Co-Chairman
Department of Computer Science & Information
Technology
NED University

PhD Computer Vision NED + UCF (Orlando, US)


MS Comp. Science NED
MCS Comp. Science KU
BE Mech. Engg. NED
3

Books
• “Introduction to Data Mining” by Tan, Steinbach, Kumar.

• Mining Massive Datasets by Anand Rajaraman, Jeff Ullman, and Jure


Leskovec. Free online book. Includes slides from the course     

• “Data Mining: Concepts and Techniques”, by Jiawei Han and


Micheline Kambe

• “Data Mining: Practical Machine Learning Tools and Techniques”


by Ian H. Witten
4

Thanks to the owner of these slides !!!


DATA MINING
LECTURE 4
Market-Basket Analysis
Frequent Itemsets
6

This is how it all started…


• Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami:
Mining Association Rules between Sets of Items in
Large Databases. SIGMOD Conference 1993: 207-216
• Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms
for Mining Association Rules in Large Databases.
VLDB 1994: 487-499

• These two papers are credited with the birth of Data


Mining
• For a long time people were fascinated with Association
Rules and Frequent Itemsets
• Some people (in industry and academia) still are.
7

Market-Basket Data
• A large set of items, e.g., things sold in a
supermarket.
• A large set of baskets, each of which is a small
subset of the items, e.g., the things one customer
buys on one day.
Items: {Bread, Milk, Diaper, Beer, Eggs, Coke}
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
Baskets: Transactions
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
8

Frequent itemsets
• Goal: find combinations of items (itemsets) that
occur frequently
• Called Frequent Itemsets

 Support : number of
TID Items transactions that contain
1 Bread, Milk
itemset I
2 Bread, Diaper, Beer, Eggs  Examples of frequent itemsets ≥ 3
3 Milk, Diaper, Beer, Coke {Bread}: 4
4 Bread, Milk, Diaper, Beer {Milk} : 4
5 Bread, Milk, Diaper, Coke {Diaper} : 4
{Beer}: 3
{Diaper, Beer} : 3
{Milk, Bread} : 3
9

Market-Baskets – (2)
• Really, a general many-to-many mapping
(association) between two kinds of things, where the
one (the baskets) is a set of the other (the items)
• But we ask about connections among “items,” not “baskets.”

• The technology focuses on common/frequent events,


not rare events (“long tail”).
10

Applications – (1)
• Items = products; baskets = sets of products
someone bought in one trip to the store.

• Example application: given that many people buy


beer and diapers together:
• Run a sale on diapers; raise price of beer.
• Only useful if many buy diapers & beer.
11

Applications – (2)
• Baskets = Web pages; items = words.

• Example application: Unusual words appearing


together in a large number of documents, e.g.,
“Brad” and “Angelina,” may indicate an interesting
relationship.
12

Applications – (3)
• Baskets = sentences; items = documents
containing those sentences.

• Example application: Items that appear together


too often could represent plagiarism.

• Notice items do not have to be “in” baskets.


Definitions
•   Itemset TID Items
• A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Beer, Eggs
• k-itemset 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
• An itemset that contains k items
5 Bread, Milk, Diaper, Coke
• Support (s)
• Count: Frequency of occurrence of an itemset
• E.g. s({Milk, Bread,Diaper}) = 2
• Fraction: Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 40%
• Frequent Itemset
• An itemset whose support is greater than or equal to a
minsup threshold, minsup
14

Mining Frequent Itemsets task


• Input: Market basket data, threshold minsup
• Output: All frequent itemsets with support ≥ minsup

• Problem parameters:
• N (size): number of transactions
• Wallmart: billions of baskets per year
• Web: billions of pages
• d (dimension): number of (distinct) items
• Wallmart sells more than 100,000 items
• Web: billions of words
• w: max size of a basket
• M: Number
 M =2𝑑 of possible itemsets.
15

The itemset lattice


null Representation of all possible
itemsets and their relationships

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given d items, there are 2d


ABCDE possible itemsets
16

A Naïve Algorithm
• Brute-force approach: Every itemset is a candidate :
• Consider all itemsets in the lattice, and scan the data for each candidate to
compute the support
• OR
• Scan the data, and for each transaction generate all possible itemsets.
Keep a count for each itemset in the data.

• Expensive since M = 2d !!!


• No solution that considers all candidates is acceptable!

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
17

Computation Model
• Typically, data is kept in flat files rather than in a
database system.
• Stored on disk.
• Stored basket-by-basket.
• We can expand a basket into pairs, triples, etc. as we read
the data.
• Use k nested loops, or recursion to generate all itemsets of size k.

• Data is too large to be loaded in memory.


18

Example file: retail


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32
33 34 35
36 37 38 39 40 41 42 43 44 45 46
38 39 47 48
38 39 48 49 50 51 52 53 54 55 56 57 58 Example: items are positive integers,
32 41 59 60 61 62
3 39 48 and each basket corresponds to a line in
63 64 65 66 67 68 the file of space-separated integers
32 69
48 70 71 72
39 73 74 75 76 77 78 79
36 38 39 41 48 79 80 81
82 83 84
41 85 86 87 88
39 48 89 90 91 92 93 94 95 96 97 98 99 100 101
36 38 39 48 89
39 41 102 103 104 105 106 107 108
38 39 41 109 110
39 111 112 113 114 115 116 117 118
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
48 134 135 136
39 48 137 138 139 140 141 142 143 144 145 146 147 148 149
39 150 151 152
38 39 56 153 154 155
19

Computation Model – (2)


• The true cost of mining disk-resident data is
usually the number of disk I/O’s.
• In practice, association-rule algorithms read the
data in passes – all baskets read in turn.

• Thus, we measure the cost by the number of


passes an algorithm takes.
20

Main-Memory Bottleneck
• For many frequent-itemset algorithms, main
memory is the critical resource.
• As we read baskets, we need to count something, e.g.,
occurrences of pairs.
• The number of different things we can count is limited
by main memory.
• Swapping counts in/out is too slow
21

Computational Model - Summary


• There are two quantities that capture the cost of
our algorithm
1. [Time] The number of passes we make over the data
2. [Space] The amount of memory that we use

• As it is usually the case, there is a trade-off


between the two.
22

The Apriori Principle


• Apriori principle (Main observation):
– If an itemset is frequent, then all of its subsets must also
be frequent
– If an itemset is not frequent, then all of its supersets
cannot be frequent
– The support of an itemset never exceeds the support of
its subsets
∀ 𝑋 ,𝑌 : 𝑋 ⊆ 𝑌 ⇒ 𝑠 ( 𝑋 ) ≥ 𝑠(𝑌 )
 

– This is known as the anti-monotone property of support


23

Illustration of the Apriori principle


Frequent
subsets

Found to be frequent
24

Illustration of the Apriori principle


null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent supersets
ABCDE
Pruned
25

The Apriori algorithm


Ck = candidate itemsets of size k
Level-wise approach
Lk = frequent itemsets of size k

1. k = 1, C1 = all items
2. While Ck not empty
Frequent
itemset 3. Scan the database to find which itemsets in
generation Ck are frequent and put them into Lk
Candidate 4. Generate the candidate itemsets Ck+1 of
generation
size k+1 using Lk
5. k = k+1
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, 1994.
26

Candidate Generation
• Apriori principle:
• An itemset of size k+1 is candidate to be frequent only if
all of its subsets of size k are known to be frequent

Candidate generation:
• Construct a candidate of size k+1 by combining
frequent itemsets of size k
• If k = 1, take the all pairs of frequent items
• If k > 1, join pairs of itemsets that differ by just one item
• For each generated candidate itemset ensure that all
subsets of size k are frequent.
27

Generate Candidates Ck+1


• Assumption: The items in an itemset are ordered
• Integers ordered in increasing order, strings ordered lexicographicly
• The order ensures that if item y > x appears before x, then x is not in
the itemset

• The itemsets in Lk are also ordered


Create a candidate itemset of size k+1, by joining
two itemsets of size k, that share the first k-1 items
Item 1 Item 2 Item 3
1 2 3
1 2 5
1 4 5
28

Generate Candidates Ck+1


• Assumption: The items in an itemset are ordered
• Integers ordered in increasing order, strings ordered in lexicographicly
• The order ensures that if item y > x appears before x, then x is not in
the itemset

• The itemsets in Lk are also ordered


Create a candidate itemset of size k+1, by joining
two itemsets of size k, that share the first k-1 items
Item 1 Item 2 Item 3
1 2 3
1 2 3 5
1 2 5
1 4 5
29

Generate Candidates Ck+1


• Assumption: The items in an itemset are ordered
• Integers ordered in increasing order, strings ordered in lexicographicly
• The order ensures that if item y > x appears before x, then x is not in
the itemset

• The itemsets in Lk are also ordered


Create a candidate itemset of size k+1, by joining
two itemsets of size k, that share the first k-1 items
Item 1 Item 2 Item 3
Are we missing something?
1 2 3
What about this candidate?
1 2 5
1 2 4 5
1 4 5
30

Example
• L3={abc, abd, acd, ace, bcd}

• Generating candidate set C4

• Self-join: L3*L3
item1 item2 item3 item1 item2 item3
a b c a b c
a b d a b d
a c d a c d
a c e a c e
b c d b c d
31

Example
• L3={abc, abd, acd, ace, bcd}

• Generating candidate set C4

• Self-join: L3*L3
item1 item2 item3 item1 item2 item3
a b c a b c
a b d a b d
a c d a c d
a c e a c e
b c d b c d
32

Example
• L3={abc, abd, acd, ace, bcd}

• Generating candidate set C4


C4 ={abcd}
• Self-join: L3*L3
item1 item2 item3 item1 item2 item3
a b c a b c
a b d a b d
a c d a c d {a,b,c} {a,b,d}

a c e a c e
{a,b,c,d}
b c d b c d

p.item1=q.item1,p.item2=q.item2, p.item3< q.item3


33

Example
• L3={abc, abd, acd, ace, bcd}

• Generating candidate set C4


C4 ={abcd
• Self-join: L3*L3
acde}
item1 item2 item3 item1 item2 item3
a b c a b c
a b d a b d
{a,c,d} {a,c,e}
a c d a c d
a c e a c e {a,c,d,e}
b c d b c d

p.item1=q.item1,p.item2=q.item2, p.item3< q.item3


34

Number of Combinations
35

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
36

Illustration of the Apriori principle


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
minsup = 3 3 Milk, Diaper, Beer, Coke
Item Count Items (1-itemsets) 4 Bread, Milk, Diaper, Beer
Bread 4 5 Bread, Milk, Diaper, Coke
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3
Diaper 4
{Bread,Milk} 3
{Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Triplets (3-itemsets)
 
If every subset is considered, Itemset Count
+ + = 6 + 15 + 20 = 41 {Bread,Milk,Diaper} 2
With support-based pruning,
+ + = 6 + 6 + 1 = 13 Only this triplet has all subsets to be frequent
But it is below the minsup threshold

You might also like