0% found this document useful (0 votes)
12 views32 pages

Lec1b Assoc Rules

Uploaded by

jasperqiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Lec1b Assoc Rules

Uploaded by

jasperqiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Frequent Itemsets and

Association Rules

Market Baskets
Frequent Itemsets
A-Priori Algorithm
1

Thursday, March 19, 20


The Market-Basket Model

! A large set of items, e.g., things sold in a


supermarket.
! A large set of baskets, each of which is a
small set of the items, e.g., the things one
customer buys on one day.

Thursday, March 19, 20


Market-Baskets – (2)

! A general many-many mapping


(association) between two kinds of things.
But we ask about connections among “items,” not
“baskets.”
! The technology focuses on common
events, not rare events (“long tail”).

Thursday, March 19, 20


Support

! Simplest question: find sets of items that


appear “frequently” in the baskets.
! Support for itemset I (s(I)) = the number of
baskets containing all items in I.
! Given a support threshold s, sets of items
that appear in at least s baskets are called
frequent itemsets.

Thursday, March 19, 20


Example: Frequent Itemsets
! Items={milk, coke, pepsi, beer, juice}.
! Support = 3 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
! Frequent itemsets: {m}, {c}, {b}, {j},

{m,b}, {b,c} , {c,j}. 5

Thursday, March 19, 20


Monotonicity

! For any sets of items I and any set of items J,


it holds that:

Thursday, March 19, 20


Applications – (1)

! Items = products; baskets = sets of


products someone bought in one trip to the
store.
! Example application: given that many
people buy beer and diapers together:
− Run a sale on diapers; raise price of beer.
! Only useful if many buy diapers & beer.

Thursday, March 19, 20


Applications – (2)

! Items = sets of documents; baskets = for


any sentence there is a basket containining
all documents with that sentence.
! Items that appear together too often could
represent plagiarism.

Thursday, March 19, 20


Applications – (3)

! Baskets = Web pages; items = words.


! Unusual words appearing together in a
large number of documents, e.g.,
“Hollande” and “Merkel”, may indicate an
interesting connection.

Thursday, March 19, 20


Scale of the Problem

! WalMart, Carrefour sell more than 100,000


items and can store billions of baskets.
! The Web has billions of words and many
billions of pages.

10

Thursday, March 19, 20


Association Rules

If-then rules I → j about the contents of


baskets, I is a set of items and j is an item.
!
I → j means: “if a basket contains all the
items in I then it is likely to contain j.”
Confidence of this association rule is the
probability of j given I, i.e.

11

Thursday, March 19, 20


Example: Confidence
+ B1 = {m, c, b} B2 = {m, p, j}
- B3 = {m, b} B4 = {c, j}
- B5 = {m, p, b} +B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

!
An association rule: {m, b} → c.
Confidence = ?

12

Thursday, March 19, 20


Example: Confidence
+ B1 = {m, c, b} B2 = {m, p, j}
- B3 = {m, b} B4 = {c, j}
- B5 = {m, p, b} +B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

!
An association rule: {m, b} → c.
Confidence = # baskets containing {m,b,c} / #
baskets containing {m,b} = 2/4 = 0.5
13

Thursday, March 19, 20


Finding Association Rules

! Question: “find all association rules with


support ≥ s and confidence ≥ c .”
− Note: “support” of an association rule is defined:
s(I → j) = s(I U {j})
! Hard part: finding the frequent itemsets.

14

Thursday, March 19, 20


From Frequent Itemsets to AR

Note: if I → j has high support s and confidence,


then I U {j} is frequent which implies that I is
frequent.

Hence, AR with high confidence are found by


considering every frequent itemset I and check
if I \ {j} → j has high confidence, for any j in I.

Difficult part: find frequent itemsets. 15

Thursday, March 19, 20


Computation Model

! Typically, data is kept in raw files rather than


in a database system.
Stored on disk.
Stored basket-by-basket.
Expand baskets into pairs, triples, etc. as you read
baskets.
! Use k nested loops to generate all sets of size k.

16

Thursday, March 19, 20


File Organization

Item
Item
Item Basket 1 Example: every line
Item
Item in a text document
Item
Item Basket 2
contains all items of a
Item given basket.
Item
Item Basket 3
Item
Item

Etc.

17

Thursday, March 19, 20


Computation Model – (2)

! The true cost of mining disk-resident data is


usually the number of disk I/O’s.
! In practice, association-rule algorithms read
data in passes – all baskets read in turn.
! Thus, we measure the cost by the number
of passes an algorithm takes.

18

Thursday, March 19, 20


Finding Frequent Pairs
Main-Memory Bottleneck
! For many frequent-itemset algorithms, main
memory is the critical resource.
As we read baskets, we need to count
something, e.g., occurrences of pairs.
The number of different things we can count is
limited by main memory.

19

Thursday, March 19, 20


! The hardest problem often turns out to be
finding the frequent pairs.
Why? Often frequent pairs are common, frequent
triples are rare.
! Why? Probability of being frequent drops
exponentially with size; number of sets grows
more slowly with size.
! We’ll concentrate on pairs, then extend to
larger sets.
20

Thursday, March 19, 20


Naïve Algorithm

! Read file once, counting in main memory


the occurrences of each pair.
From each basket of n items, generate its
n (n -1)/2 pairs by two nested loops.
! Fails if (#items)2 exceeds main memory.
Remember: #items can be 100K (Wal-Mart) or
10B (Web pages).

21

Thursday, March 19, 20


Example: Counting Pairs

! Suppose 105 items.


! Suppose counts are 4-byte integers.
! Number of pairs of items: 105(105-1)/2 =
5*109 (approximately).
! Therefore, 2*1010 (20 gigabytes) of main
memory needed.

22

Thursday, March 19, 20


A-Priori Algorithm – (1)

! A two-pass approach called a-priori limits


the need for main memory.
! Key idea: monotonicity : if a set of items
appears at least s times, so does every
subset.
For pairs: if item i does not appear in s baskets,
then no pair including i can appear in s
baskets.

23

Thursday, March 19, 20


A-Priori Algorithm – (2)
! Pass 1: Read baskets and count in main
memory the occurrences of each item.
Requires only memory proportional to #items.
! Items that appear at least s times are the
frequent items.

24

Thursday, March 19, 20


A-Priori Algorithm – (3)
Pass 2: Read baskets again and count in
main memory only those pairs both of
which were found in pass 1 to be frequent.
! To count number of item pairs use a hash

function. Requires memory proportional to


square of frequent items only, plus a list of
the frequent items (so you know what must
be counted), plus space for hashing.

25

Thursday, March 19, 20


Main Memory Data in A-Priori

Item counts Frequent items

Counts of
pairs of
frequent
items

Pass 1 Pass 2 26

Thursday, March 19, 20


Frequent k-itemsets

! For each k, we construct two sets of k -


itemsets (sets of size k ):
− Ck = candidate k -sets = those that might be
frequent sets (support > s ) based on
information from the the (k –1)th pass .
− Fk = the set of truly frequent k -itemsets.

27

Thursday, March 19, 20


Frequent k-Itemsets

! C1 = all items
! Fk = members of Ck with support ≥ s.

28

Thursday, March 19, 20


Frequent k-itemsets

! C1 = all items
! Fk = members of Ck with support ≥ s.
! Ck +1 = (k +1) -sets, each k of which is in Fk .
(e.g. {a,b,c,d} is in Ck only if {b,c,d}, {a,c,d}, {a,b,d},
{a,b,c} are all frequent )

29

Thursday, March 19, 20


Frequent k-itemsets

! C1 = all items
! Fk = members of Ck with support ≥ s.
! Ck +1 = (k +1) -sets, each k of which is in Fk .
(e.g. {a,b,c,d} is in Ck only if {b,c,d}, {a,c,d}, {a,b,d},
{a,b,c} are all frequent )
! When do we stop?

30

Thursday, March 19, 20


Frequent k-itemsets

! C1 = all items
! Fk = members of Ck with support ≥ s.
! Ck +1 = (k +1) -sets, each k of which is in Fk .
(e.g. {a,b,c,d} is in Ck only if {b,c,d}, {a,c,d}, {a,b,d},
{a,b,c} are all frequent )
! When do we stop? When Ck +1 is empty.

31

Thursday, March 19, 20


A-Priori for All Frequent Itemsets

! One pass for each k.


! Needs room in main memory to count each
candidate k -set.
! For typical market-basket data and
reasonable support (e.g., 1%), k = 2
requires the most memory.

32

Thursday, March 19, 20

You might also like