0% found this document useful (0 votes)
89 views

Handling Large Datasets

- PCY algorithm improves Apriori by hashing item pairs to buckets during the first pass and only counting pairs that hash to frequent buckets in the second pass. - Multistage and multihash algorithms further improve memory usage by rehashing candidate pairs from the first pass to reduce the number that must be counted in the second pass. - The SON and Toivonen algorithms find frequent itemsets in a single database scan by mining a sample of transactions in memory and propagating candidate itemsets to a second pass over the full database.

Uploaded by

Tanya Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Handling Large Datasets

- PCY algorithm improves Apriori by hashing item pairs to buckets during the first pass and only counting pairs that hash to frequent buckets in the second pass. - Multistage and multihash algorithms further improve memory usage by rehashing candidate pairs from the first pass to reduce the number that must be counted in the second pass. - The SON and Toivonen algorithms find frequent itemsets in a single database scan by mining a sample of transactions in memory and propagating candidate itemsets to a second pass over the full database.

Uploaded by

Tanya Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Handling Larger Dataset in main

memory
Improvements to A-Priori

1
PCY Algorithm(Park-Chen-Yu)
• Hash-based improvement to A-Priori.

• During Pass 1 of A-priori, most memory is idle.


• Use that memory to keep counts of buckets into
which the pairs of items are hashed.
– Just the count, not the pairs themselves.

• Gives extra condition that candidate pairs must


satisfy on Pass 2.

2
Picture of PCY

Item counts Frequent items

Bitmap

Hash
table Counts of
candidate
pairs

Pass 1 Pass 2

3
PCY Algorithm – Before Pass 1
Organize Main Memory
• Space to count each item.
– One (typically) 4-byte integer per item.

• Use the rest of the space for as many integers,


representing buckets, as we can.

4
PCY Algorithm – Pass 1

FOR (each basket) {


FOR (each item)
add 1 to item’s count;
FOR (each pair of items) {
hash the pair to a bucket;
add 1 to the count for that bucket
}
}

5
Observations About Buckets
1. If a bucket contains a frequent pair, then the bucket is surely
frequent.
– We cannot use the hash table to eliminate any member of this
bucket.

2. Even without any frequent pair, a bucket can be frequent.


– Again, nothing in the bucket can be eliminated.

3. But in the best case, the count for a bucket is less than the
support s.
– Now, all pairs that hash to this bucket can be eliminated as
candidates, even if the pair consists of two frequent items.

6
PCY Algorithm – Between Passes
• Replace the buckets by a bit-vector:
– 1 means the bucket count exceeds the support s (a
frequent bucket );
– 0 means it did not.

• 4-byte integers are replaced by bits, so the bit-


vector requires 1/32 of memory.

• Also, decide which items are frequent and list


them for the second pass.

7
PCY Algorithm – Pass 2
• Count all pairs {i, j } that meet the
conditions:
1. Both i and j are frequent items.
2. The pair {i, j }, hashes to a bucket number
whose bit in the bit vector is 1.

• Notice both these conditions are necessary


for the pair to have a chance of being
frequent.

8
Memory Details
• Hash table requires buckets of 2-4 bytes.
– Number of buckets thus almost 1/4-1/2 of the
number of bytes of main memory.

• On second pass, a table of (item, item, count)


triples is essential.
– Thus, hash table must eliminate 2/3 of the
candidate pairs to beat a-priori with triangular
matrix for counts.

9
Multistage Algorithm
(Improvements to P CY )
• It might happen that even after hashing there are
still too many surviving pairs and the main
memory isn't sufficient to hold their counts.

• Key idea: After Pass 1 of PCY, rehash only those


pairs that qualify for Pass 2 of PCY.
– Using a different hash function!

• On middle pass, fewer pairs contribute to buckets,


so fewer false positives –frequent buckets with no
frequent pair.
10
Multistage
Picture

Item counts Freq. items Freq. items

Bitmap 1 Bitmap 1

First Second Bitmap 2


hash table hash table
Counts of
candidate
pairs

Pass 1 Pass 2 Pass 3

11
Multistage – Pass 3

• Count only those pairs {i, j } that satisfy:


1. Both i and j are frequent items.
2. Using the first hash function, the pair hashes to a
bucket whose bit in the first bit-vector is 1.
3. Using the second hash function, the pair hashes
to a bucket whose bit in the second bit-vector is
1.

12
Multihash
• Key idea: use several independent hash
tables on the first pass.

• Risk: halving the number of buckets


doubles the average count. We have to be
sure most buckets will still not reach count
s.

• If so, we can get a benefit like multistage,


but in only 2 passes.
13
Multihash Picture

Item counts Freq. items

Bitmap 1
First hash
table Bitmap 2

Counts of
Second
candidate
hash table
pairs

Pass 1 Pass 2

14
Extensions
• Either multistage or multihash can use more
than two hash functions.

• In multistage, there is a point of diminishing


returns, since the bit-vectors eventually
consume all of main memory.

• For multihash, the bit-vectors occupy exactly


what one PCY bitmap does, but too many hash
functions makes all counts > s.
15
All (Or Most) Frequent Itemsets In
< 2 Passes
• Simple algorithm.
• SON (Savasere, Omiecinski, and Navathe).
• Toivonen.

16
Simple Algorithm
• Take a random sample of the market
baskets.
Copy of
sample
• Run a-priori (for sets of all sizes, not just baskets
pairs) in main memory, so you don’t pay
for disk I/O each time you increase the size Space
of itemsets. for
counts
– Be sure you leave enough space for counts.

• Use as support threshold a suitable,


scaled-back number.
– E.g., if your sample is 1/100 of the baskets, use
s /100 as your support threshold instead of s .
17
Simple Algorithm – Option
• Optionally, verify that your guesses are
truly frequent in the entire data set by a
second pass.

• But you don’t catch sets frequent in the


whole but not in the sample.
– Smaller threshold, e.g., s /125, helps catch
more truly frequent itemsets.
• But requires more space.

18
SON Algorithm – (1)

• Repeatedly read small subsets of the baskets into main


memory and perform the first pass of the simple algorithm on
each subset.

• An itemset becomes a candidate if it is found to be frequent


in
 any one or more subsets of the baskets.

• On a second pass, count all the candidate itemsets


and determine which are frequent in the entire set.

19

• Key “monotonicity” idea: an itemset cannot be frequent in


SON Algorithm – Distributed Version

• This idea lends itself to distributed data


mining.

• If baskets are distributed among many nodes,


compute frequent itemsets at each node, then
distribute the candidates from each node.

• Finally, accumulate the counts of all


candidates.
20
Toivonen’s Algorithm – (1)

• Start as in the simple algorithm, but lower the


threshold slightly for the sample.

• Example: if the sample is 1% of the baskets,


use s /125 as the support threshold rather
than s /100.

• Goal is to avoid missing any itemset that is


frequent in the full set of baskets.
21
Toivonen’s Algorithm – (2)

• Add to the itemsets that are frequent in the


sample the negative border of these itemsets.

• An itemset is in the negative border if it is not


deemed frequent in the sample, but all its
immediate subsets are.

• Example. ABCD is in the negative border if and


only if it is not frequent, but all of ABC, BCD,
ACD, and ABD are.
22
Toivonen’s Algorithm – (3)

• In a second pass, count all candidate frequent


itemsets from the first pass, and also count
their negative border.

• If no itemset from the negative border turns


out to be frequent, then the candidates found
to be frequent in the whole data are exactly
the frequent itemsets.

23
Toivonen’s Algorithm – (4)

• What if we find that something in the negative


border is actually frequent?
• We must start over again!

• Try to choose the support threshold so the


probability of failure is low, while the number
of itemsets checked on the second pass fits in
main-memory.

24
Theorem:

• If there is an itemset that is frequent in the


whole, but not frequent in the sample,
• then there is a member of the negative border
for the sample that is frequent in the whole.

25
Proof:
• Suppose not; i.e., there is an itemset S
frequent in the whole but
– Not frequent in the sample, and
– Not present in the sample’s negative border.
• Let T be a smallest subset of S that is not
frequent in the sample.
• T is frequent in the whole (S is frequent,
monotonicity).
• T is in the negative border (else not
“smallest”).
26

You might also like