Handling Large Datasets
Handling Large Datasets
memory
Improvements to A-Priori
1
PCY Algorithm(Park-Chen-Yu)
• Hash-based improvement to A-Priori.
2
Picture of PCY
Bitmap
Hash
table Counts of
candidate
pairs
Pass 1 Pass 2
3
PCY Algorithm – Before Pass 1
Organize Main Memory
• Space to count each item.
– One (typically) 4-byte integer per item.
4
PCY Algorithm – Pass 1
5
Observations About Buckets
1. If a bucket contains a frequent pair, then the bucket is surely
frequent.
– We cannot use the hash table to eliminate any member of this
bucket.
3. But in the best case, the count for a bucket is less than the
support s.
– Now, all pairs that hash to this bucket can be eliminated as
candidates, even if the pair consists of two frequent items.
6
PCY Algorithm – Between Passes
• Replace the buckets by a bit-vector:
– 1 means the bucket count exceeds the support s (a
frequent bucket );
– 0 means it did not.
7
PCY Algorithm – Pass 2
• Count all pairs {i, j } that meet the
conditions:
1. Both i and j are frequent items.
2. The pair {i, j }, hashes to a bucket number
whose bit in the bit vector is 1.
8
Memory Details
• Hash table requires buckets of 2-4 bytes.
– Number of buckets thus almost 1/4-1/2 of the
number of bytes of main memory.
9
Multistage Algorithm
(Improvements to P CY )
• It might happen that even after hashing there are
still too many surviving pairs and the main
memory isn't sufficient to hold their counts.
Bitmap 1 Bitmap 1
11
Multistage – Pass 3
12
Multihash
• Key idea: use several independent hash
tables on the first pass.
Bitmap 1
First hash
table Bitmap 2
Counts of
Second
candidate
hash table
pairs
Pass 1 Pass 2
14
Extensions
• Either multistage or multihash can use more
than two hash functions.
16
Simple Algorithm
• Take a random sample of the market
baskets.
Copy of
sample
• Run a-priori (for sets of all sizes, not just baskets
pairs) in main memory, so you don’t pay
for disk I/O each time you increase the size Space
of itemsets. for
counts
– Be sure you leave enough space for counts.
18
SON Algorithm – (1)
19
23
Toivonen’s Algorithm – (4)
24
Theorem:
25
Proof:
• Suppose not; i.e., there is an itemset S
frequent in the whole but
– Not frequent in the sample, and
– Not present in the sample’s negative border.
• Let T be a smallest subset of S that is not
frequent in the sample.
• T is frequent in the whole (S is frequent,
monotonicity).
• T is in the negative border (else not
“smallest”).
26