0% found this document useful (0 votes)
11 views178 pages

Ilovepdf Merged

The document discusses Big Data mining algorithms, focusing on frequent pattern mining and the Apriori algorithm. It explains the concepts of market-basket data, frequent itemsets, and the challenges of mining large datasets, particularly in terms of memory management and computational efficiency. Additionally, it covers applications of these algorithms in various domains, such as retail and web analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views178 pages

Ilovepdf Merged

The document discusses Big Data mining algorithms, focusing on frequent pattern mining and the Apriori algorithm. It explains the concepts of market-basket data, frequent itemsets, and the challenges of mining large datasets, particularly in terms of memory management and computational efficiency. Additionally, it covers applications of these algorithms in various domains, such as retail and web analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 178

Big Data Analytics 2020

Module 5 –Big Data


Mining Algorithms

1. Frequent Pattern Mining


Overview of the Chapter

◼ Handling Larger Data Sets in Main


Memory
◼ Algorithm of Park, Chen and Yu
◼ The SON Algorithm and
MapReduce
Market-Basket Data

◼ A large set of items, e.g., things


sold in a supermarket.
◼ A large set of baskets, each of
which is a small set of the items,
e.g., the things one customer
buys on one day.
Market-Baskets – (2)
◼ Just, a general many-to-many mapping
(association) between two kinds of
things, where the one (the baskets) is a
set of the other (the items)
▪ But we ask about connections among
“items,” not “baskets.”
◼ The technology focuses on common
events, not rare events (“long tail”).
Applications – (1)
◼ Items = products; baskets = sets of
products someone bought in one trip to
the store.

◼ Example application: given that many


people buy soaps and towels together:
▪ Run a sale on Soaps; raise price of Towels.
◼ Only useful if many buy soaps & towels.
Applications – (2)

◼ Baskets = Web pages; items = words.


◼ Example application: Unusual words
appearing together in a large
number of documents, e.g.,
“Priyanka” and “NBC,” may indicate
an interesting relationship.
Applications – (3)
◼ Baskets = sentences; items =
documents containing those
sentences.
◼ Example application: Items that
appear together too often could
represent plagiarism.
◼ Notice items do not have to be “in”
baskets.
Definition: Frequent Itemset
◼ Itemset
▪ A collection of one or more items
▪ Example: {Milk, Bread, Diaper}
▪ k-itemset
▪ An itemset that contains k items
◼ Support (σ)
▪ Count: Frequency of occurrence of
an itemset
▪ E.g. σ({Milk, Bread,Diaper}) = 2
▪ Fraction: Fraction of transactions
that contain an itemset
▪ E.g. s({Milk, Bread, Diaper}) = 40%
◼ Frequent Itemset
▪ An itemset whose support is greater
than or equal to a minsup threshold
Mining Frequent Itemsets task
◼ Input: A set of transactions T, over a set of items I
◼ Output: All itemsets with items in I having
▪ support ≥ minsup threshold
◼ Problem parameters:
▪ N = |T|: number of transactions
▪ d = |I|: number of (distinct) items
▪ w: max width of a transaction
d
▪ Number of possible itemsets? M = 2
◼ Scale of the problem:
▪ WalMart sells 100,000 items ; stores billions of baskets.
▪ Web has billions of words and many billions of pages.
The itemset lattice

Given d items, there are


2d possible itemsets
Computation Model
◼ Typically, data is kept in flat files
rather than in a database system.
▪ Stored on disk.
▪ Stored basket-by-basket.
▪ Expand baskets into pairs, triples, etc. as
you read baskets.
▪Use k nested loops to generate all sets of
size k.
Example file: retail
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32
33 34 35
36 37 38 39 40 41 42 43 44 45 46
38 39 47 48 Example: items are
38 39 48 49 50 51 52 53 54 55 56 57 58
32 41 59 60 61 62 positive integers,
3 39 48
63 64 65 66 67 68
and each basket
32 69 corresponds to a line in
48 70 71 72
39 73 74 75 76 77 78 79 the file of space
36 38 39 41 48 79 80 81
82 83 84 separated integers
41 85 86 87 88
39 48 89 90 91 92 93 94 95 96 97 98 99 100 101
36 38 39 48 89
39 41 102 103 104 105 106 107 108
38 39 41 109 110
39 111 112 113 114 115 116 117 118
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
48 134 135 136
39 48 137 138 139 140 141 142 143 144 145 146 147 148 149
File Organization
Item
Item
Item
Basket 1
Item
Item
Example: items are
Item
Item
positive integers,
Item Basket 2 and boundaries
Item
Item between baskets
Item Basket 3
Item are –1.

Etc.
Computation Model – (2)

◼ The true cost of mining disk-resident


data is usually the number of disk I/O’s.
◼ In practice, association-rule algorithms
read the data in passes – all baskets
read in turn.
◼ Thus, we measure the cost by the
number of passes an algorithm takes.
The Apriori Principle

◼ If an itemset is frequent, then all of its subsets


must also be frequent
◼ If an itemset is not frequent, then all of its
supersets cannot be frequent

◼ The support of an itemset never exceeds the


support of its subsets
◼ This is known as the anti-monotone property of
support
Illustration of the Apriori principle

Found to be
Infrequent

Infrequent supersets

Pruned
The Apriori algorithm
Ck = candidate itemsets of size k
Level-wise approach Lk = frequent itemsets of size k

1. k = 1, C1 = all items
2. While Ck not empty
Frequent
itemset
3. Scan the database to find which itemsets in Ck are
generation frequent and put them into Lk
Candidate4. Use Lk to generate a collection of candidate
generation itemsets Ck+1 of size k+1
5. k = k+1
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, 1994.
A simple hash structure

◼ Create a dictionary (hash table) that


stores the candidate itemsets as keys,
and the number of appearances as the
value.
▪ Initialize with zero
◼ Increment the counter for each itemset
that you see in the data
Key Value
Example {3 6 7} 0
{3 4 5} 1
{1 3 6} 3
Suppose you have 15 candidate {1 4 5} 5
itemsets of length 3: {2 3 4} 2
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9} 1
{3 6 8} 0
{1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{4 5 7} 2
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} {6 8 9} 0
{5 6 7} 3
{1 2 4} 8
Hash table stores the counts of the
{3 5 7} 1
candidate itemsets as they have
been computed so far {1 2 5} 0
{3 5 6} 1
{4 5 8} 0
Key Value
Example {3 6 7} 0
{3 4 5} 1
{1 3 6} 4

Tuple {1,2,3,5,6} generates {1 4 5} 5


{2 3 4} 2
the following itemsets of
{1 5 9} 1
length 3:
{3 6 8} 0
{1 2 3}, {1 2 5}, {1 2 6}, {1 3 5}, {4 5 7} 2
{1 3 6}, {6 8 9} 0
{5 6 7} 3
{1 5 6}, {2 3 5}, {2 3 6}, {3 5 6}, {1 2 4} 8
Increment the counters for {3 5 7} 1
the itemsets in the dictionary {1 2 5} 1
{3 5 6} 2
{4 5 8} 0
Association Rule anti-monotonicity

◼ Confidence is anti-monotone w.r.t. number of


items on the RHS of the rule (or monotone
with respect to the LHS of the rule)

◼ e.g., L = {A,B,C,D}:

c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)


Rule Generation for APriori Algorithm

◼ Candidate rule is generated by merging two rules that


share the same prefix
in the RHS
◼ join(CD→AB,BD→AC)
would produce the candidate
rule D → ABC
◼ Prune rule D → ABC if its
subset AD→BC does not have
high confidence
◼ Essentially we are doing APriori on the RHS
Maximal vs Closed Itemsets
Pattern Evaluation
◼ Association rule algorithms tend to produce too many rules but
many of them are uninteresting or redundant
▪ Redundant if {A,B,C} → {D} and {A,B} → {D} have same
support & confidence
▪ Summarization techniques
▪ Uninteresting, if the pattern that is revealed does not offer
useful information.
▪ Interestingness measures: a hard problem to define
◼ Interestingness measures can be used to prune/rank the derived
patterns
▪ Subjective measures: require human analyst
▪ Objective measures: rely on the data.
◼ In the original formulation of association rules, support &
confidence are the only measures used
Main-Memory Bottleneck
◼ For many frequent-itemset algorithms,
main memory is the critical resource.
▪ As we read baskets, we need to count
something, e.g., occurrences of pairs.
▪ The number of different things we can count
is limited by main memory.
▪ Swapping counts in/out is a disaster (why?).
Finding Frequent Pairs
◼ The hardest problem often turns out to be
finding the frequent pairs.
▪ Why? Often frequent pairs are common,
frequent triples are rare.
▪ Why? Probability of being frequent drops
exponentially with size; number of sets grows more
slowly with size.
◼ We’ll concentrate on pairs, then extend to
larger sets.
Naïve Algorithm
◼ Read file once, counting in main
memory the occurrences of each pair.
▪ From each basket of n items, generate Its
(n -1)/2 pairs by two nested loops.
2
◼ Fails if (#items) exceeds main memory.
▪ Remember: #items can be 100K
(Wal-Mart) or 10B (Web pages).
Example: Counting Pairs

5
◼ Suppose 10 items.
◼ Suppose counts are 4-byte integers.
◼ Number of pairs of items:
5 5 9
10 (10 -1)/2 = 5*10
(approximately).
10
◼ Therefore, 2*10 (20 gigabytes) of
main memory needed.
Details of Main-Memory Counting
◼ Two approaches:
(1) Count all pairs, using a triangular matrix.
(2) Keep a table of triples [i, j, c] = “the count of
the pair of items {i, j } is c.”
◼ (1) requires only 4 bytes/pair.
▪ Note: always assume integers are 4 bytes.
◼ (2) requires 12 bytes, but only for those
pairs with count > 0.
Details of Main-Memory Counting

4 bytes per 12 bytes per


pair occurring pair

Method (1) Method (2)


Association Rules

Triangular-Matrix Approach – (1)


◼ Number items 1, 2,…
▪ Requires table of size O(n) to convert item
names to consecutive integers.
◼ Count {i, j } only if i < j.
◼ Keep pairs in the order {1,2}, {1,3},…,
{1,n }, {2,3}, {2,4},…,{2,n }, {3,4},…, {3,n
},…{n -1,n }.
Association Rules

Triangular-Matrix Approach – (2)

◼ Find pair {i, j } at the position


(i –1)(n –i /2) + j – i
◼ Total number of pairs
n (n –1)/2
2
total bytes about 2n .
Triangular Matrix
◼ Pair {i, j} is at position (i –1)(n– i/2) + j –i
◼ {1,2}: 0 + 2 -1 = 1
◼ {1,3} = 0 + 3 – 1 = 2
◼ {1,4} = 0 + 4 – 1 = 3
◼ {1,5} = 0 + 5 – 1 = 4
◼ {2,3} = (2-1)*(5-2/2) + 3 – 2= 5
◼ {2,4} = (2-1)*(5-2/2) + 4 – 2= 6
◼ {2,5} = (2-1)*(5-2/2) +5 - 2= 7
◼ {3,4} = (3-1)*(5-3/2) +4 - 2= 8
◼ {3,5} = 9 • {4,5} = 10
Triangular Method
1,2 1,3 1,4 1,5
2,3 2,4 2,5
3,4 3,5
4,5

Pair 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5

Position 1 2 3 4 5 6 7 8 9 10
Frequent Triples Approach

◼ Total bytes used is about 12p, where


p is the number of pairs that actually
occur.
▪ Beats triangular matrix if at most 1/3 of
possible pairs actually occur.
◼ May require extra space for retrieval
structure, e.g., a hash table.
A-Priori Algorithm – (1)

◼ A two-pass approach called a-priori


limits the need for main memory.
◼ Key idea: monotonicity : if a set of items
appears at least s times, so does every
subset.
▪ Contrapositive for pairs: if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.
A-Priori Algorithm – (2)

◼ Pass 1: Read baskets and count in


main memory the occurrences of
each item.
▪ Requires only memory proportional
to #items.
◼ Items that appear at least s times
are the frequent items.
A-Priori Algorithm – (3)

◼ Pass 2: Read baskets again and count in


main memory only those pairs both of
which were found in Pass 1 to be
frequent.
▪ Requires memory proportional to square of
frequent items only (for counts), plus a list
of the frequent items (so you know what
must be counted).
Picture of A-Priori

Frequent
Item counts items

Counts of
pairs of
frequent
items

Pass 1 Pass 2
Detail for A-Priori

◼ You can use the triangular matrix


method with n = number of frequent
items.
▪ May save space compared with storing
triples.
◼ Trick: number frequent items 1,2,…
and keep a table relating new
numbers to original item numbers.
A-Priori Using Triangular Matrix for Counts

Freq- Old
Item counts
quent item
… items #’s

Counts of
pairs of
frequent
items

Pass 1 Pass 2
A-Priori for All Frequent Itemsets

◼ For each k, we construct two k -sets


(sets of size k ):
▪ Ck = candidate k -sets = those that
might be frequent sets (support > s )
based on information from the pass for
k –1.
▪ Lk = the set of truly frequent k -sets.
A-Priori for All Frequent Itemsets
Count All pairs Count
All of items To be
the items the pairs explained
items from L1

C1 Filter L1 Construct C2 Filter L2 Construct C3

First Second
pass pass

Frequent Frequent
items pairs
A-Priori for All Frequent Itemsets

◼ One pass for each k.


◼ Needs room in main memory to
count each candidate k -set.
◼ For typical market-basket data
and reasonable support (e.g., 1%),
k = 2 requires the most memory.
A-Priori for All Frequent Itemsets

◼ C1 = all items
◼ In general, Lk = members of Ck
with support ≥ s.
◼ Ck +1 = (k +1) -sets, each k of
which is in Lk .
Finding the frequent pairs is usually the most
expensive operation

Count All pairs Count


All the items of items the pairs
items from L1

C1 Filter L1 Construct C2 Filter L2 Construct C3

First Second
pass pass

Frequent Frequent
items pairs
PCY (Park, Chen & Yu) Algorithm

◼ During Pass 1 (computing Item counts

frequent items) of Apriori, most


memory is idle.

◼ Use that memory to keep


counts of buckets into which
pairs of items are hashed. Pass 1

▪ Just the count, not the pairs


themselves.
Needed Extensions
1. Pairs of items need to be generated
from the input file; they are not
present in the file.
2. We are not just interested in the
presence of a pair, but we need to
see whether it is present at least s
(support) times.
PCY Algorithm – (2)
◼ A bucket is frequent if its count is at least the
support threshold.
◼ If a bucket is not frequent, no pair that hashes
to that bucket could possibly be a frequent
pair.
▪ The opposite is not true, a bucket may be frequent
but hold infrequent pairs
◼ On Pass 2 (frequent pairs), we only count
pairs that hash to frequent buckets.
PCY Algorithm –

Before Pass 1 Organize Main Memory

◼ Space to count each item.


▪ One (typically) 4-byte integer per
item.
◼ Use the rest of the space for as
many integers, representing
buckets, as we can.
Picture of PCY

Item counts

Hash
table

Pass 1
PCY Algorithm – Pass 1

FOR (each basket) {


FOR (each item in the basket)
add 1 to item’s count;
FOR (each pair of items in the
basket) {
hash the pair to a bucket;
add 1 to the count for that bucket
}
}
Observations About Buckets
1. A bucket that a frequent pair hashes to is surely frequent.
▪ We cannot use the hash table to eliminate any member
of this bucket.
2. Even without any frequent pair, a bucket can be frequent.
▪ Again, nothing in the bucket can be eliminated.
3. But in the best case, the count for a bucket is less than
the support s.
▪ Now, all pairs that hash to this bucket can be eliminated
as candidates, even if the pair consists of two frequent
items.
PCY Algorithm – Between Passes

◼ Replace the buckets by a bit-vector:


▪ 1 means the bucket is frequent; 0 means it is not.

◼ 4-byte integers are replaced by bits, so the


bit-vector requires 1/32 of memory.

◼ Also, find which items are frequent and list them


for the second pass.
▪ Same as with Apriori
Picture of PCY

Item counts Frequent items

Bitmap

Hash
table Counts of
candidate
pairs

Pass 1 Pass 2
PCY Algorithm – Pass 2

◼ Count all pairs {i, j } that meet the conditions


for being a candidate pair:
1. Both i and j are frequent items.
2. The pair {i, j }, hashes to a bucket number whose
bit in the bit vector is 1.

◼ Notice both these conditions are necessary


for the pair to have a chance of being
frequent.
Example PCY

TID Items
1 1,3,4
2 2,3,5
3 1,2,3,5
4 2,5
Example PCY
Itemset Sup
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3

Bucket 1 2 3 4 5
Count 3 2 4 1 3
Pass 1
PCY Algorithm in Big Data

PCY was developed by Park, Chen, and Yu. It is used for frequent itemset mining when the
dataset is very large.

What is the PCY Algorithm?

The PCY algorithm (Park-Chen-Yu algorithm) is a data mining algorithm that is used to find
frequent itemets in large datasets. It is an improvement over the Apriori algorithm and was
first described in 2001 in a paper titled "PrefixSpan: Mining Sequential Patterns Efficiently
by Prefix-Projected Pattern Growth" by Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and
Helen Pinto.

The PCY algorithm uses hashing to efficiently count item set frequencies and reduce
overall computational cost.

The basic idea is to use a hash function to map itemsets to hash buckets, followed
by a hash table to count the frequency of itemsets in each bucket.
Apply the PCY algorithm on the following transaction to find the candidate sets
(frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
Step 1: Find the frequency of each element and remove the candidate set
having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and
corresponding to them write their frequency. Note - Note: Pairs should not get
repeated avoid the pairs that are already written before.
Step 3: List all sets whose length is greater than the threshold and then apply
Hash Functions. (It gives us the bucket number). It defines in what bucket this
particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the
following details -
● Bit vector - if the frequency of the candidate pair is greater than equal to
the threshold then the bit vector is 1 otherwise 0. (mostly 1)
● Bucket number - found in the previous step
● Maximum number of support - frequency of this candidate pair, found in
step 2.
● Correct - the candidate pair will be mentioned here.
● Candidate set - if the bit vector is 1, then "correct" will be written here.
Question: Apply PCY algorithm on the following transaction to find the candidate sets (frequent
sets).

Given data
Threshold value or minimization value = 3
Hash function= (i*j) mod 10

T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}
Use buckets and concepts of Mapreduce to solve the above
problem.

Solution
1. To identify the length or we can say repetition of each candidate in
the given dataset.
2. Reduce the candidate set to all having length 1.
3. Map pair of candidates and find the length of each pair.
4. Apply a hash function to find bucket no.
5. Draw a candidate set table.
Step 1: Mapping all the elements in order to find their length.
Items → {1, 2, 3, 4, 5, 6}
Key 1 2 3 4 5 6
Value 4 6 8 8 6 4
Step 2: Removing all elements having value less than 3.
But here in this example there is no key having value less than 3.
Hence, candidate set = {1, 2, 3, 4, 5, 6}
Step 3: Map all the candidate set in pairs and calculate their lengths.
T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)
T2: {(2, 4) (3, 4)} = (3 4)
T3: {(3, 5) (4, 5)} = (5, 3)
T4: {(4, 5) (5, 6)} = (3, 2)
T5: {(1, 5)} = 1
T6: {(2, 6)} = 1
T7: {(1, 4)} = 2
T8: {(2, 5)} = 2
T9: {(3, 6)} = 2
T10:______
T11:______
T12:______
Note: Pairs should not get repeated avoid the pairs that are already written before.

Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4) (3,5)
(4,5) (4,6)}

Step 4: Apply Hash Functions. (It gives us bucket number)

Hash Function = ( i * j) mod 10


(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Now, arrange the pairs according to the ascending order of their obtained
bucket number.
Bucket no. Pair
0 (4,5)
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)
Multistage Algorithm

◼ Key idea: After Pass 1 of PCY, rehash


only those pairs that qualify for Pass
2 of PCY.
◼ On middle pass, fewer pairs
contribute to buckets, so fewer false
positives –frequent buckets with no
frequent pair.
Multistage 3 Passes
Multistage – Pass 3

◼ Count only those pairs {i, j } that satisfy


these candidate pair conditions:
1. Both i and j are frequent items.
2. Using the first hash function, the pair hashes
to a bucket whose bit in the first bit-vector is 1.
3. Using the second hash function, the pair
hashes to a bucket whose bit in the second
bit-vector is 1.
Important Points

1. The two hash functions have to be


independent.
2. We need to check both hashes on the
third pass.
▪ If not, we would wind up counting pairs of
frequent items that hashed first to a
frequent bucket but happened to hash
second to an infrequent bucket.
Multihash
◼ Key idea: use several independent hash
tables on the first pass.
◼ Risk: halving the number of buckets
doubles the average count. We have to
be sure most buckets will still not reach
count s.
◼ If so, we can get a benefit like
multistage, but in only 2 passes.
Multihash Picture

Item counts Freq. items


Bitmap 1
First hash
table Bitmap 2

Counts of
Second
candidate
hash table
pairs

Pass 1 Pass 2
Limited Pass Algorithms
◼ A-Priori, PCY, etc., take k
passes to
find frequent itemsets of size k.
◼ Other techniques use 2 or fewer passes
for all sizes:
▪ Simple Randomized Sampling algorithm.
▪ SON (Savasere, Omiecinski, and
Navathe).
Randomized Sampling Algorithm – (1)

◼ Take a random sample of the


market baskets. Copy of
◼ Run Apriori or one of its sample
improvements (for sets of all baskets
sizes, not just pairs) in main
memory, so you don’t pay for Space
disk I/O each time you increase for
the size of itemsets. counts
▪ Make sure the sample is such that
there is enough space for counts.
Randomized Sampling Algorithm – (2)

Randomized Sampling Algorithm – (3)
◼ To avoid bias, baskets should not be selected from one
part of the file.
◼ Similarly if the file is part of a distributed file system
picking chunks at random to serve as the sample will be
biased.
◼ Algorithm requires only one pass for itemset of any size
K.
◼ As frequent itemsets of each size are discovered, they
can be written out to disk; this operation and the initial
reading of the sample from disk are the only disk I/O’s
the algorithm does
Randomized Sampling Algorithm – (4)

◼ Two kinds of errors:


▪ false negatives: itemsets that are
frequent in the full dataset but
infrequent in the sample dataset.
▪ false positives: itemsets that are
infrequent in the full dataset but
frequent in the sample dataset.
Randomized Sampling Algorithm – (5)
◼ Verify that your guesses are truly frequent in the
entire data set by a second pass (eliminate false
positives)

◼ But you don’t catch sets frequent in the whole


but not in the sample. (false negatives)
▪ Smaller threshold, e.g., s /125, helps catch
more truly frequent itemsets.
▪But requires more space.
SON Algorithm – (1)

◼ First pass: Break the data into chunks that


can be processed in main memory.
◼ Read one chunk at the time
▪ Find all frequent itemsets for each chunk.
▪ Threshold = s/number of chunks

◼ An itemset becomes a candidate if it is found


to be frequent in any one or more chunks of
the baskets.
SON Algorithm – (2)

◼ Second pass: count all the candidate itemsets


and determine which are frequent in the
entire set.

◼ Key “monotonicity” idea: an itemset cannot


be frequent in the entire set of baskets unless
it is frequent in at least one subset.
▪ Why?
SON Algorithm – Distributed Version

◼ This idea lends itself to distributed data


mining.
◼ If baskets are distributed among many
nodes, compute frequent itemsets at each
node, then distribute the candidates from
each node.
◼ Finally, accumulate the counts of all
candidates.
The first map-reduce step:

◼ map(key, value):
◼ //value is a chunk of the full dataset
◼ Count occurrences of itemsets in the chunk.
◼ for itemset in itemsets:
◼ if supp(itemset) >= p*s
◼ emit(itemset, null)

◼ reduce(key, values):
◼ emit(key, null)
The second map-reduce step:
◼ map(key, value):
◼ // value is the candidate itemsets and a chunk
◼ of the full dataset
◼ Count occurrences of itemsets in the chunk.
◼ for itemset in itemsets:
◼ emit(itemset, supp(itemset))
◼ reduce(key, values):
◼ result = 0
◼ for value in values:
◼ result += value
◼ if result >= s:
◼ emit(key, result)
First Phase Map Reduce
◼ First Map Function:
▪ Take the assigned subset of the baskets and find
the itemsets frequent in the subset using the
simple Randomized Algorithm.
▪ Lower the support threshold from s to ps if each
Map task gets fraction p of the total input file.
▪ The output is a set of key-value pairs (F, 1),
where F is a frequent itemset from the sample.
▪ The value is always 1 and is irrelevant.
First Phase - MR

◼ First Reduce Function:


▪ Each Reduce task is assigned a set of keys,
which are itemsets.
▪ The value is ignored, and the Reduce task
simply produces those keys (itemsets) that
appear one or more times.
▪ Thus, the output of the first Reduce
function is the candidate itemsets.
Second Phase Map Reduce
◼ Second Map Function:
▪ This Map function takes all the output from the first
Reduce Function (the candidate itemsets) and a
portion of the input data file.
▪ Each Map task counts the number of occurrences of
each of the candidate itemsets among the baskets
in the portion of the dataset that it was assigned.
▪ The output is a set of key-value pairs (C, v), where C
is one of the candidate sets and v is the support for
that itemset among the baskets that were input to
this Map task.
Second Phase Map Reduce
◼ Second Reduce Function:
▪ The Reduce tasks take the itemsets they are given as
keys and sum the associated values.
▪ The result is the total support for each of the
itemsets that the Reduce task was assigned to
handle.
▪ Those itemsets whose sum of values is at least s are
frequent in the whole dataset, so the Reduce task
outputs these itemsets with their counts.
▪ Itemsets that do not have total support at least s are
not transmitted to the output of the Reduce task.
Map Reduce
Big Data Analytics 2020
Module 5 –Big Data
Mining Algorithms

Clustering Approaches
Overview of the Chapter

◼ What Is Clustering?
◼ Challenges of Big Data
Clustering
◼ CURE Algorithm.
◼ Canopy Clustering,
◼ Clustering with MapReduce

2
3
■ Clustering is important unsupervised learning
technique.
■ It deals with finding a structure in a collection of
unlabeled data.
■ Clustering is “the process of organizing
objects into groups whose members are
similar in some way”.
■ A cluster is therefore a collection of objects
which are “similar” between them and are
“dissimilar” to the objects belonging to other
clusters.
* Data Mining: Concepts and Techniques 4
■ Clustering Algorithms:
■ A Clustering Algorithm tries to analyse natural
groups of data on the basis of some similarity.
■ It locates the centroid of the group of data
points.
■ To carry out effective clustering, the algorithm
evaluates the distance between each point from
the centroid of the cluster.

* Data Mining: Concepts and Techniques 6


The Problem of Clustering
◼ Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that
▪ Members of a cluster are close/similar to each other
▪ Members of different clusters are dissimilar
◼ Usually:
▪ Points are in a high-dimensional space
▪ Similarity is defined using a distance measure
▪ Euclidean, Cosine, Jaccard, edit distance, …

7
• k-Means algorithm [1957, 1967]
• PAM [1990]
• k-Medoids algorithm • CLARA [1990]
Partitioning • k-Modes [1998] • CLARANS [1994]
methods • Fuzzy c-means algorithm [1999]

• DIANA [1990]
Divisive
• AGNES [1990]
Hierarchical • BIRCH [1996]
methods Agglomerative • CURE [1998]
methods • ROCK [1999]
• Chamelon [1999]
Clustering
Techniques Density-based • STING [1997] • DENCLUE [1998]
• DBSCAN [1996] • OPTICS [1999]
methods • CLIQUE [1998] • Wave Cluster [1998]

• MST Clustering [1999]


Graph based
• OPOSSUM [2000]
methods • SNN Similarity Clustering [2001, 2003]

• EM Algorithm [1977]
Model based • Auto class [1996]
clustering • COBWEB [1987]
• ANN Clustering [1982, 1989]

CS 40003: Data Analytics 8


Example: Clusters & Outliers

x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Outlier Cluster

9
Why is it hard?
◼ Clustering in two dimensions looks easy
◼ Clustering small amounts of data looks easy
◼ Many applications involve not 2, but 10 or
10,000 dimensions
◼ High-dimensional spaces look different:
Almost all pairs of points are at about the
same distance

10
Clustering Problem: Books
◼ Intuitively: Books divides into categories, and
customers prefer a few categories
▪ But what are categories really?

◼ Represent a Book by a set of customers who


bought it:

◼ Similar Books have similar sets of customers,


and vice-versa

11
Clustering Problem: Books
Space of all Books:
◼ Think of a space with one dim. for each
customer
▪ Values in a dimension may be 0 or 1 only
▪ A book is a point in this space (x1, x2,…, xk),
where xi = 1 iff the i th customer bought the book

◼ For example Amazon Kindle, the dimension is


tens of millions
◼ Task: Find clusters of similar Books
12
Clustering Problem: Documents
Finding topics:
◼ Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
▪ It actually doesn’t matter if k is infinite; i.e., we
don’t limit the set of words

◼ Documents with similar sets of words


may be about the same topic

13
Applications

◼ Collaborative Filtering
◼ Customer Segmentation
◼ Data Summarization
◼ Location Based Analysis
◼ Multimedia Data Analysis
◼ Biological Data Analysis
◼ Social Network Analysis
14
Overview: Methods of Clustering
◼ Hierarchical:
▪ Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two
“nearest” clusters into one
▪ Divisive (top down):
▪ Start with one cluster and recursively split it

◼ Point assignment:
▪ Maintain a set of clusters
▪ Points belong to “nearest” cluster
15
Hierarchical Clustering
◼ Key operation:
Repeatedly combine
two nearest clusters

◼ Three important questions:


▪ 1) How do you represent a cluster of more
than one point?
▪ 2) How do you determine the “nearness” of
clusters?
▪ 3) When to stop combining clusters?

16
Hierarchical Clustering
◼ Key operation: Repeatedly combine two
nearest clusters
◼ (1) How to represent a cluster of many points?
▪ Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
◼ Euclidean case: each cluster has a
centroid = average of its (data)points
◼ (2) How to determine “nearness” of clusters?
▪ Measure cluster distances by distances of centroids

17
Example: Hierarchical clustering

(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
ox (2,1)
(1,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)

Data:
o … data point
x … centroid
Dendrogram
18
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
◼ The only “locations” we can talk about are the
points themselves
▪ i.e., there is no “average” of two points

◼ Approach 1:
▪ (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
▪ (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
19
“Closest” Point?
◼ (1) How to represent a cluster of many points?
clustroid = point “closest” to other points
◼ Possible meanings of “closest”:
▪ Smallest maximum distance to other points
▪ Smallest average distance to other points
▪ Smallest sum of squares of distances to other points
▪ For distance metric d clustroid c of cluster C is:
Datapoint Centroid

X Centroid is the avg. of all (data)points


in the cluster. This means centroid is
Clustroid an “artificial” point.
Cluster on Clustroid is an existing (data)point
3 datapoints that is “closest” to all other points in
the cluster. 20
Clustroid – Example
◼ Using edit distance.
◼ Cluster points: abcd, aecdb, abecb, ecdab.
◼ Their distances:

◼ Applying the three clustroid criteria to each of


the four points:
Defining “Nearness” of Clusters
◼ (2) How do you determine the
“nearness” of clusters?
▪ Approach 2:
Intercluster distance = minimum of the
distances between any two points, one from
each cluster
▪ Approach 3:
Pick a notion of “cohesion” of clusters, e.g.,
maximum distance from the clustroid
▪ Merge clusters whose union is most cohesive
22
Cohesion
◼ Approach 3.1: Use the diameter of the
merged cluster = maximum distance
between points in the cluster
◼ Approach 3.2: Use the average distance
between points in the cluster
◼ Approach 3.3: Use a density-based
approach
▪ Take the diameter or avg. distance, e.g., and
divide by the number of points in the cluster
23
Implementation
◼ Naïve implementation of hierarchical
clustering:
▪ At each step, compute pairwise distances
between all pairs of clusters, then merge
▪ O(N3)

◼ Careful implementation using priority


queue can reduce time to O(N2 log N)
▪ Still too expensive for really big datasets
that do not fit in memory
24
Mahalanobis Distance

σi … standard deviation of points in


the cluster in the ith dimension
25
Mahalanobis Distance

26
The CURE Algorithm
Extension of k-means to clusters
of arbitrary shapes
The CURE Algorithm
Vs.
◼ Problem with k-means:
▪ Assumes clusters are normally
distributed in each dimension
▪ And axes are fixed – ellipses at
an angle are not OK

◼ CURE (Clustering Using REpresentatives):


▪ Assumes a Euclidean distance
▪ Allows clusters to assume any shape
▪ Uses a collection of representative
points to represent clusters
28
29
30
Example: Stanford Salaries
h h

h
e e
e
h e
e e h
e e e e

salary h
e
h
h
h h
h h h

age

31
Overview
◼ CURE uses random sampling and partitioning to reliably find
clusters of arbitrary shape and size.
◼ Clusters a random sample of the database in an agglomerative
fashion, dynamically updating a constant number c of
well-scattered points R1, . . . , Rc per cluster to represent each
cluster’s shape.
◼ To assign the remaining, unsampled points to a cluster, these
points Ri are used in a similar manner to centroids in the k-means
algorithm – each data point that was not in the sample is assigned
to the cluster which contains the point Ri closest to the data point.
◼ To handle large sample sizes, CURE divides the random sample
into partitions which are pre-clustered independently, then the
partially-clustered sample is clustered further by the
agglomerative algorithm

32
Starting CURE
2 Pass algorithm. Pass 1:
◼ 0) Pick a random sample of points that fit in
main memory
◼ 1) Initial clusters:
▪ Cluster these points hierarchically – group
nearest points/clusters
◼ 2) Pick representative points:
▪ For each cluster, pick a sample of points, as
dispersed as possible
▪ From the sample, pick representatives by moving
them (say) 20% toward the centroid of the cluster
33
Example: Initial Clusters
h h

h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h

age

34
Example: Pick Dispersed Points
h h

h
e e
e
h e
e e h
e e e e
h
salary e Pick (say) 4
h
h remote points
h h for each
h h h cluster.

age

35
Example: Pick Dispersed Points
h h

h
e e
e
h e
e e h
e e e e
h
salary e Move points
h
h (say) 20%
h h toward the
h h h centroid.

age

36
CURE algorithm
Step by step
◼ For each cluster, c well scattered points within the
cluster are chosen, and then shrinking them
toward the mean of the cluster by a fraction α
◼ The distance between two clusters is then the
distance between the closest pair of
representative points from each cluster.
◼ The c representative points attempt to capture the
physical shape and geometry of the cluster.
Shrinking the scattered points toward the mean
gets rid of surface abnormalities and decrease the
effects of outliers.
37
Finishing CURE
Pass 2:
◼ Now, rescan the whole dataset and
visit each point p in the data set
◼ Place it in the “closest cluster” p
▪ Normal definition of “closest”:
Find the closest representative to p and
assign it to representative’s cluster

38
CURE algorithm
Experimental results
Shrink Factor α:
◼ 0.2 – 0.7 is a good range of values for α

39
CURE algorithm -Experimental results
Number of representative points c:
◼ For smaller values of c, the quality of clustering
suffered
◼ For values of c greater than 10, CURE always found
right clusters

40
The Canopies Algorithm

Efficient Clustering of High-Dimensional Data


Motivation
2
◼ In traditional Clustering n , evaluations
must be performed
◼ Evaluations can be expensive
▪ Very large number of data points
▪ Many features to compare
▪ Costly metrics (e.g. string edit distance)
◼ Non-matches far outnumber matches
◼ Can we quickly eliminate obvious
non-matches to focus effort?
Canopy Clustering
◼ Canopy clustering is an unsupervised clustering
algorithm, primarily as a pre-clustering algorithm
◼ its output is given as input to classical clustering
algorithms such as k-means.
◼ Pre-clustering helps in speeding up the clustering
of actual clustering algorithm, which is very
useful for very large datasets.
▪ Cheaply partitioning the data into overlapping subsets
(called "canopies")
▪ Perform more expensive clustering, but only within
these canopies
43
Key idea
◼ We can greatly reduce the number of distance
computations required for clustering by first cheaply
partitioning the data into overlapping subsets, and
then only measuring distances among pairs of data
points that belong to a common subset.
◼ The canopies technique thus uses two different
sources of information to cluster items
▪ a cheap and approximate similarity measure (e.g., for
household address, the proportion of words in
common between two address)
▪ a more expensive and accurate similarity measure
(e.g., detailed field by-ifeld string edit distance
measured with tuned transformation costs and
dynamic programming).

44
Algorithm – stage 1
◼ Use the cheap distance measure in order to create some
number of overlapping subsets, called “canopies."
◼ A canopy is simply a subset of the elements that, according to
the approximate similarity measure, are within some distance
threshold from a central point.
◼ An element may appear under more than one canopy, and
every element must appear in at least one canopy.
◼ Canopies have the property that points not appearing in any
common canopy are far enough apart that they could not
possibly be in the same cluster.
◼ Since the distance measure used to create canopies is
approximate, there may not be a guarantee of this property,
but by allowing canopies to overlap with each other, by
choosing a large enough distance threshold this can be
reduced. 45
Stage 2
◼ Now execute some traditional clustering algorithm, using the
accurate distance measure, but with the restriction that we do not
calculate the distance between two points that never appear in
the same canopy, i.e. we assume their distance to be infinite.
◼ For example, if all items are trivially placed into a single canopy,
then the second round is just normal clustering.
◼ If, however, the canopies are not too large and do not overlap too
much, then a large number of expensive distance measurements
will be avoided, and the amount of computation required for
clustering will be greatly reduced.
◼ Furthermore, if the constraints on the clustering imposed by the
canopies still include the traditional clustering solution among the
possibilities, then the canopies procedure may not lose any
clustering accuracy, while still increasing computational efficiency
significantly.
46
Algorithm

47
Algorithm
◼ Data points Preparation: The input data needs to be
converted into a format suitable for distance and
similarity measures
◼ Picking Canopy Centers – Random
◼ Assign data points to canopy centers: The canopy
assignment step would simply assign data points to
generated canopy centers.
◼ Pick K-Mean Cluster Centers & Iterate until convergence:
The computation to calculate the closest k-mean center is
greatly reduced as we only calculate the distance
between a k-center and data point if they share a canopy.
◼ Assign Points to K-Mean Centers
48
Algorithm Overview
◼ Let us assume that we have a list of data points
named X.
◼ You decide two thresholds values – T1 and T2, where
T1 > T2.
◼ Randomly pick 1 data point, which would represent
the canopy centroid, from X. Let it be A.
◼ Calculate distances, d, for all the other points with
point A
▪ If d < T1, add the point in the canopy.
▪ If d < T2, remove the point from X.
◼ 4. Repeat steps 2 and 3 for all the data points, until X
is not empty
49
Canopies
◼ A fast comparison groups the data into
overlapping “canopies”
◼ The expensive comparison for full clustering is
only performed for pairs in the same canopy
◼ No loss in accuracy if:

“For every traditional cluster,


there exists a canopy such that
all elements of the cluster are
in the canopy”
Creating Canopies
◼ Define two thresholds
▪ Tight: T1
▪ Loose: T2
◼ Put all records into a set S
◼ While S is not empty
▪ Remove any record r from S and create a canopy
centered at r
▪ For each other record ri, compute cheap distance d
from r to ri
▪ If d < T2, place ri in r’s canopy
▪ If d < T1, remove ri from S
Creating Canopies
◼ Points can be in more than one canopy
◼ Points within the tight threshold will not start
a new canopy
◼ Final number of canopies depends on
threshold values and distance metric
◼ Experimental validation suggests that T1 and
T2 should be equal
Canopies and GAC
◼ Greedy Agglomerative Clustering
▪ Make fully connected graph with a node for each
data point
▪ Edge weights are computed distances
▪ Run Kruskal’s MST algorithm, stopping when you
have a forest of k trees
▪ Each tree is a cluster
◼ With Canopies
▪ Only create edges between points in the same
canopy
▪ Run as before
54
T2 - Threshold

55
T1 overlapping clusters

56
Overlap

57
58
Summary
◼ Clustering: Given a set of points, with a notion
of distance between points, group the points
into some number of clusters
◼ Algorithms:
▪ CURE
▪ Canopy
◼ Map reduce for Clustering

59

You might also like