Ilovepdf Merged
Ilovepdf Merged
Etc.
Computation Model – (2)
Found to be
Infrequent
Infrequent supersets
Pruned
The Apriori algorithm
Ck = candidate itemsets of size k
Level-wise approach Lk = frequent itemsets of size k
1. k = 1, C1 = all items
2. While Ck not empty
Frequent
itemset
3. Scan the database to find which itemsets in Ck are
generation frequent and put them into Lk
Candidate4. Use Lk to generate a collection of candidate
generation itemsets Ck+1 of size k+1
5. k = k+1
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, 1994.
A simple hash structure
◼ e.g., L = {A,B,C,D}:
5
◼ Suppose 10 items.
◼ Suppose counts are 4-byte integers.
◼ Number of pairs of items:
5 5 9
10 (10 -1)/2 = 5*10
(approximately).
10
◼ Therefore, 2*10 (20 gigabytes) of
main memory needed.
Details of Main-Memory Counting
◼ Two approaches:
(1) Count all pairs, using a triangular matrix.
(2) Keep a table of triples [i, j, c] = “the count of
the pair of items {i, j } is c.”
◼ (1) requires only 4 bytes/pair.
▪ Note: always assume integers are 4 bytes.
◼ (2) requires 12 bytes, but only for those
pairs with count > 0.
Details of Main-Memory Counting
Pair 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5
Position 1 2 3 4 5 6 7 8 9 10
Frequent Triples Approach
Frequent
Item counts items
Counts of
pairs of
frequent
items
Pass 1 Pass 2
Detail for A-Priori
Freq- Old
Item counts
quent item
… items #’s
Counts of
pairs of
frequent
items
Pass 1 Pass 2
A-Priori for All Frequent Itemsets
First Second
pass pass
Frequent Frequent
items pairs
A-Priori for All Frequent Itemsets
◼ C1 = all items
◼ In general, Lk = members of Ck
with support ≥ s.
◼ Ck +1 = (k +1) -sets, each k of
which is in Lk .
Finding the frequent pairs is usually the most
expensive operation
First Second
pass pass
Frequent Frequent
items pairs
PCY (Park, Chen & Yu) Algorithm
Item counts
Hash
table
Pass 1
PCY Algorithm – Pass 1
Bitmap
Hash
table Counts of
candidate
pairs
Pass 1 Pass 2
PCY Algorithm – Pass 2
TID Items
1 1,3,4
2 2,3,5
3 1,2,3,5
4 2,5
Example PCY
Itemset Sup
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
Bucket 1 2 3 4 5
Count 3 2 4 1 3
Pass 1
PCY Algorithm in Big Data
PCY was developed by Park, Chen, and Yu. It is used for frequent itemset mining when the
dataset is very large.
The PCY algorithm (Park-Chen-Yu algorithm) is a data mining algorithm that is used to find
frequent itemets in large datasets. It is an improvement over the Apriori algorithm and was
first described in 2001 in a paper titled "PrefixSpan: Mining Sequential Patterns Efficiently
by Prefix-Projected Pattern Growth" by Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and
Helen Pinto.
The PCY algorithm uses hashing to efficiently count item set frequencies and reduce
overall computational cost.
The basic idea is to use a hash function to map itemsets to hash buckets, followed
by a hash table to count the frequency of itemsets in each bucket.
Apply the PCY algorithm on the following transaction to find the candidate sets
(frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
Step 1: Find the frequency of each element and remove the candidate set
having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and
corresponding to them write their frequency. Note - Note: Pairs should not get
repeated avoid the pairs that are already written before.
Step 3: List all sets whose length is greater than the threshold and then apply
Hash Functions. (It gives us the bucket number). It defines in what bucket this
particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the
following details -
● Bit vector - if the frequency of the candidate pair is greater than equal to
the threshold then the bit vector is 1 otherwise 0. (mostly 1)
● Bucket number - found in the previous step
● Maximum number of support - frequency of this candidate pair, found in
step 2.
● Correct - the candidate pair will be mentioned here.
● Candidate set - if the bit vector is 1, then "correct" will be written here.
Question: Apply PCY algorithm on the following transaction to find the candidate sets (frequent
sets).
Given data
Threshold value or minimization value = 3
Hash function= (i*j) mod 10
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}
Use buckets and concepts of Mapreduce to solve the above
problem.
Solution
1. To identify the length or we can say repetition of each candidate in
the given dataset.
2. Reduce the candidate set to all having length 1.
3. Map pair of candidates and find the length of each pair.
4. Apply a hash function to find bucket no.
5. Draw a candidate set table.
Step 1: Mapping all the elements in order to find their length.
Items → {1, 2, 3, 4, 5, 6}
Key 1 2 3 4 5 6
Value 4 6 8 8 6 4
Step 2: Removing all elements having value less than 3.
But here in this example there is no key having value less than 3.
Hence, candidate set = {1, 2, 3, 4, 5, 6}
Step 3: Map all the candidate set in pairs and calculate their lengths.
T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)
T2: {(2, 4) (3, 4)} = (3 4)
T3: {(3, 5) (4, 5)} = (5, 3)
T4: {(4, 5) (5, 6)} = (3, 2)
T5: {(1, 5)} = 1
T6: {(2, 6)} = 1
T7: {(1, 4)} = 2
T8: {(2, 5)} = 2
T9: {(3, 6)} = 2
T10:______
T11:______
T12:______
Note: Pairs should not get repeated avoid the pairs that are already written before.
Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4) (3,5)
(4,5) (4,6)}
Counts of
Second
candidate
hash table
pairs
Pass 1 Pass 2
Limited Pass Algorithms
◼ A-Priori, PCY, etc., take k
passes to
find frequent itemsets of size k.
◼ Other techniques use 2 or fewer passes
for all sizes:
▪ Simple Randomized Sampling algorithm.
▪ SON (Savasere, Omiecinski, and
Navathe).
Randomized Sampling Algorithm – (1)
◼ map(key, value):
◼ //value is a chunk of the full dataset
◼ Count occurrences of itemsets in the chunk.
◼ for itemset in itemsets:
◼ if supp(itemset) >= p*s
◼ emit(itemset, null)
◼ reduce(key, values):
◼ emit(key, null)
The second map-reduce step:
◼ map(key, value):
◼ // value is the candidate itemsets and a chunk
◼ of the full dataset
◼ Count occurrences of itemsets in the chunk.
◼ for itemset in itemsets:
◼ emit(itemset, supp(itemset))
◼ reduce(key, values):
◼ result = 0
◼ for value in values:
◼ result += value
◼ if result >= s:
◼ emit(key, result)
First Phase Map Reduce
◼ First Map Function:
▪ Take the assigned subset of the baskets and find
the itemsets frequent in the subset using the
simple Randomized Algorithm.
▪ Lower the support threshold from s to ps if each
Map task gets fraction p of the total input file.
▪ The output is a set of key-value pairs (F, 1),
where F is a frequent itemset from the sample.
▪ The value is always 1 and is irrelevant.
First Phase - MR
Clustering Approaches
Overview of the Chapter
◼ What Is Clustering?
◼ Challenges of Big Data
Clustering
◼ CURE Algorithm.
◼ Canopy Clustering,
◼ Clustering with MapReduce
2
3
■ Clustering is important unsupervised learning
technique.
■ It deals with finding a structure in a collection of
unlabeled data.
■ Clustering is “the process of organizing
objects into groups whose members are
similar in some way”.
■ A cluster is therefore a collection of objects
which are “similar” between them and are
“dissimilar” to the objects belonging to other
clusters.
* Data Mining: Concepts and Techniques 4
■ Clustering Algorithms:
■ A Clustering Algorithm tries to analyse natural
groups of data on the basis of some similarity.
■ It locates the centroid of the group of data
points.
■ To carry out effective clustering, the algorithm
evaluates the distance between each point from
the centroid of the cluster.
7
• k-Means algorithm [1957, 1967]
• PAM [1990]
• k-Medoids algorithm • CLARA [1990]
Partitioning • k-Modes [1998] • CLARANS [1994]
methods • Fuzzy c-means algorithm [1999]
• DIANA [1990]
Divisive
• AGNES [1990]
Hierarchical • BIRCH [1996]
methods Agglomerative • CURE [1998]
methods • ROCK [1999]
• Chamelon [1999]
Clustering
Techniques Density-based • STING [1997] • DENCLUE [1998]
• DBSCAN [1996] • OPTICS [1999]
methods • CLIQUE [1998] • Wave Cluster [1998]
• EM Algorithm [1977]
Model based • Auto class [1996]
clustering • COBWEB [1987]
• ANN Clustering [1982, 1989]
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster
9
Why is it hard?
◼ Clustering in two dimensions looks easy
◼ Clustering small amounts of data looks easy
◼ Many applications involve not 2, but 10 or
10,000 dimensions
◼ High-dimensional spaces look different:
Almost all pairs of points are at about the
same distance
10
Clustering Problem: Books
◼ Intuitively: Books divides into categories, and
customers prefer a few categories
▪ But what are categories really?
11
Clustering Problem: Books
Space of all Books:
◼ Think of a space with one dim. for each
customer
▪ Values in a dimension may be 0 or 1 only
▪ A book is a point in this space (x1, x2,…, xk),
where xi = 1 iff the i th customer bought the book
13
Applications
◼ Collaborative Filtering
◼ Customer Segmentation
◼ Data Summarization
◼ Location Based Analysis
◼ Multimedia Data Analysis
◼ Biological Data Analysis
◼ Social Network Analysis
14
Overview: Methods of Clustering
◼ Hierarchical:
▪ Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two
“nearest” clusters into one
▪ Divisive (top down):
▪ Start with one cluster and recursively split it
◼ Point assignment:
▪ Maintain a set of clusters
▪ Points belong to “nearest” cluster
15
Hierarchical Clustering
◼ Key operation:
Repeatedly combine
two nearest clusters
16
Hierarchical Clustering
◼ Key operation: Repeatedly combine two
nearest clusters
◼ (1) How to represent a cluster of many points?
▪ Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
◼ Euclidean case: each cluster has a
centroid = average of its (data)points
◼ (2) How to determine “nearness” of clusters?
▪ Measure cluster distances by distances of centroids
17
Example: Hierarchical clustering
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
ox (2,1)
(1,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid
Dendrogram
18
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
◼ The only “locations” we can talk about are the
points themselves
▪ i.e., there is no “average” of two points
◼ Approach 1:
▪ (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
▪ (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
19
“Closest” Point?
◼ (1) How to represent a cluster of many points?
clustroid = point “closest” to other points
◼ Possible meanings of “closest”:
▪ Smallest maximum distance to other points
▪ Smallest average distance to other points
▪ Smallest sum of squares of distances to other points
▪ For distance metric d clustroid c of cluster C is:
Datapoint Centroid
26
The CURE Algorithm
Extension of k-means to clusters
of arbitrary shapes
The CURE Algorithm
Vs.
◼ Problem with k-means:
▪ Assumes clusters are normally
distributed in each dimension
▪ And axes are fixed – ellipses at
an angle are not OK
h
e e
e
h e
e e h
e e e e
salary h
e
h
h
h h
h h h
age
31
Overview
◼ CURE uses random sampling and partitioning to reliably find
clusters of arbitrary shape and size.
◼ Clusters a random sample of the database in an agglomerative
fashion, dynamically updating a constant number c of
well-scattered points R1, . . . , Rc per cluster to represent each
cluster’s shape.
◼ To assign the remaining, unsampled points to a cluster, these
points Ri are used in a similar manner to centroids in the k-means
algorithm – each data point that was not in the sample is assigned
to the cluster which contains the point Ri closest to the data point.
◼ To handle large sample sizes, CURE divides the random sample
into partitions which are pre-clustered independently, then the
partially-clustered sample is clustered further by the
agglomerative algorithm
32
Starting CURE
2 Pass algorithm. Pass 1:
◼ 0) Pick a random sample of points that fit in
main memory
◼ 1) Initial clusters:
▪ Cluster these points hierarchically – group
nearest points/clusters
◼ 2) Pick representative points:
▪ For each cluster, pick a sample of points, as
dispersed as possible
▪ From the sample, pick representatives by moving
them (say) 20% toward the centroid of the cluster
33
Example: Initial Clusters
h h
h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h
age
34
Example: Pick Dispersed Points
h h
h
e e
e
h e
e e h
e e e e
h
salary e Pick (say) 4
h
h remote points
h h for each
h h h cluster.
age
35
Example: Pick Dispersed Points
h h
h
e e
e
h e
e e h
e e e e
h
salary e Move points
h
h (say) 20%
h h toward the
h h h centroid.
age
36
CURE algorithm
Step by step
◼ For each cluster, c well scattered points within the
cluster are chosen, and then shrinking them
toward the mean of the cluster by a fraction α
◼ The distance between two clusters is then the
distance between the closest pair of
representative points from each cluster.
◼ The c representative points attempt to capture the
physical shape and geometry of the cluster.
Shrinking the scattered points toward the mean
gets rid of surface abnormalities and decrease the
effects of outliers.
37
Finishing CURE
Pass 2:
◼ Now, rescan the whole dataset and
visit each point p in the data set
◼ Place it in the “closest cluster” p
▪ Normal definition of “closest”:
Find the closest representative to p and
assign it to representative’s cluster
38
CURE algorithm
Experimental results
Shrink Factor α:
◼ 0.2 – 0.7 is a good range of values for α
39
CURE algorithm -Experimental results
Number of representative points c:
◼ For smaller values of c, the quality of clustering
suffered
◼ For values of c greater than 10, CURE always found
right clusters
40
The Canopies Algorithm
44
Algorithm – stage 1
◼ Use the cheap distance measure in order to create some
number of overlapping subsets, called “canopies."
◼ A canopy is simply a subset of the elements that, according to
the approximate similarity measure, are within some distance
threshold from a central point.
◼ An element may appear under more than one canopy, and
every element must appear in at least one canopy.
◼ Canopies have the property that points not appearing in any
common canopy are far enough apart that they could not
possibly be in the same cluster.
◼ Since the distance measure used to create canopies is
approximate, there may not be a guarantee of this property,
but by allowing canopies to overlap with each other, by
choosing a large enough distance threshold this can be
reduced. 45
Stage 2
◼ Now execute some traditional clustering algorithm, using the
accurate distance measure, but with the restriction that we do not
calculate the distance between two points that never appear in
the same canopy, i.e. we assume their distance to be infinite.
◼ For example, if all items are trivially placed into a single canopy,
then the second round is just normal clustering.
◼ If, however, the canopies are not too large and do not overlap too
much, then a large number of expensive distance measurements
will be avoided, and the amount of computation required for
clustering will be greatly reduced.
◼ Furthermore, if the constraints on the clustering imposed by the
canopies still include the traditional clustering solution among the
possibilities, then the canopies procedure may not lose any
clustering accuracy, while still increasing computational efficiency
significantly.
46
Algorithm
47
Algorithm
◼ Data points Preparation: The input data needs to be
converted into a format suitable for distance and
similarity measures
◼ Picking Canopy Centers – Random
◼ Assign data points to canopy centers: The canopy
assignment step would simply assign data points to
generated canopy centers.
◼ Pick K-Mean Cluster Centers & Iterate until convergence:
The computation to calculate the closest k-mean center is
greatly reduced as we only calculate the distance
between a k-center and data point if they share a canopy.
◼ Assign Points to K-Mean Centers
48
Algorithm Overview
◼ Let us assume that we have a list of data points
named X.
◼ You decide two thresholds values – T1 and T2, where
T1 > T2.
◼ Randomly pick 1 data point, which would represent
the canopy centroid, from X. Let it be A.
◼ Calculate distances, d, for all the other points with
point A
▪ If d < T1, add the point in the canopy.
▪ If d < T2, remove the point from X.
◼ 4. Repeat steps 2 and 3 for all the data points, until X
is not empty
49
Canopies
◼ A fast comparison groups the data into
overlapping “canopies”
◼ The expensive comparison for full clustering is
only performed for pairs in the same canopy
◼ No loss in accuracy if:
55
T1 overlapping clusters
56
Overlap
57
58
Summary
◼ Clustering: Given a set of points, with a notion
of distance between points, group the points
into some number of clusters
◼ Algorithms:
▪ CURE
▪ Canopy
◼ Map reduce for Clustering
59