Big Data Analytics AAM Unit 4
Big Data Analytics AAM Unit 4
(20CS733)
Unit IV
Unit IV
Frequent Itemsets And Clustering :
Mining Frequent Itemsets - Market Based
Model – Apriori Algorithm – Handling Large
Data Sets In Main Memory – Limited Pass
Algorithm – Counting Frequent Itemsets In A
Stream – Clustering Techniques –
Hierarchical – K- Means – Clustering High
Dimensional Data – CLIQUE And PROCLUS –
Frequent Pattern Based Clustering Methods
– Clustering In Non-Euclidean Space –
Clustering For Streams And Parallelism.
Association Rule Discovery
Supermarket shelf management – Market-
basket model:
Goal: Identify items that are bought together
For example:
◦ Finding communities in graphs (e.g., Twitter)
Example:
Finding communities in graphs (e.g.,
Twitter)
Baskets = nodes; Items = outgoing
neighbors How?
◦ Searching for complete bipartite subgraphs Ks,t of
◦ View each node i as a
a big graph
basket Bi of nodes i it points
to
t nodes
s nodes
occurs in s buckets Bi
…
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
{m,b}, {b,c}
,
{c,j}.
Association Rules
Association Rules:
If-then rules about the contents of baskets
{i , i ,…,i } → j means: “if a basket contains
1 2 k
all of i1,…,ik then it is likely to contain j”
In practice there are many rules, want
to find significant/interesting ones!
Confidence of this association rule is the
support( I j )
conf( I j )
support( I )
Interesting Association
Rules
Not all high-confidence rules are
interesting
◦ The rule X → milk may have high confidence for
many itemsets X, because milk is just purchased
very often (independent of X) and the confidence
will be high
Interest of an association rule I → j:
difference between its confidence and the
fraction of baskets that contain j
I arej )those
Interest(rules
◦ Interesting conf( positive
withI high j ) Pr[orj ]
negative interest values (usually above 0.5)
Example: Confidence and Interest
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
12 per
4 bytes per pair
occurring pair
Frequent items
Item counts
Main memory
Counts of
pairs of frequent
items
(candidate
pairs)
Pass 1 Pass 2
Detail for A-Priori
You can use the
triangular matrix
method with n = Item counts Frequent Old
item
number of frequent items
#s
items
◦ May save space compared Counts of pairs
Main memory
with storing triples Counts
of of
frequent
pairs of
items
Trick: re-number frequent items
frequent items 1,2,…
and keep a table
relating new numbers Pass 1 Pass 2
to original item
numbers
Frequent Triples, Etc.
For each k, we construct two sets of
k-tuples (sets of size k):
◦ Ck = candidate k-tuples = those that might be
frequent sets (support > s) based on information
from the pass for k–1
◦ Lk = the set of truly frequent k-tuples
•Introduction
•Applications
•Association rules
•Support
•Confidence
•Example
Introduction:
•Name derived from idea of customers throwing all
their purchases into shopping cart or market
basket during grocery shopping.
•Method of data analysis for marketing and
retailing
•Determines the products purchased together
•Strength of this method is using computer tools for
mining and analysis purposes.
Strength of Market Basket Analysis: Beer and
diaper story
•A large super market chain in US, analysed the
buying habits of their customers and found a
statistically significant correlation between purchases
of beer and purchases of diapers on weekends.
Association rule:
•How to find the products are purchased together or
entities that go together.
•Confidence:
Denotes the percentage of transactions containing A
which also contain B.
•Example:
Transaction ID Products
1 Shoes, trouser, shirt, belt
2 Shoes, , trouser, shirt, hat, belt, scarf
3 Shoes, shirt
4 Shoes, trouser, belt
Consider rule Trouser => Shirt, we’ll check whether
this rule would be interesting one or not.
Transaction shoes trouser shirt Belt Hat scarf
T1 1 1 1 1 0 0
T2 1 1 1 1 1 1
T3 1 0 1 0 0 0
T4 1 1 0 1 0 0
Minimum support: 40 %
Minimum confidence: 65%
ID Item
1 HotDogs, Buns, Ketchup
2 HotDogs, Buns
3 HotDogs, Coke, Chips
4 Chips, Coke
5 Chips, Ketchup
6 HotDogs, Coke, Chips
Clustering
High Dimensional Data
Given a cloud of data points we want
to understand its structure
The Problem of Clustering
Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
◦ Members of a cluster are close/similar to each other
◦ Members of different clusters are dissimilar
Usually:
◦ Points are in a high-dimensional space
◦ Similarity is defined using a distance measure
Euclidean, Cosine, Jaccard, edit distance, …
Example: Clusters &
Outliers
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster
Clustering is a hard
problem!
Why is it hard?
Clustering in two dimensions looks easy
Clustering small amounts of data looks easy
And in most cases, looks are not deceiving
customer
◦ Values in a dimension may be 0 or 1 only
◦ A CD is a point in this space (x1, x2,…, xk),
where xi = 1 iff the i th customer bought the CD
Point assignment:
◦ Maintain a set of clusters
◦ Points belong to “nearest” cluster
Hierarchical Clustering
Key operation:
Repeatedly combine
two nearest clusters
points?
◦ Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
Euclidean case: each cluster has a
centroid = average of its (data)points
(2) How to determine “nearness” of
clusters?
◦ Measure cluster distances by distances of centroids
Example: Hierarchical
clustering
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid
Dendrogram
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
The only “locations” we can talk about are
Approach 1:
◦ (1) How to represent a cluster of many
points?
clustroid = (data)point “closest” to other points
◦ (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid,
when computing inter-cluster distances
“Closest” Point?
(1) How to represent a cluster of many
points?
clustroid = point “closest” to other points
Possible meanings of “closest”:
Datapoint Centroid
approach
◦ Take the diameter or avg. distance, e.g., and
divide by the number of points in the cluster
Implementation
Naïve implementation of hierarchical
clustering:
◦ At each step, compute pairwise distances
between all pairs of clusters, then merge
◦ O(N3)
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 1
Example: Assigning
Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 2
Example: Assigning
Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters at the end
Getting the k right
How to select k?
Try different k, looking at the change in the
changes little
Best value
of k
Average
distance to
centroid k
Example: Picking k
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Example: Picking k
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Example: Picking k
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Clustering
2 5 6 8 12 15 18 28 30
2 5 6 8 12 15 18 28 30
4. Find the mean of each cluster : 4.3 , 13.25 , 29 This
will serves as
new cluster heads/centre
We ‘ll get
2 4.3 5 6 8 12 13.25 15 18 28 29
30