0% found this document useful (0 votes)
57 views80 pages

Big Data Analytics AAM Unit 4

Uploaded by

fattestbully
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views80 pages

Big Data Analytics AAM Unit 4

Uploaded by

fattestbully
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Big Data Analytics

(20CS733)
Unit IV
Unit IV
Frequent Itemsets And Clustering :
Mining Frequent Itemsets - Market Based
Model – Apriori Algorithm – Handling Large
Data Sets In Main Memory – Limited Pass
Algorithm – Counting Frequent Itemsets In A
Stream – Clustering Techniques –
Hierarchical – K- Means – Clustering High
Dimensional Data – CLIQUE And PROCLUS –
Frequent Pattern Based Clustering Methods
– Clustering In Non-Euclidean Space –
Clustering For Streams And Parallelism.
Association Rule Discovery
Supermarket shelf management – Market-
basket model:
 Goal: Identify items that are bought together

by sufficiently many customers


 Approach: Process the sales data collected

with barcode scanners to find dependencies


among items
 A classic rule:

◦ If someone buys diaper and milk, then he/she is


likely to buy beer
◦ Don’t be surprised if you find six-packs next to
diapers!
The Market-Basket Model
Input:
 A large set of items TID Items
◦ e.g., things sold in a 1 Bread, Coke, Milk
supermarket 2 Beer, Bread
3 Beer, Coke, Diaper, Milk
 A large set of baskets 4 Beer, Bread, Diaper, Milk
 Each basket is a 5 Coke, Diaper, Milk

small subset of items


Output:
◦ e.g., the things one
customer buys on one day Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
 Want to discover {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
association rules
◦ People who bought {x,y,z} tend to buy {v,w}
 Amazon!
Applications – (1)
 Items = products; Baskets = sets of
products someone bought in one trip to the
store
 Real market baskets: Chain stores keep

TBs of data about what customers buy


together
◦ Tells how typical customers navigate stores, lets
them position tempting items
◦ Suggests tie-in “tricks”, e.g., run sale on diapers
and raise the price of beer
◦ Need the rule to occur frequently, or no money
 Amazon’s people who bought X also
bought Y
Applications – (2)
 Baskets = sentences; Items = documents
containing those sentences
◦ Items that appear together too often could
represent plagiarism
◦ Notice items do not have to be “in” baskets

 Baskets = patients; Items = drugs & side-


effects
◦ Has been used to detect combinations
of drugs that result in particular side-effects
◦ But requires extension: Absence of an item
needs to be observed as well as presence
More generally
 A general many-to-many mapping
(association) between two kinds of
things
◦ But we ask about connections among “items”,
not “baskets”

 For example:
◦ Finding communities in graphs (e.g., Twitter)
Example:
 Finding communities in graphs (e.g.,
Twitter)
 Baskets = nodes; Items = outgoing

neighbors  How?
◦ Searching for complete bipartite subgraphs Ks,t of
◦ View each node i as a
a big graph
basket Bi of nodes i it points
to
t nodes
s nodes

◦ Ks,t = a set Y of size t that



occurs in s buckets Bi

◦ Looking for Ks,t  set of


A dense 2-layer support s and look at layer t –
graph all frequent sets of size t
First: Define
Outline
Frequent itemsets
Association rules:
Confidence, Support, Interestingness

Then: Algorithms for finding


frequent itemsets
Finding frequent pairs
A-Priori algorithm
PCY algorithm + 2 refinements
Frequent Itemsets
 Simplest question: Find sets of items that
appear together “frequently” in baskets
 Support for itemset I: Number of baskets

containing all items in I


◦ (Often expressed as a fraction TID Items

of the total number of baskets) 1


2
Bread, Coke, Milk
Beer, Bread
 Given a support threshold s, 3
4
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
then sets of items that appear 5 Coke, Diaper, Milk

in at least s baskets are called Support of


{Beer, Bread} = 2
frequent itemsets
Example: Frequent
Itemsets
 Items = {milk, coke, pepsi, beer, juice}
 Support threshold = 3 baskets

B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

 Frequent itemsets: {m}, {c}, {b}, {j},

{m,b}, {b,c}
,
{c,j}.
Association Rules
 Association Rules:
If-then rules about the contents of baskets
 {i , i ,…,i } → j means: “if a basket contains
1 2 k
all of i1,…,ik then it is likely to contain j”
 In practice there are many rules, want
to find significant/interesting ones!
 Confidence of this association rule is the

probability of j given I = {i1,…,ik}

support( I  j )
conf( I  j ) 
support( I )
Interesting Association
Rules
 Not all high-confidence rules are
interesting
◦ The rule X → milk may have high confidence for
many itemsets X, because milk is just purchased
very often (independent of X) and the confidence
will be high
 Interest of an association rule I → j:
difference between its confidence and the
fraction of baskets that contain j

I arej )those
Interest(rules
◦ Interesting conf(  positive
withI high j )  Pr[orj ]
negative interest values (usually above 0.5)
Example: Confidence and Interest

B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

 Association rule: {m, b} →c


◦ Confidence = 2/4 = 0.5
◦ Interest = |0.5 – 5/8| = 1/8
 Item c appears in 5/8 of the baskets
 Rule is not very interesting!
Finding Association Rules
 Problem: Find all association rules with
support ≥s and confidence ≥c
◦ Note: Support of an association rule is the
support of the set of items on the left side
 Hard part: Finding the frequent
itemsets!
◦ If {i1, i2,…, ik} → j has high support and confidence,
then both {i1, i2,…, ik} and
{i1, i2,…,ik, j} will be “frequent”
support( I  j )
conf( I  j ) 
support( I )
Mining Association Rules
 Step 1: Find all frequent itemsets I

 Step 2: Rule generation


◦ For every subset A of I, generate a rule A → I \ A
 Since I is frequent, A is also frequent
 Variant 1: Single pass to compute the rule confidence
 confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
 Variant 2:
 Observation: If A,B,C→D is below confidence, so is
A,B→C,D
 Can generate “bigger” rules from smaller ones!
◦ Output the rules above the confidence
threshold
Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 Support threshold s = 3, confidence c = 0.75
 1) Frequent itemsets:

◦ {b,m} {b,c} {c,m} {c,j} {m,c,b}


 2) Generate rules:
◦ b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
◦ m→b: c=4/5 … b,m→c:
c=3/4
◦ b→c,m: c=3/6
Compacting the Output
 To reduce the number of rules we can
post-process them and only output:
◦ Maximal frequent itemsets:
No immediate superset is frequent
 Gives more pruning
or
◦ Closed itemsets:
No immediate superset has the same count (> 0)
 Stores not only frequent information, but exact
counts
Example: Maximal/Closed
Frequent, but
superset BC
Support Maximal(s=3) Closed
also frequent.
A 4 No No
Frequent, and
B 5 No its Yes
only superset,
C 3 No No ABC, not freq.
AB 4 Yes YesSuperset BC
has same count.
AC 2 No No
Its only super-
BC 3 Yes Yes set, ABC, has
smaller count.
ABC 2 No Yes
Finding Frequent Itemsets
Itemsets: Computation
Model
 Back to finding frequent itemsets ItemItem
 Typically, data is kept in flat files Item
Item
rather than in a database system: Item
Item
◦ Stored on disk Item
Item
◦ Stored basket-by-basket Item

◦ Baskets are small but we have Item


Item
many baskets and many items Item

 Expand baskets into pairs, triples, etc.


as you read baskets Etc.
 Use k nested loops to generate all
sets of size k

Items are positive integers,


Note: We want to find frequent itemsets. To find them, we and boundaries between
have to count them. To count them, we have to generate them. baskets are –1.
Computation Model
 The true cost of mining disk-resident data is
usually the number of disk I/Os
 In practice, association-rule algorithms read
the data in passes – all baskets read in
turn
 We measure the cost by the number of
passes an algorithm makes over the data
Main-Memory Bottleneck
 For many frequent-itemset algorithms,
main-memory is the critical resource
◦ As we read baskets, we need to count
something, e.g., occurrences of pairs of items
◦ The number of different things we can count
is limited by main memory
◦ Swapping counts in/out is a disaster (why?)
Finding Frequent Pairs
 The hardest problem often turns out to
be finding the frequent pairs of items {i1,
i2}
◦ Why? Freq. pairs are common, freq. triples are rare
 Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size
 Let’s first concentrate on pairs, then
extend to larger sets
 The approach:

◦ We always need to generate all the itemsets


◦ But we would only like to count (keep track) of those
itemsets that in the end turn out to be frequent
Naïve Algorithm
 Naïve approach to finding frequent
pairs
 Read file once, counting in main memory

the occurrences of each pair:


◦ From each basket of n items, generate its
n(n-1)/2 pairs by two nested loops
 Fails if (#items)2 exceeds main memory
◦ Remember: #items can be
100K (Wal-Mart) or 10B (Web pages)
 Suppose 105 items, counts are 4-byte integers
 Number of pairs of items: 105(105-1)/2 = 5*109
 Therefore, 2*1010 (20 gigabytes) of memory needed
Counting Pairs in Memory
Two approaches:
 Approach 1: Count all pairs using a matrix
 Approach 2: Keep a table of triples [i, j, c] =

“the count of the pair of items {i, j} is c.”


◦ If integers and item ids are 4 bytes, we need
approximately 12 bytes for pairs with count > 0
◦ Plus some additional overhead for the hashtable
Note:
 Approach 1 only requires 4 bytes per pair
 Approach 2 uses 12 bytes per pair

(but only for pairs with count > 0)


Comparing the 2 Approaches

12 per
4 bytes per pair
occurring pair

Triangular Matrix Triples


Comparing the two
approaches
 Approach 1: Triangular Matrix
◦ n = total number items
◦ Count pair of items {i, j} only if i<j
◦ Keep pair counts in lexicographic order:
 {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},

◦ Pair {i, j} is at position (i –1)(n– i/2) + j –1
◦ Total number of pairs n(n –1)/2; total bytes= 2n2
◦ Triangular Matrix requires 4 bytes per pair
 Approach 2 uses 12 bytes per occurring
pair
(but only for pairs with count > 0)
◦ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
Comparing the two
approaches
Approach 1: Triangular Matrix
◦ n = total number items
◦ Count pair of items {i, j} only if i<j
Problem
◦ Keep pair is if we
counts in lexicographic order:
have too many
 {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},

items
◦ Pair {i, so the
j} is at position pairs
(i –1)(n– i/2) + j –1
doofnot
◦ Total number pairs fit intototal bytes= 2n
n(n –1)/2; 2

◦ Triangular Matrix requires 4 bytes per pair


memory.
Approach 2 uses 12 bytes per pair
(but Can
only forwe
pairsdo
withbetter?
count > 0)
◦ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
A-Priori Algorithm
A-Priori Algorithm – (1)
 A two-pass approach called
A-Priori limits the need for
main memory
 Key idea: monotonicity

◦ If a set of items I appears at


least s times, so does every subset J of I
 Contrapositive for pairs:
If item i does not appear in s baskets, then
no pair including i can appear in s baskets
 So, how does A-Priori find freq. pairs?
A-Priori Algorithm – (2)
 Pass 1: Read baskets and count in main memory
the occurrences of each individual item
 Requires only memory proportional to #items

 Items that appear times are the frequent


items
 Pass 2: Read baskets again and count in main
memory only those pairs where both elements
are frequent (from Pass 1)
◦ Requires memory proportional to square of frequent
items only (for counts)
◦ Plus a list of the frequent items (so you know what must
be counted)
Main-Memory: Picture of A-
Priori

Frequent items
Item counts
Main memory

Counts of
pairs of frequent
items
(candidate
pairs)

Pass 1 Pass 2
Detail for A-Priori
 You can use the
triangular matrix
method with n = Item counts Frequent Old
item
number of frequent items
#s
items
◦ May save space compared Counts of pairs

Main memory
with storing triples Counts
of of
frequent
pairs of
items
 Trick: re-number frequent items
frequent items 1,2,…
and keep a table
relating new numbers Pass 1 Pass 2
to original item
numbers
Frequent Triples, Etc.
 For each k, we construct two sets of
k-tuples (sets of size k):
◦ Ck = candidate k-tuples = those that might be
frequent sets (support > s) based on information
from the pass for k–1
◦ Lk = the set of truly frequent k-tuples

Count All pairs Count


All of items To be
the items the pairs explained
items from L1

C1 Filter L1 Construct C2 Filter L2 Construct C3


Example ** Note here we generate new candidates by generating C k from Lk-1 and L1.
But that one can be more careful with candidate generation. For example, in
C3 we know {b,m,j} cannot be frequent since {m,j} is not frequent

 Hypothetical steps of the A-Priori algorithm


◦ C1 = { {b} {c} {j} {m} {n} {p} }
◦ Count the support of itemsets in C 1
◦ Prune non-frequent: L1 = { b, c, j, m }
◦ Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
◦ Count the support of itemsets in C 2
◦ Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
◦ Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} }
◦ Count the support of itemsets in C 3
◦ Prune non-frequent: L3 = { {b,c,m} }
A-Priori for All Frequent Itemsets
 One pass for each k (itemset size)
 Needs room in main memory to count

each candidate k–tuple


 For typical market-basket data and

reasonable support (e.g., 1%), k = 2 requires


the most memory
 Many possible extensions:
◦ Association rules with intervals:
 For example: Men over 65 have 2 cars
◦ Association rules when items are in a taxonomy
 Bread, Butter → FruitJam
 BakedGoods, MilkProduct → PreservedGoods
◦ Lower the support s as itemset gets bigger
Market Basket Model

What is Market-Basket Model?

With large data available, analysts are interested


in getting something useful.

•Introduction
•Applications
•Association rules
•Support
•Confidence
•Example
Introduction:
•Name derived from idea of customers throwing all
their purchases into shopping cart or market
basket during grocery shopping.
•Method of data analysis for marketing and
retailing
•Determines the products purchased together
•Strength of this method is using computer tools for
mining and analysis purposes.
Strength of Market Basket Analysis: Beer and
diaper story
•A large super market chain in US, analysed the
buying habits of their customers and found a
statistically significant correlation between purchases
of beer and purchases of diapers on weekends.

•The super market decided to place the beer next to


diapers, resulting in increased sales of both.
Applications
•Cross selling: buy Burger + vendors offers coke
•Product placement:
•Catalog design/Store layout; Hot areas
•Loss leader analysis: pricing strategy by keeping low
moving product at less cost than MRP along with fast
moving product at high cost price.

Association rule:
•How to find the products are purchased together or
entities that go together.

Ans : Association rule


Rule form
Antecendent => Consequent [ support, confidence]
A => b [s,c]

•Support (s): denotes the percentage of transactions


that contain
(A U B)

Support ( A=> B[s,c]) = p(A U B)

•Confidence:
Denotes the percentage of transactions containing A
which also contain B.

Confidence ( A=>B [s, c]) = p ( B/ A) = p( A U B)/ p


•An association rules are considered interesting if they
satisfy both a minimum support threshold and
minimum confidence threshold.

•Example:
Transaction ID Products
1 Shoes, trouser, shirt, belt
2 Shoes, , trouser, shirt, hat, belt, scarf
3 Shoes, shirt
4 Shoes, trouser, belt
Consider rule Trouser => Shirt, we’ll check whether
this rule would be interesting one or not.
Transaction shoes trouser shirt Belt Hat scarf

T1 1 1 1 1 0 0
T2 1 1 1 1 1 1
T3 1 0 1 0 0 0
T4 1 1 0 1 0 0
Minimum support: 40 %
Minimum confidence: 65%

Support (Trouser => Shirt)

p ( A U B) = 2 / 4 = 0.5 [ trouser and shirt


occurring together against total transaction]

Confidence (Trouser => Shirt)


p ( A U B) / p (A) = 2/ 3 = 0.66

Since the support and confidence greater than the


minimum threshold, rule is an interesting one.

Check for Trouser => Belt, Shoe => shirt etc.


Market Basket Model: Show all the rules that are interesting for the
vendor for the following transactions.

ID Item
1 HotDogs, Buns, Ketchup
2 HotDogs, Buns
3 HotDogs, Coke, Chips
4 Chips, Coke
5 Chips, Ketchup
6 HotDogs, Coke, Chips
Clustering
High Dimensional Data
 Given a cloud of data points we want
to understand its structure
The Problem of Clustering
 Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
◦ Members of a cluster are close/similar to each other
◦ Members of different clusters are dissimilar
 Usually:
◦ Points are in a high-dimensional space
◦ Similarity is defined using a distance measure
 Euclidean, Cosine, Jaccard, edit distance, …
Example: Clusters &
Outliers
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Outlier Cluster
Clustering is a hard
problem!
Why is it hard?
 Clustering in two dimensions looks easy
 Clustering small amounts of data looks easy
 And in most cases, looks are not deceiving

 Many applications involve not 2, but 10 or


10,000 dimensions
 High-dimensional spaces look

different: Almost all pairs of points are at


about the same distance
Clustering Problem: Galaxies
 A catalog of 2 billion “sky objects”
represents objects by their radiation in
7 dimensions (frequency bands)
 Problem: Cluster into similar objects,

e.g., galaxies, nearby stars, quasars,


etc.
 Sloan Digital Sky Survey
Clustering Problem: Music CDs
 Intuitively: Music divides into
categories, and customers prefer a few
categories
◦ But what are categories really?

 Represent a CD by a set of customers who


bought it:

 Similar CDs have similar sets of customers,


and vice-versa
Clustering Problem: Music
CDs
Space of all CDs:
 Think of a space with one dim. for each

customer
◦ Values in a dimension may be 0 or 1 only
◦ A CD is a point in this space (x1, x2,…, xk),
where xi = 1 iff the i th customer bought the CD

 For Amazon, the dimension is tens of millions


 Task: Find clusters of similar CDs
Clustering Problem:
Documents
Finding topics:
 Represent a document by a vector

(x1, x2,…, xk), where xi = 1 iff the i th word


(in some order) appears in the document
◦ It actually doesn’t matter if k is infinite; i.e., we
don’t limit the set of words

 Documents with similar sets of words


may be about the same topic
Cosine, Jaccard, and
Euclidean
As with CDs we have a choice when
we think of documents as sets of
words or shingles:
◦ Sets as vectors: Measure similarity by the
cosine distance
◦ Sets as sets: Measure similarity by the
Jaccard distance
◦ Sets as points: Measure similarity by
Euclidean distance
Overview: Methods of
Clustering
 Hierarchical:
◦ Agglomerative (bottom up):
 Initially, each point is a cluster
 Repeatedly combine the two
“nearest” clusters into one
◦ Divisive (top down):
 Start with one cluster and recursively split it

 Point assignment:
◦ Maintain a set of clusters
◦ Points belong to “nearest” cluster
Hierarchical Clustering
 Key operation:
Repeatedly combine
two nearest clusters

 Three important questions:


◦ 1) How do you represent a cluster of more
than one point?
◦ 2) How do you determine the “nearness” of
clusters?
◦ 3) When to stop combining clusters?
Hierarchical Clustering
 Key operation: Repeatedly combine two
nearest clusters
 (1) How to represent a cluster of many

points?
◦ Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
 Euclidean case: each cluster has a
centroid = average of its (data)points
 (2) How to determine “nearness” of

clusters?
◦ Measure cluster distances by distances of centroids
Example: Hierarchical
clustering
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)

Data:
o … data point
x … centroid
Dendrogram
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
 The only “locations” we can talk about are

the points themselves


◦ i.e., there is no “average” of two points

 Approach 1:
◦ (1) How to represent a cluster of many
points?
clustroid = (data)point “closest” to other points
◦ (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid,
when computing inter-cluster distances
“Closest” Point?
 (1) How to represent a cluster of many
points?
clustroid = point “closest” to other points
 Possible meanings of “closest”:

◦ Smallest maximum distance to other points


◦ Smallest average distance to other points
◦ Smallest sum of squares of distances to other
points
 min  d (
For distance metric d clustroid c of cluster C is:
c
x ,
xC
c ) 2

Datapoint Centroid

X Centroid is the avg. of all (data)points


in the cluster. This means centroid is
Clustroid an “artificial” point.
Cluster on Clustroid is an existing (data)point
3 datapoints that is “closest” to all other points in
the cluster.
Defining “Nearness” of
Clusters
 (2) How do you determine the
“nearness” of clusters?
◦ Approach 2:
Intercluster distance = minimum of the
distances between any two points, one from each
cluster
◦ Approach 3:
Pick a notion of “cohesion” of clusters, e.g.,
maximum distance from the clustroid
 Merge clusters whose union is most cohesive
Cohesion
 Approach 3.1: Use the diameter of the
merged cluster = maximum distance
between points in the cluster
 Approach 3.2: Use the average distance

between points in the cluster


 Approach 3.3: Use a density-based

approach
◦ Take the diameter or avg. distance, e.g., and
divide by the number of points in the cluster
Implementation
 Naïve implementation of hierarchical
clustering:
◦ At each step, compute pairwise distances
between all pairs of clusters, then merge
◦ O(N3)

 Careful implementation using priority queue


can reduce time to O(N2 log N)
◦ Still too expensive for really big datasets
that do not fit in memory
k-means clustering
k–means Algorithm(s)
 Assumes Euclidean space/distance
 Start by picking k, the number of clusters
 Initialize clusters by picking one point per
cluster
◦ Example: Pick one point at random, then k-1
other points, each as far away as possible from
the previous points
Populating Clusters
 1) For each point, place it in the cluster
whose current centroid it is nearest
 2) After all points are assigned, update the
locations of centroids of the k clusters
 3) Reassign all points to their closest
centroid
◦ Sometimes moves points between clusters

 Repeat 2 and 3 until convergence


◦ Convergence: Points don’t move between clusters
and centroids stabilize
Example: Assigning
Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 1
Example: Assigning
Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 2
Example: Assigning
Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters at the end
Getting the k right
How to select k?
 Try different k, looking at the change in the

average distance to centroid as k increases


 Average falls rapidly until right k, then

changes little

Best value
of k
Average
distance to
centroid k
Example: Picking k
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Example: Picking k
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Example: Picking k
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Clustering

•Clustering is the grouping of a particular set of


objects based on their characteristics, aggregating
them according to their similarities.

•It is an unsupervised learning approach


K Means clustering

1. Objective: to make groups of data points


2 5 6 8 12 15 18 28 30

2. Define the value of K ( number of clusters) say k


=3

3. Select cluster centre, different criteria can be


followed such as random number, selection of three
farthest numbers etc.

2 5 6 8 12 15 18 28 30

2 5 6 8 12 15 18 28 30
4. Find the mean of each cluster : 4.3 , 13.25 , 29 This
will serves as
new cluster heads/centre

5. Find the cluster with new cluster centre

We ‘ll get

2 4.3 5 6 8 12 13.25 15 18 28 29
30

Repeat this process till cluster centre converges ( there


is no change in cluster centre)

You might also like