0% found this document useful (0 votes)

39 views52 pages

Clustering Today

Cluster analysis is a statistical technique used to group similar objects into categories called clusters. It aims to maximize homogeneity within clusters and heterogeneity between clusters. The key steps in cluster analysis involve selecting a distance measure to quantify similarity, choosing a clustering algorithm, determining the optimal number of clusters, and validating the results. Common applications of cluster analysis include data reduction and identifying hidden patterns in data.

Uploaded by

Shubh Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views52 pages

Clustering Today

Uploaded by

Shubh Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 52

INTRODUCTION TO CLUSTERING

It is a class of techniques used to classify cases

into groups that are relatively homogeneous
within themselves and heterogeneous between
each other, on the basis of a defined set of
variables. These groups are called clusters.
Cluster analysis is a statistical method used to
group similar objects into respective categories. It
can also be referred to as segmentation analysis,
taxonomy analysis, or clustering.
• Cluster analysis is a group of multivariate technique whose primary
purpose is to group objects based on the characteristics they possess.
• Factor analysis is primarily concerned with grouping variables based
on pattern of variation(correlation) in the data whereas cluster
analysis makes groupings on the basis of distance(proximity).
• Cluster analysis classifies objects (i.e. respondents , products, or other
entities) on a set of user selected characteristics i.e. clustering
variables. The resulting clusters should exhibit high internal (within
cluster) homogeneity and high external (between cluster)
heterogeneity. If the classification is successful , the objects within
the clusters will be close together when plotted geometrically and
different clusters will be far apart.
Conceptual development with cluster analysis

• Data reduction – a researcher may be faced with a large

number of observations that are meaningless unless classified
into manageable groups. Cluster analysis can perform this
data reduction procedure objectively by reducing the
information from an entire population or sample to
information about specific groups .

• Steps to conduct a Cluster Analysis

• 1. Select a distance measure
• 2. Select a clustering algorithm
• 3. Determine the number of clusters
• 4. Validate the analysis
• The primary objective of cluster analysis is to define the
structure of the data by placing the most similar observations
into groups :, we must address three basic questions
• 1. How do we measure similarity ? Several methods are
possible , including correlation between objects or perhaps a
measure of their proximity in two dimensional space such
that the distance between observations indicates similarity?
• 2. How do we form clusters? No matter how similarity is
measured , the procedure must group those observations that
are most similar into a cluster.
• 3. How many groups do we form ?
Measuring similarity

• The first task is developing some measure of similarity

between each object to be used in the clustering process.
Non-overlapping clusters
Cluster in which each observation belongs to only
one cluster. Non-overlapping clusters are more
frequently used clustering techniques in practice.
Overlapping clusters
• An observation may belong to more than one cluster
Probabilistic clusters
An observation may belong to a cluster according to a
probability distribution.
Hierarchical clustering
Hierarchical clustering creates subsets of data similar to a tree-
like structure in which the root node corresponds to the
complete set of data. Branches are created from the root node
to split the data into heterogeneous subsets (clusters).
Euclidean Distance
Euclidean is one of the frequently used distance measures when
the data are either in interval or ratio scale.

The Eucledian distance between two n-dimensional

observations X1 (x11, x12, …, x1n) and X2 (x21, x22, …, x2n) is given
by

D( X 1 , X 2 )  ( x11  x21 )2  ( x12  x22 ) 2    ( x1n  x2n ) 2

Example

The below table has information about 20 wines sold in the market along with their
alcohol and alkalinity of ash content

Alkalinity of Alkalinity of
Wine Alcohol Wine Alcohol
Ash Ash

1 14.8 28 11 10.7 12.2

2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8
Clusters of wine based on alcohol and ash
content.
Standardized Euclidean Distance
Let X1k and X2k be two attributes of the data (where k stands for
the kth observation in the data set). It is possible that the range
of X1k can be much smaller compared to X2k, resulting in
skewed Euclidean distance value. An easier way of handling
the potential bias is to standardize the data using the following
equation:
  
 X ik  X i 
Standardized value of the attribute =  
  Xi 
 

Where X i and  X i are, respectively, the mean and standard

deviation of ith attribute
Manhattan Distance (City Block Distance)

Euclidean distance may not be appropriate while measuring

distance between different locations (for example, distance
between two shops in a city). In such cases, we use Manhattan
distance, which is given by

n
DM ( X 1 , X 2 )   X 1i  X 2i
i 1

It is not based on Euclidean distance , instead it uses the sum of the absolute
distance of the variables . It is simply to calculate but may lead to invalid
clusters if the clustering variable are highly correlated
Minkowski Distance
Minsowski distance is the generalized distance measure
between two cases in the dataset and is given by

1 p
 n p

Minkowski D( X 1 , X 2 )    X 1i  X 2 i 
 i 1
 

When p = 1, Minkowski distance is same as the Manhattan

distance.
For p = 2, Minkowski distance is same as the Euclidean
distance.
Jaccard Similarity Coefficient (Jaccard
Index)
 Jaccard similarity coefficient (JSC) or Jaccard index (Real and
Vargas, 1996) is a measure used when the data is qualitative,
especially when attributes can be represented in binary form.
 JSC for two n-dimensional data (n attributes), X1 and X2, is
given by

n( X1  X 2 )
Jaccard(X1, X2) =
n( X1  X 2 )
where n(X1  X2) is the number of attributes that belong to both
X1 and X2 (that is, X1  X2), n(X1  X2) is the number of
attributes that belong to either X1 or X2 (that is, X1  X2).
Example
Consider movie DVD purchases made by two customers as given by
the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu Panda
(KFP), Before Sunrise (BS), Bridge of spies (BoS), Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest Gump, Iron
Man (IM), Kung Fu Panda (KFP), Schindler’s List (SL), The God
Father (TGF)}
In this case, each movie is an attribute. The purchases made by the two
customers are shown in Table

Movie Title BS BoS C FG IM JB KFP SL TGF

Customer 1 1 1 0 1 1 1 1 0 0

Customer 2 0 0 1 1 1 1 1 1 1
• The JSC is given by

n(customer 1  customer 2) 4
JSC    0.44
n(customer 1  customer 2) 9

Higher the Jaccard coefficient, higher the similarity between

two observations being compared. The value of JSC lies
between 0 and 1.
Cosine Similarity
The cosine similarity between X1 and X2 is given by
n
 X1i  X 2i
X1  X 2 i 1
Similarity (X1, X2) = cos() = 
X1  X 2 n
2
n
2
 X1i   X 2i
i 1 i 1

In cosine similarity, X1 and X2 are two n-dimensional vectors

and it measures the angle between two vectors (thus called
vector space model).
Cosine similarity of different values of .
Gower’s Similarity Coefficient
Gower’s similarity coefficient (Gower, 1971) is used when
the data has both quantitative and qualitative data.
Gower’s coefficient between two n-dimensional
observations i and j is given by

n
 DijkWijk
Dij  k 1n
 Wijk
k 1

where Dijk is the distance between observations (i and j) for kth

variable and Wijk is a binary variable that captures whether
the distance between observations is valid for kth variable.
Example
Table 14.5 shows 5 customers and their movie downloads from a portal.
The data consists of genre of the movies, maximum rating given by the
customer, and the marital status (code 1 implies married and 0
otherwise). For example, customer 1 downloaded 23 action, 5
romance, 15 comedy, and 0 Sci-fi movies and his maximum rating was
4.

Customer Number of Movies Downloaded Under Each Genre Marital

Maximum
Rating
Status

Action Romance Comedy Sci-fi (k = 5) Married

(k = 1) (k = 2) (k = 3) (k = 4) (k = 6)

1 23 5 15 0 4 0

2 5 18 16 2 5 1

3 25 0 0 15 5 0

4 2 30 15 0 4 1

5 45 0 0 10 5 0
Solution
The Gowers distance between customers 1 and 2 can be
calculated as shown in Table below :

k=1 k=2 k=3 k=4 k=5 k=6 Sum

Dijk 0.5814 0.5667 0.9375 0.8667 0.0000 0 2.952

Wijk 1 1 1 1 1 1 6

n
 Dijk Wijk
Dij  k 1
n
 Wijk
k 1

The Gower’s distance between customers 1 and 2 is given

by 2.952/6 = 0.492.
Quality and Optimal Number of Clusters
Milligan and Cooper (1985) analysed over 30 procedures for
determining the optimal number of clusters and recommended
the index proposed by Calinski and Harabasz (1974) which is
given by

B(k ) / k  1
CH ( k ) 
W ( k ) /( n  k )

where CH(k) is the Calinski and Harabasz index with k-

clusters (k > 1), B(k) and W(k) are the between and within
clusters sum of squared variations with k clusters.
Clustering Algorithms
Clustering algorithms group data into finite number of
mutually exclusive subsets.

Steps followed in clustering algorithms:

•Variable selection.
•Deciding the distance/similarity measure for measuring
distance/dissimilarity between the observations.
•Deciding the number of clusters.
•Validation of the clusters.
Variable Selection

Ketchen and Shook (1996) suggest inductive, deductive, and

cognitive approaches for variable selection.

• Inductive is basically an exploratory approach and starts

with as many variables as possible.

• On the other hand, in deductive variable selection,

suitability of the variable and theoretical basis influence
selection of variables.

• Under cognitive variable selection, expert opinion plays a

major role in variable selection
Deciding Distance/Similarity Measures
Choosing the right distance/similarity measure plays an
important role in developing clusters.

Number of Clusters

Several approaches are available for deciding the number of

clusters such as CH index , Hartigan statistic [Eq. (14.14)],
Silhouette statistic, and elbow method in which the ideal number
of clusters is given by the position of elbow in an L-shaped
curve.
Cluster Validation

The clusters created should be validated for consistency using

different algorithms to ensure that the clusters represent the
structures that exist in the population.
Halkidi et al. (2001) suggest the following measures to
validate the clusters:

•Compactness: Closeness of each member of a cluster which

can be measured through variance.

•Separation: Distance between different clusters.

K-Means Clustering

• K-means clustering is one of the frequently used clustering

algorithms.

• It is a non-hierarchical clustering method in which the

number of clusters (K) is decided a priori.
K-Means Clustering - Steps
1) Choose K observations from the data that are likely to be in
different clusters. There are many ways of choosing these initial
K values; easiest approach is to choose observations that are
farthest (in one of the parameters of the data).
2) The K observations chosen in step 1 are the centroids of those
clusters.
3) For remaining observations, find the cluster closest to the
centroid. Add the new observation (say observation j) to the
cluster with closest centroid. Adjust the centroid after adding a
new observation to the cluster. The closest centroid is chosen
based on an appropriate distance measure.
4) Repeat step 3 till all observations are assigned to a cluster.
Hierarchical Clustering
Hierarchical clustering is a clustering algorithm which uses the
following steps to develop clusters:

1) Start with each data point in a single cluster.

2) Find the data points with shortest distance (using an

appropriate distance measure) and merge them to form a
cluster.

3) Repeat step 2 until all data points are merged to form a single
cluster

The above procedure is called agglomerative hierarchical cluster

Dendrogram for movie clustering
Summary

• Clustering is an unsupervised learning algorithms that divides

the data set into mutually exclusive and exhaustive subsets (in
non-overlapping clusters) that that are homogeneous within
the group and heterogeneous between the groups.

• Clustering is one of the frequently used techniques and

practitioners first cluster the data and develop predictive
models for each cluster for better management.
• Several distance measures such as Euclidian distance, Gower
distance are used in clustering algorithms. Similarity
coefficients such as Jaccard coefficient and Cosine similarity
are used depending on the data type.

• K-means clustering and Hierarchical clustering are two

popular techniques used for clustering.

• One of the decisions to be taken during clustering is to

decide on the number of cluster. Usually this is carried out
using elbow curve. The cluster number at which the elbow
(bend) occurs in the elbow curve is the optimal number of
clusters.
• Kalyani Jeweler designer wishes to know if the population of young teenage girls
aged 13-19 can be divided into smaller groups who might be looking at jewellery
very differently .
• The following six questions were given to a group of 10 girls to understand what
jewelery meant to them. The questionnaire was on a 5 point likert scale ranging
from 1-straingly agree to 5 strongly disagree.
Questionnaire

• 1. I like to wear jewellery that glitters.

• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 2. My Jewellery should match my dress.
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 3. I want everyone to admire my jewellery
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 4. I take my friends with me when I go jewellery shopping
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 5. Beautiful jewellery adds to a girls beauty
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• Cluster-1 = 5,9,4,10
• Cluster-2 = 6,8,3,2,7,1

• Cluster1 seems to be socially concerned group as they have high degree of

agreement with X3 and X4.

• Cluster-2 seems to be more self driven as they show a high degree of agreement
with X1,x2,x3
• The Different Types of Cluster Analysis
• There are three primary methods used to perform cluster analysis:
• Hierarchical Cluster
• This is the most common method of clustering. It creates a series of models with
cluster solutions from 1 (all cases in one cluster) to n (each case is an individual
cluster). This approach also works with variables instead of cases. Hierarchical
clustering can group variables together in a manner similar to factor analysis.
• Finally, hierarchical cluster analysis can handle nominal, ordinal, and scale data. But,
remember not to mix different levels of measurement into your study.
• K-Means Cluster
• This method is used to quickly cluster large datasets. Here, researchers define the
number of clusters prior to performing the actual study. This approach is useful when
testing different models with a different assumed number of clusters.
• Two-Step Cluster
• This method uses a cluster algorithm to identify groupings by performing pre-
clustering first, and then performing hierarchical methods. Two-step clustering is best
for handling larger datasets that would otherwise take too long a time to calculate with
strictly hierarchical methods.
• Essentially, two-step cluster analysis is a combination of hierarchical and k-means
cluster analysis. It can handle both scale and ordinal data, and it automatically selects
the number of clusters.
• Steps for Cluster Analysis
• Formulate the problem – Select the variables on which the clustering will be based.
The variables should describe the similarity between objects in terms that are relevant
to the research problem. The variables should be selected based on past research,
theory, the hypotheses being tested, or the judgment of the researcher.
• Select a distance measure – An appropriate measure of distance needs to be selected
to determine how similar or dissimilar the objects being clustered should be. The most
commonly used measure is Euclidean distance.
• Select a clustering procedure – Several clustering procedures have been developed
and the one most appropriate for the problem at hand should be chosen.
• Decide on the number of clusters – The number of clusters can be based on
theoretical, conceptual, or practical considerations.
• Interpret and profile clusters – This involves examining cluster centroids. The
centroids represent the mean values of the objects contained in the cluster on each of
the variables.
• Asses the validity of clustering – Some methods to validate the data quality include
using different methods of clustering and comparing the results or clustering on a
smaller set of variables (randomly deleted) and comparing the results with the entire
set of variables.

BA 2023 - 2024 T03 Descriptive Data Mining
No ratings yet
BA 2023 - 2024 T03 Descriptive Data Mining
57 pages
Masked Voices: Craig M. Loftin
No ratings yet
Masked Voices: Craig M. Loftin
327 pages
Cluster Analysis: Classification Analysis, or Numerical Taxonomy
No ratings yet
Cluster Analysis: Classification Analysis, or Numerical Taxonomy
13 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Letter To Judge Alvin Hellerstein
100% (1)
Letter To Judge Alvin Hellerstein
5 pages
Clustering
No ratings yet
Clustering
55 pages
The Secret Behind Support and Resistance
94% (17)
The Secret Behind Support and Resistance
76 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Ket Test 25
No ratings yet
Ket Test 25
3 pages
How To Make A Survey in Thesis
100% (1)
How To Make A Survey in Thesis
6 pages
Communication Art
No ratings yet
Communication Art
19 pages
Machiavelli, Hobbes and Locke
No ratings yet
Machiavelli, Hobbes and Locke
2 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
The Prelude Emigree Effects of Nature Essay
No ratings yet
The Prelude Emigree Effects of Nature Essay
2 pages
Unreleased Quorum Based Computations Paper
No ratings yet
Unreleased Quorum Based Computations Paper
19 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Apologético PDF
No ratings yet
Apologético PDF
4 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
SEMINAR
No ratings yet
SEMINAR
19 pages
Unit 2
No ratings yet
Unit 2
89 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
1592 Fdoc
No ratings yet
1592 Fdoc
14 pages
Cluster Analysis: Introduction - I: Dr. A. Ramesh
No ratings yet
Cluster Analysis: Introduction - I: Dr. A. Ramesh
28 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Stat 443 Lecture 1
No ratings yet
Stat 443 Lecture 1
24 pages
Group#10 (Cluster Analysis)
No ratings yet
Group#10 (Cluster Analysis)
53 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Animal Research and Human Medicine Booklet
No ratings yet
Animal Research and Human Medicine Booklet
24 pages
Unsupervised Learning: Uses of Cluster Analysis
No ratings yet
Unsupervised Learning: Uses of Cluster Analysis
2 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
Lec 35
No ratings yet
Lec 35
18 pages
Activity 1.2 Journal Writing 1
No ratings yet
Activity 1.2 Journal Writing 1
2 pages
ROBOTICS QUIZ With Answer Key - Third Quarter
100% (1)
ROBOTICS QUIZ With Answer Key - Third Quarter
13 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
Descriptive Data Mining
No ratings yet
Descriptive Data Mining
8 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
108 & Iob
No ratings yet
108 & Iob
75 pages
Lecture-9 Cluster Analysis - LAK
No ratings yet
Lecture-9 Cluster Analysis - LAK
4 pages
MDA Session 4
No ratings yet
MDA Session 4
5 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Quali Notes From Sir Leo
No ratings yet
Quali Notes From Sir Leo
62 pages
Clustering
No ratings yet
Clustering
47 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Model Test 3 Lexico-Grammar
No ratings yet
Model Test 3 Lexico-Grammar
4 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Deen Dayal Upadhyaya Gorakhpur University, Gorakhpur
No ratings yet
Deen Dayal Upadhyaya Gorakhpur University, Gorakhpur
1 page
Resume For Jessica Navarrete
No ratings yet
Resume For Jessica Navarrete
1 page
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Relationship Principles To Win Hearts at Work
No ratings yet
Relationship Principles To Win Hearts at Work
104 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
History of Agrarian Reform
No ratings yet
History of Agrarian Reform
27 pages
In Marketing, Cluster Analysis Is Used For: Statistical
No ratings yet
In Marketing, Cluster Analysis Is Used For: Statistical
3 pages
Buyer-Seller Relationship: Industrial Marketing: Chapter 4
No ratings yet
Buyer-Seller Relationship: Industrial Marketing: Chapter 4
10 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Enopia Trisha Mae M. BSE 2C PDF
No ratings yet
Enopia Trisha Mae M. BSE 2C PDF
27 pages
Practice Test 1: Gen Gaze Gaudy Gate Obtain Obstacle Obstinate Obsolete
No ratings yet
Practice Test 1: Gen Gaze Gaudy Gate Obtain Obstacle Obstinate Obsolete
6 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Delta Module 1 June 2010 Paper 1 PDF
No ratings yet
Delta Module 1 June 2010 Paper 1 PDF
8 pages
Case History Excess and Deficiency in Thrush by Cheng Hao Zhou
No ratings yet
Case History Excess and Deficiency in Thrush by Cheng Hao Zhou
2 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
Aspergillus Salpingitis A Rare Case Report
No ratings yet
Aspergillus Salpingitis A Rare Case Report
4 pages
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering
0% (1)
Clustering
127 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
LAW 100 - Pana v. Heirs of Juanite, Sr. (G.R. No. 164201)
No ratings yet
LAW 100 - Pana v. Heirs of Juanite, Sr. (G.R. No. 164201)
3 pages
Kinetics: The Oxidation of Iodide by Hydrogen Peroxide
No ratings yet
Kinetics: The Oxidation of Iodide by Hydrogen Peroxide
3 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Clustering Today

Uploaded by

Clustering Today

Uploaded by

INTRODUCTION TO CLUSTERING

It is a class of techniques used to classify cases

• Data reduction – a researcher may be faced with a large

• Steps to conduct a Cluster Analysis

• The first task is developing some measure of similarity

The Eucledian distance between two n-dimensional

D( X 1 , X 2 )  ( x11  x21 )2  ( x12  x22 ) 2    ( x1n  x2n ) 2

1 14.8 28 11 10.7 12.2

Where X i and  X i are, respectively, the mean and standard

Euclidean distance may not be appropriate while measuring

When p = 1, Minkowski distance is same as the Manhattan

Movie Title BS BoS C FG IM JB KFP SL TGF

Higher the Jaccard coefficient, higher the similarity between

In cosine similarity, X1 and X2 are two n-dimensional vectors

where Dijk is the distance between observations (i and j) for kth

Customer Number of Movies Downloaded Under Each Genre Marital

Action Romance Comedy Sci-fi (k = 5) Married

k=1 k=2 k=3 k=4 k=5 k=6 Sum

Dijk 0.5814 0.5667 0.9375 0.8667 0.0000 0 2.952

The Gower’s distance between customers 1 and 2 is given

where CH(k) is the Calinski and Harabasz index with k-

Steps followed in clustering algorithms:

Ketchen and Shook (1996) suggest inductive, deductive, and

• Inductive is basically an exploratory approach and starts

• On the other hand, in deductive variable selection,

• Under cognitive variable selection, expert opinion plays a

Several approaches are available for deciding the number of

The clusters created should be validated for consistency using

•Compactness: Closeness of each member of a cluster which

•Separation: Distance between different clusters.

• K-means clustering is one of the frequently used clustering

• It is a non-hierarchical clustering method in which the

1) Start with each data point in a single cluster.

2) Find the data points with shortest distance (using an

The above procedure is called agglomerative hierarchical cluster

• Clustering is an unsupervised learning algorithms that divides

• Clustering is one of the frequently used techniques and

• K-means clustering and Hierarchical clustering are two

• One of the decisions to be taken during clustering is to

• 1. I like to wear jewellery that glitters.

• Cluster1 seems to be socially concerned group as they have high degree of

You might also like