
Download as pdf or txt
Download as pdf or txt
You are on page 1of 26



Clustering is the process of finding meaningful groups in data.

For example, customers of a company can be grouped based on the purchase

behavior. In recent years, clustering has even found its use in political elections

Clustering to describe the data

Clustering for pre-processing

Types of clustering
1. Exclusive or strict partitioning clusters
2. Overlapping clusters
3. Hierarchical clusters
4. Fuzzy or probabilistic clusters

Based Algorithmic approach

1. Prototype-based clustering
2. Density clustering
3. Hierarchical clustering
4. Model-based clustering
k-Means Clustering
k-means clustering is a prototype-based clustering method where the data set
is divided into k clusters.

Objective: find a prototype data point for each cluster; all the data points are
then assigned to the nearest prototype, which then forms a cluster
k Partitions

k-means algorithm divides the data space

into k partitions or boundaries, where the
centroid in each partition is the prototype of
the clusters

Voronoi partition. (“Euclidean Voronoi Diagram” by Raincomplex – personal work. Licensed under Creative
Commons Zero, Public Domain Dedication via Wikimedia Commons
k Partitions

k-means algorithm divides

the data space into k
partitions or boundaries,
where the centroid in each
partition is the prototype of
the clusters
Step 1: Initiate Centroids
Step 2: Assign Data Points
Step 3: Calculate New Centroids

Minimizing the sum of

squared errors (SSE)
Step 4: Repeat Assignment and Calculate
New Centroids
Step 5: Termination

No further change in assignment of data points happens or, in other words, no

significant change in centroids are noted.

Evaluation of Clusters

- Minimize total SSE

- Davies-Bouldin index
Density in the dataset
Density of a data point

The number of points within a circular space

with radius ε (epsilon) around a data point A
is six
Step 1: Defining Epsilon and MinPoints

The number of data points inside the space is defined by radius ε . If

MinPoints is defined as 5, the space ε surrounding data point A is considered
a high-density region.
Step 2: Classification of Data Points
Step 3: Clustering

Groups of core points form distinct clusters. If two core points are within ε of
each other, then both core points are within the same cluster.

Optimize the Parameters: ε and a minimum threshold (MinPoints)

Special Cases: Varying Densities
Powerful visual clustering technique

A neural network. Output is an organized visual matrix. SOM output is a

two-dimensional grid with data objects placed next to each other based on
their similarity to one another.
Step 1: Topology Specification
Step 2: Initialize Centroids

The initial centroids are values of random data objects from the data set.

Step 3: Assignment of Data Objects

Data objects are selected one by one and assigned to the nearest centroid.
Step 4: Centroid Update

Update the data values of the nearest centroid of the data object, proportional
to the difference between the centroid and the data object.
Step 4: Centroid Update
Step 5: Termination

Until no significant centroid updates take place in each run.

Step 6: Mapping a New Data Object

Based on proximity to the centroids.

You might also like