Clustering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

7.

Clustering
Clustering

Clustering is the process of finding meaningful groups in data.

For example, customers of a company can be grouped based on the purchase


behavior. In recent years, clustering has even found its use in political elections

Clustering to describe the data

Clustering for pre-processing


Types of clustering
1. Exclusive or strict partitioning clusters
2. Overlapping clusters
3. Hierarchical clusters
4. Fuzzy or probabilistic clusters

Based Algorithmic approach

1. Prototype-based clustering
2. Density clustering
3. Hierarchical clustering
4. Model-based clustering
k-Means Clustering
k-Means
k-means clustering is a prototype-based clustering method where the data set
is divided into k clusters.

Objective: find a prototype data point for each cluster; all the data points are
then assigned to the nearest prototype, which then forms a cluster
k Partitions

k-means algorithm divides the data space


into k partitions or boundaries, where the
centroid in each partition is the prototype of
the clusters

Voronoi partition. (“Euclidean Voronoi Diagram” by Raincomplex – personal work. Licensed under Creative
Commons Zero, Public Domain Dedication via Wikimedia Commons
k Partitions

k-means algorithm divides


the data space into k
partitions or boundaries,
where the centroid in each
partition is the prototype of
the clusters
Step 1: Initiate Centroids
Step 2: Assign Data Points
Step 3: Calculate New Centroids

Minimizing the sum of


squared errors (SSE)
Step 4: Repeat Assignment and Calculate
New Centroids
Step 5: Termination

No further change in assignment of data points happens or, in other words, no


significant change in centroids are noted.

Evaluation of Clusters

- Minimize total SSE


- Davies-Bouldin index
DBSCAN CLUSTERING
Density in the dataset
Density of a data point

The number of points within a circular space


with radius ε (epsilon) around a data point A
is six
Step 1: Defining Epsilon and MinPoints

The number of data points inside the space is defined by radius ε . If


MinPoints is defined as 5, the space ε surrounding data point A is considered
a high-density region.
Step 2: Classification of Data Points
Step 3: Clustering

Groups of core points form distinct clusters. If two core points are within ε of
each other, then both core points are within the same cluster.

Optimize the Parameters: ε and a minimum threshold (MinPoints)


Special Cases: Varying Densities
SELF-ORGANIZING MAPS
SOM
Powerful visual clustering technique

A neural network. Output is an organized visual matrix. SOM output is a


two-dimensional grid with data objects placed next to each other based on
their similarity to one another.
Step 1: Topology Specification
Step 2: Initialize Centroids

The initial centroids are values of random data objects from the data set.

Step 3: Assignment of Data Objects

Data objects are selected one by one and assigned to the nearest centroid.
Step 4: Centroid Update

Update the data values of the nearest centroid of the data object, proportional
to the difference between the centroid and the data object.
Step 4: Centroid Update
Step 5: Termination

Until no significant centroid updates take place in each run.

Step 6: Mapping a New Data Object

Based on proximity to the centroids.

You might also like