Assignment No 5 K-Means Clustering

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 2

Honours* in Data Science #Fourth year of Engineering (Semester VII) #410502:

Machine Learning and Data Science Laboratory


Dr. Girija Gireesh Chiddarwar
Assignment No 4 - Text classification for Sentimental analysis using KNN Note: Use
twitter data

Clustering
Clustering: the process of grouping a set of objects into classes of similar
objects

Clustering is the classification of objects into different groups, or more


precisely, the partitioning of a data set into subsets (clusters), so that the data
in each subset (ideally) share some common trait - often according to some defined
distance measure.

Clustering is an unsupervised learning technique. It is the task of grouping


together a set of objects in a way that objects in the same cluster are more
similar to each other than to objects in other clusters. Similarity is an amount
that reflects the strength of relationship between two data objects. Clustering is
mainly used for exploratory data mining. It is used in many fields such as machine
learning, pattern recognition, image analysis, information retrieval, bio-
informatics, data compression, and computer graphics.

Clustering: Types
Clustering can be broadly divided into two subgroups:
Hard clustering: in hard clustering, each data object or point either belongs to a
cluster completely or not. For example in the Uber dataset, each location belongs
to either one borough or the other.
Soft clustering: in soft clustering, a data point can belong to more than one
cluster with some probability or likelihood value. For example, you could identify
some locations as the border points belonging to two or more boroughs.

K-means algorithm
K-mean is, without doubt, the most popular clustering method. Researchers released
the algorithm decades ago, and lots of improvements have been done to k-means.
The algorithm tries to find groups by minimizing the distance between the
observations, called local optimal solutions. The distances are measured based on
the coordinates of the observations.

Algorithm
The algorithm works as follow:
Step 1: Choose groups in the feature plan randomly
Step 2: Minimize the distance between the cluster center and the different
observations (centroid). It results in groups with observations
Step 3: Shift the initial centroid to the mean of the coordinates within a group.
Step 4: Minimize the distance according to the new centroids. New boundaries are
created. Thus, observations will move from one group to another
Repeat until no observation changes groups

Algorithm

Visual Representation-Left points selected

Visual Representation- Random point selection

Algorithm
Install and import required packages.
Load dataset
Define K (no of clusters)
Kmeans clustering
Calculate inertia for the given no of k
Select the k which has low inertia and low value of k for predicting

You might also like