Practical # 12
Practical # 12
Clustering
Clustering is the task of grouping together a set of objects in a way that objects in the same
cluster are more similar to each other than to objects in other clusters. Similarity is a metric that
reflects the strength of relationship between two data objects. Clustering is mainly used for
exploratory data mining. It has manifold usage in many fields such as machine learning, pattern
recognition, image analysis, information retrieval, bio-informatics, data compression, and
computer graphics.
Introducing k-Means
The k-means algorithm searches for a pre-determined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:
The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
Each point is closer to its own cluster center than to other cluster centers.
Those two assumptions are the basis of the k-means model. We will soon dive into
exactly how the algorithm reaches this solution, but for now let's take a look at a simple dataset
and see the k-means result.
First, let's generate a two-dimensional dataset containing four distinct blobs. To emphasize that
this is an unsupervised algorithm, we will leave the labels out of the visualization
Program
The good news is that the k-means algorithm (at least in this simple case) assigns the points to
clusters very similarly to how we might assign them by eye. But you might wonder how this
algorithm finds these clusters so quickly! After all, the number of possible combinations of
cluster assignments is exponential in the number of data points—an exhaustive search would be
very, very costly. Fortunately for us, such an exhaustive search is not necessary: instead, the
typical approach to k-means involves an intuitive iterative approach known as expectation–
maximization.
Class Tasks
Submission Date: --
1. Perform K-Mean Clustering on your dataset and also plot a graph along with clusters