Assignment Clustering
Assignment Clustering
Cluster analysis is part of the unsupervised learning. A cluster is a group of data that share similar features. We can
say, clustering analysis is more about discovery than a prediction. The machine searches for similarity in the data. For
instance, you can use cluster analysis for the following application:
Clustering analysis is not too difficult to implement and is meaningful as well as actionable for business.
The most striking difference between supervised and unsupervised learning lies in the results. Unsupervised learning
creates a new variable, the label, while supervised learning predicts an outcome. The machine helps the practitioner in
the quest to label the data based on close relatedness. It is up to the analyst to make use of the groups and give a name
to them.
Let's make an example to understand the concept of clustering. For simplicity, we work in two dimensions. You have data
on the total spend of customers and their ages. To improve advertising, the marketing team wants to send more targeted
emails to their customers.
In the following graph, you plot the total spend and the age of the customers.
Library (ggplot2)
df <- data.frame(age = c(18, 21, 22, 24, 26, 26, 27, 30, 31, 35, 39, 40, 41, 42, 44, 46, 47, 48, 49, 54),
Spend = c (10, 11, 22, 15, 12, 13, 14, 33, 39, 37, 44, 27, 29, 20, 28, 21, 30, 31, 23, 24)
)
ggplot(df, aes(x = age, y = spend)) +
geom_point()
A pattern is visible at this point
1. At the bottom-left, you can see young people with a lower purchasing power
2. Upper-middle reflects people with a job that they can afford spend more
3. Finally, older people with a lower budget.
In the figure above, you cluster the observations by hand and define each of the three groups. This example is somewhat
straightforward and highly visual. If new observations are appended to the data set, you can label them within the circles.
You define the circle based on our judgment. Instead, you can use Machine Learning to group the data objectively.
K-means algorithm
The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final
solution.
The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The
selected objects are also known as cluster means or centroids.
Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean
distance between the object and the cluster mean. This step is called “cluster assignment step”. Note that, to use
correlation distance, the data are input as z-scores.
After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update”
is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it
might be closer to a different cluster. All the objects are reassigned again using the updated cluster means.
The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e
until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the
previous iteration.
We’ll use the demo data sets “USArrests”. The data should be prepared as described in chapter @ref(data-preparation-
and-r-packages). The data must contains only continuous variables, as the k-means algorithm uses variable means. As
we don’t want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R
function scale () as follow:
The standard R function for k-means clustering is kmeans() [stats package], which simplified format is as follow:
kmeans(x, centers, iter.max = 10, nstart = 1)
Clustering: Types
Clustering can be broadly divided into two subgroups:
Hard clustering: in hard clustering, each data object or point either belongs to a cluster completely or not. For
example in the Uber dataset, each location belongs to either one borough or the other.
Soft clustering: in soft clustering, a data point can belong to more than one cluster with some probability or
likelihood value. For example, you could identify some locations as the border points belonging to two or more
boroughs.
K-Means Clustering
In this section, you will work with the Uber dataset, which contains data generated by Uber for the city on New York. Uber
Technologies Inc. is a peer-to-peer ride sharing platform. Don't worry if you don't know too much about Uber, all you need
to know is that the Uber platform connects you with (cab)drivers who can drive you to your destiny. The data is freely
available on Kaggle. The dataset contains raw data on Uber pickups with information such as the date, time of the trip
along with the longitude-latitude information.
New York city has five boroughs: Brooklyn, Queens, Manhattan, Bronx, and Staten Island. At the end of this mini-project,
you will apply k-means clustering on the dataset to explore the dataset better and identify the different boroughs within
New York. All along, you will also learn the various steps that you should take when working on a data science project in
general.
Problem Understanding
There is a lot of information stored in the traffic flow of any city. This data when mined over location can provide
information about the major attractions of the city, it can help us understand the various zones of the city such as
residential areas, office/school zones, highways, etc. This can help governments and other institutes plan the city better
and enforce suitable rules and regulations accordingly. For example, a different speed limit in school and residential zone
than compared to highway zones.
The data when monitored over time can help us identify rush hours, holiday season, impact of weather, etc. This
knowledge can be applied for better planning and traffic management. This can at a large, impact the efficiency of the city
and can also help avoid disasters, or at least faster redirection of traffic flow after accidents.
However, this is all looking at the bigger problem. This tutorial will only concentrate on trying to solve the problem of
identifying the five boroughs of New York city using k-means algorithm, so as to get a better understanding of the
algorithms, all along learning to tackle a data science problem.
You only need to use the Uber data from 2014. You will find the following .csv files in the Kaggle link mentioned above:
uber-raw-data-apr14.csv
uber-raw-data-may14.csv
uber-raw-data-jun14.csv
uber-raw-data-jul14.csv
uber-raw-data-aug14.csv
uber-raw-data-sep14.csv
This tutorial makes use of various libraries. Remember that when you work locally, you might have to install them. You
can easily do so, using install.packages().
data-apr14.csv")
data-may14.csv")
data-jun14.csv")
data-jul14.csv")
data-sep14.csv")
Let's bind all the data files into one. For this, you can use the bind_rows() function under the dplyr library in R.
library(dplyr)
So far, so good! Let's get a summary of the data to get an idea of what you are dealing with.
summary(data14)
Base: the TLC base company code affiliated with the Uber pickup.
Data Preparation
This step consists of cleaning and rearranging your data so that you can work on it more easily. It's a good idea to first
think of the sparsity of the dataset and check the amount of missing data.
library(VIM)
aggr(data14)
As you can see, the dataset has no missing values. However, this might not always be the case with real datasets and
you will have to decide how you want to deal with these values. Some popular methods include either deleting the
particular row/column or replacing with a mean of the value.
You can see that the first column is Date.Time. To be able to use these values, you need to separate them. So let's do
that, you can use the lubridate library for this. Lubridate makes it simple for you to identify the order in which the year,
month, and day appears in your dates and manipulate them.
library(lubridate)
#data14$date_time
data14$Month
Let's check out the first few rows to see what our data looks like now....
head(data14, n=10)
Date.Time Lat Lon Base Year Month Day Weekday Hour Minute Second
2014-04-01 -
40.7690 B02512 2014 4 1 3 0 11 0
00:11:00 73.9549
2014-04-01 -
40.7267 B02512 2014 4 1 3 0 17 0
00:17:00 74.0345
2014-04-01 -
40.7316 B02512 2014 4 1 3 0 21 0
00:21:00 73.9873
2014-04-01 -
40.7588 B02512 2014 4 1 3 0 28 0
00:28:00 73.9776
2014-04-01 -
40.7594 B02512 2014 4 1 3 0 33 0
00:33:00 73.9722
00:33:00 74.0403
2014-04-01 -
40.7223 B02512 2014 4 1 3 0 39 0
00:39:00 73.9887
2014-04-01 -
40.7620 B02512 2014 4 1 3 0 45 0
00:45:00 73.9790
2014-04-01 -
40.7524 B02512 2014 4 1 3 0 55 0
00:55:00 73.9960
2014-04-01 -
40.7575 B02512 2014 4 1 3 1 1 0
01:01:00 73.9846
Awesome!
For this case study, this is the only data manipulation you will require for a good data understanding as well as to work
with k-means clustering.
Now would be a good time to divide your data into training and test set. This is an important step in every data science
project, it is done to train the model on the training set, determine the values of the parameters required and to finally test
the model on the testing set. For example, when working with clustering algorithms, this division is done so that you can
identify the parameters such as k, which is the number of clusters in k-means clustering. However, for this case study, you
already know the number of clusters expected, which is 5 - the number of boroughs in NYC. Hence, you shall not be
working the traditional way but rather, keep it primarily about learning about k-means clustering.
Have a look at DataCamp's Python Machine Learning: Scikit-Learn Tutorial for a project that guides you through all the
steps for a data science (machine learning) project using Python. You will also work with k-means algorithm in this tutorial.
Now before diving into the R code for the same, let's learn about the k-means clustering algorithm...
K-means clustering is the most commonly used unsupervised machine learning algorithm for dividing a given dataset into
k clusters. Here, k represents the number of clusters and must be provided by the user. You already know k in case of the
Uber dataset, which is 5 or the number of boroughs. k-means is a good algorithm choice for the Uber 2014 dataset since
you do not know the target labels making the problem unsupervised and there is a pre-specified k value.
Here you are using clustering for classifying the pickup points into various boroughs. The general scenario where you
would use clustering is when you want to learn more about your dataset. So you can run clustering several times,
investigate the interesting clusters and note down some of the insights you get. Clustering is more of a tool to help you
explore a dataset, and should not always be used as an automatic method to classify data. Hence, you may not always
deploy a clustering algorithm for real-world production scenario. They are often too unreliable, and a single clustering
alone will not be able to give you all the information you can extract from a dataset.
The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as
total within-cluster variation) is minimized. There are several k-means algorithms available. However, the standard
algorithm defines the total within-cluster variation as the sum of squared distances Euclidean distances between items
and the corresponding centroid:
W(Ck)=∑xi∈Ck(xi−μk)2W(Ck)=∑xi∈Ck(xi−μk)2
where:
Each observation xixi is assigned to a given cluster such that the sum of squares distance of the observation to their
assigned cluster centers μkμk is minimized.
2. Select randomly k objects from the dataset as the initial cluster centers.
3. Assign each observation to their closest centroid, based on the Euclidean distance between the object and the
centroid.
4. For each of the k clusters recompute the cluster centroid by calculating the new mean value of all the data points in
the cluster.
5. Iteratively minimize the total within sum of square. Repeat Step 3 and Step 4, until the centroids do not change or
the maximum number of iterations is reached (R uses 10 as the default value for the maximum number of
iterations).
The total within sum of square or the total within-cluster variation is defined as:
∑kk=1W(Ck)=∑kk=1∑xi∈Ck(xi−μk)2∑k=1kW(Ck)=∑k=1k∑xi∈Ck(xi−μk)2
This is the summation of all the clusters over the sum of squared Euclidean distances between items and their
corresponding centroid.
Now that you have seen the theory, let's implement the algorithm and see the results!
You can use the kmeans() function in R. k value will be set as 5. Also, there is a nstart option that attempts multiple initial
configurations and reports on the best one within the kmeans function. Seeds allow you to create a starting point for
randomly generated numbers, so that each time your code is run, the same answer is generated.
set.seed(20)
str(clusters)
List of 9
$ centers : num [1:5, 1:2] 40.7 40.8 40.8 40.7 40.7 ...
$ iter : int 4
$ ifault : int 0
The above list is an output of the kmeans() function. Let's see some of the important ones closely:
cluster: a vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers: a matrix of cluster centers.
Let's plot some graphs to visualize the data as well as the results of the k-means clustering well.
library(ggmap)
The boroughs (clusters) formed is matched against the real boroughs. The cluster number corresponds to the following
boroughs:
1. Bronx
2. Manhattan
3. Brooklyn
4. Staten Island
5. Queens
As you can see, the results are pretty impressive. And now, that you have used k-mean to categorize the pickup point and
have additional knowledge added to the dataset. Let's try to do something with this new found knowledge. You can use
the borough information to check out Uber's growth within the boroughs for each month. Here's how...
library(DT)
datatable(month_borough_14)
library(dplyr)
k-means is a lazy learner where generalization of the training data is delayed until a query is made to the system. Which
means k-means starts working only when you trigger it to, thus lazy learning methods can construct a different
approximation or result to the target function for each encountered query. It is a good method for online learning but it
requires a possibly large amount of memory to store the data, and each request involves starting the identification of a
local model from scratch.