0% found this document useful (0 votes)

530 views22 pages

Assignment Clustering

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points that share similar characteristics. It can be used for customer segmentation, stock market clustering, and dimensionality reduction. K-means clustering is a common algorithm that groups data by minimizing distances between points and cluster centroids, and is illustrated using a dataset of customer ages and spending.

Uploaded by

Netra Raina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

530 views22 pages

Assignment Clustering

Uploaded by

Netra Raina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

What is Cluster analysis?

Cluster analysis is part of the unsupervised learning. A cluster is a group of data that share similar features. We can
say, clustering analysis is more about discovery than a prediction. The machine searches for similarity in the data. For
instance, you can use cluster analysis for the following application:

 Customer segmentation: Looks for similarity between groups of customers

 Stock Market clustering: Group stock based on performances
 Reduce dimensionality of a dataset by grouping observations with similar values

Clustering analysis is not too difficult to implement and is meaningful as well as actionable for business.

The most striking difference between supervised and unsupervised learning lies in the results. Unsupervised learning
creates a new variable, the label, while supervised learning predicts an outcome. The machine helps the practitioner in
the quest to label the data based on close relatedness. It is up to the analyst to make use of the groups and give a name
to them.

Let's make an example to understand the concept of clustering. For simplicity, we work in two dimensions. You have data
on the total spend of customers and their ages. To improve advertising, the marketing team wants to send more targeted
emails to their customers.

In the following graph, you plot the total spend and the age of the customers.

Library (ggplot2)
df <- data.frame(age = c(18, 21, 22, 24, 26, 26, 27, 30, 31, 35, 39, 40, 41, 42, 44, 46, 47, 48, 49, 54),
Spend = c (10, 11, 22, 15, 12, 13, 14, 33, 39, 37, 44, 27, 29, 20, 28, 21, 30, 31, 23, 24)
)
ggplot(df, aes(x = age, y = spend)) +
geom_point()
A pattern is visible at this point

1. At the bottom-left, you can see young people with a lower purchasing power
2. Upper-middle reflects people with a job that they can afford spend more
3. Finally, older people with a lower budget.
In the figure above, you cluster the observations by hand and define each of the three groups. This example is somewhat
straightforward and highly visual. If new observations are appended to the data set, you can label them within the circles.
You define the circle based on our judgment. Instead, you can use Machine Learning to group the data objectively.

K-means algorithm

The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final
solution.

The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The
selected objects are also known as cluster means or centroids.
Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean
distance between the object and the cluster mean. This step is called “cluster assignment step”. Note that, to use
correlation distance, the data are input as z-scores.
After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update”
is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it
might be closer to a different cluster. All the objects are reassigned again using the updated cluster means.

The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e
until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the
previous iteration.

K-means algorithm can be summarized as follow:

1. Specify the number of clusters (K) to be created (by the analyst)

2. Select randomly k objects from the dataset as the initial cluster centers or means
3. Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
4. For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the
cluster. The centoid of a Kth cluster is a vector of length p containing the means of all variables for the observations in
the kth cluster; p is the number of variables.
5. Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the cluster assignments stop
changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the
maximum number of iterations.

Computing k-means clustering in R

Data

We’ll use the demo data sets “USArrests”. The data should be prepared as described in chapter @ref(data-preparation-
and-r-packages). The data must contains only continuous variables, as the k-means algorithm uses variable means. As
we don’t want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R
function scale () as follow:

data("USArrests") # Loading the data set

df <- scale(USArrests) # Scaling the data

# View the firt 3 rows of the data

head(df, n = 3)
## Murder Assault UrbanPop Rape

## Alabama 1.2426 0.783 -0.521 -0.00342

## Alaska 0.5079 1.107 -1.212 2.48420

## Arizona 0.0716 1.479 0.999 1.04288

Required R packages and functions

The standard R function for k-means clustering is kmeans() [stats package], which simplified format is as follow:
kmeans(x, centers, iter.max = 10, nstart = 1)

Clustering: Types
Clustering can be broadly divided into two subgroups:

 Hard clustering: in hard clustering, each data object or point either belongs to a cluster completely or not. For
example in the Uber dataset, each location belongs to either one borough or the other.

 Soft clustering: in soft clustering, a data point can belong to more than one cluster with some probability or
likelihood value. For example, you could identify some locations as the border points belonging to two or more
boroughs.

K-Means Clustering

In this section, you will work with the Uber dataset, which contains data generated by Uber for the city on New York. Uber
Technologies Inc. is a peer-to-peer ride sharing platform. Don't worry if you don't know too much about Uber, all you need
to know is that the Uber platform connects you with (cab)drivers who can drive you to your destiny. The data is freely
available on Kaggle. The dataset contains raw data on Uber pickups with information such as the date, time of the trip
along with the longitude-latitude information.

New York city has five boroughs: Brooklyn, Queens, Manhattan, Bronx, and Staten Island. At the end of this mini-project,
you will apply k-means clustering on the dataset to explore the dataset better and identify the different boroughs within
New York. All along, you will also learn the various steps that you should take when working on a data science project in
general.
Problem Understanding

There is a lot of information stored in the traffic flow of any city. This data when mined over location can provide
information about the major attractions of the city, it can help us understand the various zones of the city such as
residential areas, office/school zones, highways, etc. This can help governments and other institutes plan the city better
and enforce suitable rules and regulations accordingly. For example, a different speed limit in school and residential zone
than compared to highway zones.

The data when monitored over time can help us identify rush hours, holiday season, impact of weather, etc. This
knowledge can be applied for better planning and traffic management. This can at a large, impact the efficiency of the city
and can also help avoid disasters, or at least faster redirection of traffic flow after accidents.

However, this is all looking at the bigger problem. This tutorial will only concentrate on trying to solve the problem of
identifying the five boroughs of New York city using k-means algorithm, so as to get a better understanding of the
algorithms, all along learning to tackle a data science problem.

Understanding The Data

You only need to use the Uber data from 2014. You will find the following .csv files in the Kaggle link mentioned above:

 uber-raw-data-apr14.csv

 uber-raw-data-may14.csv

 uber-raw-data-jun14.csv
 uber-raw-data-jul14.csv

 uber-raw-data-aug14.csv

 uber-raw-data-sep14.csv

This tutorial makes use of various libraries. Remember that when you work locally, you might have to install them. You
can easily do so, using install.packages().

Let's now load up the data:

# Load the .csv files

apr14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

data-apr14.csv")

may14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

data-may14.csv")

jun14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

data-jun14.csv")

jul14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

data-jul14.csv")

aug14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

data-aug14.csv")
sep14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

data-sep14.csv")

Let's bind all the data files into one. For this, you can use the bind_rows() function under the dplyr library in R.

library(dplyr)

data14 <- bind_rows(apr14, may14, jun14, jul14, aug14, sep14)

So far, so good! Let's get a summary of the data to get an idea of what you are dealing with.

summary(data14)

Date.Time Lat Lon Base

Length:4534327 Min. :39.66 Min. :-74.93 B02512: 205673

Class :character 1st Qu.:40.72 1st Qu.:-74.00 B02598:1393113

Mode :character Median :40.74 Median :-73.98 B02617:1458853

Mean :40.74 Mean :-73.97 B02682:1212789

3rd Qu.:40.76 3rd Qu.:-73.97 B02764: 263899

Max. :42.12 Max. :-72.07

The dataset contains the following columns:

 Date.Time : the date and time of the Uber pickup;

 Lat: the latitude of the Uber pickup;

 Lon: the longitude of the Uber pickup;

 Base: the TLC base company code affiliated with the Uber pickup.

Data Preparation

This step consists of cleaning and rearranging your data so that you can work on it more easily. It's a good idea to first
think of the sparsity of the dataset and check the amount of missing data.

# VIM library for using 'aggr'

library(VIM)

# 'aggr' plots the amount of missing/imputed values in each column

aggr(data14)
As you can see, the dataset has no missing values. However, this might not always be the case with real datasets and
you will have to decide how you want to deal with these values. Some popular methods include either deleting the
particular row/column or replacing with a mean of the value.

You can see that the first column is Date.Time. To be able to use these values, you need to separate them. So let's do
that, you can use the lubridate library for this. Lubridate makes it simple for you to identify the order in which the year,
month, and day appears in your dates and manipulate them.

library(lubridate)

# Separate or mutate the Date/Time columns

data14$Date.Time <- mdy_hms(data14$Date.Time)

data14$Year <- factor(year(data14$Date.Time))

data14$Month <- factor(month(data14$Date.Time))

data14$Day <- factor(day(data14$Date.Time))

data14$Weekday <- factor(wday(data14$Date.Time))

data14$Hour <- factor(hour(data14$Date.Time))

data14$Minute <- factor(minute(data14$Date.Time))

data14$Second <- factor(second(data14$Date.Time))

#data14$date_time
data14$Month
Let's check out the first few rows to see what our data looks like now....

head(data14, n=10)

Date.Time Lat Lon Base Year Month Day Weekday Hour Minute Second

2014-04-01 -
40.7690 B02512 2014 4 1 3 0 11 0
00:11:00 73.9549

2014-04-01 -
40.7267 B02512 2014 4 1 3 0 17 0
00:17:00 74.0345

2014-04-01 -
40.7316 B02512 2014 4 1 3 0 21 0
00:21:00 73.9873

2014-04-01 -
40.7588 B02512 2014 4 1 3 0 28 0
00:28:00 73.9776

2014-04-01 -
40.7594 B02512 2014 4 1 3 0 33 0
00:33:00 73.9722

2014-04-01 40.7383 - B02512 2014 4 1 3 0 33 0

Date.Time Lat Lon Base Year Month Day Weekday Hour Minute Second

00:33:00 74.0403

2014-04-01 -
40.7223 B02512 2014 4 1 3 0 39 0
00:39:00 73.9887

2014-04-01 -
40.7620 B02512 2014 4 1 3 0 45 0
00:45:00 73.9790

2014-04-01 -
40.7524 B02512 2014 4 1 3 0 55 0
00:55:00 73.9960

2014-04-01 -
40.7575 B02512 2014 4 1 3 1 1 0
01:01:00 73.9846

Awesome!

For this case study, this is the only data manipulation you will require for a good data understanding as well as to work
with k-means clustering.
Now would be a good time to divide your data into training and test set. This is an important step in every data science
project, it is done to train the model on the training set, determine the values of the parameters required and to finally test
the model on the testing set. For example, when working with clustering algorithms, this division is done so that you can
identify the parameters such as k, which is the number of clusters in k-means clustering. However, for this case study, you
already know the number of clusters expected, which is 5 - the number of boroughs in NYC. Hence, you shall not be
working the traditional way but rather, keep it primarily about learning about k-means clustering.

Have a look at DataCamp's Python Machine Learning: Scikit-Learn Tutorial for a project that guides you through all the
steps for a data science (machine learning) project using Python. You will also work with k-means algorithm in this tutorial.

Now before diving into the R code for the same, let's learn about the k-means clustering algorithm...

K-Means Clustering with R

K-means clustering is the most commonly used unsupervised machine learning algorithm for dividing a given dataset into
k clusters. Here, k represents the number of clusters and must be provided by the user. You already know k in case of the
Uber dataset, which is 5 or the number of boroughs. k-means is a good algorithm choice for the Uber 2014 dataset since
you do not know the target labels making the problem unsupervised and there is a pre-specified k value.

Here you are using clustering for classifying the pickup points into various boroughs. The general scenario where you
would use clustering is when you want to learn more about your dataset. So you can run clustering several times,
investigate the interesting clusters and note down some of the insights you get. Clustering is more of a tool to help you
explore a dataset, and should not always be used as an automatic method to classify data. Hence, you may not always
deploy a clustering algorithm for real-world production scenario. They are often too unreliable, and a single clustering
alone will not be able to give you all the information you can extract from a dataset.

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as
total within-cluster variation) is minimized. There are several k-means algorithms available. However, the standard
algorithm defines the total within-cluster variation as the sum of squared distances Euclidean distances between items
and the corresponding centroid:
W(Ck)=∑xi∈Ck(xi−μk)2W(Ck)=∑xi∈Ck(xi−μk)2

where:

 xixi is a data point belonging to the cluster CkCk

 μkμk is the mean value of the points assigned to the cluster CkCk

Each observation xixi is assigned to a given cluster such that the sum of squares distance of the observation to their
assigned cluster centers μkμk is minimized.

Let's go through the steps more systematically:

1. Specify k - the number of clusters to be created.

2. Select randomly k objects from the dataset as the initial cluster centers.

3. Assign each observation to their closest centroid, based on the Euclidean distance between the object and the
centroid.
4. For each of the k clusters recompute the cluster centroid by calculating the new mean value of all the data points in
the cluster.

5. Iteratively minimize the total within sum of square. Repeat Step 3 and Step 4, until the centroids do not change or
the maximum number of iterations is reached (R uses 10 as the default value for the maximum number of
iterations).

The total within sum of square or the total within-cluster variation is defined as:
∑kk=1W(Ck)=∑kk=1∑xi∈Ck(xi−μk)2∑k=1kW(Ck)=∑k=1k∑xi∈Ck(xi−μk)2

This is the summation of all the clusters over the sum of squared Euclidean distances between items and their
corresponding centroid.

Now that you have seen the theory, let's implement the algorithm and see the results!

You can use the kmeans() function in R. k value will be set as 5. Also, there is a nstart option that attempts multiple initial
configurations and reports on the best one within the kmeans function. Seeds allow you to create a starting point for
randomly generated numbers, so that each time your code is run, the same answer is generated.

set.seed(20)

clusters <- kmeans(data14[,2:3], 5)

# Save the cluster number in the dataset as column 'Borough'

data14$Borough <- as.factor(clusters$cluster)
# Inspect 'clusters'

str(clusters)

List of 9

$ cluster : int [1:4534327] 3 4 4 3 3 4 4 3 4 3 ...

$ centers : num [1:5, 1:2] 40.7 40.8 40.8 40.7 40.7 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:5] "1" "2" "3" "4" ...

.. ..$ : chr [1:2] "Lat" "Lon"

$ totss : num 22107

$ withinss : num [1:5] 1386 1264 948 2787 1029

$ tot.withinss: num 7414

$ betweenss : num 14692

$ size : int [1:5] 145109 217566 1797598 1802301 571753

$ iter : int 4

$ ifault : int 0

- attr(*, "class")= chr "kmeans"

The above list is an output of the kmeans() function. Let's see some of the important ones closely:

 cluster: a vector of integers (from 1:k) indicating the cluster to which each point is allocated.
 centers: a matrix of cluster centers.

 withinss: vector of within-cluster sum of squares, one component per cluster.

 tot.withinss: total within-cluster sum of squares. That is, sum(withinss).

 size: the number of points in each cluster.

Let's plot some graphs to visualize the data as well as the results of the k-means clustering well.

library(ggmap)

NYCMap <- get_map("New York", zoom = 10)

ggmap(NYCMap) + geom_point(aes(x = Lon[], y = Lat[], colour = as.factor(Borough)),data = data14) +

ggtitle("NYC Boroughs using KMean")

The boroughs (clusters) formed is matched against the real boroughs. The cluster number corresponds to the following
boroughs:

1. Bronx

2. Manhattan
3. Brooklyn

4. Staten Island

5. Queens

As you can see, the results are pretty impressive. And now, that you have used k-mean to categorize the pickup point and
have additional knowledge added to the dataset. Let's try to do something with this new found knowledge. You can use
the borough information to check out Uber's growth within the boroughs for each month. Here's how...

library(DT)

data14$Month <- as.double(data14$Month)

month_borough_14 <- count_(data14, vars = c('Month', 'Borough'), sort = TRUE) %>%

arrange(Month, Borough)

datatable(month_borough_14)

Let's get a graphical view of the same...

library(dplyr)

monthly_growth <- month_borough_14 %>%

mutate(Date = paste("04", Month)) %>%

ggplot(aes(Month, n, colour = Borough)) + geom_line() +

ggtitle("Uber Monthly Growth - 2014")

monthly_growth
As you have seen, k-means is a pretty good clustering algorithm. But, it has some drawbacks. The biggest disadvantage
is that it requires us to pre-specify the number of clusters (k). However, for the Uber dataset, you had some domain
knowledge available that told you the number of boroughs in New York City. This might not always be the case with real
world datasets. Hierarchical clustering is an alternative approach that does not require a particular choice of clusters. An
additional disadvantage of k-means is that it is sensitive to outliers and different results can occur if you change the
ordering of the data.

k-means is a lazy learner where generalization of the training data is delayed until a query is made to the system. Which
means k-means starts working only when you trigger it to, thus lazy learning methods can construct a different
approximation or result to the target function for each encountered query. It is a good method for online learning but it
requires a possibly large amount of memory to store the data, and each request involves starting the identification of a
local model from scratch.

Rajiv Ranjan 11 Dec 2022
No ratings yet
Rajiv Ranjan 11 Dec 2022
18 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
Patterns of Attachment A Psychological Study of the Strange Situation - 1st Edition Full Version Download
100% (13)
Patterns of Attachment A Psychological Study of the Strange Situation - 1st Edition Full Version Download
17 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
SESSION-Plan-basic 3
No ratings yet
SESSION-Plan-basic 3
3 pages
Advance Statistics-Project Report
50% (2)
Advance Statistics-Project Report
17 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Silabus Fil.109 Introduksyon Sa Pagsasalin
No ratings yet
Silabus Fil.109 Introduksyon Sa Pagsasalin
21 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
Utilisation of Research Findings
100% (2)
Utilisation of Research Findings
13 pages
Great Learning DVT Final Project - Car Claims For Insurance
100% (1)
Great Learning DVT Final Project - Car Claims For Insurance
113 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Reasons and Persons PDF
100% (1)
Reasons and Persons PDF
560 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
English Ii Learning Competencies
0% (1)
English Ii Learning Competencies
7 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Time Series Rose Shehroz Arfeen
100% (1)
Time Series Rose Shehroz Arfeen
42 pages
SMDM Project
100% (1)
SMDM Project
22 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Clustering Project
100% (1)
Clustering Project
44 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
Rajendra Ladda DVT Car Insurance Tableau Project
No ratings yet
Rajendra Ladda DVT Car Insurance Tableau Project
8 pages
Decision Making: Submitted By-Ankita Mishra
No ratings yet
Decision Making: Submitted By-Ankita Mishra
20 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
100% (1)
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
25 pages
A Wholesale Distributor
100% (3)
A Wholesale Distributor
5 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
7 E's Lesson Plan
100% (1)
7 E's Lesson Plan
4 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
SMDM Project
No ratings yet
SMDM Project
17 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
نموذج امتحان
No ratings yet
نموذج امتحان
9 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Memmory Shit 1 2 1
100% (1)
Memmory Shit 1 2 1
3 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
Indigenous Media May Be Defined As Forms of Media Expression
No ratings yet
Indigenous Media May Be Defined As Forms of Media Expression
3 pages
AS Project Report
No ratings yet
AS Project Report
22 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Leadership Theories and Styles
No ratings yet
Leadership Theories and Styles
37 pages
Advanced Statistics Project - Jayant Chandra
No ratings yet
Advanced Statistics Project - Jayant Chandra
20 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
Organisational Behaviour Solved MCQs (Set-5)
No ratings yet
Organisational Behaviour Solved MCQs (Set-5)
6 pages
Pranjal - Singh - 27.11.2022 AS Project
No ratings yet
Pranjal - Singh - 27.11.2022 AS Project
9 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Assessment of Learning 1 Chapter 1
100% (1)
Assessment of Learning 1 Chapter 1
9 pages
Audio Lingual Method: ELD 2123 Approaches and Methods in TESL
No ratings yet
Audio Lingual Method: ELD 2123 Approaches and Methods in TESL
19 pages
Operation Research Introduction
0% (1)
Operation Research Introduction
16 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
K Tws Online
No ratings yet
K Tws Online
26 pages
Applications of Artificial Intelligence in Machine Learning: Review and Prospect
No ratings yet
Applications of Artificial Intelligence in Machine Learning: Review and Prospect
11 pages
Presentation On Modal Auxiliary Verbs
No ratings yet
Presentation On Modal Auxiliary Verbs
18 pages
VGD Q1W1
No ratings yet
VGD Q1W1
3 pages
Which Year Has The Most Number of Records?: AS Quiz 2: Exploratory Data Analysis
100% (2)
Which Year Has The Most Number of Records?: AS Quiz 2: Exploratory Data Analysis
5 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Fundamentals of Review of Literature
No ratings yet
Fundamentals of Review of Literature
4 pages
DLL W2
No ratings yet
DLL W2
6 pages
DLP in English 2
100% (1)
DLP in English 2
10 pages
S11.s1 - Unit 10 (Helbling)
No ratings yet
S11.s1 - Unit 10 (Helbling)
4 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
English 5 - Unit 13. What Do You Do in Your Free Time?
No ratings yet
English 5 - Unit 13. What Do You Do in Your Free Time?
6 pages
Lessonplan To Be Passed During Modern Geometry
No ratings yet
Lessonplan To Be Passed During Modern Geometry
3 pages
Kline
No ratings yet
Kline
5 pages
1c4t14iga 199995 PDF
No ratings yet
1c4t14iga 199995 PDF
4 pages
Language Variation - Part Two: Language in The Usa
No ratings yet
Language Variation - Part Two: Language in The Usa
5 pages
I. Overview of Fromm's Humanistic Psychoanalysis: Human Dilemma
No ratings yet
I. Overview of Fromm's Humanistic Psychoanalysis: Human Dilemma
7 pages
Graham Harman - On Vicarious Causation PDF
No ratings yet
Graham Harman - On Vicarious Causation PDF
18 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Project Questions
No ratings yet
Project Questions
3 pages

Assignment Clustering

Uploaded by

Assignment Clustering

Uploaded by

What is Cluster analysis?

 Customer segmentation: Looks for similarity between groups of customers

K-means algorithm can be summarized as follow:

1. Specify the number of clusters (K) to be created (by the analyst)

Computing k-means clustering in R

data("USArrests") # Loading the data set

# View the firt 3 rows of the data

## Alabama 1.2426 0.783 -0.521 -0.00342

## Alaska 0.5079 1.107 -1.212 2.48420

## Arizona 0.0716 1.479 0.999 1.04288

Required R packages and functions

Understanding The Data

Let's now load up the data:

# Load the .csv files

apr14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

may14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

jun14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

jul14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

aug14 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-

data14 <- bind_rows(apr14, may14, jun14, jul14, aug14, sep14)

Date.Time Lat Lon Base

Length:4534327 Min. :39.66 Min. :-74.93 B02512: 205673

Class :character 1st Qu.:40.72 1st Qu.:-74.00 B02598:1393113

Mode :character Median :40.74 Median :-73.98 B02617:1458853

Mean :40.74 Mean :-73.97 B02682:1212789

3rd Qu.:40.76 3rd Qu.:-73.97 B02764: 263899

Max. :42.12 Max. :-72.07

The dataset contains the following columns:

 Date.Time : the date and time of the Uber pickup;

 Lat: the latitude of the Uber pickup;

# VIM library for using 'aggr'

# 'aggr' plots the amount of missing/imputed values in each column

# Separate or mutate the Date/Time columns

data14$Date.Time <- mdy_hms(data14$Date.Time)

data14$Year <- factor(year(data14$Date.Time))

data14$Day <- factor(day(data14$Date.Time))

data14$Weekday <- factor(wday(data14$Date.Time))

data14$Hour <- factor(hour(data14$Date.Time))

data14$Minute <- factor(minute(data14$Date.Time))

data14$Second <- factor(second(data14$Date.Time))

2014-04-01 40.7383 - B02512 2014 4 1 3 0 33 0

K-Means Clustering with R

 xixi is a data point belonging to the cluster CkCk

Let's go through the steps more systematically:

1. Specify k - the number of clusters to be created.

clusters <- kmeans(data14[,2:3], 5)

# Save the cluster number in the dataset as column 'Borough'

$ cluster : int [1:4534327] 3 4 4 3 3 4 4 3 4 3 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:5] "1" "2" "3" "4" ...

.. ..$ : chr [1:2] "Lat" "Lon"

$ totss : num 22107

$ withinss : num [1:5] 1386 1264 948 2787 1029

$ tot.withinss: num 7414

$ betweenss : num 14692

$ size : int [1:5] 145109 217566 1797598 1802301 571753

- attr(*, "class")= chr "kmeans"

 withinss: vector of within-cluster sum of squares, one component per cluster.

 tot.withinss: total within-cluster sum of squares. That is, sum(withinss).

 size: the number of points in each cluster.

NYCMap <- get_map("New York", zoom = 10)

ggmap(NYCMap) + geom_point(aes(x = Lon[], y = Lat[], colour = as.factor(Borough)),data = data14) +

ggtitle("NYC Boroughs using KMean")

data14$Month <- as.double(data14$Month)

month_borough_14 <- count_(data14, vars = c('Month', 'Borough'), sort = TRUE) %>%

Let's get a graphical view of the same...

monthly_growth <- month_borough_14 %>%

mutate(Date = paste("04", Month)) %>%

ggtitle("Uber Monthly Growth - 2014")

You might also like