0% found this document useful (0 votes)

267 views3 pages

Clustering On Boston Dataset

The document describes k-means clustering and provides code to demonstrate it. It explains k-means clustering with inputs for the number of clusters k, number of initializing trials nstart, and maximum number of iterations iter.max. It then performs k-means clustering on Boston housing data for 3 clusters as a test case, and calculates the within-sum-of-squares for each cluster and overall to validate the k-means calculations.

Uploaded by

anubhav582

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

267 views3 pages

Clustering On Boston Dataset

Uploaded by

anubhav582

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

###

#
# Introduction to Clustering demo
#
# k-means clustering
#
# History:
#
# 2018/07/11 Initial code (copied from other example documents) walter
johnston
#
###
#
# k-means algorithm description/explanation:
#
# user- supplied input parameters:
#
# k: number of clusters
# nstart: number of initializing trials
# iter.max: maximum number of iterations (repetitions)
#
# (1) randomize observations into "k" initial groups (keep best of
"nstart" trials)
# (2) calculate centroid (vector of arithmetic means) for each cluster
# (3) calculate within-sum-of-squares (ESS) for each cluster (retain
value)
# (4) re-assign observations into closest cluster
# (distance from centroid; retain count of re-assignments)
# (5) if iterations <= "iter.max" and count of re-assignments > 0, go to
(2)
# (6) finished
#
###
# code to fetch the package if it is not present

if ( !require(MASS ) ) { install.packages('MASS'); library(MASS) }

if ( !require(tidyverse) ) { install.packages('tidyverse'); library(tidyverse) }
if ( !require(broom ) ) { install.packages('broom'); library(broom) }

data(Boston) # from MASS

###
#
# filter for complete cases (no missing data)
#
###
myBoston <- Boston[ complete.cases(Boston), ]

dim(Boston)
dim(myBoston)
###
#
# kmeans: fixed number of clusters
#
# cluster data based on error-sum-of-squares
# random starting points
# user specified number of clusters
# user specified number of starting trials (best one is automatically selected)
# user specified maximum number of iterations
#
###
# user choices
#
seed <- 1 # random number generator seed
minClusters <- 1 # minimum number of clusters (see code)
maxClusters <- 20 # maximum number of clusters (see code)
km.nstart <- 10 # number of starting trials
km.iter.max <- 20 # iteration limit

###
# start skip of code
###

# track the behavior (initialize to an illegal value) [ total within-ss ] (elbow

method)
#sse <- rep(-1, maxClusters)

# test results for each clustering scenario

#for (i in minClusters:maxClusters) {
# set.seed(seed) # reset RNG each time
# sse[i] <- kmeans(myBoston,
# centers=i,
# nstart=km.nstart,
# iter.max=km.iter.max)$tot.withinss
# }

# plot results to find optimal tradeoff (num clusters v. total winthin-ss)

#plot( minClusters:maxClusters,
# sse,
# type="b",
# xlab="Number of Clusters",
# ylab="Aggregate within cluster SSE (sum of squared error)")

# zoom in to identify the choice more easily

#clust1 <- 3
#clust2 <- 8
#sse2 <- sse[ clust1:clust2 ]
#plot( clust1:clust2,
# sse2,
# type="b",
# xlab="Number of Clusters",
# ylab="Aggregate within cluster SSE (sum of squared error)")

###
# skip to here
###

# test case: 3 clusters

set.seed(seed)
t <- kmeans(myBoston,
centers=3,
nstart=km.nstart,
iter.max=km.iter.max)

# measures of interest
table(t$cluster)
t$tot.withinss # overall
t$withinss # by cluster

# reconstruct the error measurements

t2 <- myBoston
t2$cluster <- t$cluster # add cluster number for each row (observation)

# calculate penalty function (within group error sum of squares)

wssf <- function(df) {
t <- scale(df, center=T, scale=F) # center around the mean
return( sum(t^2) )
}

# by cluster
t3 <- t2 %>%
group_by(cluster) %>%
do( data.frame(wss = wssf(.) ) )

sum(t3) # overall
t3 # by cluster

# individual divergences
t$withinss - t3$wss

# overall divergence
t$tot.withinss - sum(t3$wss)

###
#
# k-means WSS calculations woking correctly
#
# now, apply it to hclust() to select a number of clusters
#
# use: squared euclidean distance as metric for clustering
# method="complete"
#
###

Data Mining
100% (1)
Data Mining
6 pages
NguyenCongSang ITITIU20292 Lab3
No ratings yet
NguyenCongSang ITITIU20292 Lab3
21 pages
ReliabilityHw1 R26104047
No ratings yet
ReliabilityHw1 R26104047
4 pages
Isye HW2
No ratings yet
Isye HW2
10 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
Week3HW 091323
No ratings yet
Week3HW 091323
8 pages
16 Flat
No ratings yet
16 Flat
88 pages
Banknote Authentication
100% (1)
Banknote Authentication
3 pages
ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
Homoscedasticity, Heteroscedasticity and Multicollinearity
100% (1)
Homoscedasticity, Heteroscedasticity and Multicollinearity
10 pages
Prog Assignment 3
No ratings yet
Prog Assignment 3
10 pages
Lab 07
No ratings yet
Lab 07
4 pages
Saurabh Pandey 22it3044 K Mean
No ratings yet
Saurabh Pandey 22it3044 K Mean
12 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Goldfeld Quandt Test
No ratings yet
Goldfeld Quandt Test
10 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
Cap - 1 - Peter - Smith - Survival - Analysis - Solution - Manual
100% (1)
Cap - 1 - Peter - Smith - Survival - Analysis - Solution - Manual
9 pages
Chapter 4. Data Plots Peter Smith: Required Libraries
No ratings yet
Chapter 4. Data Plots Peter Smith: Required Libraries
9 pages
Data Mining CS4168 Lecture 5 Basics of Classification 1
No ratings yet
Data Mining CS4168 Lecture 5 Basics of Classification 1
25 pages
Data Visualization With Ggplot2: Install Packages
No ratings yet
Data Visualization With Ggplot2: Install Packages
19 pages
Times University Ranks DataSet Analysis
No ratings yet
Times University Ranks DataSet Analysis
19 pages
Stats216 hw3 PDF
No ratings yet
Stats216 hw3 PDF
26 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
Matrices - Determinant - JEE (Main) - 2024
100% (1)
Matrices - Determinant - JEE (Main) - 2024
143 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Chapter 8 Fitting Parametric Regression Models: Required Data
No ratings yet
Chapter 8 Fitting Parametric Regression Models: Required Data
11 pages
Clustering
No ratings yet
Clustering
8 pages
Scott Knott
No ratings yet
Scott Knott
26 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
First Quarter Exam in Math 5-Sy 2024-2025
No ratings yet
First Quarter Exam in Math 5-Sy 2024-2025
4 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Problems Chap3
No ratings yet
Problems Chap3
27 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
CH-5 Super 30
No ratings yet
CH-5 Super 30
4 pages
K-NN (Nearest Neighbor)
100% (1)
K-NN (Nearest Neighbor)
17 pages
Ggplot2 - Easy Way To Mix Multiple Graphs On The Same Page - Articles - STHDA
No ratings yet
Ggplot2 - Easy Way To Mix Multiple Graphs On The Same Page - Articles - STHDA
54 pages
Sampling Theory and Method-301-500
No ratings yet
Sampling Theory and Method-301-500
200 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
PreCalculus Pre-Assessment
No ratings yet
PreCalculus Pre-Assessment
8 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
3 pages
Expectation Maximization
No ratings yet
Expectation Maximization
23 pages
Cambridge IGCSE: Additional Mathematics 0606/11
100% (1)
Cambridge IGCSE: Additional Mathematics 0606/11
16 pages
07 - Natural Experiment (Part 2) PDF
No ratings yet
07 - Natural Experiment (Part 2) PDF
90 pages
K-Nearest Neighbours (KNN)
No ratings yet
K-Nearest Neighbours (KNN)
10 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
Nearest Neighbour Algorithm
No ratings yet
Nearest Neighbour Algorithm
20 pages
GWR Presentation
No ratings yet
GWR Presentation
48 pages
Course 1
No ratings yet
Course 1
138 pages
Thesis 1
No ratings yet
Thesis 1
13 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
ACT Math Practice Test 6: 1. On The Real Number Line, - 0.578 Is
No ratings yet
ACT Math Practice Test 6: 1. On The Real Number Line, - 0.578 Is
9 pages
Data Science Interview Questions and Answer
100% (1)
Data Science Interview Questions and Answer
41 pages
03 - K Means Clustering On Iris Datasets
No ratings yet
03 - K Means Clustering On Iris Datasets
4 pages
Quant Checklist 583 by Aashish Arora For Bank Exams 2024
No ratings yet
Quant Checklist 583 by Aashish Arora For Bank Exams 2024
110 pages
Different Math Competition
No ratings yet
Different Math Competition
30 pages
When Should You Adjust Standard Errors For Clustering?: Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey Wooldridge
No ratings yet
When Should You Adjust Standard Errors For Clustering?: Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey Wooldridge
33 pages
12-Multiple Comparison Procedure
No ratings yet
12-Multiple Comparison Procedure
12 pages
5-3-1 Rational Functions Worksheet
No ratings yet
5-3-1 Rational Functions Worksheet
3 pages
Michelle Bodnar, Andrew Lohr April 12, 2016
100% (1)
Michelle Bodnar, Andrew Lohr April 12, 2016
12 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Estimation EMV
No ratings yet
Estimation EMV
37 pages
Lecture3 ReviewofLaplaceTransform
No ratings yet
Lecture3 ReviewofLaplaceTransform
33 pages
Long Quiz Mathematics 7 NAME: - DATE: - GRADE & SECTION: - SCORE
100% (1)
Long Quiz Mathematics 7 NAME: - DATE: - GRADE & SECTION: - SCORE
3 pages
Dokumen - Tips - Ee263 Homework 4 Solutions Stanford Prof S Boyd Ee263 Homework 4 Solutions 52
No ratings yet
Dokumen - Tips - Ee263 Homework 4 Solutions Stanford Prof S Boyd Ee263 Homework 4 Solutions 52
28 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
Entrepreneur India 2014-02
No ratings yet
Entrepreneur India 2014-02
104 pages
Reasoning and Aptitude Development Syllabus
No ratings yet
Reasoning and Aptitude Development Syllabus
13 pages
Ejemplo de Inferencia Umvue
No ratings yet
Ejemplo de Inferencia Umvue
10 pages
© Praadis Education Do Not Copy: Knowing Our Numbers
No ratings yet
© Praadis Education Do Not Copy: Knowing Our Numbers
51 pages
Specimen 2018 MS
No ratings yet
Specimen 2018 MS
12 pages
1971 - Rand - Objective Criteria For The Evaluation of Clustering Methods
No ratings yet
1971 - Rand - Objective Criteria For The Evaluation of Clustering Methods
6 pages
Instructions To Install Mysql Server Step 1
No ratings yet
Instructions To Install Mysql Server Step 1
8 pages
MMW-Module Chapter 3
No ratings yet
MMW-Module Chapter 3
9 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
39 pages
Counts and Totals: Ginger Grant
No ratings yet
Counts and Totals: Ginger Grant
38 pages
BCA - 3rd Sem - 23OBC303 - Digital Logic Design
No ratings yet
BCA - 3rd Sem - 23OBC303 - Digital Logic Design
3 pages
MSDC NUET v10.3
No ratings yet
MSDC NUET v10.3
10 pages
Welcome To Intermediate SQL!: Mona Khalil
No ratings yet
Welcome To Intermediate SQL!: Mona Khalil
32 pages
Window Functions: Ginger Grant
No ratings yet
Window Functions: Ginger Grant
31 pages
Chapter2 - SQL Server
No ratings yet
Chapter2 - SQL Server
29 pages
Welcome: Ginger Grant
No ratings yet
Welcome: Ginger Grant
29 pages
You'Ve Got The Power: John Mackintosh
No ratings yet
You'Ve Got The Power: John Mackintosh
24 pages
Correlated Subqueries: Mona Khalil
No ratings yet
Correlated Subqueries: Mona Khalil
40 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
38 pages
WHILE Loops: Ginger Grant
No ratings yet
WHILE Loops: Ginger Grant
17 pages
Joining Tables: John Mackintosh
No ratings yet
Joining Tables: John Mackintosh
30 pages
Coordinate Descent and Golden Selection Search
No ratings yet
Coordinate Descent and Golden Selection Search
2 pages
CIT 101 Practical 04 Final
No ratings yet
CIT 101 Practical 04 Final
4 pages
400 - Lecture 8 Problems
No ratings yet
400 - Lecture 8 Problems
6 pages
1.1 Murcho, D. (2006) - Does Science Need Philosophy
No ratings yet
1.1 Murcho, D. (2006) - Does Science Need Philosophy
9 pages
Lesson 2: Definition and Equation of A Circle
No ratings yet
Lesson 2: Definition and Equation of A Circle
5 pages
Characteristics of Dynamic Programming Problems
No ratings yet
Characteristics of Dynamic Programming Problems
13 pages
Quant Percentages-864ec1f4
No ratings yet
Quant Percentages-864ec1f4
11 pages
Instructions To Install MySQL Server
No ratings yet
Instructions To Install MySQL Server
8 pages
Star Mesh Transform
No ratings yet
Star Mesh Transform
2 pages
Tutorial Sheet 2 A PDF
No ratings yet
Tutorial Sheet 2 A PDF
3 pages
Step 1: Understanding The Model: Project: Diamond Prices
No ratings yet
Step 1: Understanding The Model: Project: Diamond Prices
1 page
Verbal Live Sessions Calendar - August 2014 Batch
No ratings yet
Verbal Live Sessions Calendar - August 2014 Batch
1 page
Verbal Live Sessions Calendar - Sep 6
No ratings yet
Verbal Live Sessions Calendar - Sep 6
1 page
Exponential and Logarithmic Functions Solutions
No ratings yet
Exponential and Logarithmic Functions Solutions
4 pages
g11 PDF
No ratings yet
g11 PDF
2 pages
SAS Programming II Data Manipulation Techniques
No ratings yet
SAS Programming II Data Manipulation Techniques
2 pages

Clustering On Boston Dataset

Uploaded by

Clustering On Boston Dataset

Uploaded by

###

if ( !require(MASS ) ) { install.packages('MASS'); library(MASS) }

data(Boston) # from MASS

# track the behavior (initialize to an illegal value) [ total within-ss ] (elbow

# test results for each clustering scenario

# plot results to find optimal tradeoff (num clusters v. total winthin-ss)

# zoom in to identify the choice more easily

# test case: 3 clusters

# reconstruct the error measurements

# calculate penalty function (within group error sum of squares)

You might also like