ISYE6501 Homework 2
ISYE6501 Homework 2
2024-01-22
rm(list=ls())
setwd("~/Georgia Tech - OMSA/ISYE 6501")
if(!require(pacman)) install.packages("pacman")
library(pacman)
p_load(kernlab, tinytex, kknn, caret, magrittr, rio, tidyverse)
Question 3.1 Using the same data set as in Question 2.2, use the ksvm or kknn
function to find a good classifier
3.1.a Using cross-validation (do this for the k-nearest-neighbors model; SVM is optional)
Load credit card data into cc_data tables, show table head and tail
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1
#tail(df)
In the following code block, we have tried to use cross validation on the full data set (df)
using the train() method from “caret” package with 20 different values of k. As the result, we
have identified that k=5 is the optimal value with an accuracy of ~69% or ~37% when include
randomness factor.
set.seed(1)
1
#Define parametters
statctrl <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 10, # Number of folds
repeats = 3 # Number of complete sets of folds
)
#Run fit
fit
## k-Nearest Neighbors
##
## 654 samples
## 10 predictor
## 2 classes: ’0’, ’1’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 589, 589, 588, 589, 588, 589, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6917172 0.3692819
## 7 0.6774204 0.3378301
## 9 0.6764336 0.3354226
## 11 0.6733955 0.3313479
## 13 0.6703186 0.3262664
## 15 0.6626962 0.3109202
## 17 0.6631469 0.3113559
## 19 0.6601243 0.3047398
## 21 0.6616550 0.3064370
## 23 0.6702875 0.3231283
## 25 0.6763481 0.3340599
## 27 0.6753380 0.3327768
## 29 0.6773271 0.3370988
## 31 0.6778555 0.3380353
## 33 0.6788967 0.3403241
## 35 0.6839627 0.3506076
## 37 0.6895804 0.3631481
## 39 0.6880653 0.3602042
## 41 0.6885703 0.3612929
## 43 0.6855245 0.3545281
##
## Accuracy was used to select the optimal model using the largest value.
2
## The final value used for the model was k = 5.
3.1.b Splitting the data into training, validation, and test data sets (pick either KNN or SVM;
the other is optional)
In the code block below, we’ve splited the data randomly into train and test data sets, with
train contains 70% of data (458 samples) and test contain 30% of data (196 samples). Then
we’ve performed cross-validation on train data set (10 folds, repeated 3 times) using train()
from “caret” package similar to the question bove. The optimal value of k for the train data
set is 27 with an accuracy ~69% or ~0.37% with accounted for randomness.
#Split the data randomly into train and test datasets using the percentage of 70% for train and 30% for
train <- df %>% sample_frac(.70)
test <- anti_join(df, train)
## Joining with ‘by = join_by(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11)‘
#Define parametters
statctrl <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 10, # Number of folds
repeats = 3 # Number of complete sets of folds
)
#Run fit
fit
## k-Nearest Neighbors
##
## 458 samples
## 10 predictor
## 2 classes: ’0’, ’1’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 413, 412, 412, 413, 412, 412, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6900805 0.3641546
## 7 0.6755233 0.3341961
3
## 9 0.6696618 0.3235942
## 11 0.6573430 0.3005527
## 13 0.6507568 0.2878012
## 15 0.6500322 0.2827889
## 17 0.6500322 0.2798894
## 19 0.6434138 0.2669186
## 21 0.6390499 0.2585996
## 23 0.6448953 0.2720385
## 25 0.6544122 0.2918998
## 27 0.6704348 0.3242520
## 29 0.6704831 0.3236226
## 31 0.6617069 0.3053828
## 33 0.6588567 0.2980426
## 35 0.6632045 0.3056490
## 37 0.6617552 0.3013254
## 39 0.6726409 0.3241827
## 41 0.6777134 0.3350740
## 43 0.6777778 0.3353016
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
Now, let apply that model to our test data set by first predict the test set, then check how
well this model would predict the V11 on the test data set. We’ve obtained an accuracy of
70% with a 95% Confident Interval between ~0.63 and ~0.77.
4
## Sensitivity : 0.6923
## Specificity : 0.6582
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.5909
## Prevalence : 0.5969
## Detection Rate : 0.4133
## Detection Prevalence : 0.5510
## Balanced Accuracy : 0.6753
##
## ’Positive’ Class : 0
##
Question 4.1
Describe a situation or problem from your job, everyday life, current events, etc., for which
a clustering model would be appropriate. List some (up to 5) predictors that you might use.
During my time as a webmaster for a community college, the marketing department sought to enhance our
recruiting campaign through social media advertising. To optimize our outreach, we decided to employ clus-
tering models. This involved categorizing users based on their engagement patterns to tailor our advertising
content effectively.
We considered key predictors for clustering, including:
1. Engagement with College Posts: User interactions with posts on academic programs, events, and
campus life.
2. Exploration of Academic Pages: User visits to academic department pages, course details, and
admission information.
3. Participation in Virtual Campus Tours: Involvement in virtual tours and online events organized
by the college.
4. Field of Study Preferences: User preferences for specific fields of study or majors.
5. Extracurricular Interests: User engagement with content related to extracurricular activities and
student organizations.
Applying clustering models to these predictors allowed us to identify distinct user segments, enabling us to
tailor our social media advertising content for a more impactful recruiting campaign.
Question 4.2
Use the R function kmeans to cluster the points as well as possible. Report the best combina-
tion of predictors, your suggested value of k, and how well your best clustering predicts flower
type. Load ggplot2, datasets package and the iris data and display a sumary
set.seed(88)
p_load(ggplot2, datasets,tidyverse)
5
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
In the code block below, we’ve attempted to make a dry run of kmeans() using all 4 columns
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) as predictors and assign k = 3, since
we’ve already known that there are 3 types of flowers from the Species column. We’ve also
measured the accuracy by comparing the result cluster labels to actual species labels.
6
fitted_labels <- kmeans_result$cluster
correct_labels <- as.numeric(iris$Species)
## [1] 0.5
The kmeans() seems to output different result every time it was run, so the best accuracy
we can find by running the above code is around 0.5 (or 50%), Now the next code block
is used to graph the kmeans result by using different shapes for Species and and different
color for cluster. It displays Petal.Length and Petal.Width as x, y axis but Petal.Length and
Petal.width would works as well.
7
K−means Clustering of Iris Dataset
Accuracy: 0.5
2.5
Species
2.0
setosa
versicolor
Petal Width
1.5 virginica
Cluster
1.0
1
2
3
0.5
0.0
2 4 6
Petal Length
Next, we will apply the Elbow method to determine a good value of k using map_dpl that
extract the total within-cluster sum of squares value from each model. Often, the last value
of k before the slope of the plot levels off is suggest a good k value which is 2 in this case.
Credits: https://predictivehacks.com/how-to-determine-the-number-of-clusters-of-k-means-in-r/
8
600
400
tot_withinss
200
1 2 3 4 5 6 7 8 9 10
k
Knowing that k=2 is the optimal value, we now create a set of all possible combinations from
the 4 predictors using combn(), then set up a loop through the combinations with kmeans()
then update the accuracy. As the result we’ve obtained our best accuracy of 67% and our best
combination of Petal.Length + Petal.Width
9
best_accuracy <- accuracy
best_combination <- current_combination
}
}
}
## [1] 0.16
## [1] 0.54
## [1] 0.006666667
## [1] 0.04666667
## [1] 0.6533333
## [1] 0.6333333
## [1] 0.07333333
## [1] 0.006666667
## [1] 0.66
## [1] 0.6666667
## [1] 0.6666667
## [1] 0.6666667
## [1] 0.02666667
## [1] 0.6666667
## [1] 0.6666667
10
Best K−means Clustering of Iris Dataset
Suggested Value of k: 2 | Accuracy: 0.67
2.5
Species
2.0 setosa
versicolor
virginica
Petal Width
1.5
Cluster
2.00
1.0 1.75
1.50
0.5 1.25
1.00
0.0
2 4 6
Petal Length
11