0% found this document useful (0 votes)
52 views11 pages

ISYE6501 Homework 2

The document outlines the process of using k-nearest neighbors (KNN) for classification on credit card data, identifying k=5 as optimal with ~69% accuracy. It also discusses clustering models for optimizing social media advertising based on user engagement patterns, suggesting predictors like engagement with posts and academic preferences. Additionally, k-means clustering is applied to the Iris dataset with k=3, achieving clustering results that align with known flower species.

Uploaded by

vitieubao083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views11 pages

ISYE6501 Homework 2

The document outlines the process of using k-nearest neighbors (KNN) for classification on credit card data, identifying k=5 as optimal with ~69% accuracy. It also discusses clustering models for optimizing social media advertising based on user engagement patterns, suggesting predictors like engagement with posts and academic preferences. Additionally, k-means clustering is applied to the Iris dataset with k=3, achieving clustering results that align with known flower species.

Uploaded by

vitieubao083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ISYE6501-Homework2

2024-01-22

Load dependencies packages using pacman

rm(list=ls())
setwd("~/Georgia Tech - OMSA/ISYE 6501")
if(!require(pacman)) install.packages("pacman")

## Loading required package: pacman

library(pacman)
p_load(kernlab, tinytex, kknn, caret, magrittr, rio, tidyverse)

Question 3.1 Using the same data set as in Question 2.2, use the ksvm or kknn
function to find a good classifier

3.1.a Using cross-validation (do this for the k-nearest-neighbors model; SVM is optional)

Load credit card data into cc_data tables, show table head and tail

file_path <- "~/Georgia Tech - OMSA/ISYE 6501/hw2/data 3.1/credit_card_data.txt"


df <- read.table(file_path, stringsAsFactors = FALSE, header=FALSE)
head(df)

## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1

#tail(df)

In the following code block, we have tried to use cross validation on the full data set (df)
using the train() method from “caret” package with 20 different values of k. As the result, we
have identified that k=5 is the optimal value with an accuracy of ~69% or ~37% when include
randomness factor.

set.seed(1)

###### Compute the KNN model on full data set #######

1
#Define parametters
statctrl <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 10, # Number of folds
repeats = 3 # Number of complete sets of folds
)

#Define and save model


fit <- train(
as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = df, #use tranning data set
method = "knn",
trControl = statctrl,
tuneLength = 20, # 20 dif values for k
)

#Run fit
fit

## k-Nearest Neighbors
##
## 654 samples
## 10 predictor
## 2 classes: ’0’, ’1’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 589, 589, 588, 589, 588, 589, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6917172 0.3692819
## 7 0.6774204 0.3378301
## 9 0.6764336 0.3354226
## 11 0.6733955 0.3313479
## 13 0.6703186 0.3262664
## 15 0.6626962 0.3109202
## 17 0.6631469 0.3113559
## 19 0.6601243 0.3047398
## 21 0.6616550 0.3064370
## 23 0.6702875 0.3231283
## 25 0.6763481 0.3340599
## 27 0.6753380 0.3327768
## 29 0.6773271 0.3370988
## 31 0.6778555 0.3380353
## 33 0.6788967 0.3403241
## 35 0.6839627 0.3506076
## 37 0.6895804 0.3631481
## 39 0.6880653 0.3602042
## 41 0.6885703 0.3612929
## 43 0.6855245 0.3545281
##
## Accuracy was used to select the optimal model using the largest value.

2
## The final value used for the model was k = 5.

3.1.b Splitting the data into training, validation, and test data sets (pick either KNN or SVM;
the other is optional)

In the code block below, we’ve splited the data randomly into train and test data sets, with
train contains 70% of data (458 samples) and test contain 30% of data (196 samples). Then
we’ve performed cross-validation on train data set (10 folds, repeated 3 times) using train()
from “caret” package similar to the question bove. The optimal value of k for the train data
set is 27 with an accuracy ~69% or ~0.37% with accounted for randomness.

#Split the data randomly into train and test datasets using the percentage of 70% for train and 30% for
train <- df %>% sample_frac(.70)
test <- anti_join(df, train)

## Joining with ‘by = join_by(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11)‘

###### Compute the KNN model on trainning data set #######

#Define parametters
statctrl <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 10, # Number of folds
repeats = 3 # Number of complete sets of folds
)

#Define and save model


fit <- train(
as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = train, #use tranning data set
method = "knn",
trControl = statctrl,
tuneLength = 20, # 20 dif values for k
)

#Run fit
fit

## k-Nearest Neighbors
##
## 458 samples
## 10 predictor
## 2 classes: ’0’, ’1’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 413, 412, 412, 413, 412, 412, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6900805 0.3641546
## 7 0.6755233 0.3341961

3
## 9 0.6696618 0.3235942
## 11 0.6573430 0.3005527
## 13 0.6507568 0.2878012
## 15 0.6500322 0.2827889
## 17 0.6500322 0.2798894
## 19 0.6434138 0.2669186
## 21 0.6390499 0.2585996
## 23 0.6448953 0.2720385
## 25 0.6544122 0.2918998
## 27 0.6704348 0.3242520
## 29 0.6704831 0.3236226
## 31 0.6617069 0.3053828
## 33 0.6588567 0.2980426
## 35 0.6632045 0.3056490
## 37 0.6617552 0.3013254
## 39 0.6726409 0.3241827
## 41 0.6777134 0.3350740
## 43 0.6777778 0.3353016
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Now, let apply that model to our test data set by first predict the test set, then check how
well this model would predict the V11 on the test data set. We’ve obtained an accuracy of
70% with a 95% Confident Interval between ~0.63 and ~0.77.

#predict the test set


predict_cc <- predict(fit, newdata = test)

#acccuracy of model on test data


table(
actualclass = test$V11,
predictedclass = predict_cc
) %>%
confusionMatrix() %>%
print()

## Confusion Matrix and Statistics


##
## predictedclass
## actualclass 0 1
## 0 81 27
## 1 36 52
##
## Accuracy : 0.6786
## 95% CI : (0.6083, 0.7433)
## No Information Rate : 0.5969
## P-Value [Acc > NIR] : 0.01127
##
## Kappa : 0.3442
##
## Mcnemar’s Test P-Value : 0.31350
##

4
## Sensitivity : 0.6923
## Specificity : 0.6582
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.5909
## Prevalence : 0.5969
## Detection Rate : 0.4133
## Detection Prevalence : 0.5510
## Balanced Accuracy : 0.6753
##
## ’Positive’ Class : 0
##

Question 4.1

Describe a situation or problem from your job, everyday life, current events, etc., for which
a clustering model would be appropriate. List some (up to 5) predictors that you might use.
During my time as a webmaster for a community college, the marketing department sought to enhance our
recruiting campaign through social media advertising. To optimize our outreach, we decided to employ clus-
tering models. This involved categorizing users based on their engagement patterns to tailor our advertising
content effectively.
We considered key predictors for clustering, including:

1. Engagement with College Posts: User interactions with posts on academic programs, events, and
campus life.

2. Exploration of Academic Pages: User visits to academic department pages, course details, and
admission information.
3. Participation in Virtual Campus Tours: Involvement in virtual tours and online events organized
by the college.
4. Field of Study Preferences: User preferences for specific fields of study or majors.

5. Extracurricular Interests: User engagement with content related to extracurricular activities and
student organizations.

Applying clustering models to these predictors allowed us to identify distinct user segments, enabling us to
tailor our social media advertising content for a more impactful recruiting campaign.

Question 4.2

Use the R function kmeans to cluster the points as well as possible. Report the best combina-
tion of predictors, your suggested value of k, and how well your best clustering predicts flower
type. Load ggplot2, datasets package and the iris data and display a sumary

set.seed(88)
p_load(ggplot2, datasets,tidyverse)

#Load the Iris dataset


data("iris")
#head(iris)
summary(iris)

5
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##

In the code block below, we’ve attempted to make a dry run of kmeans() using all 4 columns
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) as predictors and assign k = 3, since
we’ve already known that there are 3 types of flowers from the Species column. We’ve also
measured the accuracy by comparing the result cluster labels to actual species labels.

# Define predictors (features)


predictors <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
#Scale the preditors
scaled_predictors <- as.data.frame(scale(predictors))

# Perform k-means clustering with k = 3


kmeans_result <- kmeans(scaled_predictors, centers = 3)
print(kmeans_result)

## K-means clustering with 3 clusters of sizes 50, 47, 53


##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 -1.01119138 0.85041372 -1.3006301 -1.2507035
## 2 1.13217737 0.08812645 0.9928284 1.0141287
## 3 -0.05005221 -0.88042696 0.3465767 0.2805873
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3
## [75] 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
## [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 3 2
## [149] 2 3
##
## Within cluster sum of squares by cluster:
## [1] 47.35062 47.45019 44.08754
## (between_SS / total_SS = 76.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"

6
fitted_labels <- kmeans_result$cluster
correct_labels <- as.numeric(iris$Species)

# Measure accuracy by comparing cluster labels to actual species labels


accuracy <- sum(fitted_labels == correct_labels) / length(correct_labels)
print(accuracy)

## [1] 0.5

The kmeans() seems to output different result every time it was run, so the best accuracy
we can find by running the above code is around 0.5 (or 50%), Now the next code block
is used to graph the kmeans result by using different shapes for Species and and different
color for cluster. It displays Petal.Length and Petal.Width as x, y axis but Petal.Length and
Petal.width would works as well.

# Add cluster assignments to the original dataset


Clusters <- as.factor(kmeans_result$cluster)
iris_clustered <- cbind(iris, Clusters )

# Visualize the clusters using ggplot2


ggplot(iris_clustered, aes(x = Petal.Length, y = Petal.Width, color = Clusters, shape = Species)) +
geom_point(size = 4) +
labs(title = "K-means Clustering of Iris Dataset",
subtitle = paste("Accuracy:", round(accuracy, 2)),
x = "Petal Length",
y = "Petal Width",
color = "Cluster",
shape = "Species") +
theme_minimal()

7
K−means Clustering of Iris Dataset
Accuracy: 0.5

2.5

Species
2.0
setosa
versicolor
Petal Width

1.5 virginica

Cluster
1.0
1
2
3
0.5

0.0
2 4 6
Petal Length

Next, we will apply the Elbow method to determine a good value of k using map_dpl that
extract the total within-cluster sum of squares value from each model. Often, the last value
of k before the slope of the plot levels off is suggest a good k value which is 2 in this case.
Credits: https://predictivehacks.com/how-to-determine-the-number-of-clusters-of-k-means-in-r/

# Use map_dbl to run many models with varying value of k (centers)


tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = scaled_predictors, centers = k)
model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss


elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)

# Plot the elbow plot


ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() + geom_point()+
scale_x_continuous(breaks = 1:10)

8
600

400
tot_withinss

200

1 2 3 4 5 6 7 8 9 10
k

Knowing that k=2 is the optimal value, we now create a set of all possible combinations from
the 4 predictors using combn(), then set up a loop through the combinations with kmeans()
then update the accuracy. As the result we’ve obtained our best accuracy of 67% and our best
combination of Petal.Length + Petal.Width

# Initialize variables to store the best results


best_accuracy <- -1
best_combination <- NULL

# Iterate through predictor combinations


for (num_predictors in 1:4) {
combinations <- combn(4, num_predictors, simplify = FALSE)

for (current_combination in combinations) {


current_predictors <- scaled_predictors[, current_combination]

# Perform k-means clustering


kmeans_result <- kmeans(current_predictors, centers = 2)

# Measure accuracy by comparing cluster labels to actual species labels


cluster_labels <- kmeans_result$cluster
correct_labels <- as.numeric(iris$Species)
accuracy <- sum(cluster_labels == correct_labels) / length(correct_labels)
print(accuracy)
# Update best results if accuracy is higher
if (accuracy > best_accuracy) {

9
best_accuracy <- accuracy
best_combination <- current_combination
}
}
}

## [1] 0.16
## [1] 0.54
## [1] 0.006666667
## [1] 0.04666667
## [1] 0.6533333
## [1] 0.6333333
## [1] 0.07333333
## [1] 0.006666667
## [1] 0.66
## [1] 0.6666667
## [1] 0.6666667
## [1] 0.6666667
## [1] 0.02666667
## [1] 0.6666667
## [1] 0.6666667

# Print the best combination of predictors and accuracy


cat("Best Combination of Predictors:", colnames(predictors)[best_combination], "\n")

## Best Combination of Predictors: Petal.Length Petal.Width

cat("Best Accuracy:", round(best_accuracy, 2), "\n")

## Best Accuracy: 0.67

Graph the final result.

# Visualize the clusters using ggplot2


ggplot(iris_clustered, aes(x = Petal.Length, y = Petal.Width, color = cluster_labels, shape = Species))
geom_point(size = 4) +
labs(title = "Best K-means Clustering of Iris Dataset",
subtitle = paste("Suggested Value of k:", 2, " | Accuracy:", round(best_accuracy, 2)),
x = "Petal Length",
y = "Petal Width",
color = "Cluster",
shape = "Species") +
theme_minimal()

10
Best K−means Clustering of Iris Dataset
Suggested Value of k: 2 | Accuracy: 0.67

2.5

Species

2.0 setosa
versicolor
virginica
Petal Width

1.5

Cluster
2.00

1.0 1.75

1.50

0.5 1.25

1.00

0.0
2 4 6
Petal Length

11

You might also like