0% found this document useful (0 votes)

52 views11 pages

ISYE6501 Homework 2

The document outlines the process of using k-nearest neighbors (KNN) for classification on credit card data, identifying k=5 as optimal with ~69% accuracy. It also discusses clustering models for optimizing social media advertising based on user engagement patterns, suggesting predictors like engagement with posts and academic preferences. Additionally, k-means clustering is applied to the Iris dataset with k=3, achieving clustering results that align with known flower species.

Uploaded by

vitieubao083

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views11 pages

ISYE6501 Homework 2

Uploaded by

vitieubao083

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

ISYE6501-Homework2

2024-01-22

Load dependencies packages using pacman

rm(list=ls())
setwd("~/Georgia Tech - OMSA/ISYE 6501")
if(!require(pacman)) install.packages("pacman")

## Loading required package: pacman

library(pacman)
p_load(kernlab, tinytex, kknn, caret, magrittr, rio, tidyverse)

Question 3.1 Using the same data set as in Question 2.2, use the ksvm or kknn
function to find a good classifier

3.1.a Using cross-validation (do this for the k-nearest-neighbors model; SVM is optional)

Load credit card data into cc_data tables, show table head and tail

file_path <- "~/Georgia Tech - OMSA/ISYE 6501/hw2/data 3.1/credit_card_data.txt"

df <- read.table(file_path, stringsAsFactors = FALSE, header=FALSE)
head(df)

## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1

#tail(df)

In the following code block, we have tried to use cross validation on the full data set (df)
using the train() method from “caret” package with 20 different values of k. As the result, we
have identified that k=5 is the optimal value with an accuracy of ~69% or ~37% when include
randomness factor.

set.seed(1)

###### Compute the KNN model on full data set #######

1
#Define parametters
statctrl <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 10, # Number of folds
repeats = 3 # Number of complete sets of folds
)

#Define and save model

fit <- train(
as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = df, #use tranning data set
method = "knn",
trControl = statctrl,
tuneLength = 20, # 20 dif values for k
)

#Run fit
fit

## k-Nearest Neighbors
##
## 654 samples
## 10 predictor
## 2 classes: ’0’, ’1’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 589, 589, 588, 589, 588, 589, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6917172 0.3692819
## 7 0.6774204 0.3378301
## 9 0.6764336 0.3354226
## 11 0.6733955 0.3313479
## 13 0.6703186 0.3262664
## 15 0.6626962 0.3109202
## 17 0.6631469 0.3113559
## 19 0.6601243 0.3047398
## 21 0.6616550 0.3064370
## 23 0.6702875 0.3231283
## 25 0.6763481 0.3340599
## 27 0.6753380 0.3327768
## 29 0.6773271 0.3370988
## 31 0.6778555 0.3380353
## 33 0.6788967 0.3403241
## 35 0.6839627 0.3506076
## 37 0.6895804 0.3631481
## 39 0.6880653 0.3602042
## 41 0.6885703 0.3612929
## 43 0.6855245 0.3545281
##
## Accuracy was used to select the optimal model using the largest value.

2
## The final value used for the model was k = 5.

3.1.b Splitting the data into training, validation, and test data sets (pick either KNN or SVM;
the other is optional)

In the code block below, we’ve splited the data randomly into train and test data sets, with
train contains 70% of data (458 samples) and test contain 30% of data (196 samples). Then
we’ve performed cross-validation on train data set (10 folds, repeated 3 times) using train()
from “caret” package similar to the question bove. The optimal value of k for the train data
set is 27 with an accuracy ~69% or ~0.37% with accounted for randomness.

#Split the data randomly into train and test datasets using the percentage of 70% for train and 30% for
train <- df %>% sample_frac(.70)
test <- anti_join(df, train)

## Joining with ‘by = join_by(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11)‘

###### Compute the KNN model on trainning data set #######

#Define parametters
statctrl <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 10, # Number of folds
repeats = 3 # Number of complete sets of folds
)

#Define and save model

fit <- train(
as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = train, #use tranning data set
method = "knn",
trControl = statctrl,
tuneLength = 20, # 20 dif values for k
)

#Run fit
fit

## k-Nearest Neighbors
##
## 458 samples
## 10 predictor
## 2 classes: ’0’, ’1’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 413, 412, 412, 413, 412, 412, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6900805 0.3641546
## 7 0.6755233 0.3341961

3
## 9 0.6696618 0.3235942
## 11 0.6573430 0.3005527
## 13 0.6507568 0.2878012
## 15 0.6500322 0.2827889
## 17 0.6500322 0.2798894
## 19 0.6434138 0.2669186
## 21 0.6390499 0.2585996
## 23 0.6448953 0.2720385
## 25 0.6544122 0.2918998
## 27 0.6704348 0.3242520
## 29 0.6704831 0.3236226
## 31 0.6617069 0.3053828
## 33 0.6588567 0.2980426
## 35 0.6632045 0.3056490
## 37 0.6617552 0.3013254
## 39 0.6726409 0.3241827
## 41 0.6777134 0.3350740
## 43 0.6777778 0.3353016
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Now, let apply that model to our test data set by first predict the test set, then check how
well this model would predict the V11 on the test data set. We’ve obtained an accuracy of
70% with a 95% Confident Interval between ~0.63 and ~0.77.

#predict the test set

predict_cc <- predict(fit, newdata = test)

#acccuracy of model on test data

table(
actualclass = test$V11,
predictedclass = predict_cc
) %>%
confusionMatrix() %>%
print()

## Confusion Matrix and Statistics

##
## predictedclass
## actualclass 0 1
## 0 81 27
## 1 36 52
##
## Accuracy : 0.6786
## 95% CI : (0.6083, 0.7433)
## No Information Rate : 0.5969
## P-Value [Acc > NIR] : 0.01127
##
## Kappa : 0.3442
##
## Mcnemar’s Test P-Value : 0.31350
##

4
## Sensitivity : 0.6923
## Specificity : 0.6582
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.5909
## Prevalence : 0.5969
## Detection Rate : 0.4133
## Detection Prevalence : 0.5510
## Balanced Accuracy : 0.6753
##
## ’Positive’ Class : 0
##

Question 4.1

Describe a situation or problem from your job, everyday life, current events, etc., for which
a clustering model would be appropriate. List some (up to 5) predictors that you might use.
During my time as a webmaster for a community college, the marketing department sought to enhance our
recruiting campaign through social media advertising. To optimize our outreach, we decided to employ clus-
tering models. This involved categorizing users based on their engagement patterns to tailor our advertising
content effectively.
We considered key predictors for clustering, including:

1. Engagement with College Posts: User interactions with posts on academic programs, events, and
campus life.

2. Exploration of Academic Pages: User visits to academic department pages, course details, and
admission information.
3. Participation in Virtual Campus Tours: Involvement in virtual tours and online events organized
by the college.
4. Field of Study Preferences: User preferences for specific fields of study or majors.

5. Extracurricular Interests: User engagement with content related to extracurricular activities and
student organizations.

Applying clustering models to these predictors allowed us to identify distinct user segments, enabling us to
tailor our social media advertising content for a more impactful recruiting campaign.

Question 4.2

Use the R function kmeans to cluster the points as well as possible. Report the best combina-
tion of predictors, your suggested value of k, and how well your best clustering predicts flower
type. Load ggplot2, datasets package and the iris data and display a sumary

set.seed(88)
p_load(ggplot2, datasets,tidyverse)

#Load the Iris dataset

data("iris")
#head(iris)
summary(iris)

5
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##

In the code block below, we’ve attempted to make a dry run of kmeans() using all 4 columns
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) as predictors and assign k = 3, since
we’ve already known that there are 3 types of flowers from the Species column. We’ve also
measured the accuracy by comparing the result cluster labels to actual species labels.

# Define predictors (features)

predictors <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
#Scale the preditors
scaled_predictors <- as.data.frame(scale(predictors))

# Perform k-means clustering with k = 3

kmeans_result <- kmeans(scaled_predictors, centers = 3)
print(kmeans_result)

## K-means clustering with 3 clusters of sizes 50, 47, 53

##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 -1.01119138 0.85041372 -1.3006301 -1.2507035
## 2 1.13217737 0.08812645 0.9928284 1.0141287
## 3 -0.05005221 -0.88042696 0.3465767 0.2805873
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3
## [75] 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
## [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 3 2
## [149] 2 3
##
## Within cluster sum of squares by cluster:
## [1] 47.35062 47.45019 44.08754
## (between_SS / total_SS = 76.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"

6
fitted_labels <- kmeans_result$cluster
correct_labels <- as.numeric(iris$Species)

# Measure accuracy by comparing cluster labels to actual species labels

accuracy <- sum(fitted_labels == correct_labels) / length(correct_labels)
print(accuracy)

## [1] 0.5

The kmeans() seems to output different result every time it was run, so the best accuracy
we can find by running the above code is around 0.5 (or 50%), Now the next code block
is used to graph the kmeans result by using different shapes for Species and and different
color for cluster. It displays Petal.Length and Petal.Width as x, y axis but Petal.Length and
Petal.width would works as well.

# Add cluster assignments to the original dataset

Clusters <- as.factor(kmeans_result$cluster)
iris_clustered <- cbind(iris, Clusters )

# Visualize the clusters using ggplot2

ggplot(iris_clustered, aes(x = Petal.Length, y = Petal.Width, color = Clusters, shape = Species)) +
geom_point(size = 4) +
labs(title = "K-means Clustering of Iris Dataset",
subtitle = paste("Accuracy:", round(accuracy, 2)),
x = "Petal Length",
y = "Petal Width",
color = "Cluster",
shape = "Species") +
theme_minimal()

7
K−means Clustering of Iris Dataset
Accuracy: 0.5

2.5

Species
2.0
setosa
versicolor
Petal Width

1.5 virginica

Cluster
1.0
1
2
3
0.5

0.0
2 4 6
Petal Length

Next, we will apply the Elbow method to determine a good value of k using map_dpl that
extract the total within-cluster sum of squares value from each model. Often, the last value
of k before the slope of the plot levels off is suggest a good k value which is 2 in this case.
Credits: https://predictivehacks.com/how-to-determine-the-number-of-clusters-of-k-means-in-r/

# Use map_dbl to run many models with varying value of k (centers)

tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = scaled_predictors, centers = k)
model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss

elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)

# Plot the elbow plot

ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() + geom_point()+
scale_x_continuous(breaks = 1:10)

8
600

400
tot_withinss

200

1 2 3 4 5 6 7 8 9 10
k

Knowing that k=2 is the optimal value, we now create a set of all possible combinations from
the 4 predictors using combn(), then set up a loop through the combinations with kmeans()
then update the accuracy. As the result we’ve obtained our best accuracy of 67% and our best
combination of Petal.Length + Petal.Width

# Initialize variables to store the best results

best_accuracy <- -1
best_combination <- NULL

# Iterate through predictor combinations

for (num_predictors in 1:4) {
combinations <- combn(4, num_predictors, simplify = FALSE)

for (current_combination in combinations) {

current_predictors <- scaled_predictors[, current_combination]

# Perform k-means clustering

kmeans_result <- kmeans(current_predictors, centers = 2)

# Measure accuracy by comparing cluster labels to actual species labels

cluster_labels <- kmeans_result$cluster
correct_labels <- as.numeric(iris$Species)
accuracy <- sum(cluster_labels == correct_labels) / length(correct_labels)
print(accuracy)
# Update best results if accuracy is higher
if (accuracy > best_accuracy) {

9
best_accuracy <- accuracy
best_combination <- current_combination
}
}
}

## [1] 0.16
## [1] 0.54
## [1] 0.006666667
## [1] 0.04666667
## [1] 0.6533333
## [1] 0.6333333
## [1] 0.07333333
## [1] 0.006666667
## [1] 0.66
## [1] 0.6666667
## [1] 0.6666667
## [1] 0.6666667
## [1] 0.02666667
## [1] 0.6666667
## [1] 0.6666667

# Print the best combination of predictors and accuracy

cat("Best Combination of Predictors:", colnames(predictors)[best_combination], "\n")

## Best Combination of Predictors: Petal.Length Petal.Width

cat("Best Accuracy:", round(best_accuracy, 2), "\n")

## Best Accuracy: 0.67

Graph the final result.

# Visualize the clusters using ggplot2

ggplot(iris_clustered, aes(x = Petal.Length, y = Petal.Width, color = cluster_labels, shape = Species))
geom_point(size = 4) +
labs(title = "Best K-means Clustering of Iris Dataset",
subtitle = paste("Suggested Value of k:", 2, " | Accuracy:", round(best_accuracy, 2)),
x = "Petal Length",
y = "Petal Width",
color = "Cluster",
shape = "Species") +
theme_minimal()

10
Best K−means Clustering of Iris Dataset
Suggested Value of k: 2 | Accuracy: 0.67

2.5

Species

2.0 setosa
versicolor
virginica
Petal Width

1.5

Cluster
2.00

1.0 1.75

1.50

0.5 1.25

1.00

0.0
2 4 6
Petal Length

ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
Basement Ventilation
67% (3)
Basement Ventilation
8 pages
Din 1685 - 1
67% (3)
Din 1685 - 1
4 pages
Solution 3.1
No ratings yet
Solution 3.1
4 pages
AR253 History 2 - Structuralism and Metabolism
No ratings yet
AR253 History 2 - Structuralism and Metabolism
55 pages
Deaerator Performance Testing
100% (3)
Deaerator Performance Testing
3 pages
SImple and Compound Interest Notes Lyst6475
No ratings yet
SImple and Compound Interest Notes Lyst6475
11 pages
Machine Learning
100% (5)
Machine Learning
56 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
Indradrive Diagnostic Old
No ratings yet
Indradrive Diagnostic Old
436 pages
1
No ratings yet
1
19 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
Hruby Ondrej Hw3
No ratings yet
Hruby Ondrej Hw3
18 pages
4063 Final复习资料
No ratings yet
4063 Final复习资料
6 pages
Week10 KNN Practical
No ratings yet
Week10 KNN Practical
4 pages
Discussion 3 Supervised
No ratings yet
Discussion 3 Supervised
14 pages
Codes
No ratings yet
Codes
14 pages
Grid Search For SVM
No ratings yet
Grid Search For SVM
9 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
hw2 - Credit
No ratings yet
hw2 - Credit
3 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Final Project
No ratings yet
Final Project
9 pages
Assigmnent 3 (Data Mining)
No ratings yet
Assigmnent 3 (Data Mining)
18 pages
R Assignment
No ratings yet
R Assignment
8 pages
Here's An Visualization of The K-Nearest Neighbors Algorithm
No ratings yet
Here's An Visualization of The K-Nearest Neighbors Algorithm
5 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
5 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Question 2.2
No ratings yet
Question 2.2
4 pages
Heat Treatment of Steel: Assessment Performance Criteria
No ratings yet
Heat Treatment of Steel: Assessment Performance Criteria
6 pages
Lecture 3 - MachineLearning-CrashCourse2023
No ratings yet
Lecture 3 - MachineLearning-CrashCourse2023
99 pages
2D Mensuration
No ratings yet
2D Mensuration
12 pages
Vivek Sharma 2k21 Cs 111
No ratings yet
Vivek Sharma 2k21 Cs 111
48 pages
CASOS
No ratings yet
CASOS
12 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
Blas Lapack
No ratings yet
Blas Lapack
21 pages
Solution HW2
No ratings yet
Solution HW2
6 pages
Solution 1
No ratings yet
Solution 1
6 pages
ML Lec-10
No ratings yet
ML Lec-10
19 pages
Final Practical
No ratings yet
Final Practical
53 pages
Datamining 2
No ratings yet
Datamining 2
54 pages
ISYE 6501 Georgia Tech hmwk3.1b
No ratings yet
ISYE 6501 Georgia Tech hmwk3.1b
5 pages
ISYE 6501 Georgia Tech Hmwk3.1a
No ratings yet
ISYE 6501 Georgia Tech Hmwk3.1a
4 pages
A. Rupasri (20NE1A0510) Sk. Rehamunnisha (20NE1A0539) D. Sai Supriya (20NE1A0542) Sk. Mohammad Fahim (20NE1A0551)
No ratings yet
A. Rupasri (20NE1A0510) Sk. Rehamunnisha (20NE1A0539) D. Sai Supriya (20NE1A0542) Sk. Mohammad Fahim (20NE1A0551)
20 pages
Tutorial 20. Modeling Solidification
No ratings yet
Tutorial 20. Modeling Solidification
32 pages
Da Exp9,10
No ratings yet
Da Exp9,10
9 pages
Unit 1 & 2
No ratings yet
Unit 1 & 2
26 pages
Brochure Force Sensor
No ratings yet
Brochure Force Sensor
7 pages
Part I
No ratings yet
Part I
12 pages
Surmount International School Half Yearly Examination (2019-2020) Class: 10 Subject: Mathematics
No ratings yet
Surmount International School Half Yearly Examination (2019-2020) Class: 10 Subject: Mathematics
4 pages
Buchanan
No ratings yet
Buchanan
8 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
Confined Space Entry Permit Sample 1
No ratings yet
Confined Space Entry Permit Sample 1
2 pages
Relationship Between Marketing and Customer Satisfaction: Case Study From Beco Powering Somalia in Mogadishu-Somalia
No ratings yet
Relationship Between Marketing and Customer Satisfaction: Case Study From Beco Powering Somalia in Mogadishu-Somalia
10 pages
CS211 Flow Control Structures
No ratings yet
CS211 Flow Control Structures
29 pages
Map Assign 8
No ratings yet
Map Assign 8
7 pages
WEEK
No ratings yet
WEEK
17 pages
Vighnesh - S Log 13
No ratings yet
Vighnesh - S Log 13
4 pages
PR
No ratings yet
PR
17 pages
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
No ratings yet
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
21 pages
Experiment 2 VOM
No ratings yet
Experiment 2 VOM
5 pages
Q150 R Plus Brochure V1
No ratings yet
Q150 R Plus Brochure V1
4 pages
Quarter 3 Week 5 and 6 Final
No ratings yet
Quarter 3 Week 5 and 6 Final
11 pages
Week-2 NK
No ratings yet
Week-2 NK
12 pages
CH-10 Boiler Performance
No ratings yet
CH-10 Boiler Performance
19 pages
E-Learning and Job Performance of Academic Staff in Bayelsa State Owned Universities
No ratings yet
E-Learning and Job Performance of Academic Staff in Bayelsa State Owned Universities
6 pages
Earned Value Analysis-15-12-2016 - AH PDF
No ratings yet
Earned Value Analysis-15-12-2016 - AH PDF
17 pages
7406HW02 1
No ratings yet
7406HW02 1
3 pages
TENSION TEST ON Tor Steel
No ratings yet
TENSION TEST ON Tor Steel
7 pages
Chenhao HW1
No ratings yet
Chenhao HW1
5 pages
Analysis Course HW2
No ratings yet
Analysis Course HW2
13 pages
2.1/2.2 Adding and Subtracting Rational Expressions - Worksheet
No ratings yet
2.1/2.2 Adding and Subtracting Rational Expressions - Worksheet
3 pages
Polygenic Risk in Families With Spon
No ratings yet
Polygenic Risk in Families With Spon
8 pages
R Course - Part7 ML - Exercise Sheet 2024
No ratings yet
R Course - Part7 ML - Exercise Sheet 2024
8 pages
Record
No ratings yet
Record
23 pages
Ergonomically Designed Turmeric - FINALE
No ratings yet
Ergonomically Designed Turmeric - FINALE
24 pages
DATAMINING
No ratings yet
DATAMINING
24 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
EX - NO:3: Algorithm
No ratings yet
EX - NO:3: Algorithm
11 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Data Science
No ratings yet
Data Science
15 pages
قوانين الفصول بملف واحد فيزياء السادس علمي للاستاذ سعيد محي تومان PDF PDF Mathematical Analysis Teaching Mathematics
No ratings yet
قوانين الفصول بملف واحد فيزياء السادس علمي للاستاذ سعيد محي تومان PDF PDF Mathematical Analysis Teaching Mathematics
1 page
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
cp4252 Machine Learning Lab Manual
No ratings yet
cp4252 Machine Learning Lab Manual
21 pages
Solution 2.2
No ratings yet
Solution 2.2
4 pages
DS File Et C1 23
No ratings yet
DS File Et C1 23
15 pages
Notes - With R Code
No ratings yet
Notes - With R Code
7 pages
Rlab
No ratings yet
Rlab
7 pages

ISYE6501 Homework 2

Uploaded by

ISYE6501 Homework 2

Uploaded by

ISYE6501-Homework2

Load dependencies packages using pacman

## Loading required package: pacman

file_path <- "~/Georgia Tech - OMSA/ISYE 6501/hw2/data 3.1/credit_card_data.txt"

###### Compute the KNN model on full data set #######

#Define and save model

###### Compute the KNN model on trainning data set #######

#Define and save model

#predict the test set

#acccuracy of model on test data

## Confusion Matrix and Statistics

#Load the Iris dataset

# Define predictors (features)

# Perform k-means clustering with k = 3

## K-means clustering with 3 clusters of sizes 50, 47, 53

# Measure accuracy by comparing cluster labels to actual species labels

# Add cluster assignments to the original dataset

# Visualize the clusters using ggplot2

# Use map_dbl to run many models with varying value of k (centers)

# Generate a data frame containing both k and tot_withinss

# Plot the elbow plot

# Initialize variables to store the best results

# Iterate through predictor combinations

for (current_combination in combinations) {

# Perform k-means clustering

# Measure accuracy by comparing cluster labels to actual species labels

# Print the best combination of predictors and accuracy

## Best Combination of Predictors: Petal.Length Petal.Width

cat("Best Accuracy:", round(best_accuracy, 2), "\n")

## Best Accuracy: 0.67

Graph the final result.

# Visualize the clusters using ggplot2

You might also like