ISYE6501 Homework 1
ISYE6501 Homework 1
2024-01-15
Question 2.1 Describe a situation for which a classification model would be ap-
propriate. List some (up to 5) predictors that you might use.
Recently, I found myself in the process of shopping for new auto insurance and filled out an online intake
form on an insurance company’s website. What struck me was the commonality in the predictors used by
every auto insurance company. These factors play a pivotal role in shaping insurance premiums and coverage.
Here are five key aspects that stood out during the form-filling process:
1. Driving History: The questions about past driving records, encompassing accidents, traffic violations,
and claims history, were quite detailed. It was evident that this information significantly influences an
insurance company’s assessment of an applicant’s risk level.
2. Age and Gender: The intake form delved into specifics about age and gender. It appears that these
demographic factors consistently factor into insurers’ statistical evaluations of risk, potentially leading
to varied premiums for younger or male drivers.
3. Vehicle Type: Notably, there was a focus on the make and model of my vehicle. This suggests that
insurers use this data to gauge the overall risk associated with a particular vehicle, considering repair
costs and safety features.
4. Annual Mileage: Questions about the number of miles driven annually were prominent. The em-
phasis on mileage indicates that insurers view higher mileage as a potential risk factor, likely tied to
an increased likelihood of accidents and, subsequently, higher insurance rates.
5. Credit Score: Surprisingly, the intake form sought information on my credit score. It seems that
some insurers use credit history as a predictor, linking a higher credit score to responsible financial
behavior and, potentially, responsible driving habits.
Question 2.2
Q 2.2.1 Using the support vector machine function ksvm contained in the R package kernlab,
find a good classifier for this data. . .
rm(list = ls())
#install.packages("kernlab")
if(!require(pacman)) install.packages("pacman")
library(pacman)
p_load(kernlab, tinytex)
1
Load and read credit_card_data.txt to dataframe
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1
#tail(data)
Call ksvm
ksvm(x, y, type, kernel, . . . )
model
#attributes(model)
Calculate a1. . . am
2
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
a
## V1 V2 V3 V4 V5
## -0.0010701658 -0.0010813457 -0.0016245504 0.0027619336 1.0049404299
## V6 V7 V8 V9 V10
## -0.0027308622 0.0001028633 -0.0005668291 -0.0012766361 0.1063990379
Calculate a0
a0 <- -model@b
a0
## [1] 0.08152318
## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [297] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [556] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
See what fraction of the model’s predictions match the actual classification
## [1] 0.8639144
First observation: while an accuracy of 86.39% might be good in many contexts, it’s crucial
to consider the specific nature of credit card application decisions and any associated risks or
consequences.
3
#p_load(caret)
#Define a range of C values
C_values <- c(0.01, 0.1, 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
# Initialize variables to store best accuracy and corresponding C value
best_accuracy <- 0
best_C <- NULL
# Loop over each C value
for (C in C_values){
# Train the model
svm_model <- ksvm(
x = as.matrix(data[, 1:10]),
y = as.factor(data[, 11]),
type = "C-svc",
kernel = "vanilladot",
C = C,
scaled = TRUE
)
# Print results
print(paste("Best C Value:", best_C))
4
print(paste("Test Accuracy with Best C:", best_accuracy))
Conclusion: Even though my code seems to work properly without error, it was not able to find
a better accuracy than 86.39%, the the best C that I found is 0.01. I suspect that something
might be wrong with my logic, or perhaps I did not properly split my dataset into training
and test data.
Q 2.2.2 Try other (nonlinear) kernels as well; they can sometimes be useful and might provide
better predictions than vanilladot.
rbfdot_model
polydot_model
5
Observation: “rbfdot” kernel seems to do a better job with training error of 0.053517 while
“polydot” kernel provide the same stats as “vanilladot”.
Q 2.2.3 Using the k-nearest-neighbors classification function kknn contained in the R kknn
package, suggest a good value of k, and show how well it classifies that data points in the full
data set.
p_load(kknn)
#Pick a random i can be any number in the dataset
i=200
# Specify the range of k values to test
k_values <- c(1, 5, 10, 15, 20, 25, 30)
## Best k value: 1
6
kknn(formula, train, test, k = 7, distance = 2, kernel = “optimal”, scale = TRUE, . . . )
• Training Data (data[-i,]): The dataset used for training the model, excluding the observation at
index i.
• Test Data (data[i,]): The dataset containing a single observation (index i) for which you want to
predict the response variable.
• k (k = 10): The number of nearest neighbors to consider.
• Distance Metric (distance = 2): Euclidean distance is used as the distance metric.
• Kernel Function (kernel = "optimal"): The “optimal” kernel dynamically adjusts weights based
on local data density.
• Scaling (scale = TRUE): The predictor variables are scaled to have zero mean and unit variance.
• The fitted.values method is used to obtain the predicted values for the response variable based
on a trained k-nearest neighbors (KNN) model. Specifically, fitted.values(model_knn) returns the
predicted values for the response variable for the data points used to train the model.
Conclusion: Using the for loop with the range if k values (1, 5, 10, 15, 20, 25, 30), it seems
that the best k value is 1 and the best accuracy is 0.4517611.