0% found this document useful (0 votes)
9 views7 pages

ISYE6501 Homework 1

The document discusses a classification model for auto insurance applications, highlighting key predictors such as driving history, age and gender, vehicle type, annual mileage, and credit score. It also details the implementation of a support vector machine (SVM) using the R package 'kernlab' to classify credit card application data, achieving an accuracy of 86.39%. Additionally, the document outlines a method to optimize the model by testing various cost parameters (C values) to improve accuracy.

Uploaded by

vitieubao083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

ISYE6501 Homework 1

The document discusses a classification model for auto insurance applications, highlighting key predictors such as driving history, age and gender, vehicle type, annual mileage, and credit score. It also details the implementation of a support vector machine (SVM) using the R package 'kernlab' to classify credit card application data, achieving an accuracy of 86.39%. Additionally, the document outlines a method to optimize the model by testing various cost parameters (C values) to improve accuracy.

Uploaded by

vitieubao083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ISYE 6501 - Homework 1

2024-01-15

Question 2.1 Describe a situation for which a classification model would be ap-
propriate. List some (up to 5) predictors that you might use.

Recently, I found myself in the process of shopping for new auto insurance and filled out an online intake
form on an insurance company’s website. What struck me was the commonality in the predictors used by
every auto insurance company. These factors play a pivotal role in shaping insurance premiums and coverage.
Here are five key aspects that stood out during the form-filling process:

1. Driving History: The questions about past driving records, encompassing accidents, traffic violations,
and claims history, were quite detailed. It was evident that this information significantly influences an
insurance company’s assessment of an applicant’s risk level.
2. Age and Gender: The intake form delved into specifics about age and gender. It appears that these
demographic factors consistently factor into insurers’ statistical evaluations of risk, potentially leading
to varied premiums for younger or male drivers.
3. Vehicle Type: Notably, there was a focus on the make and model of my vehicle. This suggests that
insurers use this data to gauge the overall risk associated with a particular vehicle, considering repair
costs and safety features.
4. Annual Mileage: Questions about the number of miles driven annually were prominent. The em-
phasis on mileage indicates that insurers view higher mileage as a potential risk factor, likely tied to
an increased likelihood of accidents and, subsequently, higher insurance rates.
5. Credit Score: Surprisingly, the intake form sought information on my credit score. It seems that
some insurers use credit history as a predictor, linking a higher credit score to responsible financial
behavior and, potentially, responsible driving habits.

Question 2.2

Q 2.2.1 Using the support vector machine function ksvm contained in the R package kernlab,
find a good classifier for this data. . .

Install and load “kernlab”

rm(list = ls())
#install.packages("kernlab")
if(!require(pacman)) install.packages("pacman")

## Loading required package: pacman

library(pacman)
p_load(kernlab, tinytex)

1
Load and read credit_card_data.txt to dataframe

file_path <- "~/Georgia Tech - OMSA/ISYE 6501/hw1/data 2.2/credit_card_data.txt"


data <- read.table(file_path, stringsAsFactors = FALSE, header=FALSE)
head(data)

## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1

#tail(data)

Call ksvm
ksvm(x, y, type, kernel, . . . )

model <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]),


type="C-svc", kernel="vanilladot", C=99, scaled=TRUE)

## Setting default kernel parameters

model

## Support Vector Machine object of class "ksvm"


##
## SV type: C-svc (classification)
## parameter : cost C = 99
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 192
##
## Objective Function Value : -17709.08
## Training error : 0.136086

#attributes(model)

Calculate a1. . . am

• The model@xmatrix[[1]] contains the support vectors.


• The model@coef[[1]] contains the coefficients associated with the support vectors.
• The expression colSums(model@xmatrix[[1]] * model@coef[[1]]) computes the weighted sum of
the support vectors for each feature, which represents the coefficients (excluding the bias term) of the
decision function.

2
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
a

## V1 V2 V3 V4 V5
## -0.0010701658 -0.0010813457 -0.0016245504 0.0027619336 1.0049404299
## V6 V7 V8 V9 V10
## -0.0027308622 0.0001028633 -0.0005668291 -0.0012766361 0.1063990379

Calculate a0

• The model@b contains the bias term of the decision function.


• The expression a0 <- -model@b calculates the negation of the bias term.

a0 <- -model@b
a0

## [1] 0.08152318

See what the model predicts

pred <- predict(model,data[,1:10])


pred

## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [297] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [556] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1

See what fraction of the model’s predictions match the actual classification

sum(pred == data[,11]) / nrow(data)

## [1] 0.8639144

First observation: while an accuracy of 86.39% might be good in many contexts, it’s crucial
to consider the specific nature of credit card application decisions and any associated risks or
consequences.

3
#p_load(caret)
#Define a range of C values
C_values <- c(0.01, 0.1, 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
# Initialize variables to store best accuracy and corresponding C value
best_accuracy <- 0
best_C <- NULL
# Loop over each C value
for (C in C_values){
# Train the model
svm_model <- ksvm(
x = as.matrix(data[, 1:10]),
y = as.factor(data[, 11]),
type = "C-svc",
kernel = "vanilladot",
C = C,
scaled = TRUE
)

# Evaluate on the validation set


predictions <- predict(svm_model, newdata = as.matrix(data[, 1:10]))
accuracy <- mean(predictions == data[, 11])

# Check if the current C value gives a better accuracy


if (accuracy > best_accuracy) {
best_accuracy <- accuracy
best_C <- C
}
}

My attempt to use a loop to find a better C using a loop

## Setting default kernel parameters


## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters

# Print results
print(paste("Best C Value:", best_C))

## [1] "Best C Value: 0.01"

4
print(paste("Test Accuracy with Best C:", best_accuracy))

## [1] "Test Accuracy with Best C: 0.863914373088685"

Conclusion: Even though my code seems to work properly without error, it was not able to find
a better accuracy than 86.39%, the the best C that I found is 0.01. I suspect that something
might be wrong with my logic, or perhaps I did not properly split my dataset into training
and test data.

Q 2.2.2 Try other (nonlinear) kernels as well; they can sometimes be useful and might provide
better predictions than vanilladot.

#attempt with rbfdot


rbfdot_model <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]),
type="C-svc", kernel="rbfdot", C=95, scaled=TRUE)

rbfdot_model

## Support Vector Machine object of class "ksvm"


##
## SV type: C-svc (classification)
## parameter : cost C = 95
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.101540052980955
##
## Number of Support Vectors : 246
##
## Objective Function Value : -8230.715
## Training error : 0.044343

#attempt with polydot


polydot_model <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]),
type="C-svc", kernel="polydot", C=95, scaled=TRUE)

## Setting default kernel parameters

polydot_model

## Support Vector Machine object of class "ksvm"


##
## SV type: C-svc (classification)
## parameter : cost C = 95
##
## Polynomial kernel function.
## Hyperparameters : degree = 1 scale = 1 offset = 1
##
## Number of Support Vectors : 189
##
## Objective Function Value : -16993.56
## Training error : 0.136086

5
Observation: “rbfdot” kernel seems to do a better job with training error of 0.053517 while
“polydot” kernel provide the same stats as “vanilladot”.

Q 2.2.3 Using the k-nearest-neighbors classification function kknn contained in the R kknn
package, suggest a good value of k, and show how well it classifies that data points in the full
data set.

p_load(kknn)
#Pick a random i can be any number in the dataset
i=200
# Specify the range of k values to test
k_values <- c(1, 5, 10, 15, 20, 25, 30)

# Initialize variables to store best accuracy and corresponding k value


best_accuracy <- 0
best_k <- NULL

# Iterate over each k value


for (k in k_values) {
# Train the model
model_knn <- kknn(
V11 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10,
data[-i, ],
data[i, ],
k = k,
distance = 2,
kernel = "optimal",
scale = TRUE
)

# Obtain predicted values for the training set


predicted_values <- fitted.values(model_knn)

# Evaluate model performance (you can use other metrics as well)


accuracy <- mean(predicted_values == data[-i, ]$V11)

# Check if the current k value gives a better accuracy


if (accuracy > best_accuracy) {
best_accuracy <- accuracy
best_k <- k
}
}

# Print the best k value and corresponding accuracy


cat("Best k value:", best_k, "\n")

## Best k value: 1

cat("Best Accuracy:", best_accuracy, "\n")

## Best Accuracy: 0.4517611

6
kknn(formula, train, test, k = 7, distance = 2, kernel = “optimal”, scale = TRUE, . . . )

• Response Variable (V11): The variable you are trying to predict.


• Predictor Variables (V1 to V10): The variables used as features for prediction.

• Training Data (data[-i,]): The dataset used for training the model, excluding the observation at
index i.
• Test Data (data[i,]): The dataset containing a single observation (index i) for which you want to
predict the response variable.
• k (k = 10): The number of nearest neighbors to consider.

• Distance Metric (distance = 2): Euclidean distance is used as the distance metric.
• Kernel Function (kernel = "optimal"): The “optimal” kernel dynamically adjusts weights based
on local data density.
• Scaling (scale = TRUE): The predictor variables are scaled to have zero mean and unit variance.

• The fitted.values method is used to obtain the predicted values for the response variable based
on a trained k-nearest neighbors (KNN) model. Specifically, fitted.values(model_knn) returns the
predicted values for the response variable for the data points used to train the model.

Conclusion: Using the for loop with the range if k values (1, 5, 10, 15, 20, 25, 30), it seems
that the best k value is 1 and the best accuracy is 0.4517611.

You might also like