Week-2 NK
Week-2 NK
Question 3.1
Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in
Question 2.2, use the ksvm or kknn function to find a good classifier:
(a) using cross-validation (do this for the k-nearest-neighbors model; SVM is
optional); and
(b) splitting the data
Answers:
(a)
1. Divide the entire data set into two parts-(1) Test data (20% of the data), and
(2) Training data (80% of the data)
2. Further divide Training data into “k” (=10) groups to build kknn models using
cross-validation. This means in every iteration, out of 10 groups of training
data, 9 will be used to build model, and 1 group will be used to validate.
3. To get the coefficients, use the entire training data and estimate the kknn
model.
4. Estimate the accuracy of the model on Test data.
1
*Note that y-axis is misclassification i.e. (=1-accuracy)
The procedure followed to fit this model is similar to that in previous section. However in
step-4, I fit a SVM model instead of kknn model. Results are as follows:
These results illustrate the disadvantage of over-fitting the model. Model with kernel=rbgdot
has very high accuracy on training data set, but poor accuracy on test data.
(b)
2
b. Validation data: 20% of the data
c. Test data: 20% of the data
2. Follow the sequence in the following flowchart i.e. build SVM and kknn models using
training data, but note the performance of each model on the validation set, and
choose the best model. Estimate the quality of chosen model on test data.
distance:
1
3
2 kernel:
gaussian
93.07%
k= 29
distance:
1
3 kernel:
triangular
k: 52 92.31%
distance:
2
4
From Table-6, we can see that knn-2, svm-1, and svm-3 have the same performance in
validation set. Just based on this table, we could choose any of these three models. Although
I understand that model should be chosen only based on performance on validation data, let
us examine the performance of these models on test data.
Based on the performance of knn-2, svm-1, and svm-3 models on the test data, I would pick
knn-2 model. The performance of this model is consistent across validation data set and test
data set.
Question 4.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a
clustering model would be appropriate. List some (up to 5) predictors that you might use.
Answer:
In most of the developing countries, providing subsidies to specific groups of people (usually
low-income, underprivileged) is a common practice. Usually the challenge to the
government in these countries is to identify people who are eligible for subsidy and who are
not eligible. This challenge is tougher in countries which have large middle class population.
One can apply clustering model to identify if there are certain features common to group of
people who deserve subsidies. The predictors I would consider are as follows:
5
Question 4.2
The iris data set iris.txt contains 150 data points, each with four predictor variables and one
categorical response. The predictors are the width and length of the sepal and petal of
flowers and the response is the type of flower.
The response values are only given to see how well a specific method performed and should
not be used to build the model.
Use the R function kmeans to cluster the points as well as possible. Report the best
combination of predictors, your suggested value of k, and how well your best clustering
predicts flower type.
Answer:
6
From the data visualisation, one can expect a combination of petal length, and petal width
could cluster the flowers most accurately. However, let us try for different combinations of
variables. After executing the k-means algorithm, the results obtained are as follows.
Code
7
For 3.1 (a)
#install required libraries and packages
library(kernlab)
library(kknn)
library(caret)
library(rpart)
install.packages('e1071', dependencies=TRUE)
#read data
ccdata$R1 =as.factor(ccdata$R1)
set.seed(1)
inTraining <- createDataPartition(ccdata$R1, p = .80, list = FALSE) #p=0.80 means 80% of data is
for training and 20% is for testing
#model kknn
8
prediction <- predict(model, testing[,11])
accuracy
#model SVM
#summary output
pred
set.seed(1)
9
# fit kknn model on training data
plot(model)
model
CM
#compute accuracy
accuracy
CM
#compute accuracy
accuracy
10
model <- ksvm(train_data[,1:10],train_data[,11],type="C-
svc",kernel="splinedot",C=1000,scaled=TRUE)
View(iris)
#plot clusters
library(ggplot2)
set.seed(20)
#get output
irisCluster
11
12