0% found this document useful (0 votes)
56 views12 pages

Week-2 NK

Uploaded by

Nagaraj Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views12 pages

Week-2 NK

Uploaded by

Nagaraj Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Week-2 Homework submission

Question 3.1
Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in
Question 2.2, use the ksvm or kknn function to find a good classifier:
(a) using cross-validation (do this for the k-nearest-neighbors model; SVM is
optional); and
(b) splitting the data

Answers:

(a)

Using k-nearest neighbors model

Procedure followed to answer this section is as follows:

1. Divide the entire data set into two parts-(1) Test data (20% of the data), and
(2) Training data (80% of the data)
2. Further divide Training data into “k” (=10) groups to build kknn models using
cross-validation. This means in every iteration, out of 10 groups of training
data, 9 will be used to build model, and 1 group will be used to validate.
3. To get the coefficients, use the entire training data and estimate the kknn
model.
4. Estimate the accuracy of the model on Test data.

Results are as follows:

Table-1: Results of knn models with 10-fold cross-validation


R-output Interpretation
Minimal misclassification: 0.1335878 Average accuracy of 10 models= 1-
0.1335878=86.64%

kernel: optimal Kernels used to estimate the model-


optimal
Best k: 5 The highest accuracy was obtained for
k=5
Accuracy on Test data 79.23%
Figure-1: Model performance* for different values of k

1
*Note that y-axis is misclassification i.e. (=1-accuracy)

Using SVM model

The procedure followed to fit this model is similar to that in previous section. However in
step-4, I fit a SVM model instead of kknn model. Results are as follows:

Table-3: Results of SVM models with 10-fold cross-validation


Parameter Trial-1 Trial-2
SV type C-svc (classification) C-svc (classification)
C C=1 C=1000
kernel vanilladot rbfdot
Classification -0.04961577+-1.719443e- 0.2520074+
equation 05*A1+9.750605e-06*A2+1.676985e- -36.67765*A1-36.94432*A2-
05*A3+ 3.829653e-05*A8 64.57669*A3+70.25796*A8+118.83971*A
+9.996934e-01*A9-2.235538e- 9
05*A10+ 2.377781e-05*A11+ -33.75022*A10+93.99805*A11-
8.816228e-06*A12-4.663454e- 33.23766*A12 -132.42911*A14+
05*A14+8.935907e-05 *A15 66.11733 *A15
Accuracy on 87.78% 98.85%
Training set
Accuracy on Test 80% 75.38%
set

These results illustrate the disadvantage of over-fitting the model. Model with kernel=rbgdot
has very high accuracy on training data set, but poor accuracy on test data.

(b)

Splitting the data

Procedure followed to answer this section is as follows:

1. Split the randomly data into three groups:


a. Training data: 60% of the data

2
b. Validation data: 20% of the data
c. Test data: 20% of the data
2. Follow the sequence in the following flowchart i.e. build SVM and kknn models using
training data, but note the performance of each model on the validation set, and
choose the best model. Estimate the quality of chosen model on test data.

Split of data is as follows:

a. Training data: 491


b. Validation data: 130
c. Test data: 130

Table-4: Estimating various kknn models


Mod Paramet Accuracy Model fit for Training data
el ers on
No. validation
data set
1 Kernel:
rectangul
ar
88.46%
k=20

distance:
1

3
2 kernel:
gaussian

93.07%
k= 29

distance:
1

3 kernel:
triangular

k: 52 92.31%

distance:
2

Table-5: Estimating various SVM models for comparison


Model Parameters Accuracy on validation data
No. set
1 kernel="rbfdot",C=1000 93.07%, 51.53%
2 Kernel=vanilladot, C=100 85.38%, 51.53%
3 kernel="splinedot",C=1000 93.07%, 51.53%

Table-6: Comparison of models based on accuracy on


validation data set
1 2 3
knn 88.46% 93.07% 92.31%
svm 93.07% 85.38% 93.07%

4
From Table-6, we can see that knn-2, svm-1, and svm-3 have the same performance in
validation set. Just based on this table, we could choose any of these three models. Although
I understand that model should be chosen only based on performance on validation data, let
us examine the performance of these models on test data.

Table-7: Comparison of models based on accuracy on


test data set
1 2 3
knn 85.38% 87.69% 88.46%
svm 51.53% 51.53% 51.53%

Based on the performance of knn-2, svm-1, and svm-3 models on the test data, I would pick
knn-2 model. The performance of this model is consistent across validation data set and test
data set.

So my model would be as follows:

kernel: gaussian, k= 29, distance: 1

Validation set accuracy= 93.07%

Test set accuracy=87.69%

Question 4.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a
clustering model would be appropriate. List some (up to 5) predictors that you might use.

Answer:

In most of the developing countries, providing subsidies to specific groups of people (usually
low-income, underprivileged) is a common practice. Usually the challenge to the
government in these countries is to identify people who are eligible for subsidy and who are
not eligible. This challenge is tougher in countries which have large middle class population.

One can apply clustering model to identify if there are certain features common to group of
people who deserve subsidies. The predictors I would consider are as follows:

1. Annual income of the family


2. Total assets of the family
3. Average tax paid by the family during the last three years
4. Annual credit card bill paid (as a indicator of expenditure)
5. Location of the house in which family stays (pincode)

5
Question 4.2
The iris data set iris.txt contains 150 data points, each with four predictor variables and one
categorical response. The predictors are the width and length of the sepal and petal of
flowers and the response is the type of flower.
The response values are only given to see how well a specific method performed and should
not be used to build the model.

Use the R function kmeans to cluster the points as well as possible. Report the best
combination of predictors, your suggested value of k, and how well your best clustering
predicts flower type.

Answer:

Table-8: Visualisation of data

6
From the data visualisation, one can expect a combination of petal length, and petal width
could cluster the flowers most accurately. However, let us try for different combinations of
variables. After executing the k-means algorithm, the results obtained are as follows.

Table-9: Summary of various k-means clustering models


Model Variables Cluster Accurac
No. size y
1 Petal Length, Petal Width 50, 52, 94.3 %
48
2 Sepal Length, Sepal Width, Petal Length, Petal 38, 62, 50 88.4%
Width
3 Sepal Width, Petal Length, Petal Width 53, 50, 47 91.7%
4 Sepal Length, Petal Length, Petal Width 38, 62, 50 90.3%

Based on the Accuracy, I choose Model-1.

Table-10: Model -1 parameters

Cluster No. Petal Petal Width Within cluster sum of


Length squares
1 1.462000 0.246000 2.02200
2 4.269231 1.342308 13.05769
3 5.595833 2.037500 16.29167

Table-11: Classification table for Model-


1
Setosa Versicol Virginic
or a
Cluster- 50 0 0
1
Cluster- 0 48 4
2
Cluster- 0 2 46
3

Code

7
For 3.1 (a)
#install required libraries and packages

library(kernlab)

library(kknn)

library(caret)

library(rpart)

install.packages('e1071', dependencies=TRUE)

#read data

ccdata <- read.delim("N:/ISyE 6501/R/ccdata.txt")

ccdata$R1 =as.factor(ccdata$R1)

#split data into test, and training datasets

set.seed(1)

inTraining <- createDataPartition(ccdata$R1, p = .80, list = FALSE) #p=0.80 means 80% of data is
for training and 20% is for testing

training <- ccdata[ inTraining,]

testing <- ccdata[-inTraining,]

#model kknn

fit_control<- trainControl(method="cv", number=10, savePredictions = TRUE)

model<- train.kknn(R1~., data=training, kmax=100, trControl=fit_control, method="rpart")

model #summary output

8
prediction <- predict(model, testing[,11])

CM <- table(testing[,11], prediction)

accuracy <- (sum(diag(CM)))/sum(CM)

accuracy

#model SVM

fit_control<- trainControl(method="cv", number=10, savePredictions = TRUE)

model<- ksvm(training[,1:10],training[,11], type="C-svc", trControl=fit_control, kernel="rbfdot",


method="rpart")

#summary output

pred <- predict(model,testing[,1:10])

pred

sum(pred == data[,11]) / nrow(testing)

For 3.1 (b)


#creating training data, validation data, and test data

set.seed(1)

level_1 <- createDataPartition(ccdata$R1, p = .80, list = FALSE)

level_2 <- ccdata[ level_1,]

test <- ccdata[-level_1,]

level_3 <- createDataPartition(ccdata$R1, p = .75, list = FALSE)

train_data <- ccdata[ level_3,]

validation_data <- ccdata[-level_1,]

9
# fit kknn model on training data

model<-train.kknn(R1~., train_data, kmax = 100, ks = NULL, distance = 1, kernel = "rectangular",

ykernel = NULL, scale = TRUE)

#get model output

plot(model)

model

#fit model to validation data set

validation <- round(predict(model, validation_data[, -11]))

#get confusion matrix

CM <- table(validation_data[, 11], validation)

CM

#compute accuracy

accuracy <- (sum(diag(CM)))/sum(CM)

accuracy

#fit model to test data set

validation <- round(predict(model, test_data[, -11]))

#get confusion matrix

CM <- table(test_data[, 11], test)

CM

#compute accuracy

accuracy <- (sum(diag(CM)))/sum(CM)

accuracy

# fit SVM models on training data

10
model <- ksvm(train_data[,1:10],train_data[,11],type="C-
svc",kernel="splinedot",C=1000,scaled=TRUE)

#find accuracy on validation data set

pred_valid <- round(predict(model,validation_data[,1:10]))

sum(pred_valid == validation_data[,11]) / nrow(validation_data)

#find accuracy on test data set

pred_test <- round(predict(model,test_data[,1:10]))

sum(pred == test_data[,11]) / nrow(test_data)

For 4.2 k-means clustering


#import data

iris <- read.csv("N:/ISyE 6501/R/iris.txt", sep="")

View(iris)

#plot clusters

library(ggplot2)

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

#run clustering algorithm

set.seed(20)

irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)

#get output

irisCluster

11
12

You might also like