Lab 6
Lab 6
Yash Rathod
2022-12-08
library(tidyr)
library(caret)
1
## 3rd Qu.: 31.00
## Max. :512.33
##
summary(test_main)
By summarizing the result we can see that there are 177 NA’s in the column Age of train and Test Dataset.
To visualize the distribution of this column lets plot a Histogram.
2
Age Distribution
200
150
Frequency
100
50
0
0 20 40 60 80
Age
We can see that the data is almost normally distributed by slightly right skewed. However, we can consider
replacing the NA values with the Median of the Data in Train. We will be replacing the NA’s with value 28
which is the Median of train dataset.
set.seed(123)
split = createDataPartition(train_main$Survived, p = 0.8, list = FALSE)
train = train_main[split,]
test = train_main[-split,]
3
colnames(train)[2] <- "TARGET"
colnames(test)[2] <- "TARGET"
names(train)
4. Drop features that you think are not predictive and explain why
they are being dropped
library(dplyr)
##
## Attaching package: ’dplyr’
Pclass: The passenger class (1 = first class, 2 = second class, 3 = third class). Sex: The passenger’s gender
(male or female). Age: The passenger’s age. SibSp: The number of siblings or spouses the passenger was
traveling with. Parch: The number of parents or children the passenger was traveling with. Fare: The
fare paid by the passenger. Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S =
Southampton).
PassengerId: The unique ID of the passenger. Name: The passenger’s name. Cabin: The passenger’s cabin
number. Ticket: The passenger’s ticket number.
The reason for dropping these columns is that these columns will not contribute much towards survival rate
as they are individualized unique identifiers for each passenger.
4
5. Transform selected categorical features with Dummy values
We have transformed 2 Variables namely Sex and Embarked. Also, in the Embarked column we can see
there are 4 columns in test_selected dataset. Therefore we need to remove the class “ ” from that column.
library(caret)
predictions = predict(model, newdata = test_selected, type = "response")
predictions_binary = ifelse(predictions > 0.5, 1, 0)
accuracy <- mean(predictions_binary == test_selected$TARGET)
print(accuracy)
## [1] 0.8248588
Accuracy is : 82.48%
Note: Sorry I was not able to use the accuracy function of Caret library as it was throwing errors in my
system. I could have dig deeper into it but due to time restrictions I had to manually calculate the accuracy.
5
summary(model)
##
## Call:
## glm(formula = TARGET ~ ., family = "binomial", data = train_selected)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5063 -0.6270 -0.4479 0.6743 2.3819
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.796595 0.603742 7.945 1.95e-15 ***
## Pclass -0.979673 0.156268 -6.269 3.63e-10 ***
## Sexmale -2.635261 0.220833 -11.933 < 2e-16 ***
## Age -0.033096 0.008459 -3.913 9.13e-05 ***
## Fare 0.001880 0.002480 0.758 0.4484
## SibSp -0.291558 0.121906 -2.392 0.0168 *
## Parch -0.061550 0.124431 -0.495 0.6208
## EmbarkedQ -0.231530 0.410737 -0.564 0.5730
## EmbarkedS -0.524459 0.254227 -2.063 0.0391 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 940.00 on 711 degrees of freedom
## Residual deviance: 644.21 on 703 degrees of freedom
## AIC: 662.21
##
## Number of Fisher Scoring iterations: 5
From the above results we can see that Fare, Parch & Embarked are not significant contributors in model’s
accuracy.
##
## predictions_binary 0 1
## 0 89 18
## 1 13 57
The model is doing pretty well with an accuracy of 82%. Above I have tried to plot the confusion matrix
which shows class wise predictitons of the model. In this we can identify the number of false positive or type
1 error while predicting the results. We can see that for non-survived it has false identified 18 times out of
107 where as for survived it has classified 13 false positives out of 70.
—————————————- After Email Instructions
6
Generating the Survived column for test dataset
##
## 0 1
## 197 134
According to the results our model predicts that out of 331 people 134 survived and 197 couldn’t.