0% found this document useful (0 votes)
7 views

Lab 6

Uploaded by

yash rathod
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lab 6

Uploaded by

yash rathod
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lab 6

Yash Rathod

2022-12-08

1. Upload Titanic dataset

library(tidyr)
library(caret)

## Warning: package ’caret’ was built under R version 4.2.2

## Loading required package: ggplot2

## Warning: package ’ggplot2’ was built under R version 4.2.2

## Loading required package: lattice

test_main <- read.csv("test.csv")


train_main <- read.csv("train.csv")
summary(train_main)

## PassengerId Survived Pclass Name


## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA’s :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20

1
## 3rd Qu.: 31.00
## Max. :512.33
##

summary(test_main)

## PassengerId Pclass Name Sex


## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :30.27 Mean :0.4474 Mean :0.3923
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
## NA’s :86
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA’s :1

train_main <- train_main[train_main$Embarked == "C" | train_main$Embarked == "Q" | train_main$Embarked =

By summarizing the result we can see that there are 177 NA’s in the column Age of train and Test Dataset.
To visualize the distribution of this column lets plot a Histogram.

hist(train_main$Age, main = "Age Distribution", col = "Light Green", xlab = "Age")

2
Age Distribution
200
150
Frequency

100
50
0

0 20 40 60 80

Age

We can see that the data is almost normally distributed by slightly right skewed. However, we can consider
replacing the NA values with the Median of the Data in Train. We will be replacing the NA’s with value 28
which is the Median of train dataset.

train_main$Age <- ifelse(is.na(train_main$Age), median(train_main$Age, na.rm = TRUE),train_main$Age)

As we can see above that we do not have the Survived column in


our test dataset we will split out train_main dataset into test and
train [ As we need Survival column in our test dataset to calculate
the accuracy further in the code]

set.seed(123)
split = createDataPartition(train_main$Survived, p = 0.8, list = FALSE)
train = train_main[split,]
test = train_main[-split,]

2. Define Survived column as TARGET variable

3
colnames(train)[2] <- "TARGET"
colnames(test)[2] <- "TARGET"
names(train)

## [1] "PassengerId" "TARGET" "Pclass" "Name" "Sex"


## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"

3. Select features that can be predictive of the survival status

4. Drop features that you think are not predictive and explain why
they are being dropped

library(dplyr)

##
## Attaching package: ’dplyr’

## The following objects are masked from ’package:stats’:


##
## filter, lag

## The following objects are masked from ’package:base’:


##
## intersect, setdiff, setequal, union

train_selected <- select(train,"Pclass","Sex","Age","Fare","SibSp","Parch","Embarked","TARGET")


test_selected <- select(test,"Pclass","Sex","Age","Fare","SibSp","Parch","Embarked","TARGET")

Above we have selected the following columns:

Pclass: The passenger class (1 = first class, 2 = second class, 3 = third class). Sex: The passenger’s gender
(male or female). Age: The passenger’s age. SibSp: The number of siblings or spouses the passenger was
traveling with. Parch: The number of parents or children the passenger was traveling with. Fare: The
fare paid by the passenger. Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S =
Southampton).

I have Dropped the following columns:

PassengerId: The unique ID of the passenger. Name: The passenger’s name. Cabin: The passenger’s cabin
number. Ticket: The passenger’s ticket number.
The reason for dropping these columns is that these columns will not contribute much towards survival rate
as they are individualized unique identifiers for each passenger.

4
5. Transform selected categorical features with Dummy values

train_selected$Sex <- as.factor(train_selected$Sex)


train_selected$Embarked <- as.factor(train_selected$Embarked)
test_selected$Sex <- as.factor(test_selected$Sex)
test_selected$Embarked <- as.factor(test_selected$Embarked)
str(test_selected)

## ’data.frame’: 177 obs. of 8 variables:


## $ Pclass : int 3 3 1 1 3 2 2 3 1 1 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 2 2 2 1 ...
## $ Age : num 22 26 54 58 14 28 34 28 19 28 ...
## $ Fare : num 7.25 7.92 51.86 26.55 7.85 ...
## $ SibSp : int 1 0 0 0 0 0 0 0 3 1 ...
## $ Parch : int 0 0 0 0 0 0 0 0 2 0 ...
## $ Embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 1 3 1 ...
## $ TARGET : int 0 1 0 1 0 1 1 0 0 1 ...

We have transformed 2 Variables namely Sex and Embarked. Also, in the Embarked column we can see
there are 4 columns in test_selected dataset. Therefore we need to remove the class “ ” from that column.

#test_selected <- test_selected[!test_selected$Embarked == "",]

6. Apply logistic regression on the train/test dataset

model = glm(TARGET ~.,data = train_selected, family = "binomial")

7. Compute your model’s accuracy using accuracy function

library(caret)
predictions = predict(model, newdata = test_selected, type = "response")
predictions_binary = ifelse(predictions > 0.5, 1, 0)
accuracy <- mean(predictions_binary == test_selected$TARGET)
print(accuracy)

## [1] 0.8248588

Accuracy is : 82.48%
Note: Sorry I was not able to use the accuracy function of Caret library as it was throwing errors in my
system. I could have dig deeper into it but due to time restrictions I had to manually calculate the accuracy.

5
summary(model)

##
## Call:
## glm(formula = TARGET ~ ., family = "binomial", data = train_selected)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5063 -0.6270 -0.4479 0.6743 2.3819
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.796595 0.603742 7.945 1.95e-15 ***
## Pclass -0.979673 0.156268 -6.269 3.63e-10 ***
## Sexmale -2.635261 0.220833 -11.933 < 2e-16 ***
## Age -0.033096 0.008459 -3.913 9.13e-05 ***
## Fare 0.001880 0.002480 0.758 0.4484
## SibSp -0.291558 0.121906 -2.392 0.0168 *
## Parch -0.061550 0.124431 -0.495 0.6208
## EmbarkedQ -0.231530 0.410737 -0.564 0.5730
## EmbarkedS -0.524459 0.254227 -2.063 0.0391 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 940.00 on 711 degrees of freedom
## Residual deviance: 644.21 on 703 degrees of freedom
## AIC: 662.21
##
## Number of Fisher Scoring iterations: 5

From the above results we can see that Fare, Parch & Embarked are not significant contributors in model’s
accuracy.

confusion_Matrix <- table(predictions_binary, test_selected$TARGET)


confusion_Matrix

##
## predictions_binary 0 1
## 0 89 18
## 1 13 57

The model is doing pretty well with an accuracy of 82%. Above I have tried to plot the confusion matrix
which shows class wise predictitons of the model. In this we can identify the number of false positive or type
1 error while predicting the results. We can see that for non-survived it has false identified 18 times out of
107 where as for survived it has classified 13 false positives out of 70.
—————————————- After Email Instructions

6
Generating the Survived column for test dataset

predictTest = predict(model, type ="response", newdata = test_main)


test_main$Survived = as.numeric(predictTest >= 0.5)
table(test_main$Survived)

##
## 0 1
## 197 134

According to the results our model predicts that out of 331 people 134 survived and 197 couldn’t.

You might also like