0% found this document useful (0 votes)
58 views

1.1 Objective: 2. Data Preparation and Exploratory Analysis

The document discusses analyzing Titanic passenger data to predict which passengers survived the sinking. It first cleans and explores the data, finding that features like sex, age, class, family size and fare affected the likelihood of survival. It then selects these features to train logistic regression and random forest models to classify survivors. The logistic model achieved 83.7% accuracy on the training set while the random forest model's accuracy is not reported.

Uploaded by

k767
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

1.1 Objective: 2. Data Preparation and Exploratory Analysis

The document discusses analyzing Titanic passenger data to predict which passengers survived the sinking. It first cleans and explores the data, finding that features like sex, age, class, family size and fare affected the likelihood of survival. It then selects these features to train logistic regression and random forest models to classify survivors. The logistic model achieved 83.7% accuracy on the training set while the random forest model's accuracy is not reported.

Uploaded by

k767
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

1.

1 Objective

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15,
1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out
of 2224 passengers and crew. This sensational tragedy shocked the international community and
led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough
lifeboats for the passengers and crew. Although there was some element of luck involved in
surviving the sinking, some groups of people were more likely to survive than others, such as
women, children, and the upper-class.

In this challenge, we are going to complete the analysis of what sorts of people were likely to
survive.

1.2 Data Understanding


library('dplyr') # data manipulation
library('ggplot2') # Data Visualization
library('ggthemes') # Data Visualization

options(warn = -1)
# load train.csv
train <- read.csv('../input/train.csv', stringsAsFactors = F)
# load test.csv
test <- read.csv('../input/test.csv', stringsAsFactors = F)
# combine them as a whole
test$Survived <- NA
full <- rbind(train,test)
head(full)
str(full)

We have ordinal variable PassengerId, lable variable Name and Ticket, numeric variables such
as Age, SibSp, Parch, Fare, and categorical variables like Survived ,Pclass, Sex ,Cabin,
andEmbarked.

2. Data Preparation and Exploratory Analysis

2.1 Data Cleaning

Age:
In [4]:
# Process Age Column

# create a new data set age


age <- full$Age
n = length(age)
# replace missing value with a random sample from raw data
set.seed(123)
for(i in 1:n){
if(is.na(age[i])){
age[i] = sample(na.omit(full$Age),1)
}
}
# check effect
par(mfrow=c(1,2))
hist(full$Age, freq=F, main='Before Replacement',
col='lightblue', ylim=c(0,0.04),xlab = "age")
hist(age, freq=F, main='After Replacement',
col='darkblue', ylim=c(0,0.04))

Cabin:
In [5]:
# Process Cabin Column to show number of cabins passenger has
cabin <- full$Cabin
n = length(cabin)
for(i in 1:n){
if(nchar(cabin[i]) == 0){
cabin[i] = 0
} else{
s = strsplit(cabin[i]," ")
cabin[i] = length(s[[1]])
}
}
table(cabin)

Fare:
In [6]:
# process fare column

# check missing
full$PassengerId[is.na(full$Fare)]
full[1044,]
ggplot(full[full$Pclass == '3' & full$Embarked == 'S', ],
aes(x = Fare)) +
geom_density(fill = '#99d6ff', alpha=0.4) +
geom_vline(aes(xintercept=median(Fare, na.rm=T)),
colour='red', linetype='dashed', lwd=1)
# we can see that fare is clustered around mode. we just repace the mis
sing value with
# median fare of according Pclass and Embarked

full$Fare[1044] <- median(full[full$Pclass == '3' & full$Embarked == 'S


', ]$Fare, na.rm = TRUE)
Embarked
In [10]:
# process embarked column
embarked <- full$Embarked
n = length(embarked)
for(i in 1:n){
if(embarked[i] != "S" && embarked[i] != "C" && embarked[i] != "Q"){
embarked[i] = "S"
}
}
table(embarked)

2.1 Exploratory Analysis and Data Processing

Age v.s Survival


In [11]:
# number of survivals and nonsurvivals across different age
d <- data.frame(Age = age[1:891], Survived = train$Survived)
ggplot(d, aes(Age,fill = factor(Survived))) +
geom_histogram()
# create bar chart to show relationship between survival rate and age i
ntervals
cuts <- cut(d$Age,hist(d$Age,10,plot = F)$breaks)
rate <- tapply(d$Survived,cuts,mean)
d2 <- data.frame(age = names(rate),rate)
barplot(d2$rate, xlab = "age",ylab = "survival rate")

Sex v.s Survival


In [13]:
# create histgram to show effect of Sex on survival
ggplot(train, aes(Sex,fill = factor(Survived))) +
geom_histogram(stat = "count")
# calculate survival rate
tapply(train$Survived,train$Sex,mean)
female
0.74203821656051
male
0.188908145580589
The survival rate of female is 0.74,
while the survival rate of male is 0.19.

Name v.s. Survival

We also notice that title of surname is a meaningful feature.


In [15]:
# extract title from Name
# (here I process full data set but only plot title vs survival in train
# data set because there is no survival value for test data set)
n = length(full$Survived)
title = rep(NA,n)
for (i in 1:n){
lastname = strsplit(full$Name[i],", ")[[1]][2]
title[i] = strsplit(lastname,". ")[[1]][1]
}

# make a histogram of title v.s survival


d <- data.frame(title = title[1:891],Survived = train$Survived)
ggplot(d, aes(title,fill = factor(Survived))) +
geom_histogram(stat = "count")
# count of title
table(title)
# survival rate
tapply(d$Survived,d$title,mean)
# replace rare titles to 'Rare'
title[title != 'Mr' & title != 'Miss' & title != 'Mrs' & title != 'Mast
er'] <- 'Rare'
table(title)

Pclass v.s. Survival


In [19]:
# make a histogram
ggplot(train, aes(Pclass,fill = factor(Survived))) +
geom_histogram(stat = "count")
# calculate survival rate
tapply(train$Survived,train$Pclass,mean)
# histogram of Parch
ggplot(train, aes(Parch,fill = factor(Survived))) +
geom_histogram(stat = "count")
# histogram of SibSp
ggplot(train, aes(SibSp,fill = factor(Survived))) +
geom_histogram(stat = "count")
# combine SibSp and Parch
family <- full$SibSp + full$Parch
d <- data.frame(family = family[1:891],Survived = train$Survived)
ggplot(d, aes(family,fill = factor(Survived))) +
geom_histogram(stat = "count")
tapply(d$Survived,d$family,mean)

Cabin v.s. Survival


In [25]:
# create histogram
d <- data.frame(Cabin = cabin[1:891],Survived = train$Survived)
ggplot(d, aes(Cabin,fill = factor(Survived))) +
geom_histogram(stat = "count")
# calculate survival rate
tapply(d$Survived,d$Cabin,mean)

Fare v.s. Survival


In [27]:
# make a histogram
ggplot(train, aes(Fare,fill = factor(Survived))) +
geom_histogram()
# calculate
cuts <- cut(train$Fare,hist(train$Fare,10,plot = F)$breaks)
rate <- tapply(train$Survived,cuts,mean)
d <- data.frame(fare = names(rate),rate)
barplot(d$rate, xlab = "fare",ylab = "survival rate")

Embarked v.s. Survival


In [29]:
# make histogram
d <- data.frame(Embarked = embarked[1:891], Survived = train$Survived)
ggplot(d, aes(Embarked,fill = factor(Survived))) +
geom_histogram(stat = "count")
# make table
tapply(train$Survived,train$Embarked,mean)

3. Modeling

3.1 Feature Engineering

In this section, we are going to prepare features used for training and predicting. We first choose
our features that have significant effect on survival according to the exploratory process above.
Here we choose Survived column as response variable, age (after filling), title, Pclass, Sex,
family size, Fare, cabin(cabin count), Embarked these 8 column as features.

Response
Features (X)
Variable (Y)

age ,fare, cabin , title , family ,


Survived
Pclass, Sex , Embarked

# response variable
f.survived = train$Survived
In [32]:
# feature
# 1. age
f.age = age[1:891] # for training
t.age = age[892:1309] # for testing
In [33]:
# 2. fare
f.fare = full$Fare[1:891]
t.fare = full$Fare[892:1309]
In [34]:
# 3. cabin
f.cabin = cabin[1:891]
t.cabin = cabin[892:1309]
# 4. title
f.title = title[1:891]
t.title = title[892:1309]

# 5. family
family <- full$SibSp + full$Parch
f.family = family[1:891]
t.family = family[892:1309]

# 6. plcass
f.pclass = train$Pclass
t.pclass = test$Pclass

# 7. sex
f.sex = train$Sex
t.sex = test$Sex

# 8. embarked
f.embarked = embarked[1:891]
t.embarked = embarked[892:1309]

3.2 Model Training


# construct training data frame
new_train = data.frame(survived = f.survived, age = f.age, fare = f.far
e , sex = f.sex,
embarked = f.embarked ,family = f.family ,title = f.title ,cabin
= f.cabin, pclass= f.pclass)
# logistic regression
fit_logit <- glm(factor(survived) ~ age + fare + sex + embarked + famil
y
+ title + cabin + pclass,data = new_train,family = bin
omial)
# predicted result of regression
ans_logit = rep(NA,891)
for(i in 1:891){
ans_logit[i] = round(fit_logit$fitted.values[[i]],0)
}
# check result
mean(ans_logit == train$Survived)
table(ans_logit)
0.837261503928171
ans_logit
0 1
566 325
# random forest
library('randomForest')

set.seed(123)
fit_rf <- randomForest(factor(survived) ~ age + fare + sex + embarked +
family
+ title + cabin + pclass,data = new_train)

# predicted result of regression


rf.fitted = predict(fit_rf)
ans_rf = rep(NA,891)
for(i in 1:891){
ans_rf[i] = as.integer(rf.fitted[[i]]) - 1
}
# check result
mean(ans_rf == train$Survived)
table(ans_rf)
0.826038159371493
ans_rf
0 1
590 301
# decision tree
library(rpart)

fit_dt <- rpart(factor(survived) ~ age + fare + sex + embarked + family


+ title + cabin + pclass,data = new_train)

# predicted result of regression


dt.fitted = predict(fit_dt)
ans_dt = rep(NA,891)
for(i in 1:891){
if(dt.fitted[i,1] >= dt.fitted[i,2] ){
ans_dt[i] = 0
} else{
ans_dt[i] = 1
}
}
# check result
mean(ans_dt == train$Survived)
table(ans_dt)
0.81705948372615
ans_dt
0 1
568 323
# svm
library(e1071)

fit_svm <- svm(factor(survived) ~ age + fare + sex + embarked + family


+ title + cabin + pclass,data = new_train)

# predicted result of regression


svm.fitted = predict(fit_svm)
ans_svm = rep(NA,891)
for(i in 1:891){
ans_svm[i] = as.integer(svm.fitted[[i]]) - 1
}
# check result
mean(ans_svm == train$Survived)
table(ans_svm)
0.836139169472503
ans_svm
0 1
583 308

3.3 Model Evaluation

We built 4 basic learner in last section. Then we are going to evaluate model accuracy
using Confusion Matrix.

Logistic Regression:
In [40]:
# logistic
a = sum(ans_logit ==1 & f.survived == 1)
b = sum(ans_logit ==1 & f.survived == 0)
c = sum(ans_logit ==0 & f.survived == 1)
d = sum(ans_logit ==0 & f.survived == 0)
data.frame(a,b,c,d)

a b c d

26 6 8 48
1 4 1 5
Random Forest:
In [41]:
# Random Forest
a = sum(ans_rf ==1 & f.survived == 1)
b = sum(ans_rf ==1 & f.survived == 0)
c = sum(ans_rf ==0 & f.survived == 1)
d = sum(ans_rf ==0 & f.survived == 0)
data.frame(a,b,c,d)

a b c d

24 5 9 49
4 7 8 2

Decision Tree:
In [42]:
# Decision Tree
a = sum(ans_dt ==1 & f.survived == 1)
b = sum(ans_dt ==1 & f.survived == 0)
c = sum(ans_dt ==0 & f.survived == 1)
d = sum(ans_dt ==0 & f.survived == 0)
data.frame(a,b,c,d)
a b c d

25 7 9 47
1 2 1 7

SVM:
In [43]:
# SVM
a = sum(ans_svm ==1 & f.survived == 1)
b = sum(ans_svm ==1 & f.survived == 0)
c = sum(ans_svm ==0 & f.survived == 1)
d = sum(ans_svm ==0 & f.survived == 0)
data.frame(a,b,c,d)

a b c d

25 5 9 49
2 6 0 3
From matrix above, we can see that all models predict non-survival better than survival. And both
logistic regression and SVM work well for training data set. Here, logistic regression has
accuracy = 0.837, SVM has accuracy = 0.836.

4. Prediction
Since we got models that have reasonable predictive power, we can perform them to our test
data set to make prediction. Here we choose SVM to perform prediction as an example.
# construct testing data frame
test_data_set <- data.frame(age = t.age, fare = t.fare, sex = t.sex, embark
ed = t.embarked,
family = t.family, title = t.title,cabin = t.c
abin, pclass = t.pclass)
# make prediction
svm_predict = predict(fit_svm,newdata = test_data_set )
ans_svm_predict = rep(NA,418)
for(i in 1:418){
ans_svm_predict[i] = as.integer(svm_predict[[i]]) - 1
}
table(ans_svm_predict)
ans_svm_predict
0 1
259 159

You might also like