1.1 Objective: 2. Data Preparation and Exploratory Analysis
1.1 Objective: 2. Data Preparation and Exploratory Analysis
1 Objective
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15,
1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out
of 2224 passengers and crew. This sensational tragedy shocked the international community and
led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough
lifeboats for the passengers and crew. Although there was some element of luck involved in
surviving the sinking, some groups of people were more likely to survive than others, such as
women, children, and the upper-class.
In this challenge, we are going to complete the analysis of what sorts of people were likely to
survive.
options(warn = -1)
# load train.csv
train <- read.csv('../input/train.csv', stringsAsFactors = F)
# load test.csv
test <- read.csv('../input/test.csv', stringsAsFactors = F)
# combine them as a whole
test$Survived <- NA
full <- rbind(train,test)
head(full)
str(full)
We have ordinal variable PassengerId, lable variable Name and Ticket, numeric variables such
as Age, SibSp, Parch, Fare, and categorical variables like Survived ,Pclass, Sex ,Cabin,
andEmbarked.
Age:
In [4]:
# Process Age Column
Cabin:
In [5]:
# Process Cabin Column to show number of cabins passenger has
cabin <- full$Cabin
n = length(cabin)
for(i in 1:n){
if(nchar(cabin[i]) == 0){
cabin[i] = 0
} else{
s = strsplit(cabin[i]," ")
cabin[i] = length(s[[1]])
}
}
table(cabin)
Fare:
In [6]:
# process fare column
# check missing
full$PassengerId[is.na(full$Fare)]
full[1044,]
ggplot(full[full$Pclass == '3' & full$Embarked == 'S', ],
aes(x = Fare)) +
geom_density(fill = '#99d6ff', alpha=0.4) +
geom_vline(aes(xintercept=median(Fare, na.rm=T)),
colour='red', linetype='dashed', lwd=1)
# we can see that fare is clustered around mode. we just repace the mis
sing value with
# median fare of according Pclass and Embarked
3. Modeling
In this section, we are going to prepare features used for training and predicting. We first choose
our features that have significant effect on survival according to the exploratory process above.
Here we choose Survived column as response variable, age (after filling), title, Pclass, Sex,
family size, Fare, cabin(cabin count), Embarked these 8 column as features.
Response
Features (X)
Variable (Y)
# response variable
f.survived = train$Survived
In [32]:
# feature
# 1. age
f.age = age[1:891] # for training
t.age = age[892:1309] # for testing
In [33]:
# 2. fare
f.fare = full$Fare[1:891]
t.fare = full$Fare[892:1309]
In [34]:
# 3. cabin
f.cabin = cabin[1:891]
t.cabin = cabin[892:1309]
# 4. title
f.title = title[1:891]
t.title = title[892:1309]
# 5. family
family <- full$SibSp + full$Parch
f.family = family[1:891]
t.family = family[892:1309]
# 6. plcass
f.pclass = train$Pclass
t.pclass = test$Pclass
# 7. sex
f.sex = train$Sex
t.sex = test$Sex
# 8. embarked
f.embarked = embarked[1:891]
t.embarked = embarked[892:1309]
set.seed(123)
fit_rf <- randomForest(factor(survived) ~ age + fare + sex + embarked +
family
+ title + cabin + pclass,data = new_train)
We built 4 basic learner in last section. Then we are going to evaluate model accuracy
using Confusion Matrix.
Logistic Regression:
In [40]:
# logistic
a = sum(ans_logit ==1 & f.survived == 1)
b = sum(ans_logit ==1 & f.survived == 0)
c = sum(ans_logit ==0 & f.survived == 1)
d = sum(ans_logit ==0 & f.survived == 0)
data.frame(a,b,c,d)
a b c d
26 6 8 48
1 4 1 5
Random Forest:
In [41]:
# Random Forest
a = sum(ans_rf ==1 & f.survived == 1)
b = sum(ans_rf ==1 & f.survived == 0)
c = sum(ans_rf ==0 & f.survived == 1)
d = sum(ans_rf ==0 & f.survived == 0)
data.frame(a,b,c,d)
a b c d
24 5 9 49
4 7 8 2
Decision Tree:
In [42]:
# Decision Tree
a = sum(ans_dt ==1 & f.survived == 1)
b = sum(ans_dt ==1 & f.survived == 0)
c = sum(ans_dt ==0 & f.survived == 1)
d = sum(ans_dt ==0 & f.survived == 0)
data.frame(a,b,c,d)
a b c d
25 7 9 47
1 2 1 7
SVM:
In [43]:
# SVM
a = sum(ans_svm ==1 & f.survived == 1)
b = sum(ans_svm ==1 & f.survived == 0)
c = sum(ans_svm ==0 & f.survived == 1)
d = sum(ans_svm ==0 & f.survived == 0)
data.frame(a,b,c,d)
a b c d
25 5 9 49
2 6 0 3
From matrix above, we can see that all models predict non-survival better than survival. And both
logistic regression and SVM work well for training data set. Here, logistic regression has
accuracy = 0.837, SVM has accuracy = 0.836.
4. Prediction
Since we got models that have reasonable predictive power, we can perform them to our test
data set to make prediction. Here we choose SVM to perform prediction as an example.
# construct testing data frame
test_data_set <- data.frame(age = t.age, fare = t.fare, sex = t.sex, embark
ed = t.embarked,
family = t.family, title = t.title,cabin = t.c
abin, pclass = t.pclass)
# make prediction
svm_predict = predict(fit_svm,newdata = test_data_set )
ans_svm_predict = rep(NA,418)
for(i in 1:418){
ans_svm_predict[i] = as.integer(svm_predict[[i]]) - 1
}
table(ans_svm_predict)
ans_svm_predict
0 1
259 159