0% found this document useful (0 votes)

61 views11 pages

1.1 Objective: 2. Data Preparation and Exploratory Analysis

The document discusses analyzing Titanic passenger data to predict which passengers survived the sinking. It first cleans and explores the data, finding that features like sex, age, class, family size and fare affected the likelihood of survival. It then selects these features to train logistic regression and random forest models to classify survivors. The logistic model achieved 83.7% accuracy on the training set while the random forest model's accuracy is not reported.

Uploaded by

k767

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views11 pages

1.1 Objective: 2. Data Preparation and Exploratory Analysis

Uploaded by

k767

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

1.

1 Objective

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15,
1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out
of 2224 passengers and crew. This sensational tragedy shocked the international community and
led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough
lifeboats for the passengers and crew. Although there was some element of luck involved in
surviving the sinking, some groups of people were more likely to survive than others, such as
women, children, and the upper-class.

In this challenge, we are going to complete the analysis of what sorts of people were likely to
survive.

1.2 Data Understanding

library('dplyr') # data manipulation
library('ggplot2') # Data Visualization
library('ggthemes') # Data Visualization

options(warn = -1)
# load train.csv
train <- read.csv('../input/train.csv', stringsAsFactors = F)
# load test.csv
test <- read.csv('../input/test.csv', stringsAsFactors = F)
# combine them as a whole
test$Survived <- NA
full <- rbind(train,test)
head(full)
str(full)

We have ordinal variable PassengerId, lable variable Name and Ticket, numeric variables such
as Age, SibSp, Parch, Fare, and categorical variables like Survived ,Pclass, Sex ,Cabin,
andEmbarked.

2. Data Preparation and Exploratory Analysis

2.1 Data Cleaning

Age:
In [4]:
# Process Age Column

# create a new data set age

age <- full$Age
n = length(age)
# replace missing value with a random sample from raw data
set.seed(123)
for(i in 1:n){
if(is.na(age[i])){
age[i] = sample(na.omit(full$Age),1)
}
}
# check effect
par(mfrow=c(1,2))
hist(full$Age, freq=F, main='Before Replacement',
col='lightblue', ylim=c(0,0.04),xlab = "age")
hist(age, freq=F, main='After Replacement',
col='darkblue', ylim=c(0,0.04))

Cabin:
In [5]:
# Process Cabin Column to show number of cabins passenger has
cabin <- full$Cabin
n = length(cabin)
for(i in 1:n){
if(nchar(cabin[i]) == 0){
cabin[i] = 0
} else{
s = strsplit(cabin[i]," ")
cabin[i] = length(s[[1]])
}
}
table(cabin)

Fare:
In [6]:
# process fare column

# check missing
full$PassengerId[is.na(full$Fare)]
full[1044,]
ggplot(full[full$Pclass == '3' & full$Embarked == 'S', ],
aes(x = Fare)) +
geom_density(fill = '#99d6ff', alpha=0.4) +
geom_vline(aes(xintercept=median(Fare, na.rm=T)),
colour='red', linetype='dashed', lwd=1)
# we can see that fare is clustered around mode. we just repace the mis
sing value with
# median fare of according Pclass and Embarked

full$Fare[1044] <- median(full[full$Pclass == '3' & full$Embarked == 'S

', ]$Fare, na.rm = TRUE)
Embarked
In [10]:
# process embarked column
embarked <- full$Embarked
n = length(embarked)
for(i in 1:n){
if(embarked[i] != "S" && embarked[i] != "C" && embarked[i] != "Q"){
embarked[i] = "S"
}
}
table(embarked)

2.1 Exploratory Analysis and Data Processing

Age v.s Survival

In [11]:
# number of survivals and nonsurvivals across different age
d <- data.frame(Age = age[1:891], Survived = train$Survived)
ggplot(d, aes(Age,fill = factor(Survived))) +
geom_histogram()
# create bar chart to show relationship between survival rate and age i
ntervals
cuts <- cut(d$Age,hist(d$Age,10,plot = F)$breaks)
rate <- tapply(d$Survived,cuts,mean)
d2 <- data.frame(age = names(rate),rate)
barplot(d2$rate, xlab = "age",ylab = "survival rate")

Sex v.s Survival

In [13]:
# create histgram to show effect of Sex on survival
ggplot(train, aes(Sex,fill = factor(Survived))) +
geom_histogram(stat = "count")
# calculate survival rate
tapply(train$Survived,train$Sex,mean)
female
0.74203821656051
male
0.188908145580589
The survival rate of female is 0.74,
while the survival rate of male is 0.19.

Name v.s. Survival

We also notice that title of surname is a meaningful feature.

In [15]:
# extract title from Name
# (here I process full data set but only plot title vs survival in train
# data set because there is no survival value for test data set)
n = length(full$Survived)
title = rep(NA,n)
for (i in 1:n){
lastname = strsplit(full$Name[i],", ")[[1]][2]
title[i] = strsplit(lastname,". ")[[1]][1]
}

# make a histogram of title v.s survival

d <- data.frame(title = title[1:891],Survived = train$Survived)
ggplot(d, aes(title,fill = factor(Survived))) +
geom_histogram(stat = "count")
# count of title
table(title)
# survival rate
tapply(d$Survived,d$title,mean)
# replace rare titles to 'Rare'
title[title != 'Mr' & title != 'Miss' & title != 'Mrs' & title != 'Mast
er'] <- 'Rare'
table(title)

Pclass v.s. Survival

In [19]:
# make a histogram
ggplot(train, aes(Pclass,fill = factor(Survived))) +
geom_histogram(stat = "count")
# calculate survival rate
tapply(train$Survived,train$Pclass,mean)
# histogram of Parch
ggplot(train, aes(Parch,fill = factor(Survived))) +
geom_histogram(stat = "count")
# histogram of SibSp
ggplot(train, aes(SibSp,fill = factor(Survived))) +
geom_histogram(stat = "count")
# combine SibSp and Parch
family <- full$SibSp + full$Parch
d <- data.frame(family = family[1:891],Survived = train$Survived)
ggplot(d, aes(family,fill = factor(Survived))) +
geom_histogram(stat = "count")
tapply(d$Survived,d$family,mean)

Cabin v.s. Survival

In [25]:
# create histogram
d <- data.frame(Cabin = cabin[1:891],Survived = train$Survived)
ggplot(d, aes(Cabin,fill = factor(Survived))) +
geom_histogram(stat = "count")
# calculate survival rate
tapply(d$Survived,d$Cabin,mean)

Fare v.s. Survival

In [27]:
# make a histogram
ggplot(train, aes(Fare,fill = factor(Survived))) +
geom_histogram()
# calculate
cuts <- cut(train$Fare,hist(train$Fare,10,plot = F)$breaks)
rate <- tapply(train$Survived,cuts,mean)
d <- data.frame(fare = names(rate),rate)
barplot(d$rate, xlab = "fare",ylab = "survival rate")

Embarked v.s. Survival

In [29]:
# make histogram
d <- data.frame(Embarked = embarked[1:891], Survived = train$Survived)
ggplot(d, aes(Embarked,fill = factor(Survived))) +
geom_histogram(stat = "count")
# make table
tapply(train$Survived,train$Embarked,mean)

3. Modeling

3.1 Feature Engineering

In this section, we are going to prepare features used for training and predicting. We first choose
our features that have significant effect on survival according to the exploratory process above.
Here we choose Survived column as response variable, age (after filling), title, Pclass, Sex,
family size, Fare, cabin(cabin count), Embarked these 8 column as features.

Response
Features (X)
Variable (Y)

age ,fare, cabin , title , family ,

Survived
Pclass, Sex , Embarked

# response variable
f.survived = train$Survived
In [32]:
# feature
# 1. age
f.age = age[1:891] # for training
t.age = age[892:1309] # for testing
In [33]:
# 2. fare
f.fare = full$Fare[1:891]
t.fare = full$Fare[892:1309]
In [34]:
# 3. cabin
f.cabin = cabin[1:891]
t.cabin = cabin[892:1309]
# 4. title
f.title = title[1:891]
t.title = title[892:1309]

# 5. family
family <- full$SibSp + full$Parch
f.family = family[1:891]
t.family = family[892:1309]

# 6. plcass
f.pclass = train$Pclass
t.pclass = test$Pclass

# 7. sex
f.sex = train$Sex
t.sex = test$Sex

# 8. embarked
f.embarked = embarked[1:891]
t.embarked = embarked[892:1309]

3.2 Model Training

# construct training data frame
new_train = data.frame(survived = f.survived, age = f.age, fare = f.far
e , sex = f.sex,
embarked = f.embarked ,family = f.family ,title = f.title ,cabin
= f.cabin, pclass= f.pclass)
# logistic regression
fit_logit <- glm(factor(survived) ~ age + fare + sex + embarked + famil
y
+ title + cabin + pclass,data = new_train,family = bin
omial)
# predicted result of regression
ans_logit = rep(NA,891)
for(i in 1:891){
ans_logit[i] = round(fit_logit$fitted.values[[i]],0)
}
# check result
mean(ans_logit == train$Survived)
table(ans_logit)
0.837261503928171
ans_logit
0 1
566 325
# random forest
library('randomForest')

set.seed(123)
fit_rf <- randomForest(factor(survived) ~ age + fare + sex + embarked +
family
+ title + cabin + pclass,data = new_train)

# predicted result of regression

rf.fitted = predict(fit_rf)
ans_rf = rep(NA,891)
for(i in 1:891){
ans_rf[i] = as.integer(rf.fitted[[i]]) - 1
}
# check result
mean(ans_rf == train$Survived)
table(ans_rf)
0.826038159371493
ans_rf
0 1
590 301
# decision tree
library(rpart)

fit_dt <- rpart(factor(survived) ~ age + fare + sex + embarked + family

+ title + cabin + pclass,data = new_train)

# predicted result of regression

dt.fitted = predict(fit_dt)
ans_dt = rep(NA,891)
for(i in 1:891){
if(dt.fitted[i,1] >= dt.fitted[i,2] ){
ans_dt[i] = 0
} else{
ans_dt[i] = 1
}
}
# check result
mean(ans_dt == train$Survived)
table(ans_dt)
0.81705948372615
ans_dt
0 1
568 323
# svm
library(e1071)

fit_svm <- svm(factor(survived) ~ age + fare + sex + embarked + family

+ title + cabin + pclass,data = new_train)

# predicted result of regression

svm.fitted = predict(fit_svm)
ans_svm = rep(NA,891)
for(i in 1:891){
ans_svm[i] = as.integer(svm.fitted[[i]]) - 1
}
# check result
mean(ans_svm == train$Survived)
table(ans_svm)
0.836139169472503
ans_svm
0 1
583 308

3.3 Model Evaluation

We built 4 basic learner in last section. Then we are going to evaluate model accuracy
using Confusion Matrix.

Logistic Regression:
In [40]:
# logistic
a = sum(ans_logit ==1 & f.survived == 1)
b = sum(ans_logit ==1 & f.survived == 0)
c = sum(ans_logit ==0 & f.survived == 1)
d = sum(ans_logit ==0 & f.survived == 0)
data.frame(a,b,c,d)

a b c d

26 6 8 48
1 4 1 5
Random Forest:
In [41]:
# Random Forest
a = sum(ans_rf ==1 & f.survived == 1)
b = sum(ans_rf ==1 & f.survived == 0)
c = sum(ans_rf ==0 & f.survived == 1)
d = sum(ans_rf ==0 & f.survived == 0)
data.frame(a,b,c,d)

a b c d

24 5 9 49
4 7 8 2

Decision Tree:
In [42]:
# Decision Tree
a = sum(ans_dt ==1 & f.survived == 1)
b = sum(ans_dt ==1 & f.survived == 0)
c = sum(ans_dt ==0 & f.survived == 1)
d = sum(ans_dt ==0 & f.survived == 0)
data.frame(a,b,c,d)
a b c d

25 7 9 47
1 2 1 7

SVM:
In [43]:
# SVM
a = sum(ans_svm ==1 & f.survived == 1)
b = sum(ans_svm ==1 & f.survived == 0)
c = sum(ans_svm ==0 & f.survived == 1)
d = sum(ans_svm ==0 & f.survived == 0)
data.frame(a,b,c,d)

a b c d

25 5 9 49
2 6 0 3
From matrix above, we can see that all models predict non-survival better than survival. And both
logistic regression and SVM work well for training data set. Here, logistic regression has
accuracy = 0.837, SVM has accuracy = 0.836.

4. Prediction
Since we got models that have reasonable predictive power, we can perform them to our test
data set to make prediction. Here we choose SVM to perform prediction as an example.
# construct testing data frame
test_data_set <- data.frame(age = t.age, fare = t.fare, sex = t.sex, embark
ed = t.embarked,
family = t.family, title = t.title,cabin = t.c
abin, pclass = t.pclass)
# make prediction
svm_predict = predict(fit_svm,newdata = test_data_set )
ans_svm_predict = rep(NA,418)
for(i in 1:418){
ans_svm_predict[i] = as.integer(svm_predict[[i]]) - 1
}
table(ans_svm_predict)
ans_svm_predict
0 1
259 159

Exploratory Data Analysis With R PDF
No ratings yet
Exploratory Data Analysis With R PDF
63 pages
ISMS-FORM-00-4 ISO27001-17-18 Gap Assessment Tool
No ratings yet
ISMS-FORM-00-4 ISO27001-17-18 Gap Assessment Tool
31 pages
Titanic Dataset Model Prediction
No ratings yet
Titanic Dataset Model Prediction
11 pages
Hot Spring in of India
No ratings yet
Hot Spring in of India
21 pages
Workout Abs Bible 37 Six-Pack Secrets For Weight Loss and Ripped Abs (Workout Routines, Workout Books, Work Workout, Abs... (Harder, Felix) (Z-Library)
No ratings yet
Workout Abs Bible 37 Six-Pack Secrets For Weight Loss and Ripped Abs (Workout Routines, Workout Books, Work Workout, Abs... (Harder, Felix) (Z-Library)
132 pages
Lab Manual _DSR
No ratings yet
Lab Manual _DSR
32 pages
Data Science Assignment Submission
No ratings yet
Data Science Assignment Submission
12 pages
20mia1006_lab_4_FDA
No ratings yet
20mia1006_lab_4_FDA
15 pages
Assignment2_DMS672
No ratings yet
Assignment2_DMS672
15 pages
Homework2
No ratings yet
Homework2
12 pages
Ahamed 123
100% (1)
Ahamed 123
7 pages
R Functions
No ratings yet
R Functions
6 pages
Final Practical
No ratings yet
Final Practical
53 pages
TitanicFeatureEngineering Handout
No ratings yet
TitanicFeatureEngineering Handout
26 pages
Lab 6
No ratings yet
Lab 6
7 pages
R Assignment
No ratings yet
R Assignment
8 pages
Logistic Regression Implementation in R: The Dataset
No ratings yet
Logistic Regression Implementation in R: The Dataset
8 pages
01-Logistic Regression With Python
No ratings yet
01-Logistic Regression With Python
12 pages
What Are Decision Trees?
No ratings yet
What Are Decision Trees?
9 pages
Chapter 10 Analysis Examples Replication Fall 2011 R
No ratings yet
Chapter 10 Analysis Examples Replication Fall 2011 R
7 pages
Regression
No ratings yet
Regression
36 pages
Titanic Survival Prediction Ml
No ratings yet
Titanic Survival Prediction Ml
36 pages
I2IT DataVisualizationI - JupyterLab
No ratings yet
I2IT DataVisualizationI - JupyterLab
18 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Titanic eda
No ratings yet
Titanic eda
17 pages
Ps Project
No ratings yet
Ps Project
6 pages
Department of Statistics: Course Stats 330
No ratings yet
Department of Statistics: Course Stats 330
7 pages
Titanic EDA
No ratings yet
Titanic EDA
6 pages
supplementary_simulation_code_longmixr_Hagenberg
No ratings yet
supplementary_simulation_code_longmixr_Hagenberg
22 pages
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
No ratings yet
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
39 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
DW 20
No ratings yet
DW 20
5 pages
bacdeaf_23032025_115708_split_1
No ratings yet
bacdeaf_23032025_115708_split_1
37 pages
Dataset Visualization Basic Ml-1
No ratings yet
Dataset Visualization Basic Ml-1
12 pages
Lab6 Results 8974917
No ratings yet
Lab6 Results 8974917
4 pages
DSBAProject Oct 2020
No ratings yet
DSBAProject Oct 2020
24 pages
Machine learning with Titanic dataset tutorial
No ratings yet
Machine learning with Titanic dataset tutorial
7 pages
Aiml Team Presentation
No ratings yet
Aiml Team Presentation
18 pages
Titanic Data Analysis
No ratings yet
Titanic Data Analysis
14 pages
data science
No ratings yet
data science
15 pages
ML Report
No ratings yet
ML Report
3 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Notes - With R Code
No ratings yet
Notes - With R Code
7 pages
Titanic ML Kaggle
No ratings yet
Titanic ML Kaggle
3 pages
08 Titanic
No ratings yet
08 Titanic
19 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
1
No ratings yet
1
19 pages
iml project (1) (1)
No ratings yet
iml project (1) (1)
13 pages
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
No ratings yet
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
7 pages
Random Forest: Random Forest Has Classifier For Classification and Regressor For Regression
No ratings yet
Random Forest: Random Forest Has Classifier For Classification and Regressor For Regression
9 pages
Rouse Final
No ratings yet
Rouse Final
8 pages
Titanic Report ml report
No ratings yet
Titanic Report ml report
14 pages
Service Instruction: Wall Mounted
No ratings yet
Service Instruction: Wall Mounted
63 pages
Introduction To Research Observation Legal Methods
No ratings yet
Introduction To Research Observation Legal Methods
10 pages
CEP Final
No ratings yet
CEP Final
11 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
LamTang TitanicMachineLearningFromDisaster
No ratings yet
LamTang TitanicMachineLearningFromDisaster
5 pages
Model Definition11
No ratings yet
Model Definition11
6 pages
Model Definition
No ratings yet
Model Definition
6 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Sdps 2019
No ratings yet
Sdps 2019
32 pages
Gcs - Gecb: Reference List
100% (2)
Gcs - Gecb: Reference List
31 pages
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
No ratings yet
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
2 pages
BAN5
No ratings yet
BAN5
2 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Titanic
No ratings yet
Titanic
1 page
BAB906 Project Management Fundamentals WK 2 Instructor Version
No ratings yet
BAB906 Project Management Fundamentals WK 2 Instructor Version
45 pages
Team Sourcing Co. Ltd.
No ratings yet
Team Sourcing Co. Ltd.
20 pages
Growth Unhinged Carousel
No ratings yet
Growth Unhinged Carousel
10 pages
Ceiling Cassette TM
No ratings yet
Ceiling Cassette TM
62 pages
2019 Accs111 First Opp Exam Final Memo Final Version0605
No ratings yet
2019 Accs111 First Opp Exam Final Memo Final Version0605
7 pages
Colour Image Watermarking Based On Wavelet and QR Decomposition
No ratings yet
Colour Image Watermarking Based On Wavelet and QR Decomposition
4 pages
INTRO-2
No ratings yet
INTRO-2
8 pages
Operation - Maintenance Manual S12R CAP 4
No ratings yet
Operation - Maintenance Manual S12R CAP 4
14 pages
Chi-Square Statistic: How To Calculate It / Distribution
No ratings yet
Chi-Square Statistic: How To Calculate It / Distribution
9 pages
The Production Practices of Coconut Wine in Glan, Sarangani, Province
No ratings yet
The Production Practices of Coconut Wine in Glan, Sarangani, Province
4 pages
Formal Report 1
No ratings yet
Formal Report 1
7 pages
Job - Advertisement August 2018 PDF
No ratings yet
Job - Advertisement August 2018 PDF
3 pages
Aw24 Mickey Br Os 3201
No ratings yet
Aw24 Mickey Br Os 3201
2 pages
g (y) = βo + β (Age) - (a)
No ratings yet
g (y) = βo + β (Age) - (a)
6 pages
Cambridge IGCSE: Dhairya Doshi
No ratings yet
Cambridge IGCSE: Dhairya Doshi
12 pages
Cell Theory
No ratings yet
Cell Theory
6 pages
1-Reference Material I Cse4004 Digital-Forensics Eth 1.1 47 Cse4004
No ratings yet
1-Reference Material I Cse4004 Digital-Forensics Eth 1.1 47 Cse4004
10 pages
Dpp-1 English PC Bwyryeu
No ratings yet
Dpp-1 English PC Bwyryeu
7 pages
Field Prep 1 Assignment
No ratings yet
Field Prep 1 Assignment
2 pages
Leave Management System: Software Requirements Specification Document
No ratings yet
Leave Management System: Software Requirements Specification Document
6 pages
Microcochip
No ratings yet
Microcochip
2 pages
Sri Ramachandran
No ratings yet
Sri Ramachandran
7 pages
Cost Benefit Analysis Outline EC
No ratings yet
Cost Benefit Analysis Outline EC
5 pages
1 Selection of Heat Treatment and Aluminizing Sequence For Rene 77 Superalloy
No ratings yet
1 Selection of Heat Treatment and Aluminizing Sequence For Rene 77 Superalloy
4 pages
003
No ratings yet
003
2 pages
Vendor Form & Eft Form
No ratings yet
Vendor Form & Eft Form
2 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet