Ahamed 123
Ahamed 123
This case study is based on the very famous dataset in machine learning. The titanic survival data.
The data contains information about 891 passengers. It also indicates whether the passenger survived the
titanic crash or not?
The goal is to create a predictive model which can predict the survival of a given person, if they were to board
the titanic and the ship sinks... again! :(
In below case study I will discuss the step by step approach to create a Machine Learning predictive model in
such scenarios. You can use this flow as a template to solve any supervised ML classification problem.
The flow of the case study is as below:
I know its a long list!! Take a deep breath... and let us get started!
The data has one file "TitanicSurvivalData.csv". This file contains 891 passenger details.
The goal is to learn from this data and predict if a new person boards the titanic ship and it sinks again... will
he/she survive it or not?
You can download the data required for this case study here
Data description
The business meaning of each column in the data is as below
In [1]:
# Supressing the warning messages
import warnings
warnings.filterwarnings('ignore')
In [2]:
Braund,
Mr. 22. A/5
0 1 0 3 male 1 0 7.2500 NaN S
Owen 0 21171
Harris
PassengerI Survive Pclas SibS Parc Cabi Embarke
Name Sex Age Ticket Fare
d d s p h n d
Cumings,
Mrs. John
Bradley femal 38. PC 71.283
1 2 1 1 1 0 C85 C
(Florence e 0 17599 3
Briggs
Th...
Heikkine STON/
femal 26.
2 3 1 3 n, Miss. 0 0 O2. 7.9250 NaN S
e 0
Laina 3101282
Futrelle,
Mrs.
Jacques femal 35. 53.100
3 4 1 1 1 0 113803 C123 S
Heath e 0 0
(Lily May
Peel)
Allen,
Mr. 35.
4 5 0 3 male 0 0 373450 8.0500 NaN S
William 0
Henry
Moran,
Na
5 6 0 3 Mr. male 0 0 330877 8.4583 NaN Q
N
James
McCarthy
54. 51.862
6 7 0 1 , Mr. male 0 0 17463 E46 S
0 5
Timothy J
Palsson,
Master. 21.075
7 8 0 3 male 2.0 3 1 349909 NaN S
Gosta 0
Leonard
Mrs.
Oscar W
(Elisabet
e 0 3
h
Vilhelmin
a Berg)
Nasser,
Mrs.
femal 14. 30.070
9 10 1 2 Nicholas 1 0 237736 NaN C
e 0 8
(Adele
Achem)
In [3]:
%matplotlib inline
# Creating Bar chart as the Target variable is Categorical
GroupedData=TitanicSurvivalData.groupby('Survived').size()
GroupedData.plot(kind='bar', figsize=(4,3))
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x118242890>
The data distribution of the target variable is satisfactory to proceed further. There are sufficient number of
rows for each category to learn from.
This step helps to start the column rejection process. You must look at each column carefully and ask, does this
column affect the values of the Target variable? For example in this case study, you will ask, does this column
affect the survival of the passenger? If the answer is a clear "No", then remove the column immediately from
the data, otherwise keep the column for further analysis.
There are four commands which are used for Basic data exploration in Python
In [4]:
# Looking at sample rows in the data
TitanicSurvivalData.head()
Out[4]:
Braund,
Mr. 22. A/5
0 1 0 3 male 1 0 7.2500 NaN S
Owen 0 21171
Harris
Cumings,
Mrs.
John
femal 38. PC 71.283
1 2 1 1 Bradley 1 0 C85 C
e 0 17599 3
(Florence
Briggs
Th...
Heikkine STON/
femal 26.
2 3 1 3 n, Miss. 0 0 O2. 7.9250 NaN S
e 0
Laina 3101282
Futrelle,
Mrs.
Jacques femal 35. 53.100
3 4 1 1 1 0 113803 C123 S
Heath e 0 0
(Lily May
Peel)
Allen,
Mr. 35.
4 5 0 3 male 0 0 373450 8.0500 NaN S
William 0
Henry
In [5]: