100% found this document useful (1 vote)
53 views7 pages

Ahamed 123

This case study uses the Titanic passenger data set to create a machine learning model that predicts whether a passenger would survive or not based on their attributes. The data contains information on 891 passengers from the Titanic including whether they survived, as well as attributes like gender, age, class, etc. The case study walks through cleaning and exploring the data, feature selection, building predictive models using different algorithms, and selecting the best performing model to predict Titanic passenger survival.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
53 views7 pages

Ahamed 123

This case study uses the Titanic passenger data set to create a machine learning model that predicts whether a passenger would survive or not based on their attributes. The data contains information on 891 passengers from the Titanic including whether they survived, as well as attributes like gender, age, class, etc. The case study walks through cleaning and exploring the data, feature selection, building predictive models using different algorithms, and selecting the best performing model to predict Titanic passenger survival.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Titanic survival prediction

case study in python

This case study is based on the very famous dataset in machine learning. The titanic survival data.

The data contains information about 891 passengers. It also indicates whether the passenger survived the
titanic crash or not?

The goal is to create a predictive model which can predict the survival of a given person, if they were to board
the titanic and the ship sinks... again! :(

In below case study I will discuss the step by step approach to create a Machine Learning predictive model in
such scenarios. You can use this flow as a template to solve any supervised ML classification problem.
The flow of the case study is as below:

● Reading the data in python


● Defining the problem statement
● Identifying the Target variable
● Looking at the distribution of Target variable
● Basic Data exploration
● Rejecting useless columns
● Visual Exploratory Data Analysis for data distribution (Histogram and Barcharts)
● Feature Selection based on data distribution
● Outlier treatment
● Missing Values treatment
● Visual correlation analysis
● Statistical correlation analysis (Feature Selection)
● Converting data to numeric for ML
● Sampling and K-fold cross validation
● Trying multiple classification algorithms
● Selecting the best Model
● Deploying the best model in production

I know its a long list!! Take a deep breath... and let us get started!

Reading the data into python


This is one of the most important steps in machine learning! You must understand the data and the domain
well before trying to apply any machine learning algorithm.

The data has one file "TitanicSurvivalData.csv". This file contains 891 passenger details.

The goal is to learn from this data and predict if a new person boards the titanic ship and it sinks again... will
he/she survive it or not?
You can download the data required for this case study here

Data description
The business meaning of each column in the data is as below

● PassengerId: The id for each passenger


● Survived: Whether the passenger survived or not? 1=Survived, 0=Died
● Pclass: The travel class of the passenger
● Name: Name of the passenger
● Sex: The genger of the passenger
● Age: The Age of the passenger
● SibSp: Number of Siblings/Spouses Aboard
● Parch: Number of Parents/Children Aboard
● Ticket: The ticket number of the passenger
● Fare: The amount of fare paid by the passenger
● Cabin: The cabin number allotted
● Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
# Supressing the warning messages
import warnings
warnings.filterwarnings('ignore')
In [2]:

# Reading the dataset


import pandas as pd
import numpy as np
TitanicSurvivalData=pd.read_csv('/Users/farukh/Python Case
Studies/TitanicSurvivalData.csv', encoding='latin')
print('Shape before deleting duplicate values:', TitanicSurvivalData.shape)

# Removing duplicate rows if any


TitanicSurvivalData=TitanicSurvivalData.drop_duplicates()
print('Shape After deleting duplicate values:', TitanicSurvivalData.shape)

# Printing sample data


# Start observing the Quantitative/Categorical/Qualitative variables
TitanicSurvivalData.head(10)
Shape before deleting duplicate values: (891, 12)
Shape After deleting duplicate values: (891, 12)
Out[2]:

PassengerI Survive Pclas SibS Parc Cabi Embarke


Name Sex Age Ticket Fare
d d s p h n d

Braund,
Mr. 22. A/5
0 1 0 3 male 1 0 7.2500 NaN S
Owen 0 21171
Harris
PassengerI Survive Pclas SibS Parc Cabi Embarke
Name Sex Age Ticket Fare
d d s p h n d

Cumings,
Mrs. John
Bradley femal 38. PC 71.283
1 2 1 1 1 0 C85 C
(Florence e 0 17599 3
Briggs
Th...

Heikkine STON/
femal 26.
2 3 1 3 n, Miss. 0 0 O2. 7.9250 NaN S
e 0
Laina 3101282

Futrelle,
Mrs.
Jacques femal 35. 53.100
3 4 1 1 1 0 113803 C123 S
Heath e 0 0
(Lily May
Peel)

Allen,
Mr. 35.
4 5 0 3 male 0 0 373450 8.0500 NaN S
William 0
Henry

Moran,
Na
5 6 0 3 Mr. male 0 0 330877 8.4583 NaN Q
N
James

McCarthy
54. 51.862
6 7 0 1 , Mr. male 0 0 17463 E46 S
0 5
Timothy J

Palsson,
Master. 21.075
7 8 0 3 male 2.0 3 1 349909 NaN S
Gosta 0
Leonard

8 9 1 3 Johnson, femal 27. 0 2 347742 11.133 NaN S


PassengerI Survive Pclas SibS Parc Cabi Embarke
Name Sex Age Ticket Fare
d d s p h n d

Mrs.
Oscar W
(Elisabet
e 0 3
h
Vilhelmin
a Berg)

Nasser,
Mrs.
femal 14. 30.070
9 10 1 2 Nicholas 1 0 237736 NaN C
e 0 8
(Adele
Achem)

Defining the problem statement:


Create a Predictive model which can tell if a person will survive the titanic crash or not?

● Target Variable: Survived


● Predictors: age, sex, passenger class etc.

● Survived=0 The passenger died


● Survived=1 The passenger survived

Determining the type of Machine Learning


Based on the problem statement you can understand that we need to create a supervised ML classification
model, as the target variable is categorical.

Looking at the distribution of Target variable


● If target variable's distribution is too skewed then the predictive modeling will not be possible.
● Bell curve is desirable but slightly positive skew or negative skew is also fine
● When performing Classification, make sure there is a balance in the the distribution of each class
otherwise it impacts the Machine Learning algorithms ability to learn all the classes

In [3]:
%matplotlib inline
# Creating Bar chart as the Target variable is Categorical
GroupedData=TitanicSurvivalData.groupby('Survived').size()
GroupedData.plot(kind='bar', figsize=(4,3))
Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x118242890>

The data distribution of the target variable is satisfactory to proceed further. There are sufficient number of
rows for each category to learn from.

Basic Data Exploration


This step is performed to guage the overall data. The volume of data, the types of columns present in the data.
Initial assessment of the data should be done to identify which columns are Quantitative, Categorical or
Qualitative.

This step helps to start the column rejection process. You must look at each column carefully and ask, does this
column affect the values of the Target variable? For example in this case study, you will ask, does this column
affect the survival of the passenger? If the answer is a clear "No", then remove the column immediately from
the data, otherwise keep the column for further analysis.

There are four commands which are used for Basic data exploration in Python

● head() : This helps to see a few sample rows of the data


● info() : This provides the summarized information of the data
● describe() : This provides the descriptive statistical details of the data
● nunique(): This helps us to identify if a column is categorical or continuous

In [4]:
# Looking at sample rows in the data
TitanicSurvivalData.head()
Out[4]:

PassengerI Survive Pclas SibS Parc Cabi Embarke


Name Sex Age Ticket Fare
d d s p h n d

Braund,
Mr. 22. A/5
0 1 0 3 male 1 0 7.2500 NaN S
Owen 0 21171
Harris

Cumings,
Mrs.
John
femal 38. PC 71.283
1 2 1 1 Bradley 1 0 C85 C
e 0 17599 3
(Florence
Briggs
Th...

Heikkine STON/
femal 26.
2 3 1 3 n, Miss. 0 0 O2. 7.9250 NaN S
e 0
Laina 3101282

Futrelle,
Mrs.
Jacques femal 35. 53.100
3 4 1 1 1 0 113803 C123 S
Heath e 0 0
(Lily May
Peel)

Allen,
Mr. 35.
4 5 0 3 male 0 0 373450 8.0500 NaN S
William 0
Henry

In [5]:

You might also like