100% found this document useful (1 vote)

53 views7 pages

Ahamed 123

This case study uses the Titanic passenger data set to create a machine learning model that predicts whether a passenger would survive or not based on their attributes. The data contains information on 891 passengers from the Titanic including whether they survived, as well as attributes like gender, age, class, etc. The case study walks through cleaning and exploring the data, feature selection, building predictive models using different algorithms, and selecting the best performing model to predict Titanic passenger survival.

Uploaded by

jbalapragashpathi2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

53 views7 pages

Ahamed 123

Uploaded by

jbalapragashpathi2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Titanic survival prediction

case study in python

This case study is based on the very famous dataset in machine learning. The titanic survival data.

The data contains information about 891 passengers. It also indicates whether the passenger survived the
titanic crash or not?

The goal is to create a predictive model which can predict the survival of a given person, if they were to board
the titanic and the ship sinks... again! :(

In below case study I will discuss the step by step approach to create a Machine Learning predictive model in
such scenarios. You can use this flow as a template to solve any supervised ML classification problem.
The flow of the case study is as below:

● Reading the data in python

● Defining the problem statement
● Identifying the Target variable
● Looking at the distribution of Target variable
● Basic Data exploration
● Rejecting useless columns
● Visual Exploratory Data Analysis for data distribution (Histogram and Barcharts)
● Feature Selection based on data distribution
● Outlier treatment
● Missing Values treatment
● Visual correlation analysis
● Statistical correlation analysis (Feature Selection)
● Converting data to numeric for ML
● Sampling and K-fold cross validation
● Trying multiple classification algorithms
● Selecting the best Model
● Deploying the best model in production

I know its a long list!! Take a deep breath... and let us get started!

Reading the data into python

This is one of the most important steps in machine learning! You must understand the data and the domain
well before trying to apply any machine learning algorithm.

The data has one file "TitanicSurvivalData.csv". This file contains 891 passenger details.

The goal is to learn from this data and predict if a new person boards the titanic ship and it sinks again... will
he/she survive it or not?
You can download the data required for this case study here

Data description
The business meaning of each column in the data is as below

● PassengerId: The id for each passenger

● Survived: Whether the passenger survived or not? 1=Survived, 0=Died
● Pclass: The travel class of the passenger
● Name: Name of the passenger
● Sex: The genger of the passenger
● Age: The Age of the passenger
● SibSp: Number of Siblings/Spouses Aboard
● Parch: Number of Parents/Children Aboard
● Ticket: The ticket number of the passenger
● Fare: The amount of fare paid by the passenger
● Cabin: The cabin number allotted
● Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
# Supressing the warning messages
import warnings
warnings.filterwarnings('ignore')
In [2]:

# Reading the dataset

import pandas as pd
import numpy as np
TitanicSurvivalData=pd.read_csv('/Users/farukh/Python Case
Studies/TitanicSurvivalData.csv', encoding='latin')
print('Shape before deleting duplicate values:', TitanicSurvivalData.shape)

# Removing duplicate rows if any

TitanicSurvivalData=TitanicSurvivalData.drop_duplicates()
print('Shape After deleting duplicate values:', TitanicSurvivalData.shape)

# Printing sample data

# Start observing the Quantitative/Categorical/Qualitative variables
TitanicSurvivalData.head(10)
Shape before deleting duplicate values: (891, 12)
Shape After deleting duplicate values: (891, 12)
Out[2]:

PassengerI Survive Pclas SibS Parc Cabi Embarke

Name Sex Age Ticket Fare
d d s p h n d

Braund,
Mr. 22. A/5
0 1 0 3 male 1 0 7.2500 NaN S
Owen 0 21171
Harris
PassengerI Survive Pclas SibS Parc Cabi Embarke
Name Sex Age Ticket Fare
d d s p h n d

Cumings,
Mrs. John
Bradley femal 38. PC 71.283
1 2 1 1 1 0 C85 C
(Florence e 0 17599 3
Briggs
Th...

Heikkine STON/
femal 26.
2 3 1 3 n, Miss. 0 0 O2. 7.9250 NaN S
e 0
Laina 3101282

Futrelle,
Mrs.
Jacques femal 35. 53.100
3 4 1 1 1 0 113803 C123 S
Heath e 0 0
(Lily May
Peel)

Allen,
Mr. 35.
4 5 0 3 male 0 0 373450 8.0500 NaN S
William 0
Henry

Moran,
Na
5 6 0 3 Mr. male 0 0 330877 8.4583 NaN Q
N
James

McCarthy
54. 51.862
6 7 0 1 , Mr. male 0 0 17463 E46 S
0 5
Timothy J

Palsson,
Master. 21.075
7 8 0 3 male 2.0 3 1 349909 NaN S
Gosta 0
Leonard

8 9 1 3 Johnson, femal 27. 0 2 347742 11.133 NaN S

PassengerI Survive Pclas SibS Parc Cabi Embarke
Name Sex Age Ticket Fare
d d s p h n d

Mrs.
Oscar W
(Elisabet
e 0 3
h
Vilhelmin
a Berg)

Nasser,
Mrs.
femal 14. 30.070
9 10 1 2 Nicholas 1 0 237736 NaN C
e 0 8
(Adele
Achem)

Defining the problem statement:

Create a Predictive model which can tell if a person will survive the titanic crash or not?

● Target Variable: Survived

● Predictors: age, sex, passenger class etc.

● Survived=0 The passenger died

● Survived=1 The passenger survived

Determining the type of Machine Learning

Based on the problem statement you can understand that we need to create a supervised ML classification
model, as the target variable is categorical.

Looking at the distribution of Target variable

● If target variable's distribution is too skewed then the predictive modeling will not be possible.
● Bell curve is desirable but slightly positive skew or negative skew is also fine
● When performing Classification, make sure there is a balance in the the distribution of each class
otherwise it impacts the Machine Learning algorithms ability to learn all the classes

In [3]:
%matplotlib inline
# Creating Bar chart as the Target variable is Categorical
GroupedData=TitanicSurvivalData.groupby('Survived').size()
GroupedData.plot(kind='bar', figsize=(4,3))
Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x118242890>

The data distribution of the target variable is satisfactory to proceed further. There are sufficient number of
rows for each category to learn from.

Basic Data Exploration

This step is performed to guage the overall data. The volume of data, the types of columns present in the data.
Initial assessment of the data should be done to identify which columns are Quantitative, Categorical or
Qualitative.

This step helps to start the column rejection process. You must look at each column carefully and ask, does this
column affect the values of the Target variable? For example in this case study, you will ask, does this column
affect the survival of the passenger? If the answer is a clear "No", then remove the column immediately from
the data, otherwise keep the column for further analysis.

There are four commands which are used for Basic data exploration in Python

● head() : This helps to see a few sample rows of the data

● info() : This provides the summarized information of the data
● describe() : This provides the descriptive statistical details of the data
● nunique(): This helps us to identify if a column is categorical or continuous

In [4]:
# Looking at sample rows in the data
TitanicSurvivalData.head()
Out[4]:

PassengerI Survive Pclas SibS Parc Cabi Embarke

Name Sex Age Ticket Fare
d d s p h n d

Braund,
Mr. 22. A/5
0 1 0 3 male 1 0 7.2500 NaN S
Owen 0 21171
Harris

Cumings,
Mrs.
John
femal 38. PC 71.283
1 2 1 1 Bradley 1 0 C85 C
e 0 17599 3
(Florence
Briggs
Th...

Heikkine STON/
femal 26.
2 3 1 3 n, Miss. 0 0 O2. 7.9250 NaN S
e 0
Laina 3101282

Futrelle,
Mrs.
Jacques femal 35. 53.100
3 4 1 1 1 0 113803 C123 S
Heath e 0 0
(Lily May
Peel)

Allen,
Mr. 35.
4 5 0 3 male 0 0 373450 8.0500 NaN S
William 0
Henry

In [5]:

Algorithms: Notes For Professionals
100% (1)
Algorithms: Notes For Professionals
252 pages
Photoshop MCQ Questions and Answers
73% (15)
Photoshop MCQ Questions and Answers
9 pages
Stability Analysis and Modelling Underground Excavations in Fractured Rocks - Vol 1
No ratings yet
Stability Analysis and Modelling Underground Excavations in Fractured Rocks - Vol 1
309 pages
(Cambridge Mathematical Textbooks) Shahriar Shahriari - An Invitation To Combinatorics-Cambridge University Press (2021)
No ratings yet
(Cambridge Mathematical Textbooks) Shahriar Shahriari - An Invitation To Combinatorics-Cambridge University Press (2021)
636 pages
Handy Notes For Student Pilots
100% (1)
Handy Notes For Student Pilots
7 pages
Titanic Classification Project
No ratings yet
Titanic Classification Project
17 pages
Iare DS Lecture Notes 2
No ratings yet
Iare DS Lecture Notes 2
135 pages
Add Math Project Work 1 2010
100% (1)
Add Math Project Work 1 2010
17 pages
PSLE Maths 2020 Paper 1 Booklet B
No ratings yet
PSLE Maths 2020 Paper 1 Booklet B
8 pages
Solutions For 2007 A Level H2 Maths Paper 1
No ratings yet
Solutions For 2007 A Level H2 Maths Paper 1
12 pages
Design of Horizontal Axis Tidal Turbines
No ratings yet
Design of Horizontal Axis Tidal Turbines
8 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
34 pages
PROJECT REVIEW 2 Final
No ratings yet
PROJECT REVIEW 2 Final
23 pages
???? ???????????? ???? ??????
No ratings yet
???? ???????????? ???? ??????
63 pages
SImple and Compound Interest Notes Lyst6475
No ratings yet
SImple and Compound Interest Notes Lyst6475
11 pages
HHXHNCJMKVGK
No ratings yet
HHXHNCJMKVGK
5 pages
Tutorial 20. Modeling Solidification
No ratings yet
Tutorial 20. Modeling Solidification
32 pages
Thesis Topics On Image Processing
100% (3)
Thesis Topics On Image Processing
6 pages
The Development of The Atomic Structure.
No ratings yet
The Development of The Atomic Structure.
10 pages
Unit 5 Analysis With Pandas in Python
No ratings yet
Unit 5 Analysis With Pandas in Python
26 pages
AE II Simulation File PDF
No ratings yet
AE II Simulation File PDF
32 pages
Logistic Regression On Titanic Dataset
No ratings yet
Logistic Regression On Titanic Dataset
6 pages
Titanic Survival Prediction ML
No ratings yet
Titanic Survival Prediction ML
36 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Review Questions: Draw and Explain The Process of Communication System Model
No ratings yet
Review Questions: Draw and Explain The Process of Communication System Model
22 pages
Titanic Survival Prediction Using ML Miniproject
No ratings yet
Titanic Survival Prediction Using ML Miniproject
21 pages
Kebutuhan Panas Cement Mill (1) 1
No ratings yet
Kebutuhan Panas Cement Mill (1) 1
3 pages
Terminal Assessment 2 DAP
No ratings yet
Terminal Assessment 2 DAP
25 pages
Titanic
No ratings yet
Titanic
22 pages
LOGISTIC - REGRESSION - Jupyter Notebook
No ratings yet
LOGISTIC - REGRESSION - Jupyter Notebook
18 pages
Atlas Copco Pf4000 Manual
67% (6)
Atlas Copco Pf4000 Manual
476 pages
Iml Project
No ratings yet
Iml Project
13 pages
Titanic PuneethRegonda
No ratings yet
Titanic PuneethRegonda
8 pages
HKLS Valid Reabilit
No ratings yet
HKLS Valid Reabilit
8 pages
JavaScript Cheat Sheet & Quick Reference
No ratings yet
JavaScript Cheat Sheet & Quick Reference
23 pages
Titanic Prediction
No ratings yet
Titanic Prediction
53 pages
Assignment
No ratings yet
Assignment
14 pages
Titanic Classification Project
No ratings yet
Titanic Classification Project
17 pages
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
Discussion Forum Unit 5
No ratings yet
Discussion Forum Unit 5
2 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
No ratings yet
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
5 pages
Ds 9
No ratings yet
Ds 9
12 pages
Titanic
No ratings yet
Titanic
6 pages
Quarter 3 Week 5 and 6 Final
No ratings yet
Quarter 3 Week 5 and 6 Final
11 pages
Machine Learning With Python (Vasavi)
No ratings yet
Machine Learning With Python (Vasavi)
20 pages
Project Report
No ratings yet
Project Report
7 pages
Homework 2
No ratings yet
Homework 2
12 pages
Aw GR 11 Junie 2024 Memo Finaal
No ratings yet
Aw GR 11 Junie 2024 Memo Finaal
14 pages
Passengerid Survived Pclass Name Sex Age Sibsp Parch Ticket
No ratings yet
Passengerid Survived Pclass Name Sex Age Sibsp Parch Ticket
16 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Pandas - Data Manipulation and Analysis Library - Educative
No ratings yet
Pandas - Data Manipulation and Analysis Library - Educative
7 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
ML Report-1
No ratings yet
ML Report-1
13 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
Object Oriented Analysis
No ratings yet
Object Oriented Analysis
6 pages
08 Titanic
No ratings yet
08 Titanic
19 pages
Titanic Eda
No ratings yet
Titanic Eda
14 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
7 pages
Its A Small Small Small Small World
No ratings yet
Its A Small Small Small Small World
15 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
4.1.3.5 Lab - Decision Tree Classification
No ratings yet
4.1.3.5 Lab - Decision Tree Classification
11 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
The Titanic Dataset
No ratings yet
The Titanic Dataset
6 pages
Data Science Assignment Submission
No ratings yet
Data Science Assignment Submission
12 pages
Report TSP
No ratings yet
Report TSP
13 pages
01-Logistic Regression With Python
No ratings yet
01-Logistic Regression With Python
12 pages
Titanic
No ratings yet
Titanic
6 pages
CEP Final
No ratings yet
CEP Final
11 pages
Titanic Survival
No ratings yet
Titanic Survival
13 pages
AI Lab5
No ratings yet
AI Lab5
5 pages
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
No ratings yet
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
12 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Pandas Day 4
No ratings yet
Pandas Day 4
7 pages
Effect of Grist
No ratings yet
Effect of Grist
9 pages
Titanic ML Kaggle
No ratings yet
Titanic ML Kaggle
3 pages
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
No ratings yet
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
7 pages
Homomorphism
No ratings yet
Homomorphism
10 pages
Titanic
No ratings yet
Titanic
3 pages
Titanic
No ratings yet
Titanic
3 pages
Practical Session 1: Exploratory Data Analysis: Exercise 1
No ratings yet
Practical Session 1: Exploratory Data Analysis: Exercise 1
2 pages
Pythion Assigment
No ratings yet
Pythion Assigment
3 pages
Pra 8-1
No ratings yet
Pra 8-1
3 pages
Physics Statistical Mechanics N Solid State Physics
No ratings yet
Physics Statistical Mechanics N Solid State Physics
4 pages
2.1/2.2 Adding and Subtracting Rational Expressions - Worksheet
No ratings yet
2.1/2.2 Adding and Subtracting Rational Expressions - Worksheet
3 pages
Titanic Data Analysis-Report
No ratings yet
Titanic Data Analysis-Report
4 pages
ML Report
No ratings yet
ML Report
3 pages

Ahamed 123

Uploaded by

Ahamed 123

Uploaded by

Titanic survival prediction

case study in python

● Reading the data in python

Reading the data into python

● PassengerId: The id for each passenger

# Reading the dataset

# Removing duplicate rows if any

# Printing sample data

PassengerI Survive Pclas SibS Parc Cabi Embarke

8 9 1 3 Johnson, femal 27. 0 2 347742 11.133 NaN S

Defining the problem statement:

● Target Variable: Survived

● Survived=0 The passenger died

Determining the type of Machine Learning

Looking at the distribution of Target variable

Basic Data Exploration

● head() : This helps to see a few sample rows of the data

PassengerI Survive Pclas SibS Parc Cabi Embarke

You might also like