0% found this document useful (0 votes)

17 views14 pages

Titanic Report ML Report

The project report by Mr. Swapnil Rajendra Take focuses on predicting the survival of Titanic passengers using machine learning techniques. The study utilized data from a Kaggle competition, applying various algorithms such as decision trees, random forests, and extra trees, achieving an accuracy of 80.383%. The report highlights the importance of feature engineering and the handling of missing data in building effective prediction models.

Uploaded by

Amit Potghan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views14 pages

Titanic Report ML Report

Uploaded by

Amit Potghan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Savitribai Phule Pune University

A
Project Report
On

“Who Survived the Titanic shipwreck

Prediction using Machine Learning”
Submitted by
Mr. Swapnil Rajendra Take

UNDER THE GUIDANCE BY

Prof. Devray R.N.

DEPARTMENT OF COMPUTER ENGINEERING

VISHWABHARATI ACADEMY’S COLLAGE OF ENGINEERING
SAROLA BADDI AHMEDNAGAR 414201
DEPARTMENT OF COMPUTER ENGINEERING
VISHWABHARATI ACADEMY’S COLLAGE OF ENGINEERING
Sarola Baddi, Ahmednagar

CERTIFICATE
This is to certify that Swapnil Rajendra Take has successfully completed his
Report on “Who Survived the Titanic shipwreck Prediction using Machine
Learning” at Vishwabharti Academy’s College of Engineering,
Ahmednagar in the partial fulfillment of the Graduate Degree course in
B.E. at the Department of Computer Engineering, in the academic Year
2022-2023
Semester-VII as prescribed by the Savitribai Phule Pune University

Prof. Devray R.N. Prof. Joshi S.G. Prof. Dhongde V.S.

Project Guide Head of Department Principal

Date:
Place: Ahmednagar
Acknowledgement
We would like to extend our sincere appreciation and indebtedness to
the teacher of the Computer Department Prof. Devray R.N. for providing the
technical, informative support, valuable guidance and constant inspiration
and encouragement as a project guide which has brought this stage one
project report in this form.
We would also like to express our gratitude to Prof. Dhongade V.S. for
his constant source of encouragement and friendly guidance throughout the
project work And at the end we would like to express our gratitude to all staff
member who have directly or indirectly contributed in their own way and all
my friends Computer Department for their suggestions and constructive
criticism.

Mr. Swapnil Rajendra Take

Table of Contents

1. Abstract

2. Introduction

3. Work Plan

4. Training and Test Data

5. Feature Engineering

6. Decision Trees

7. Conclusions
The sinking of the RMS Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic
sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.
This sensational tragedy shocked the international community and led to better
safety regulations for ships.

Introduction

The goal of the project was to predict the survival of passengers based off a set of
data. We used Kaggle competition "Titanic: Machine Learning from Disaster" (see
https://www.kaggle.com/c/titanic/data) to retrieve necessary data and evaluate
accuracy of our predictions. The historical data has been split into two groups, a
'training set' and a 'test set'. For the training set, we are provided with the
outcome (whether or not a passenger survived). We used this set to build our
model to generate predictions for the test set.

For each passenger in the test set, we had to predict whether or not they survived
the sinking. Our score was the percentage of correctly predictions.
In our work, we learned
Programming language Python and its libraries NumPy (to perform matrix
operations) and SciKit-Learn (to apply machine learning algorithms)
Several machine learning algorithms (decision tree, random forests, extra
trees, linear regression)
Feature Engineering techniques

We used
Online integrated development environment Cloud 9 (https://c9.io)
Python 2.7.6 with the libraries numpy, sklearn, and matplotlib
Microsoft Excel

2
Work Plan

1. Learn programming language Python

2. Learn Shennon Entropy and write Python code to compute Shennon
Entropy
3. Get familiar with Kaggle project and try using Pivot Tables in
Microsoft Excel to analyze the data.
4. Learn to use SciKit-Learn library in Python, including
a. Building decision tree
b. Building Random Forests
c. Building ExtraTrees
d. Using Linear Regression algorithm
5. Performing Feature Engineering, applying machine learning algorithms,
and analyzing results

3
Training and Test Data

Training and Test data come in CSV file and contain the following fields:
Passenger ID
Passenger
Class Name
Sex
Age
Number of passenger's siblings and spouses on board
Number of passenger's parents and children on
board Ticket
Fare
Cabin
City where passenger embarked

4
Feature Engineering

Since the data can have missing fields, incomplete fields, or fields containing
hidden information, a crucial step in building any prediction system is Feature
Engineering. For instance, the fields Age, Fare, and Embarked in the training and
test data, had missing values that had to be filled in. The field Name while being
useless itself, contained passenger's Title (Mr., Mrs., etc.), we also used
passenger's surname to distinguish families on board of Titanic. Below is the list of
all changes that has been made to the data.
Extracting Title from Name

The field Name in the training and test data has the form "Braund, Mr. Owen
Harris". Since name is unique for each passenger, it is not useful for our prediction
system. However, a passenger's title can be extracted from his or her name. We
found 10 titles:
Index Title Number of occurrences
0 Col. 4
1 Dr. 8
2 Lady 4
3 Master 61
4 Miss 262
5 Mr. 757
6 Mrs. 198
7 Ms. 2
8 Rev. 8
9 Sir 5

We can see that title may indicate passenger's sex (Mr. vs Mrs.), class (Lady
vs Mrs.), age (Master vs Mr.), profession (Col., Dr., and Rev.).
Calculating Family Size

It seems advantageous to calculate family size as follows

Family_Size = Parents_Children + Siblings_Spouses + 1

5
Extracting Deck from Cabin

The field Cabin in the training and test data has the form "C85", "C125", where C
refers to the deck label. We found 8 deck labels: A, B, C, D, E, F, G, T. We see deck
label as a refinement of the passenger's class field since the decks A and B were
intended for passengers of the first class, etc.
Extracting Ticket_Code from Ticket

The field Ticket in the training and test data has the form "A/5 21171". Although
we couldn't understand meaning of letters in front of numbers in the field Ticket,
we extracted those letters and used them in our prediction system. We found the
following letters
Index Ticket Code Number of occurrences
0 No Code 961
1 A 42
2 C 77
3 F 13
4 L 1
5 P 98
6 S 98
7 W 19

Filling in missing values in the fields Fare, Embarked, and Age

Since the number of missing values was small, we used median of all Fare values
to fill in missing Fare fields, and the letter 'S' (most frequent value) for the field
Embarked.

In the training and test data, there was significant amount of missing Ages. To fill
in those, we used Linear Regression algorithm to predict Ages based on all other
fields except Passenger_ID and Survived.
Importance of fields

6
Decision Trees algorithm in the library SciKit-Learn allows to evaluate importance
of each field used for prediction. Below is the chart displaying importance of each
field.

We can see that the field Sex is the most important one for prediction, followed
by Title, Fare, Age, Class, Deck, Family_Size, etc.

7
Decision Trees

Our prediction system is based on growing Decision Trees to predict the survival
status. A typical Decision Tree is pictured below

The basic algorithm for growing Decision Tree:

1. Start at the root node as parent node
2. Split the parent node based on field X[i] to minimize the sum of child nodes
uncertainty (maximize information gain)
3. Assign training samples to new child nodes
4. Stop if leave nodes are pure or early stopping criteria is satisfied, otherwise
repeat step 1 and 2 for each new child node

Stopping Rules:
1. The leaf nodes are pure
2. A maximal node depth is reached
3. Splitting a node does not lead to an information gain

8
In order to measure uncertainty and information gain, we used the formula

( )= ( )− ( )− ℎ ( )
ℎ

where
: Information Gain
: Impurity (Uncertainty Measure)
,, ℎ : number of samples in the parent, the left child, and the

right child nodes

,, ℎ : training subset of the parent, the left child, and the right child nodes

For Uncertainty Measure, we used Entropy defined by

( 1, 2) = − 1 log2 1 − 2 log2 2

and GINI index defined by

( 1, 2)=2 12

The graphs of both measures are given below

Entropy

GINI

9
We can see on the graph that when probability of an event is 0 or 1, then the
uncertainty measure equals to 0, while if probability of an event is close to ½,
then the uncertainty measure is maximum.

Random Forest and ExtraTrees

One common issue with all machine learning algorithms is Overfitting. For
Decision Tree, it means growing too large tree (with strong bias, small variation)
so it loses its ability to generalize the data and to predict the output. In order to
deal with overfitting, we can grow several decision trees and take the average of
their predictions. The library SciKit-Learn provides to such algorithm Random
Forest and ExtraTrees.
In Random Forest, we grow N decision trees based on randomly selected subset of the data and randomly selected M fields, where = √ # .

In ExtraTrees, in addition to randomness of subsets of the data and of field, splits

of nodes are chosen randomly.

10
Conclusion

As a result of our work, we gained valuable experience of building prediction systems

and achieved our best score on Kaggle: 80.383% of correct predictions (in Kaggle
leaderboard, it corresponds to positions 477 - 881 out of 3911 participants).
• We performed featured engineering techniques
• Changed alphabetic values to numeric
• Calculated family size
• Extracted title from name and deck label from ticket number
• Used linear regression algorithm to fill in missing ages
• We used several prediction algorithms in python
• Decision tree
• Random forests
• Extra trees
• We achieved our best score 80.383% correct predictions

Kenya Medical Training College Proposal
33% (3)
Kenya Medical Training College Proposal
13 pages
Schematic Diagram MCB-V6-En Ver.18.06 Rev.1 (GEEC)
100% (1)
Schematic Diagram MCB-V6-En Ver.18.06 Rev.1 (GEEC)
44 pages
Content - DELMIA - Ergonomics at Work Essentials
No ratings yet
Content - DELMIA - Ergonomics at Work Essentials
28 pages
Worksheet Titanic Python PDF
No ratings yet
Worksheet Titanic Python PDF
8 pages
Machine Learning
100% (1)
Machine Learning
62 pages
Endorsement of Higher Qualification-New
0% (1)
Endorsement of Higher Qualification-New
2 pages
Instruction Manual: Programmable Automatic Shift System
No ratings yet
Instruction Manual: Programmable Automatic Shift System
25 pages
hw4 Sol PDF
100% (2)
hw4 Sol PDF
23 pages
Individual Asignment Ucs551
70% (10)
Individual Asignment Ucs551
15 pages
LV Circuit Breaker Calculator Guide (Level 2) European Arc Guide EAG
No ratings yet
LV Circuit Breaker Calculator Guide (Level 2) European Arc Guide EAG
5 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
34 pages
Thesis Slide
No ratings yet
Thesis Slide
24 pages
Yanmar SV20 - Partsbook PDF
100% (2)
Yanmar SV20 - Partsbook PDF
168 pages
Titanic Survival Prediction Using ML Miniproject
No ratings yet
Titanic Survival Prediction Using ML Miniproject
21 pages
Cloud Computing Chapter3 2
0% (1)
Cloud Computing Chapter3 2
36 pages
2-Alarm Check Valve Viking Manual........
No ratings yet
2-Alarm Check Valve Viking Manual........
23 pages
Acknowledgement
No ratings yet
Acknowledgement
24 pages
Think and Decide Think and Observe: 3 Quarter Week 1 Lesson Plan Mathematics 4 I. Objectives
100% (1)
Think and Decide Think and Observe: 3 Quarter Week 1 Lesson Plan Mathematics 4 I. Objectives
3 pages
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
No ratings yet
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
5 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Titanic Disaster Prediction
No ratings yet
Titanic Disaster Prediction
20 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov
No ratings yet
Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov
11 pages
ML Mini Project 2
No ratings yet
ML Mini Project 2
26 pages
ML Report-1
No ratings yet
ML Report-1
13 pages
MCA - Project Documentation Guidelines 2024-2025
No ratings yet
MCA - Project Documentation Guidelines 2024-2025
26 pages
Decision Tree Classifier - Manual
No ratings yet
Decision Tree Classifier - Manual
6 pages
4.1.3.5 Lab - Decision Tree Classification
No ratings yet
4.1.3.5 Lab - Decision Tree Classification
11 pages
Jntuk ML RECORD Full
No ratings yet
Jntuk ML RECORD Full
46 pages
Machine Learning With Titanic Dataset Tutorial
No ratings yet
Machine Learning With Titanic Dataset Tutorial
7 pages
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
ML Mini Project - Docx New (A)
No ratings yet
ML Mini Project - Docx New (A)
17 pages
1.7.1.8 Flow Switch - 2
No ratings yet
1.7.1.8 Flow Switch - 2
3 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
ML Aniket
No ratings yet
ML Aniket
18 pages
Rouse Final
No ratings yet
Rouse Final
8 pages
Titanic Classification Project
No ratings yet
Titanic Classification Project
17 pages
Iml Project
No ratings yet
Iml Project
13 pages
Machine Learning With Python (Vasavi)
No ratings yet
Machine Learning With Python (Vasavi)
20 pages
Ipl Matches Documentation
No ratings yet
Ipl Matches Documentation
28 pages
Dissertation Knowledge Management PDF
100% (2)
Dissertation Knowledge Management PDF
7 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
7 pages
Report TSP
No ratings yet
Report TSP
13 pages
Titanic Eda
No ratings yet
Titanic Eda
14 pages
Titanic Disaster Using Machine Learning
No ratings yet
Titanic Disaster Using Machine Learning
7 pages
LamTang TitanicMachineLearningFromDisaster
No ratings yet
LamTang TitanicMachineLearningFromDisaster
5 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
CEP Final
No ratings yet
CEP Final
11 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Adaptive DFE Modeling Using IBIS v4. 2
No ratings yet
Adaptive DFE Modeling Using IBIS v4. 2
36 pages
Oomd
No ratings yet
Oomd
11 pages
Titanic
No ratings yet
Titanic
6 pages
Unit 2
No ratings yet
Unit 2
18 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
A Comparative Study On Machine Learning Techniques Using Titanic Dataset
No ratings yet
A Comparative Study On Machine Learning Techniques Using Titanic Dataset
6 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Titanic
No ratings yet
Titanic
3 pages
Titanic
No ratings yet
Titanic
3 pages
Titanic Survival
No ratings yet
Titanic Survival
13 pages
Using Titanic Dataset For Comprehensive Machine Learning Model Training
No ratings yet
Using Titanic Dataset For Comprehensive Machine Learning Model Training
3 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
Machine Learnig - Mini Project
No ratings yet
Machine Learnig - Mini Project
5 pages
Titanic: Machine Learning For Kids:: Teachers' Notes
No ratings yet
Titanic: Machine Learning For Kids:: Teachers' Notes
1 page
Neural Network Project
No ratings yet
Neural Network Project
4 pages
Titanic Survival Analysis
No ratings yet
Titanic Survival Analysis
61 pages
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
No ratings yet
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
7 pages
Titanic
No ratings yet
Titanic
1 page
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
No ratings yet
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
2 pages
M1 - 4Mlsp - Machine Learning: Project: Binary Classification Webapp
No ratings yet
M1 - 4Mlsp - Machine Learning: Project: Binary Classification Webapp
2 pages
Module 8 Artificial Intelligence in Monitoring and Evaluation
No ratings yet
Module 8 Artificial Intelligence in Monitoring and Evaluation
23 pages
Titanic: Machine Learning From Disaster: Source
No ratings yet
Titanic: Machine Learning From Disaster: Source
1 page
The Foundry NukeX 7 for Compositors
From Everand
The Foundry NukeX 7 for Compositors
Prof. Sham Tickoo
No ratings yet
Fortimanager v6.4.11 Release Notes
No ratings yet
Fortimanager v6.4.11 Release Notes
45 pages
Java Objective Question
No ratings yet
Java Objective Question
8 pages
Cisco Intersight Infrastructure Service Data Sheet
No ratings yet
Cisco Intersight Infrastructure Service Data Sheet
15 pages
Modulewise QuestionBank
No ratings yet
Modulewise QuestionBank
9 pages
Genene Proposal
No ratings yet
Genene Proposal
32 pages
Social Media Influences To Teenagers: June 2020
No ratings yet
Social Media Influences To Teenagers: June 2020
12 pages
ADR Sabre
No ratings yet
ADR Sabre
2 pages
Yoga Pavan Resume
No ratings yet
Yoga Pavan Resume
2 pages
Raphael
No ratings yet
Raphael
8 pages
Integrating PCA With Deep Learning Models For Stock Market Forecasting
No ratings yet
Integrating PCA With Deep Learning Models For Stock Market Forecasting
13 pages
.Trashed-1742732428-Abstraction in Java - GeeksforGeeks
No ratings yet
.Trashed-1742732428-Abstraction in Java - GeeksforGeeks
11 pages
Spys Mykola Resume
No ratings yet
Spys Mykola Resume
1 page
Virtual Reality As An Empirical Research Tool - Exploring User Experience in A Real Building and A Corresponding Virtual Model
No ratings yet
Virtual Reality As An Empirical Research Tool - Exploring User Experience in A Real Building and A Corresponding Virtual Model
3 pages
Hardness Shore A vs. Shore D - Darwin Microfluidics
No ratings yet
Hardness Shore A vs. Shore D - Darwin Microfluidics
2 pages

Titanic Report ML Report

Uploaded by

Titanic Report ML Report

Uploaded by

Savitribai Phule Pune University

“Who Survived the Titanic shipwreck

UNDER THE GUIDANCE BY

DEPARTMENT OF COMPUTER ENGINEERING

Prof. Devray R.N. Prof. Joshi S.G. Prof. Dhongde V.S.

Mr. Swapnil Rajendra Take

4. Training and Test Data

1. Learn programming language Python

It seems advantageous to calculate family size as follows

Filling in missing values in the fields Fare, Embarked, and Age

The basic algorithm for growing Decision Tree:

right child nodes

For Uncertainty Measure, we used Entropy defined by

and GINI index defined by

The graphs of both measures are given below

Random Forest and ExtraTrees

In ExtraTrees, in addition to randomness of subsets of the data and of field, splits

As a result of our work, we gained valuable experience of building prediction systems

You might also like