Titanic Report ML Report
Titanic Report ML Report
A
Project Report
On
CERTIFICATE
This is to certify that Swapnil Rajendra Take has successfully completed his
Report on “Who Survived the Titanic shipwreck Prediction using Machine
Learning” at Vishwabharti Academy’s College of Engineering,
Ahmednagar in the partial fulfillment of the Graduate Degree course in
B.E. at the Department of Computer Engineering, in the academic Year
2022-2023
Semester-VII as prescribed by the Savitribai Phule Pune University
Date:
Place: Ahmednagar
Acknowledgement
We would like to extend our sincere appreciation and indebtedness to
the teacher of the Computer Department Prof. Devray R.N. for providing the
technical, informative support, valuable guidance and constant inspiration
and encouragement as a project guide which has brought this stage one
project report in this form.
We would also like to express our gratitude to Prof. Dhongade V.S. for
his constant source of encouragement and friendly guidance throughout the
project work And at the end we would like to express our gratitude to all staff
member who have directly or indirectly contributed in their own way and all
my friends Computer Department for their suggestions and constructive
criticism.
1. Abstract
2. Introduction
3. Work Plan
5. Feature Engineering
6. Decision Trees
7. Conclusions
The sinking of the RMS Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic
sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.
This sensational tragedy shocked the international community and led to better
safety regulations for ships.
Introduction
The goal of the project was to predict the survival of passengers based off a set of
data. We used Kaggle competition "Titanic: Machine Learning from Disaster" (see
https://www.kaggle.com/c/titanic/data) to retrieve necessary data and evaluate
accuracy of our predictions. The historical data has been split into two groups, a
'training set' and a 'test set'. For the training set, we are provided with the
outcome (whether or not a passenger survived). We used this set to build our
model to generate predictions for the test set.
For each passenger in the test set, we had to predict whether or not they survived
the sinking. Our score was the percentage of correctly predictions.
In our work, we learned
Programming language Python and its libraries NumPy (to perform matrix
operations) and SciKit-Learn (to apply machine learning algorithms)
Several machine learning algorithms (decision tree, random forests, extra
trees, linear regression)
Feature Engineering techniques
We used
Online integrated development environment Cloud 9 (https://c9.io)
Python 2.7.6 with the libraries numpy, sklearn, and matplotlib
Microsoft Excel
2
Work Plan
3
Training and Test Data
Training and Test data come in CSV file and contain the following fields:
Passenger ID
Passenger
Class Name
Sex
Age
Number of passenger's siblings and spouses on board
Number of passenger's parents and children on
board Ticket
Fare
Cabin
City where passenger embarked
4
Feature Engineering
Since the data can have missing fields, incomplete fields, or fields containing
hidden information, a crucial step in building any prediction system is Feature
Engineering. For instance, the fields Age, Fare, and Embarked in the training and
test data, had missing values that had to be filled in. The field Name while being
useless itself, contained passenger's Title (Mr., Mrs., etc.), we also used
passenger's surname to distinguish families on board of Titanic. Below is the list of
all changes that has been made to the data.
Extracting Title from Name
The field Name in the training and test data has the form "Braund, Mr. Owen
Harris". Since name is unique for each passenger, it is not useful for our prediction
system. However, a passenger's title can be extracted from his or her name. We
found 10 titles:
Index Title Number of occurrences
0 Col. 4
1 Dr. 8
2 Lady 4
3 Master 61
4 Miss 262
5 Mr. 757
6 Mrs. 198
7 Ms. 2
8 Rev. 8
9 Sir 5
We can see that title may indicate passenger's sex (Mr. vs Mrs.), class (Lady
vs Mrs.), age (Master vs Mr.), profession (Col., Dr., and Rev.).
Calculating Family Size
5
Extracting Deck from Cabin
The field Cabin in the training and test data has the form "C85", "C125", where C
refers to the deck label. We found 8 deck labels: A, B, C, D, E, F, G, T. We see deck
label as a refinement of the passenger's class field since the decks A and B were
intended for passengers of the first class, etc.
Extracting Ticket_Code from Ticket
The field Ticket in the training and test data has the form "A/5 21171". Although
we couldn't understand meaning of letters in front of numbers in the field Ticket,
we extracted those letters and used them in our prediction system. We found the
following letters
Index Ticket Code Number of occurrences
0 No Code 961
1 A 42
2 C 77
3 F 13
4 L 1
5 P 98
6 S 98
7 W 19
Since the number of missing values was small, we used median of all Fare values
to fill in missing Fare fields, and the letter 'S' (most frequent value) for the field
Embarked.
In the training and test data, there was significant amount of missing Ages. To fill
in those, we used Linear Regression algorithm to predict Ages based on all other
fields except Passenger_ID and Survived.
Importance of fields
6
Decision Trees algorithm in the library SciKit-Learn allows to evaluate importance
of each field used for prediction. Below is the chart displaying importance of each
field.
We can see that the field Sex is the most important one for prediction, followed
by Title, Fare, Age, Class, Deck, Family_Size, etc.
7
Decision Trees
Our prediction system is based on growing Decision Trees to predict the survival
status. A typical Decision Tree is pictured below
Stopping Rules:
1. The leaf nodes are pure
2. A maximal node depth is reached
3. Splitting a node does not lead to an information gain
8
In order to measure uncertainty and information gain, we used the formula
( )= ( )− ( )− ℎ ( )
ℎ
where
: Information Gain
: Impurity (Uncertainty Measure)
,, ℎ : number of samples in the parent, the left child, and the
Entropy
GINI
9
We can see on the graph that when probability of an event is 0 or 1, then the
uncertainty measure equals to 0, while if probability of an event is close to ½,
then the uncertainty measure is maximum.
One common issue with all machine learning algorithms is Overfitting. For
Decision Tree, it means growing too large tree (with strong bias, small variation)
so it loses its ability to generalize the data and to predict the output. In order to
deal with overfitting, we can grow several decision trees and take the average of
their predictions. The library SciKit-Learn provides to such algorithm Random
Forest and ExtraTrees.
In Random Forest, we grow N decision trees based on randomly selected subset of the data and randomly selected M fields, where = √ # .
10
Conclusion
11