Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
Abstract— It is very important to find out the root causes of Titanic data set is used to analyze the survival of Titanic
human tragedies in the past so that future crises can be based on a statistical analysis of supervised machine
avoided. The incident of 15 April 1912 is an example of a learning techniques like Logistic Regression, Rando m
human tragedy in which around 1500 passengers and working Forest, Decision Tree, K- nearest neighbor, etc.
staff lost their lives. Continuous research in today's time shows
that if the proper statistical assessment was done then it would As per the record, different types of people were present
probably be possible that human devastation could be reduced. on the ship so all predict ions go around the types of
In today's era, many new and powerful technologies are people so that it can be assured maximu m survival.
available, with the help of which accurate statistical Before starting, to train the model it is impo rtant to pre-
calculations can be done. In this research study, Titanic process the data set in all aspects like missing values,
survivals have been studied based on machine learning similar formatting, Outliers, etc [2][3]. For better
techniques. In the study, out of the total entities, 891 entities understanding, a clear p icture of doing research work a
have been used for training and 418 entities have been used for flowchart is given in figure 2.
the test set and the comparative study of different machine
learning algorithms gives importance to this research study.
so that it can be easy to understand the accuracy in a very 5 2012 Predictive It proposed a predictive
authentic way. M odeling modelling approach using the
The scope of this paper is to exp lore and analy ze the Titanic Titanic dataset and offers a
comprehensive review of
dataset using machine learning techniques to predict the
related concepts and methods.
survival of passengers. The paper focuses on the application
6 2017 Predictive This GitHub repository contains
of supervised learning algorith ms such as logistic M odeling a predictive model for the Jack
regression, random forest, stochastic gradient descent, Dies competition, which is
decision tree, and k-nearest neighbor to classify the based on the Titanic dataset and
passengers into two categories, survived or not survived. serves as a similar benchmark
The study aims to co mpare the performance of these for machine learning.
algorith ms based on different evaluation metrics such as 7 2017 Predictive This study uses various
accuracy, F1-score, recall, and precision. Additionally, the M odeling machine learning algorithms to
analyze the Titanic disaster and
paper discusses the data preparation process, feature
identifies the most important
engineering, and data visualization to gain insights into the
factors for survival prediction.
dataset. The results obtained from this study can potentially 8 1995 Neural The paper investigates the
aid in imp roving the safety protocols and emergency Networks impact of sigmoid function
procedures of future maritime transport systems. parameters on the
backpropagation learning
algorithm in artificial neural
II. LIT ERAT URE REVIEW networks.
The Titanic dataset has been extensively used in various 9 2009 Decision This paper presents an
Trees implementation of the ID3
studies to explore the predictive modeling techniques,
decision tree learning algorithm
feature selection methods, and machine learning algorith ms. and provides a tutorial on how
The dataset has served as a benchmark for co mpetitions and to apply it to predictive
tutorials, and its analys is has identified the most significant modeling tasks.
factors for survival pred iction. The studies reviewed in this 10 2018 Predictive This study compares the
literature review demonstrate the versatility and importance M odeling performance of different
of the Titanic dataset in advancing the field of machine machine learning techniques on
learning and predictive modeling. Fro m the survey of data the Titanic dataset and
mining techniques to the investigation of sigmoid function identifies the most accurate
method for survival prediction.
parameters in artificial neural networks, the studies have
11 2014 SVM The paper proposes two
explored a wide range of concepts and methods relevant to methods for selecting Gaussian
machine learning and predictive modeling. kernel parameters in one-class
SVM and applies them to fault
Table 1: Summarized view of Literature Review detection in industrial systems.
12 2014 Predictive This study uses various
Method M odeling machine learning algorithms to
Ref. Year Assessment
Used classify Titanic passenger data
1 2019 Data Set Kaggle.com provides the and predict their chances of
Provider Titanic dataset and platform for survival in the disaster.
the M achine Learning from 13 2012 Predictive This website provides a tutorial
Disaster competition, which M odeling on predictive modeling using
serves as a popular benchmark the Titanic dataset and offers a
for predictive modeling. comprehensive review of
2 2013 Data M ining This paper provides a related concepts and methods.
comprehensive survey of data 14 2011 Sentiment The paper presents a sentiment
mining techniques, including Analysis analysis method for Twitter
supervised and unsupervised data and shows its effectiveness
learning, and their applications in predicting the polarity of
in various fields. tweets.
3 2007 Feature The paper proposes a spectral 15 2020 Predictive This study uses a machine
Selection feature selection method for M odeling learning approach to predict the
both supervised and prognosis of breast cancer
unsupervised learning tasks, patients and identifies the most
which can improve the accuracy important features for outcome
and efficiency of predictive prediction.
modeling.
4 2018 Predictive This study uses the Titanic
M odeling dataset to predict the survivors III. DESCRIPT ION OF DAT A AND EXPERIMENT AL SET UP
of the disaster and compares the
performance of various The dataset utilized in this study has been obtained fro m the
machine learning algorithms. Kaggle website. www.kaggle.co m. The data collection
contains two colu mns and eleven rows. Parch Ticket, P -
class, name, PassengerId, survived, sex, age, SibSp, fare,
cabin, and embarked are the colu mns. Along with features, is a more straightforward and efficient solution for b inary
our goal value is survival. The dataset categorizes family and linear classification problems.
ties in this manner. Sib lings include b lood, adopted, and
step-relatives (mistresses and fiancés were not considered B. Random Forest is a supervised learning. It can be used in
spouses) This is how the dataset defines familial t ies. Both mach ine learning to get solutions for both regression and
the mother and father are parents. Child refers to a son, classification models. "A Random Forest is a classifier that
daughter, stepdaughter, or stepbrother. Because some incorporates numerous decision trees on different subsets of
children only traveled with a babysitter, their parch value the supplied dataset and takes the average to enhance the
was zero. predicted accuracy of that dataset "[6] as the name indicates.
Before any form of data analytics is done to it, the Author İn titanic dataset output instead of relying on a single
needs to clean the dataset. some missing values are also in decision tree, the random forest collects predictions fro m
the dataset that should need to handle. Missing values in each tree and forecasts the ultimate result using the
variables like Embarked, Cabin, and age are filled in using a predictions that received the most votes.
random selection fro m the existing age. In this scenario, the
cabin column is removed and replaced with the mode value C. Stochastic Gradient Descent (SGD) the Author takes a
fro m the Embarked miss value column. With the column titanic dataset and approach that optimizes the gradient
mean, fill in the missing value in the age column. descent throughout each search once a random weight
vector is chosen. Grad ient descent is a method that searches
Data exploration and analysis
across a vast or infin ite hypothesis space where 1)
In the first stage, we will do an exploratory analysis of data hypotheses are constantly parameterized and 2) mistakes are
on our problem. The dataset is examined through differentiable dependent on the parameters. The weights are
exploratory data analysis to identify the characteristics that initialized in SGD by the given dataset (titanic dataset), and
impact the survival rate. By creating a correlation between the algorithm updates the weight vector with a single data
each attribute and survival, the data is thoroughly reviewed. point. When an error computation is finished, the gradient
fig.3 demonstrates how sex affects the survival rate. descent progressively updates it to enhance convergence.
B. Accuracy: Accuracy is also a parameter used to evaluate Algorithm Accuracy F1- Recall Precision
S core
classification models [12][13]. The percentage of accurate
Logistic 78% 0.78 0.78 0.79
predictions done by our model is known as accuracy. The
Regression
official defin ition of accuracy is as follows: The fo llo wing Random 82% 0.82 0.81 0.82
illustrates how accuracy in binary classificat ion can also be Forest
quantified in form of positives and negatives: S tochastic 58% 0.45 0.58 0.63
Gradient
Where TP stands for True Positives (TP), TN for True Descent
Negatives (TN), FP for False Positives (FP), and FN fo r Decision 79% 0.79 0.79 0.79
False Negatives (FN). For examp le, if the X test set had 100 Tree
images and our model accurately pred icted 80 of them, we K-nearest 66% 0.64 0.66 0.67
neighbor
would get a score of 80/ 100. 0.8, or 80% precision. True
Positives + False Positives/ (True Positives + False
Negative+ True Negatives + False Negative) is the machine It is made very apparent that when using a different feature
learning accuracy formula. modeling approach, the models' accuracy may change. The
ideal models for classification problems are Random Forest
C. Recall A model's ability to locate all applicable cases and Decision Tree since they provide a high level o f
within a dataset. The number of true positives dividing by accuracy mentioned. The results of our experiment, as
the number of true positives adds to the number of false shown in Figure 4, demonstrate the performance of var ious
negatives is the definit ion of recall[14][15]. In machine mach ine learn ing algorith ms used for the prediction of
learning, the recalled formu la is True Positives / (True Titanic survival. We have evaluated the performance of the
Positives + False Negatives). algorith ms using accuracy, F1-score, recall, and precision.
Random Forest algorith m performed the best with an
D. Precision: The classification algorith m's ability to detect accuracy of 82%, an F1-score of 0.82, recall of 0.81, and
relevant data. Precision can be calculated by dividing the precision of 0.82. Log istic Regression and Decision Tree
number o f true positives by the number of t rue positives + algorith ms also performed well with an accuracy of 78%
the number of false positives. Machine learning precision and 79%, respectively. Ho wever, the Stochastic Gradient
formula = True Positives / (True Positives + False Positives) Descent algorithm showed poor performance with an
accuracy of only 58%. The K-nearest neighbor algorith m
performed moderately with an accuracy of 66%. These
E. F1-Score: When attempting to determine the ideal results indicate that the Random Forest algorithm is the
accuracy and recall ratio, we can use the F1 score to most suitable for predict ing the survival of Titanic
combine the two criteria. In machine learning, the F1 score passengers using machine learning techniques.
formula is 2*(Precision *Recall) / (Precision +Recall).
90%
VI. RESULT AND CONCLUSION
80%
The first step in conducting data analysis is data cleaning.
Analyzing exp loratory data makes co mprehending the 70%
dataset and the relationships between the features easier.
Utilizing several graphic techniques. The one used above 60%
uses histograms and ggplot. A few inferences are made and
facts are discovered by using exploratory data analysis. 50%
Based on the exploratory data analysis technique, the 40%
precise parameters for build ing the training and prediction
model are identified in feature engineering. Machine 30%
learning models pred ict the worth of passengers who
survived. To make predict ions in classificat ion problems 20%
Random Forest technique is used. With an accuracy of
10%
0.827261504, Recall of 0.813453456, F1-score of
0.8237261504, and precision of.827261504 according to the 0%
confusion matrix, Random Forest emerges as the most Logistic Random Stochastic Decision K-nearest
accurate model. This indicates that Random Fo rest has a Regression Forest Gradient Tree neighbor
very high level of prediction ability in this dataset using the Descent
selected features. For the clear p icture of statistical analysis
Accuracy F1- Score Recall Precision
see table 2.
Fig4: Display the results of the algorithms. This graph [12] Cicoria, S., Sherlock, J., Muniswamaiah, M., & Clarke, L.
demonstrates the algorithm's performance in relation to (2014). Classification of T itanic Passenger Data and Chances
of Surviving the Disaster. Proceedings of Student-Faculty
accuracy and other factors .
Research Day CSIS, 1-6.
VII. CONCLUSION AND FUT URE SCOPE [13] Lam, E., & Tang, C. (2012). T itanic Machine Learning From
Disaster. Lampang-Titanic Machine Learning From Disaster.
Models created using machine learn ing anticipate the values [14] Xie Agarwal, B., Vovsha, I., Rambow, O., & Passonneau, R.
of passengers who survived. The random forest technique is (2011). Sentiment analysis of T witter data. Proceedings of the
applied to make predict ions in a classification challenge. ACL 2011 Workshop on Languages in Social Media.
The correctness of each model is determined by the [15] Andjelkovic Cirkovic, B. R. (2020). Machine learning
confusion matrix, and the Random Forest model co mes out approach for breast cancer prognosis prediction.
on top with an accuracy of 0.82. This indicates that the Computational Modeling in Bioengineering and
Random forest's predictive capability in th is dataset with the Bioinformatics.
selected features is quite strong. It is made very apparent
that when using a different feature modeling approach, the
models' accuracy may change. The models that provide the
best level o f accuracy for classification problems are
random forest. Machine learn ing and data analytics are
being used in this project. This project's work can be used as
a model for learn ing how to incorporate EDA and machine
learning at the very beginning. With the use of more recent
lib raries, such as shiny in R, the concept can be expanded in
the future to create more co mplex graphical user interfaces.
It is possible to create an interactive page, where the values
corresponding to an attribute's graph (such as a ggplot or
histogram) would also change if the value of the attribute is
modified on the scale. By integrating our results, we can
also get too far more precise conclusions.
REFERENCES
[1] Kaggle.com. (n.d.). Titanic: Machine Learning for Disaster.
Retrieved October 29, 2019, from http://www.kaggle.com/
[2] Jain, N., & Srivastava, V. (2013). Data mining techniques: A
survey paper. IJRET: International Journal of Research in
Engineering and Technology, 2(11), 2319-1163.
[3] Zhao, Z., & Liu, H. (2007). Spectral feature selection for
supervised and unsupervised learning. Proceedings of the 24th
international conference on Machine learning. ACM.
[4] Farag, N., & Hassan, G. (2018). Predicting the Survivors of
the Titanic Kaggle, Machine Learning From Disaster. In
ICSIE'18 Proceedings of the 7th International Conference on
Software and Information Engineering (pp. 1-7). ACM.
[5] E. Lam and C. Tang, CS229 Titanic–Machine Learning From
Disaster, 2012.
[6] Liu, J. (2017). Arkham/Jack-Dies. GitHub. Retrieved August
30, 2017, from https://github.com/Arkham/jack-dies
[7] Singh, A., Saraswat, S., & Faujdar, N. (2017). Analyzing
Titanic disaster using machine learning algorithms. 2017
International Conference on Computing, Communication and
Automation (ICCCA). IEEE.
[8] Han, J., & Morag, C. (1995). The influence of the sigmoid
function parameters on the speed of back propagation
learning. In From Natural to Artificial Neural Computation
(pp. 195-201). Springer.
[9] Peng, W., Chen, J., & Zhou, H. (2009). An implementation of
ID3-decision tree learning algorithm. Retrieved from
http://web.arch.usyd.edu.au/wpeng/DecisionTree2.pdf
[10] Ekinci, E. O., & Acun, N. (2018). A comparative study on
machine learning techniques using T itanic dataset. 7th
International Conference on Advanced T echnologies.
[11] Xiao, Y., Wang, T ., & Wu, J. (2014). T wo methods of
selecting Gaussian kernel parameters for one-class SVM and
their application to fault detection. Knowledge-Based
Systems, 59, 75-84.