A Comparative Study On Machine Learning Techniques Using Titanic Dataset

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Comparative Study on Machine Learning

Techniques using Titanic Dataset


Ekin Ekinci*, Sevinç İlhan Omurca*, Neytullah Acun*
*
Kocaeli University, Faculty of Engineering, Computer Engineering Department
Umuttepe Campus, Kocaeli, Turkey
{ekin.ekinci, silhan}@.edu.tr, [email protected]

Abstract— The Titanic disaster resulting in the sinking of the He reported performance metrics across different cases
British passenger ship with the loss of 722 passengers and crew comparison and concluded that, the maximum accuracy
occurred in the North Atlantic on April 15, 1912. Although it has obtained from Multiple Linear Regression is 78.426%; the
been many years since this maritime disaster took place, research maximum accuracy obtained from Logistic Regression is
on understanding what impacts individual’s survival or death
80.756%.
has been attracting researchers' attention. In this study, we
propose to apply fourteen different machine learning techniques, Datla [2] compared the results of Decision tree and
including Logistic Regression (LR), k-Nearest Neighbors (kNN), Random Forests algorithms for Titanic dataset. Decision tree
Naïve Bayes (NB), Support Vector Machines, Decision Tree, is resulted 0.84% correctly classified instances, while Random
Bagging, AdaBoost, Extra Trees, Random Forest (RF), Gradient Forests resulted 0.81%. As the feature engineering steps, they
Boosting (GB), Calibrated GB, Artificial Neural Networks created new variables such as “survived”, “child”, “new_fare”,
(ANN), Voting (GB, ANN, kNN) and, Voting (GB, RF, NB, LR, “title”, “Familysize”, “FamilyIdentity” which are not included
kNN) to Titanic dataset, which is publicly available, to analyze in feature list of Titanic dataset and also replaced a missing
likelihood of survival and learn what features have a correlation value by the mean value of a given feature.
towards survival of passengers and crew. Also, obtained F-
There are several studies in the literature that compared
measure score from machine learning techniques are compared
with each other and the F-measure score which is obtained different classification algorithms on multiple dataset. Meyer
Kaggle. As a result of this study, more successful F-measure rates et al., [3] compared SVM implementation to 16 classification
have been obtained with GB and Voting than Kaggle. algorithms and for titanic dataset they achieved %20.81
and %21.27 error rates with neural networks and SVM
Keywords— Machine learning, classification, data analysis, respectively as minimum errors. Ratsch et al. [4] compared
Titanic, Kaggle Adaboost classifiers to SVM and RBF classifiers. For titanic
dataset, %22.4 error rate is obtained from SVM as the
I. INTRODUCTION minimum error rate. Li et al. [5] used SVM as a component
The inevitable development of technology has both classifier for Adaboost. They used titanic dataset as one of the
facilitated our life and brought some difficulties with it. One experimental data and the minimum error rate they obtained
of the benefits brought by the technology is that a wide range is %21.8.
of data can be obtained easily when requested. However, it is The rest of the paper is organized as follows: Section 2
not always possible to acquire the right information. Raw data presents the techniques employed in experimental studies.
that is easily accessed from the internet sources alone does not Experimental setup and results are given in sections 3. Section
make sense and it should be processed to serve an information 4 concludes the paper with a discussion.
retrieval system. In this regard, feature engineering methods
and machine learning algorithms are plays an important role II. METHODOLOGY
in this process.
A. Logistic Regression
The aim of this study is to get as reliable results as possible
from the raw and missing data by using machine learning and LR is one of the most popular methods used to classify
feature engineering methods. Therefore one of the most binary data. LR is based on the assumption that the value of
popular datasets in data science, Titanic is used. This dataset dependent variable is predicted by using independent
records various features of passengers on the Titanic, variables. In the model, Y is the dependent variable we are
including who survived and who didn't. It is realized that trying to predict by observing X which is the input or set of
some missing and uncorrelated features decreased the the independent variables ( , … , ) . The value of Y that
performance of prediction. For a detailed data analysis, the corresponds to the people as either survived (Y=1) or not
effect of the features has been investigated. Thus some new survived (Y=-1) and is summarized by (X=x). From this
features are added to the dataset and some existing features definition, the conditional probability follows a logistic
are removed from the dataset. distribution given by ( = 1| = ). This function called as
Chatterjee [1] applied multiple logistic regression and regression function we need to predict Y.
logistic regression to check whether a passenger is survived.
B. K Nearest Neighbors high accuracy, good generalization performance and reduce in
kNN is one of the most common, simplest and non- variance and bias are achieved.
parametric classification algorithms when there is little or no G. AdaBoost
prior knowledge about the distribution of the data. Using the
distance metrics to measure the closeness between training AdaBoost is one of the most used and effective ensemble
samples and the test sample, kNN assigns the test sample with learning methods. The base notion behind AdaBoost is that a
class of its k nearest training samples. In terms of closeness, strong classifier can be created by linearly combining a
the kNN is mostly based on the Euclidean distance. The number of weak classifiers [9]. In the training process
Euclidean distance between training sample = AdaBoost increases the weights of misclassified data points
( , ,…, ) with N features, test sample = while is decreasing weights of correctly classified data points.
( , ,…, ) with N features and = 2 is That is, AdaBoost reweights all training data in its every
iteration. Weak classifiers are applied in serially then
generated classification models are combined according to
( , ) = (∑ ( − ) ) /
. (1)
weighted majority voting.
When m = 1 the distance is called as Manhattan and > H. Extra Trees
2 the distance called as Minkowski.
Extra tree (The Extremely Randomized Decision Tree) is a
C. Naïve Bayes decision tree ensemble classification method. Extra Trees are
based on the randomization. For each node of the tree splitting
NB, which is known as effective inductive learning
rules are randomly drawn then the best performing rule based
algorithm, achieves efficient and fast classification in machine
on a score is associated with that node [10]. For each tree that
learning applications. The algorithm is based on Bayes
composed extra trees whole dataset is used for training.
theorem assuming all features are independent given the value
of the class variable [6]. This is conditional independence İ. Random Forest
assumption and true in real world applications. Due to this
RF is a classification algorithm developed by Breiman and
assumption NB performs well on high dimensional and
Cutler that uses an ensemble of tree predictors [11]. It is one
complex datasets.
of the most accurate learning algorithms and for many
D. Support Vector Machines datasets; it achieves a highly accurate classifier. In RF, each
tree is constructed by bootstrapping the training data and for
SVM, which was developed by Vapnik in 1995, is based on
each split randomly selected subset of features are used [12].
principle of structural risk minimization that exhibits good
Splitting is made based on purity measure. This classification
generalization performance. With SVM, finding an optimal
method estimates missing data and large proportion of the
separating hyperplane between classes by focusing on the
data are missing it still maintains accuracy.
support vectors is proposed [7]. This hyperplane separates the
training data by a maximal margin. SVM solves nonlinear J. Gradient Boosting
problems by mapping the data points into a high-dimensional
GB was developed by Friedman (2001) is a powerful
space.
machine learning algorithm that has shown considerable
E. Decision Tree success in a wide range of real world applications. GB handles
boosting as a method for function estimation, in terms of
Decision trees with their fairly simple structure to create are
numerical optimization in function space [13].
one of the most used classifiers. A decision tree is a tree
structured model with decision nodes and prediction nodes. K. Artificial Neural Networks
Decision nodes are used to branch and prediction nodes
Multilayer perceptron (MLP) is a kind of ANN has ability
specify class labels. C4.5 is a kind of decision tree algorithm
to solve nonlinear classification problems with high accuracy
builds a decision tree from training data by using the
and good generalization performance. The MLP has been
information gain. When building decision trees C4.5 uses
applied to a wide variety of tasks such as feature selection,
divide and conquer approach.
pattern recognition, optimization and so on. A MLP can be
F. Bagging considered as a directed graph in which artificial neurons are
presented with nodes and directed and weighted edges
Bagging is one of the oldest and easiest techniques for
connects nodes to each other [14]. Nodes are organized into
creating an ensemble of classifiers, improves accuracy by
layers: an input layer, one or more hidden layers and an output
resampling of the training set [8]. The fundamental
layer. MLP uses backpropagation to classify data points and
assumption behind the bagging is to use multiple training sets
by using backpropagation error is propagated in backward
instead of using a single one to prevent results which depend
direction to adjust weights.
on a training set. A base single classifier is applied in parallel
to generated training sets then generated classification models
are combined according to majority voting. With bagging,
L. Voting
For obtaining accurate classification results, a bunch of
classifiers are assembled for artificial and real-world datasets.
Voting is the simplest method which combines predictions
from multiple classifiers and made a single contribution. [15].
While majority voting is resulting with class with the most
votes, weighted voting makes a weighted linear combination
of classifiers and decides class with the highest aggregate.

III. EXPERIMENTS

A. Dataset
Titanic: Machine Learning from Disaster competition
dataset [16] was provided by Kaggle. The Titanic dataset
consist of a training set that includes 891 passengers and a test
set that includes 418 passengers which are different from the
Fig. 1 Distribution of sex feature.
passengers in training set. A description of the features is
given in Table I. 2) Embarked: When we consider the distribution of the
TABLE I “Embarked” feature, there are 644, 168, 77 passengers
NUMBER OF FEATURES IN THE DATASET boarding from the port “S”, “C” and “Q” on the ship
respectively. The survival rates of passengers boarding from
Feature Value of Feature Feature Characteristic
these ports are given in Fig. 2. When this figure is analyzed, C
PassengerId 1-891 Integer
is the port with the highest survival rate of 55%. Thus, this can
Survived 0,1 Integer
Pclass 1-3 Integer be interpreted like “embarked” feature gives important clues
Name of about survival.
Name Object
passengers
Sex Male, female Object
Age 0-80 Real
SibSp 0-8 Integer
Parch 0-6 Integer
Ticket Ticket number Object
Fare 0-512 Real
Cabin Cabin number Object
Embarked S, C, Q Object

While the features such as PassengerId, Survived, Pclass,


Age, SibSp, Parch and Fare are numeric values, Name, Sex
and Embarked can take nominal values; the features such as
Ticket, Cabin can take numeric and nominal values.
For a detailed feature engineering we first analyzed the
features.
1) Sex: When we consider the distribution of the “Sex” Fig. 2 Distribution of Embarked feature.
feature, there are 314 female and 577 male passengers. 233 of 3) Pclass: “Pclass” feature describes three different classes
female passengers have been rescued and others have lost of passengers. There are 216 passengers belong to the class 1,
their lives. On the other hand, 109 of male passengers have 184 passengers in class2 and finally 491 passengers in class 3.
been rescued and others have lost their lives. If we analyze The survival rates of passengers due to “Pclass” feature are
these distributions it is realized that the survival rate of given in Fig. 3. The passengers with the highest survival rates
women is higher than that of men. It has been concluded that are the first class passengers with 63%. This ratio also shows
the effect of this feature on predicting the class label is that wealthy people are alive.
significant.
figure. When the correlation scores are evaluated it is
observed that the correlation between “Survived” and “Sex” is
highest while the correlation between “Survived” and “Age”
is the minimum. Apart from that, “Sibsp” and “Parch”
features are correlated by 0.41. Accordingly by combining
these two features a new feature can be created.

Fig. 3 Distribution of Pclass feature.

4) Age: When the “age” feature” is considered it is seen that,


the age of passengers are range from 0 to 80. If we group the
passengers by specific age ranges such as 0-13, 14-60 and 61-
Fig. 5 Correlation between data.
80 then we realized that most of the passengers in the 0-13 age
group are survived and a large majority of passengers in the 7) Family size: In machine learning applications, features
age group 61-80 lost their lives. This statistical information extension methods as well as feature reduction methods can
proves that the first children were rescued when the ship also improve the classification performance. In this study, a
started to sink. feature named “Family_size” is created in addition to the
existing features. This feature is calculated by adding the
value of Sibsp feature to the value of Parch feature. After that,
we have distinguished this feature with two groups. In the
first group consist of passengers whose family_size is 0, 4, 5,
6, 7 or 10 and in the second group there are passengers whose
family_size is 1, 2 or 3. It is observed that most of the first
group lost their lives and the majority of the second group is
survived. These results show that the number of family
members strengthens the possibility of survival.

Fig. 4 Distribution of Age feature.

5) Fare: The “fare” feature specifies the fare paid by the


passenger and it changes between 0-512. If we distinguish this
feature with two groups as 0-90 and 91-512, then it is seen
that most of the passengers who paid between 0 and 90 lost
their lives and the majority of the passengers who paid
between 91and 512 survived.
Fig. 6 Distribution of Family_size feature
6) Correlation between data: In classification task of
machine learning, the correlation which is often used as a B. Preprocessing Steps
preliminary technique to discover relationships between
variables can be a key to improve the accuracy of a prediction In this study data cleaning, data integration, data
model. In classification models, the positive or negative transformation are applied as preprocessing steps. The missing
correlation between feature values can be used to discover values of Age and Fare features are filled by meadian values
which ways the independent features influence intuitive of these features. The missing values of “Embarked” feature
forecasting. The correlation between features of Titanic are filled by “C” value. The PassengerId, Name, Ticket and
dataset is shown in Fig. 5. Due to the embarked and sex Cabin features are removed from the feature set.
features have nominal values they are not included in this
C. Experimental Results Support Vector
Machines 0.787 0.71 0.766
All algorithms are run in order to analyze likelihood of
survival and learn what features have a correlation towards IV. CONCLUSIONS
survival of passengers and crew. When applying algorithms to
Obtaining valuable results from the raw and missing data
Titanic dataset, we have seen that to make the algorithm
by using machine learning and feature engineering methods is
accurate, some more adjustments on some model parameters
very important for knowledge-based world. In this paper, we
are required.
have proposed models for predicting whether a person
Logistic regression is applied with a penalty term which is
survived the Titanic disaster or not. First, a detailed data
decided as “l2”. For kNN, number of neighbors is selected as
analysis is conducted to investigate features that have
8 and Minkowski is selected as distance measure. Naïve
correlation or are non-informative. And as a preprocessing
Bayes algorithm is used based on Bernoulli distribution. In
step some new features are added to dataset such as
SVM, the penalty parameter is important to control level of
family_size and some of them are excluded such as name,
misclassification and determined as 3. Gini is used as impurity
ticket and cabin. Secondly, in classification step 14 different
measure in C4.5 decision tree. Apart from that maximum
machine learning algorithms are used for classifying the
depth is a parameter that makes the search space finite and
dataset formed in preprocessing step.
also prevents decision tree from growing to an extremely large
The proposed model can predict the survival of passengers
size and decided as 25. In Bagging, maximum bag size is
and crew with 0.82 F-measure score with Voting (GB, ANN,
determined as 2.6%. Decision tree is used as the base
kNN).
estimator in bagging and adaboost. Number of trees in the
As a conclusion, this paper presents a comparative study on
forest is selected 15 and 200 for Extra Trees and Random
machine learning techniques to analyze Titanic dataset to
Forest classifiers respectively. In Gradient Boosting, Logistic
learn what features effect the classification results and which
Regression is used as loss function to be optimized. Gradient
techniques are robust.
Boosting is calibrated with sigmoid function in Calibrated
Gradient Boosting. MLP with backpropogation is used as REFERENCES
Artificial Neural Networks. In the first voting algorithm, GB, [1] T. Chatterjee, “Prediction of Survivors in Titanic Dataset: A
ANN and kNN are voted, in the second GB, RF, NB, LR and Comparative Study using Machine Learning Algorithms,”
KNN are voted. International Journal of Emerging Research in Management
Algorithms are evaluated according to accuracy and F- &Technology, vol. 6, pp. 1-5, June 2017.
[2] M. V. Datla, “Bench Marking of Classification Algorithms: Decision
measure. We compare our F-measure scores with F-measure Trees and Random Forests – A Case Study using R,” in Proc. I-TACT-
scores obtained from Kaggle. The performances of the 15, 2015, pp. 1-7.
algorithms are listed in Table II. It is observed that the best [3] D. Meyer, F. Leisch and K. Hornik, “The support vector machine
performance is provided with Voting (GB, ANN, kNN) with under test,” Neurocomputing, vol. 55, pp. 169-186, Sept. 2003.
[4] G. Rätsch, T. Onoda, and K.-R. Müller, “Soft Margins for AdaBoost,”
F-measure score of 0.82. Compared with Kaggle, more Machine Learning,vol. 42, pp. 287-320, Mar. 2001.
successful F-measure rates have been obtained with GB and [5] X. Li, L. Wang, and E. Sung, “AdaBoost with SVM-based component
Voting. With Calibration we expect to see better results but classifiers,” Engineering Applications of Artificial Intelligence, vol. 21,
our calibration doesn’t yield increase in F-measure score pp. 785-795, Aug. 2008.
[6] S. İlhan Omurca and E. Ekinci, “An alternative evaluation of post
while Kaggle yields. traumatic stress disorder with machine learning methods,” in Proc.
TABLE II INISTA 2015, 2015, pp. 1-7.
COMPARISON OF ACCURACY, F-MEASURE AND KAGGLE SCORES OF [7] C. Cortes, and V. Vapnik, “Suppoprt-Vector Networks,” Machine
ALGORITHMS Learning, vol. 20, pp. 273-297, 1995.
[8] G. Liang, X. Zhu, and C. Zhang, “An empirical stıdy of bagging
Accuracy F- predictors for different learning algorithms,” in Proc. AAAI'11, 2011,
Algorithm Kaggle pp. 1802-1803.
measure
Voting (GB, ANN, [9] Y. Ma, X. Ding, Z. Wang and N. Wang, “Robust prcise eye loation
under probabilistic framework,” in Proc. Sixth IEEE International
kNN) 0.869 0.82 0.794
Conference on Automatic Face and Gesture Recognition, 2004, pp.
Gradient Boosting 0.869 0.815 0.789 339-344.
Calibrated (GB) 0.866 0.81 0.813 [10] C. Desir, C. Petitjean, L. Heutte, M. Salaun and L. Thiberville,
Voting (GB, RF,NB,LR “Classification of Endomicroscopic Images of the Lung Based on
kNN) 0.851 0.79 0.789 Random Subwindows and Extra-Trees,” IEEE Transactions on
Random Forest 0.848 0.781 0.789 Biomedical Engineering, vol. 59, pp. 2677-2683, Sep. 2012.
[11] L. Breiman, “Random Forests,” Machine Learning, vol. 45, pp.5-32,
Artificial Neural
2001.
Networks 0.813 0.743 0.766 [12] R. Diaz-Uriarte and S. Alvadez de Andres, “Gene selection and
AdaBoost 0.814 0.741 0.78 classification of microarray data using random forest,” BMC
Decission Tree 0.817 0.738 0.789 Bioinformatics, vol. 7, p. 3, Jan. 2006.
Bagging 0.806 0.731 0.775 [13] S. B. Taieb and R. J. Hyndman, “A gradient boosting approach to the
Logistic Regression 0.802 0.728 0.766 Kaggle load forecasting competition,” International Journal of
Forecasting, vol. 30, pp. 382-394, Apr. 2014.
Naive Bayes 0.789 0.714 0.762
[14] I. Maglogiannis, K. Karpouzis, B. A. Wallace, and J. Soldatos, Eds.,
Extra Trees 0.815 0.713 0.785 Supervised Machine Learning: A Review of Classification Techniques,
k Nearest Neighbors 0.802 0.712 0.665
ser. Emerging Artificial Intelligence Applications in Computer
Engineering. IOS Press, 2007, vol. 160.
[15] E. Bauer and R. Kohavi, “An Empirical Comparison of Voting
Classification Algorithms: Bagging, Boosting, and Variants,” Machine
Learning, vol. 36, pp. 105-139, 1999.
[16] (2018) The Kaggle website. [Online] Available:
http://www.kaggle.com/

You might also like