A Comparative Study On Machine Learning Techniques Using Titanic Dataset
A Comparative Study On Machine Learning Techniques Using Titanic Dataset
A Comparative Study On Machine Learning Techniques Using Titanic Dataset
Abstract— The Titanic disaster resulting in the sinking of the He reported performance metrics across different cases
British passenger ship with the loss of 722 passengers and crew comparison and concluded that, the maximum accuracy
occurred in the North Atlantic on April 15, 1912. Although it has obtained from Multiple Linear Regression is 78.426%; the
been many years since this maritime disaster took place, research maximum accuracy obtained from Logistic Regression is
on understanding what impacts individual’s survival or death
80.756%.
has been attracting researchers' attention. In this study, we
propose to apply fourteen different machine learning techniques, Datla [2] compared the results of Decision tree and
including Logistic Regression (LR), k-Nearest Neighbors (kNN), Random Forests algorithms for Titanic dataset. Decision tree
Naïve Bayes (NB), Support Vector Machines, Decision Tree, is resulted 0.84% correctly classified instances, while Random
Bagging, AdaBoost, Extra Trees, Random Forest (RF), Gradient Forests resulted 0.81%. As the feature engineering steps, they
Boosting (GB), Calibrated GB, Artificial Neural Networks created new variables such as “survived”, “child”, “new_fare”,
(ANN), Voting (GB, ANN, kNN) and, Voting (GB, RF, NB, LR, “title”, “Familysize”, “FamilyIdentity” which are not included
kNN) to Titanic dataset, which is publicly available, to analyze in feature list of Titanic dataset and also replaced a missing
likelihood of survival and learn what features have a correlation value by the mean value of a given feature.
towards survival of passengers and crew. Also, obtained F-
There are several studies in the literature that compared
measure score from machine learning techniques are compared
with each other and the F-measure score which is obtained different classification algorithms on multiple dataset. Meyer
Kaggle. As a result of this study, more successful F-measure rates et al., [3] compared SVM implementation to 16 classification
have been obtained with GB and Voting than Kaggle. algorithms and for titanic dataset they achieved %20.81
and %21.27 error rates with neural networks and SVM
Keywords— Machine learning, classification, data analysis, respectively as minimum errors. Ratsch et al. [4] compared
Titanic, Kaggle Adaboost classifiers to SVM and RBF classifiers. For titanic
dataset, %22.4 error rate is obtained from SVM as the
I. INTRODUCTION minimum error rate. Li et al. [5] used SVM as a component
The inevitable development of technology has both classifier for Adaboost. They used titanic dataset as one of the
facilitated our life and brought some difficulties with it. One experimental data and the minimum error rate they obtained
of the benefits brought by the technology is that a wide range is %21.8.
of data can be obtained easily when requested. However, it is The rest of the paper is organized as follows: Section 2
not always possible to acquire the right information. Raw data presents the techniques employed in experimental studies.
that is easily accessed from the internet sources alone does not Experimental setup and results are given in sections 3. Section
make sense and it should be processed to serve an information 4 concludes the paper with a discussion.
retrieval system. In this regard, feature engineering methods
and machine learning algorithms are plays an important role II. METHODOLOGY
in this process.
A. Logistic Regression
The aim of this study is to get as reliable results as possible
from the raw and missing data by using machine learning and LR is one of the most popular methods used to classify
feature engineering methods. Therefore one of the most binary data. LR is based on the assumption that the value of
popular datasets in data science, Titanic is used. This dataset dependent variable is predicted by using independent
records various features of passengers on the Titanic, variables. In the model, Y is the dependent variable we are
including who survived and who didn't. It is realized that trying to predict by observing X which is the input or set of
some missing and uncorrelated features decreased the the independent variables ( , … , ) . The value of Y that
performance of prediction. For a detailed data analysis, the corresponds to the people as either survived (Y=1) or not
effect of the features has been investigated. Thus some new survived (Y=-1) and is summarized by (X=x). From this
features are added to the dataset and some existing features definition, the conditional probability follows a logistic
are removed from the dataset. distribution given by ( = 1| = ). This function called as
Chatterjee [1] applied multiple logistic regression and regression function we need to predict Y.
logistic regression to check whether a passenger is survived.
B. K Nearest Neighbors high accuracy, good generalization performance and reduce in
kNN is one of the most common, simplest and non- variance and bias are achieved.
parametric classification algorithms when there is little or no G. AdaBoost
prior knowledge about the distribution of the data. Using the
distance metrics to measure the closeness between training AdaBoost is one of the most used and effective ensemble
samples and the test sample, kNN assigns the test sample with learning methods. The base notion behind AdaBoost is that a
class of its k nearest training samples. In terms of closeness, strong classifier can be created by linearly combining a
the kNN is mostly based on the Euclidean distance. The number of weak classifiers [9]. In the training process
Euclidean distance between training sample = AdaBoost increases the weights of misclassified data points
( , ,…, ) with N features, test sample = while is decreasing weights of correctly classified data points.
( , ,…, ) with N features and = 2 is That is, AdaBoost reweights all training data in its every
iteration. Weak classifiers are applied in serially then
generated classification models are combined according to
( , ) = (∑ ( − ) ) /
. (1)
weighted majority voting.
When m = 1 the distance is called as Manhattan and > H. Extra Trees
2 the distance called as Minkowski.
Extra tree (The Extremely Randomized Decision Tree) is a
C. Naïve Bayes decision tree ensemble classification method. Extra Trees are
based on the randomization. For each node of the tree splitting
NB, which is known as effective inductive learning
rules are randomly drawn then the best performing rule based
algorithm, achieves efficient and fast classification in machine
on a score is associated with that node [10]. For each tree that
learning applications. The algorithm is based on Bayes
composed extra trees whole dataset is used for training.
theorem assuming all features are independent given the value
of the class variable [6]. This is conditional independence İ. Random Forest
assumption and true in real world applications. Due to this
RF is a classification algorithm developed by Breiman and
assumption NB performs well on high dimensional and
Cutler that uses an ensemble of tree predictors [11]. It is one
complex datasets.
of the most accurate learning algorithms and for many
D. Support Vector Machines datasets; it achieves a highly accurate classifier. In RF, each
tree is constructed by bootstrapping the training data and for
SVM, which was developed by Vapnik in 1995, is based on
each split randomly selected subset of features are used [12].
principle of structural risk minimization that exhibits good
Splitting is made based on purity measure. This classification
generalization performance. With SVM, finding an optimal
method estimates missing data and large proportion of the
separating hyperplane between classes by focusing on the
data are missing it still maintains accuracy.
support vectors is proposed [7]. This hyperplane separates the
training data by a maximal margin. SVM solves nonlinear J. Gradient Boosting
problems by mapping the data points into a high-dimensional
GB was developed by Friedman (2001) is a powerful
space.
machine learning algorithm that has shown considerable
E. Decision Tree success in a wide range of real world applications. GB handles
boosting as a method for function estimation, in terms of
Decision trees with their fairly simple structure to create are
numerical optimization in function space [13].
one of the most used classifiers. A decision tree is a tree
structured model with decision nodes and prediction nodes. K. Artificial Neural Networks
Decision nodes are used to branch and prediction nodes
Multilayer perceptron (MLP) is a kind of ANN has ability
specify class labels. C4.5 is a kind of decision tree algorithm
to solve nonlinear classification problems with high accuracy
builds a decision tree from training data by using the
and good generalization performance. The MLP has been
information gain. When building decision trees C4.5 uses
applied to a wide variety of tasks such as feature selection,
divide and conquer approach.
pattern recognition, optimization and so on. A MLP can be
F. Bagging considered as a directed graph in which artificial neurons are
presented with nodes and directed and weighted edges
Bagging is one of the oldest and easiest techniques for
connects nodes to each other [14]. Nodes are organized into
creating an ensemble of classifiers, improves accuracy by
layers: an input layer, one or more hidden layers and an output
resampling of the training set [8]. The fundamental
layer. MLP uses backpropagation to classify data points and
assumption behind the bagging is to use multiple training sets
by using backpropagation error is propagated in backward
instead of using a single one to prevent results which depend
direction to adjust weights.
on a training set. A base single classifier is applied in parallel
to generated training sets then generated classification models
are combined according to majority voting. With bagging,
L. Voting
For obtaining accurate classification results, a bunch of
classifiers are assembled for artificial and real-world datasets.
Voting is the simplest method which combines predictions
from multiple classifiers and made a single contribution. [15].
While majority voting is resulting with class with the most
votes, weighted voting makes a weighted linear combination
of classifiers and decides class with the highest aggregate.
III. EXPERIMENTS
A. Dataset
Titanic: Machine Learning from Disaster competition
dataset [16] was provided by Kaggle. The Titanic dataset
consist of a training set that includes 891 passengers and a test
set that includes 418 passengers which are different from the
Fig. 1 Distribution of sex feature.
passengers in training set. A description of the features is
given in Table I. 2) Embarked: When we consider the distribution of the
TABLE I “Embarked” feature, there are 644, 168, 77 passengers
NUMBER OF FEATURES IN THE DATASET boarding from the port “S”, “C” and “Q” on the ship
respectively. The survival rates of passengers boarding from
Feature Value of Feature Feature Characteristic
these ports are given in Fig. 2. When this figure is analyzed, C
PassengerId 1-891 Integer
is the port with the highest survival rate of 55%. Thus, this can
Survived 0,1 Integer
Pclass 1-3 Integer be interpreted like “embarked” feature gives important clues
Name of about survival.
Name Object
passengers
Sex Male, female Object
Age 0-80 Real
SibSp 0-8 Integer
Parch 0-6 Integer
Ticket Ticket number Object
Fare 0-512 Real
Cabin Cabin number Object
Embarked S, C, Q Object