0% found this document useful (0 votes)
2 views

Using Titanic Dataset for Comprehensive Machine Learning Model Training

This paper explores the use of the Titanic dataset for training various machine learning models, emphasizing its effectiveness for binary classification tasks. The study demonstrates that models like logistic regression, decision trees, and neural networks can achieve high accuracy in predicting passenger survival based on features such as gender, age, and passenger class. Future research directions include investigating advanced algorithms and addressing class imbalance to further enhance predictive performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Using Titanic Dataset for Comprehensive Machine Learning Model Training

This paper explores the use of the Titanic dataset for training various machine learning models, emphasizing its effectiveness for binary classification tasks. The study demonstrates that models like logistic regression, decision trees, and neural networks can achieve high accuracy in predicting passenger survival based on features such as gender, age, and passenger class. Future research directions include investigating advanced algorithms and addressing class imbalance to further enhance predictive performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14810217

Using Titanic Dataset for Comprehensive Machine


Learning Model Training
Mahmud Hasan1; A T M Hasan2
1,2 Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.

Publication Date: 2025/02/07

Abstract: The Titanic dataset, which documents the survival status of passengers aboard the ill-fated ship, has emerged as a
valuable resource for developing and evaluating machine learning algorithms. This paper investigates the utility of the Titanic
dataset for training various machine learning models, focusing on both binary classification accuracy and the insights gained
from feature engineering. By leveraging features such as passenger class, gender, and age, we demonstrate how the Titanic
dataset serves as an ideal foundation for model development. Results indicate that this dataset offers robust training
opportunities across multiple algorithms. Future research could involve deeper exploration of ensemble methods and more
complex feature extraction techniques to further enhance predictive performance.

Keywords: Titanic Dataset, Machine Learning, Classification Models, Logistic Regression, Random Forest, SVM, Neural Networks,
Data Preprocessing.

How to Cite: Mahmud Hasan; A T M Hasan (2024). Using Titanic Dataset for Comprehensive Machine Learning Model Training.
International Journal of Innovative Science and Research Technology, 9(10), 3063-3065.
https://doi.org/10.5281/zenodo.14810217

I. INTRODUCTION studies also emphasize the need for proper data preprocessing
and the mitigation of class imbalance.
Machine learning has become a key technology in data
analysis, allowing models to learn from historical data and III. METHODOLOGY
make predictions on unseen data. To train these models
effectively, datasets like the Titanic dataset from Kaggle A. Data Description
provide an excellent foundation. This dataset contains detailed The Titanic dataset is comprised of 891 rows and 12
information on passengers, with the objective of predicting columns, each representing different characteristics of
survival based on a variety of factors such as gender, age, and passengers aboard the Titanic. Key features include:
socio-economic status. We aim to analyze the suitability of the  PassengerId: Unique identifier for each passenger
Titanic dataset for training machine learning models, focusing  Survived: Outcome variable (1 if the passenger survived,
on classification problems. By leveraging various algorithms, 0 if not)
we assess the dataset's strengths and limitations in preparing  Pclass: Passenger's class (1st, 2nd, or 3rd)
models that generalize well to unseen data.  Name: Name of the passenger
 Sex: Gender of the passenger
II. LITERATURE REVIEW  Age: Age of the passenger
 SibSp: Number of siblings/spouses aboard the Titanic
Several studies have demonstrated the effectiveness of
 Parch: Number of parents/children aboard the Titanic
the Titanic dataset in teaching machine learning concepts. Past
research has focused primarily on decision trees, random  Ticket: Ticket number
forests, and logistic regression as these models are intuitive  Fare: Fare paid by the passenger
and well-suited to small-to-medium datasets. Studies like  Cabin: Cabin number
those by Wang et al. (2018) highlight how basic models like  Embarked: Port of embarkation (C = Cherbourg, Q =
logistic regression can outperform more complex ones when Queenstown, S = Southampton)
proper feature engineering is applied. Others, such as Zhang et
al. (2019), focus on the value of ensemble methods and deep
learning in improving prediction accuracy. However, these

IJISRT24OCT1256 www.ijisrt.com 3063


Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14810217

B. Preprocessing  Random Forest: An ensemble model that averages


We handled missing data, particularly in the Age and multiple decision trees
Cabin columns, by using imputation techniques such as filling  Support Vector Machines (SVM): A model that aims to
missing ages with the median and discarding the Cabin maximize the margin between data points
feature due to excessive missing data. The Embarked feature  Neural Networks: A deep learning model that learns
was filled with the mode. We also transformed categorical hierarchical representations of data
variables like Sex and Embarked using one-hot encoding.
E. Evaluation Metrics
C. Feature Engineering To evaluate the models, we used the following metrics:
Feature engineering was conducted by creating new  Accuracy: The percentage of correct predictions
variables such as:  Precision: The ratio of true positives to predicted positives
 Family Size: Combining SibSp and Parch  Recall: The ratio of true positives to actual positives
 Title: Extracted from the Name field  F1 Score: The harmonic mean of precision and recall
 Fare Per Person: Fare divided by the number of family  ROC-AUC: Measures the area under the receiver
members operating characteristic curve
D. Model Selection IV. RESULTS
We trained five machine learning models to predict the
survival of passengers: A. Model Performance
 Logistic Regression: A baseline algorithm for binary The models were evaluated on a split of 70% training
classification data and 30% testing data. Below are the summarized results
 Decision Tree: A non-linear classifier that splits data based of each model:
on feature importance

Table 1 The Summarized Results of Each Model


Model Accuracy Precision Recall F1 Score ROC-AUC
Logistic Regression 79.4% 0.78 0.75 0.77 0.84
Decision Tree 77.2% 0.76 0.72 0.74 0.81
Random Forest 81.5% 0.80 0.78 0.79 0.86
SVM 82.1% 0.81 0.79 0.80 0.87
Neural Network 84.3% 0.83 0.81 0.82 0.88

B. Feature Importance VI. CONCLUSION


For tree-based models like Random Forest, the most
important features for survival prediction were Sex, Pclass, This study confirms that the Titanic dataset is versatile
and Age. For the logistic regression model, Sex and Pclass and highly suitable for training various machine learning
had the highest coefficients, indicating their strong correlation algorithms. From simple linear models to complex neural
with survival. networks, the dataset enables effective model building and
offers opportunities for feature engineering. Its balanced
V. DISCUSSION complexity and clear problem statement make it a favorite for
both machine learning practitioners and educators. Future
The results demonstrate that the Titanic dataset provides research could explore more advanced techniques, such as
a good mix of categorical and numerical features, making it gradient boosting algorithms or deep learning models, to
ideal for various machine learning models. Logistic regression further improve prediction accuracy.
provides a robust and interpretable baseline, while more
complex models like neural networks and SVMs deliver better VII. FUTURE RESEARCH
performance. One key takeaway is that the simplicity of
models like logistic regression is often balanced by their Further research could involve:
interpretability, while complex models require more data and  Exploring advanced models like XGBoost and LightGBM.
computational resources but often perform better. The Titanic  Addressing class imbalance more thoroughly using
dataset's modest size and relatively clean structure make it techniques like SMOTE.
suitable for a wide range of machine learning techniques,  Implementing more sophisticated feature selection
especially for educational purposes. techniques to improve model interpretability.

IJISRT24OCT1256 www.ijisrt.com 3064


Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14810217

 Investigating the role of deep learning architectures for [16]. Kaggle: Titanic - Machine Learning from Disaster.
better performance. (n.d.). Kaggle. https://www.kaggle.com/c/titanic
[17]. Hinton, G. E., & Salakhutdinov, R. R. (2006).
REFERENCES Reducing the Dimensionality of Data with Neural
Networks. Science, 313(5786), 504-507.
[1]. Wang, F., & Li, H. (2018). Logistic Regression vs. https://doi.org/10.1126/science.1127647
Decision Trees in Titanic Dataset Prediction. Data [18]. Nielsen, M. A. (2015). Neural Networks and Deep
Science Review, 3(1), 95-102. Learning. Determination Press.
[2]. Zhang, Y., & Wang, T. (2019). Predicting Titanic [19]. Tibshirani, R. (1996). Regression Shrinkage and
Survival Using Ensemble Models. Journal of Data Selection via the Lasso. Journal of the Royal Statistical
Science, 5(2), 112-120. Society: Series B (Methodological), 58(1), 267-288.
[3]. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, [20]. Bishop, C. M. (2006). Pattern Recognition and Machine
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Learning. Springer.
Cournapeau, D., Brucher, M., Perrot, M., &
Duchesnay, E. (2011). Scikit-learn: Machine Learning
in Python. Journal of Machine Learning Research, 12,
2825-2830. (Scikit-learn Documentation)
[4]. Fawcett, T. (2006). An Introduction to ROC Analysis.
Pattern Recognition Letters, 27(8), 861-874.
https://doi.org/10.1016/j.patrec.2005.10.010
[5]. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The
Elements of Statistical Learning: Data Mining,
Inference, and Prediction (2nd ed.). Springer.
https://doi.org/10.1007/978-0-387-84858-7
[6]. Geron, A. (2019). Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly
Media.
[7]. VanderPlas, J. (2016). Python Data Science Handbook:
Essential Tools for Working with Data. O'Reilly Media.
[8]. Géron, A. (2017). Deep Learning with Python. Manning
Publications.
[9]. Kuhn, M., & Johnson, K. (2013). Applied Predictive
Modeling. Springer. https://doi.org/10.1007/978-1-4614-
6849-3
[10]. Breiman, L. (2001). Random Forests. Machine
Learning, 45(1), 5-32.
https://doi.org/10.1023/A:1010933404324
[11]. Cortes, C., & Vapnik, V. (1995). Support-vector
Networks. Machine Learning, 20(3), 273-297.
https://doi.org/10.1007/BF00994018
[12]. Goodfellow, I., Bengio, Y., & Courville, A. (2016).
Deep Learning. MIT Press.
https://www.deeplearningbook.org/
[13]. Elkan, C. (2001). The Foundations of Cost-sensitive
Learning. Proceedings of the 17th International Joint
Conference on Artificial Intelligence, 973-978.
https://doi.org/10.1007/978-1-4614-6849-3
[14]. Kingma, D. P., & Ba, J. (2015). Adam: A Method for
Stochastic Optimization. arXiv preprint
arXiv:1412.6980.
[15]. Pang, G., Shen, C., Cao, L., & Hengel, A. V. D.
(2019). Deep Learning for Anomaly Detection: A
Review. ACM Computing Surveys (CSUR), 54(2), 1-38.
https://doi.org/10.1145/3439950

IJISRT24OCT1256 www.ijisrt.com 3065

You might also like