Using Titanic Dataset for Comprehensive Machine Learning Model Training
Using Titanic Dataset for Comprehensive Machine Learning Model Training
Abstract: The Titanic dataset, which documents the survival status of passengers aboard the ill-fated ship, has emerged as a
valuable resource for developing and evaluating machine learning algorithms. This paper investigates the utility of the Titanic
dataset for training various machine learning models, focusing on both binary classification accuracy and the insights gained
from feature engineering. By leveraging features such as passenger class, gender, and age, we demonstrate how the Titanic
dataset serves as an ideal foundation for model development. Results indicate that this dataset offers robust training
opportunities across multiple algorithms. Future research could involve deeper exploration of ensemble methods and more
complex feature extraction techniques to further enhance predictive performance.
Keywords: Titanic Dataset, Machine Learning, Classification Models, Logistic Regression, Random Forest, SVM, Neural Networks,
Data Preprocessing.
How to Cite: Mahmud Hasan; A T M Hasan (2024). Using Titanic Dataset for Comprehensive Machine Learning Model Training.
International Journal of Innovative Science and Research Technology, 9(10), 3063-3065.
https://doi.org/10.5281/zenodo.14810217
I. INTRODUCTION studies also emphasize the need for proper data preprocessing
and the mitigation of class imbalance.
Machine learning has become a key technology in data
analysis, allowing models to learn from historical data and III. METHODOLOGY
make predictions on unseen data. To train these models
effectively, datasets like the Titanic dataset from Kaggle A. Data Description
provide an excellent foundation. This dataset contains detailed The Titanic dataset is comprised of 891 rows and 12
information on passengers, with the objective of predicting columns, each representing different characteristics of
survival based on a variety of factors such as gender, age, and passengers aboard the Titanic. Key features include:
socio-economic status. We aim to analyze the suitability of the PassengerId: Unique identifier for each passenger
Titanic dataset for training machine learning models, focusing Survived: Outcome variable (1 if the passenger survived,
on classification problems. By leveraging various algorithms, 0 if not)
we assess the dataset's strengths and limitations in preparing Pclass: Passenger's class (1st, 2nd, or 3rd)
models that generalize well to unseen data. Name: Name of the passenger
Sex: Gender of the passenger
II. LITERATURE REVIEW Age: Age of the passenger
SibSp: Number of siblings/spouses aboard the Titanic
Several studies have demonstrated the effectiveness of
Parch: Number of parents/children aboard the Titanic
the Titanic dataset in teaching machine learning concepts. Past
research has focused primarily on decision trees, random Ticket: Ticket number
forests, and logistic regression as these models are intuitive Fare: Fare paid by the passenger
and well-suited to small-to-medium datasets. Studies like Cabin: Cabin number
those by Wang et al. (2018) highlight how basic models like Embarked: Port of embarkation (C = Cherbourg, Q =
logistic regression can outperform more complex ones when Queenstown, S = Southampton)
proper feature engineering is applied. Others, such as Zhang et
al. (2019), focus on the value of ensemble methods and deep
learning in improving prediction accuracy. However, these
Investigating the role of deep learning architectures for [16]. Kaggle: Titanic - Machine Learning from Disaster.
better performance. (n.d.). Kaggle. https://www.kaggle.com/c/titanic
[17]. Hinton, G. E., & Salakhutdinov, R. R. (2006).
REFERENCES Reducing the Dimensionality of Data with Neural
Networks. Science, 313(5786), 504-507.
[1]. Wang, F., & Li, H. (2018). Logistic Regression vs. https://doi.org/10.1126/science.1127647
Decision Trees in Titanic Dataset Prediction. Data [18]. Nielsen, M. A. (2015). Neural Networks and Deep
Science Review, 3(1), 95-102. Learning. Determination Press.
[2]. Zhang, Y., & Wang, T. (2019). Predicting Titanic [19]. Tibshirani, R. (1996). Regression Shrinkage and
Survival Using Ensemble Models. Journal of Data Selection via the Lasso. Journal of the Royal Statistical
Science, 5(2), 112-120. Society: Series B (Methodological), 58(1), 267-288.
[3]. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, [20]. Bishop, C. M. (2006). Pattern Recognition and Machine
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Learning. Springer.
Cournapeau, D., Brucher, M., Perrot, M., &
Duchesnay, E. (2011). Scikit-learn: Machine Learning
in Python. Journal of Machine Learning Research, 12,
2825-2830. (Scikit-learn Documentation)
[4]. Fawcett, T. (2006). An Introduction to ROC Analysis.
Pattern Recognition Letters, 27(8), 861-874.
https://doi.org/10.1016/j.patrec.2005.10.010
[5]. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The
Elements of Statistical Learning: Data Mining,
Inference, and Prediction (2nd ed.). Springer.
https://doi.org/10.1007/978-0-387-84858-7
[6]. Geron, A. (2019). Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly
Media.
[7]. VanderPlas, J. (2016). Python Data Science Handbook:
Essential Tools for Working with Data. O'Reilly Media.
[8]. Géron, A. (2017). Deep Learning with Python. Manning
Publications.
[9]. Kuhn, M., & Johnson, K. (2013). Applied Predictive
Modeling. Springer. https://doi.org/10.1007/978-1-4614-
6849-3
[10]. Breiman, L. (2001). Random Forests. Machine
Learning, 45(1), 5-32.
https://doi.org/10.1023/A:1010933404324
[11]. Cortes, C., & Vapnik, V. (1995). Support-vector
Networks. Machine Learning, 20(3), 273-297.
https://doi.org/10.1007/BF00994018
[12]. Goodfellow, I., Bengio, Y., & Courville, A. (2016).
Deep Learning. MIT Press.
https://www.deeplearningbook.org/
[13]. Elkan, C. (2001). The Foundations of Cost-sensitive
Learning. Proceedings of the 17th International Joint
Conference on Artificial Intelligence, 973-978.
https://doi.org/10.1007/978-1-4614-6849-3
[14]. Kingma, D. P., & Ba, J. (2015). Adam: A Method for
Stochastic Optimization. arXiv preprint
arXiv:1412.6980.
[15]. Pang, G., Shen, C., Cao, L., & Hengel, A. V. D.
(2019). Deep Learning for Anomaly Detection: A
Review. ACM Computing Surveys (CSUR), 54(2), 1-38.
https://doi.org/10.1145/3439950