Classification of Breast Cancer Risk Using Naïve Bayes, Decision Tree, and Random Forest
Classification of Breast Cancer Risk Using Naïve Bayes, Decision Tree, and Random Forest
Alvina Tsabitah1, Inas Najah Zhahirah2, Nadea Yiyian Salsabila3, Talidah Nur Keyesa4
4342210121, 4342210292, 4342210163, 4342210184
D4 Teknik Informatika, Faculty of Vocational Studies, Universitas Airlangga
ABTRACT
This study investigates the effectiveness of three machine learning algorithms—Naïve Bayes, Decision
Tree, and Random Forest—in classifying breast cancer risk using the Breast Cancer Coimbra dataset, which
comprises data from 116 patients with various clinical and biochemical attributes. The primary objective is
to assess the performance of these algorithms in predicting breast cancer likelihood based on patient
characteristics, including key biomarkers like leptin and resistin. Data preprocessing steps were applied to
ensure data quality, including outlier removal and normalization. Each algorithm was trained on 80% of the
dataset, with performance evaluated on the remaining 20% using metrics such as accuracy, precision, recall,
and F1-score. Random Forest emerged as the most effective model, achieving an accuracy of 82% and an
F1-score of 0.82, demonstrating its capacity to handle complex datasets and reduce overfitting through its
ensemble approach. In contrast, Naïve Bayes, while computationally efficient, achieved lower accuracy at
75% due to its assumption of feature independence. Decision Tree performed slightly better at 78%, but it
was more prone to overfitting, limiting its generalization capabilities.
The findings emphasize the significant role of ensemble learning in improving classification outcomes in
medical diagnoses. Future work could explore advanced machine learning techniques and diverse datasets
to enhance the robustness of breast cancer risk prediction models. This research contributes to the ongoing
efforts to improve early detection and treatment strategies in breast cancer care.
Keywords: Breast cancer classification, Naïve Bayes, Decision Tree, Random Forest, Machine learning.
CHAPTER I
INTRODUCTION
1.1 Background
Breast cancer is one of the leading causes of cancer-related morbidity and mortality among
women worldwide (Fakieh & Saleem, 2024). The early detection and accurate classification
of breast cancer risk are crucial for effective treatment and management. Recent advancements
in machine learning (ML) provide innovative methods for predicting breast cancer risk based
on various patient characteristics and biomarkers (T. O. Atoyebi et al., 2024).
The application of machine learning techniques in healthcare has shown promising
results, enabling the analysis of complex datasets to enhance diagnostic accuracy (Xiao &
Liang, 2024). This study aims to evaluate the performance of three machine learning
algorithms—Naïve Bayes, Decision Tree, and Random Forest—in classifying breast cancer
risk using the Breast Cancer Coimbra dataset. This dataset, which includes clinical and
biochemical data from 116 patients, has been widely utilized in medical diagnosis research and
offers a relevant basis for applying ML techniques (Yasserh, 2021).
1.2 Problem Statement
1. How can breast cancer be detected in its early stages using a reliable and accurate non-
invasive approach?
2. What is the effectiveness of different ML algorithms in classifying breast cancer risk?
3. Which ML algorithm is best suited for analyzing the complex nature of breast cancer data,
involving interactions among biomarkers and genetic information?
1.3 Project Objectives
1. To evaluate the performance of Naïve Bayes, Decision Tree, and Random Forest
algorithms in classifying breast cancer risk.
2. To identify the most effective ML algorithm for breast cancer risk classification.
3. To develop a data-driven method to support non-invasive approaches in breast cancer
detection.
1.4 Project Significance
1. Contribute to the development of more accessible, non-invasive breast cancer diagnostic
methods.
2. Reduce dependency on invasive diagnostic procedures by implementing ML-based
approaches.
3. Provide a foundation for future development of non-invasive diagnostic tools for breast
cancer detection.
CHAPTER II
LITERATUR REVIEW
Machine learning has become an essential tool in the field of medical diagnostics,
particularly in predicting disease outcomes based on clinical data. Previous studies have
demonstrated the effectiveness of various algorithms in healthcare-related datasets, illustrating
how machine learning can improve diagnostic accuracy and patient outcomes (Xiao & Liang,
2024; Yuan & Zhou, 2024).
For instance, Atoyebi et al. (2024) compared different variants of the Naïve Bayes
algorithm with Random Forest in the context of malaria disease diagnosis. Their findings
highlighted the advantages of ensemble methods in improving predictive accuracy, emphasizing
that such approaches could yield similar benefits in breast cancer classification. These insights are
particularly relevant given the challenges associated with diagnosing complex diseases like cancer,
where timely and accurate predictions are critical for patient care.
The Breast Cancer Coimbra dataset utilized in this study is noteworthy for its detailed
quantitative attributes, which serve as predictors in classification tasks (Yasserh, 2021). The dataset
includes clinical parameters such as age, body mass index (BMI), glucose levels, insulin levels,
and other biochemical markers that have been linked to breast cancer risk. Previous research has
emphasized the importance of preprocessing data to enhance model performance, including steps
such as normalization and outlier detection (Fakieh & Saleem, 2024).
Additionally, the effectiveness of machine learning in clinical settings often depends on the
quality of the input data. As noted by Fakieh and Saleem (2024), data preprocessing techniques
such as missing value handling and normalization play a crucial role in ensuring that machine
learning models yield reliable results. Therefore, the implementation of rigorous data preparation
protocols is essential for developing robust predictive models in medical diagnostics.
CHAPTER III
METHODOLOGY
The methodology of this study comprises key stages, including data collection, pre-
processing, and the application of machine learning algorithms. To ensure prediction accuracy, the
data were carefully collected and cleaned through various pre-processing techniques before
implementing the three machine learning models—Naïve Bayes, Decision Tree, and Random
Forest—to classify breast cancer risk. The performance of each model was evaluated using
standard metrics, including accuracy, precision, recall, and F1-score. This section outlines the
processes involved in collecting and preparing the dataset and describes the methods used to
develop and evaluate the machine learning models.
3.1 Data Collection
The dataset used for this study is the Breast Cancer Coimbra dataset, obtained from
Kaggle. This dataset contains records from 116 patients, each described by 10 quantitative
attributes, including:
Table 1. Attributes of the Breast Cancer Coimbra Dataset
No. Quantitative attributes
1 Age (years)
2 Body Mass Index (BMI)
3 Glucose level (mg/dL)
4 Insulin level (μU/mL)
5 HOMA (Homeostasis
Model Assessment)
6 Leptin (ng/mL)
7 Adiponectin (μg/mL)
8 Resistin (ng/mL)
9 MCP-1 (pg/dL)
10 Classification (0 for
healthy, 1 for breast
cancer)
These attributes serve as predictors in the classification task. The dataset was
chosen for its relevance to breast cancer research and its accessibility in medical diagnostic
studies. It is publicly available and widely used in machine learning research.
3.2 Data Pre-Processing
Before building the classification models, essential pre-processing steps were
undertaken to improve dataset quality and ensure model accuracy:
1. Missing Value Handling: The Breast Cancer Coimbra dataset had no missing values,
so no additional handling was required.
2. Outlier Detection and Removal: Outliers can significantly impact model performance.
Outliers were detected and removed through an iterative process to ensure data
consistency. After four iterations, all outliers were removed.
3. Normalization: To prevent features with larger scales (e.g., insulin levels) from
disproportionately influencing the model, normalization was applied to the dataset
using the Z-score method.
This adjustment ensured that all features had a mean of zero and a standard deviation
of one, making the dataset suitable for algorithms sensitive to feature scale, such as
Naïve Bayes.
3.3 Machine Learning Algorithms
Three machine learning algorithms were applied to classify breast cancer risk:
3.3.1 Naïve Bayes
Naïve Bayes is a probabilistic classifier based on Bayes' theorem, which assumes
that the presence of any specific feature in a class is independent of other features. While
this assumption often does not hold in real-world datasets, Naïve Bayes remains popular
for its simplicity and efficiency, especially with smaller datasets. The model was trained
using 80% of the data, with the remaining 20% used for testing.
3.3.2 Decision Tree
Decision Tree is a tree-structured classifier where each internal node represents a
decision based on a feature’s value, each branch represents an outcome, and each leaf node
represents a class label. To avoid overfitting, the depth of the tree was limited to three
levels. The dataset was split into 80% training and 20% testing, similar to Naïve Bayes.
3.3.3 Random Forest
Random Forest is an ensemble learning method that constructs multiple decision
trees during training and outputs either the mode of the classes (classification) or the mean
prediction (regression) of the individual trees. The advantage of Random Forest lies in its
ability to reduce variance and enhance prediction accuracy by aggregating the results of
numerous weak learners (decision trees). For this study, a forest of 100 decision trees was
constructed.
3.4 Evaluation Metrics
The performance of each model was evaluated using four key metrics:
1. Accuracy: The proportion of correct predictions made by the model.
2. Precision: The proportion of true positive predictions out of all positive predictions
made.
3. Recall: The proportion of true positives correctly identified by the model.
4. F1-Score: The harmonic mean of precision and recall, providing a balanced measure
of both.
These metrics provide a comprehensive view of model performance, which is
particularly crucial in medical diagnostics, where both false positives and false negatives
have significant consequences. The models were evaluated based on these metrics to
determine the most effective algorithm for breast cancer risk classification.
CHAPTER IV
RESULTS AND DISCUSSION
As shown in Table 2, Random Forest achieved the highest overall accuracy (82%)
and F1-score (0.82), making it the best-performing model. Random Forest’s ensemble
approach contributes to its ability to handle the complexity of the breast cancer dataset,
reducing the likelihood of overfitting and improving generalization to new data.
Random Forest had the fewest false negatives, an essential feature for medical
diagnostics, where failing to detect cancer (false negatives) is critical.
2. Naïve Bayes
Naïve Bayes showed a slightly higher number of false positives, which could lead to
unnecessary follow-up procedures.
3. Decision Tree
Decision Tree displayed a balanced performance but had more variability in correctly
predicting negative cases, likely due to its tendency to overfit on certain features.
4.3. Importance of Ensemble Learning
The superior performance of Random Forest highlights the strength of ensemble
learning for complex datasets. By aggregating the predictions of multiple decision trees,
Random Forest improves accuracy and stability, making it less susceptible to the noise and
variance that often impact single classifiers like Decision Tree. This characteristic is
particularly useful in medical diagnosis, where model consistency and robustness are
critical.
CHAPTER V
CONCLUSION
This study demonstrated that Random Forest outperforms both Naïve Bayes and Decision
Tree in classifying breast cancer risk. Although Naïve Bayes offers simplicity and computational
efficiency, its assumption of feature independence limits its effectiveness in complex medical
datasets. Decision Tree, while interpretable, suffers from overfitting, reducing its ability to
generalize to new data.
In contrast, Random Forest balances these trade-offs, offering higher accuracy and better
overall performance through its ensemble approach. This study underscores the potential of
ensemble methods in handling challenging classification tasks in medical diagnosis.
Future Work
Future work could expand on this analysis by testing advanced models, such as deep
learning algorithms, or by applying feature selection techniques to further improve model
accuracy. Additionally, incorporating more diverse datasets could help validate these findings
across different populations, providing greater confidence in the models' generalization abilities.
References
Atoyebi, T. O., Olanrewaju, R. F., Blamah, N. V., & Uwazie, E. C. (2024). Comparison of
Multinomial Naive Bayes (MNB), Gaussian Naive Bayes (GNB) and Random Forest (RF)
Algorithm in Malaria Disease Diagnosis. 2024 International Conference on Science,
Engineering and Business for Driving Sustainable Development Goals (SEB4SDG), Omu-
Aran, Nigeria, 1-6. doi:10.1109/SEB4SDG60871.2024.10630308.
Fakieh, B., & Saleem, F. (2024). COVID-19 from symptoms to prediction: A statistical and
machine learning approach. Computers in Biology and Medicine, 182, 109211.
https://doi.org/10.1016/j.compbiomed.2024.109211
Yasserh. (2021). Breast Cancer Coimbra Data Set. Kaggle.
https://www.kaggle.com/datasets/yasserhessein/breast-cancer-coimbra-data-set.
Xiao, A. S., & Liang, Q. (2024). Spam detection for Youtube video comments using machine
learning approaches. Machine Learning with Applications, 16, 100550.
https://doi.org/10.1016/j.mlwa.2024.100550
Yuan, S., & Zhou, C. (2024). An explainable hybrid machine learning model for predictive
maintenance. Machine Learning with Applications, 10, 100550.
https://doi.org/10.1016/j.mlwa.2024.10055
Appendix
The Website using Flashk