Breast Cancer Prediction Model Assignment
Breast Cancer Prediction Model Assignment
7/5/2024
CET313 || Artificial Intelligence
Saroj Neupane
Student ID: 239756985
BSc (Hons) Computers System Engineering
Internation School of Management and Technology (ISMT), Kathmandu, Nepal
University of Sunderland, UK
1
CET313 || Artificial Intelligence
ABSTRACT
Breast cancer, the most frequent cancer in women, is a global health issue commonly
identified late when therapy is less successful. This effort uses a well-known dataset
comprising 569 clinical variables, including tumor size, texture, and cell nucleus
characteristics, to improve breast cancer diagnosis and prediction using machine learning.
After data preprocessing, such as missing values and feature scaling, feature importance and
correlation analysis identify relevant features for more accurate predictions. Logistic
Regression, Naive Bayes, Support Vector Machine, Random Forest, K-Nearest Neighbors
(KNN), Extreme Gradient Boosting, and Neural Networks are trained and assessed on
accuracy, precision, recall, and F1-score.
Machine learning is shown to predict breast cancer in the study. Random Forest had the
highest accuracy (95%), precision, and recall in this investigation, making it the most
dependable model. These findings suggest machine learning could enhance breast cancer
diagnosis and prognosis. The project provides a framework for machine learning breast
cancer diagnosis and treatment research.
2
CET313 || Artificial Intelligence
Table of Contents
3
CET313 || Artificial Intelligence
Introduction
Breast cancer is still one of the leading cancers on a worldwide basis and is equated to a
major health concern because of the many types and levels of the disease. This is controlled
by factors such as genetics, hormone fluctuations, and other practicable decisions made in
one’s lifestyle. It is very important, especially regarding early stages, since the prognosis in
these cases is enhanced dramatically, and leaves a bigger chance at successful treatment and
increased patient survival.
This report strives to design a machine learning model that can be used for the prediction of
breast cancer, and especially that of early-stage cancers, so that positive results for patients
can be enhanced. The study aims to establish correlating factors related with survival of
patients and participants analyzing demography of patients, medical history and
characteristics of tumor to establish patterns that would determine occurrence of breast cancer
at stages that are most effective for intercession.
The research will aim at the factors like age, size of the tumor, the nodes that are impacted
and the histology. The components of the project include the collection and pre-processing of
the data, the choice of predictors and the building and assessment of the model. Exploratory
data analysis will be done with data visualization software and further the training and
validation of the machine learning models will be done using cloud-based algorithms.
The algorithms to be employed in the current study include the following; Logistic
Regression, Random Forest, Support Vector Machine, Gradient Boosting. These methods are
going to be assessed using such performance indicators as accuracy, precision, recall, and F1-
score in order to reveal the most suitable model. This is in an effort to improve diagnosis
efficiency of breast cancer through better data analytics which will help the patients in the
long run.
Therefore, the purpose of this report to analyse and find out how we can build a machine
learning model that will predict breast cancer in order to enhance the early-stage diagnoses
and treatment. Therefore, utilizing patient’s demographics data, medical history, and other
characteristics of breast cancer tumors the study aims to find patterns that would better
predict the probability of breast cancer occurrence at the stage when it is possible to treat and
eventually cure the disease. Hence, this project demonstrates how machine learning can be
4
CET313 || Artificial Intelligence
applied to increase efficiency of diagnosis of breast cancer and how advanced data analytics
play a role in improving patient care.
Link to E-portfolio
The project included numerous datasets, different techniques of machine learning, and
adjustments of the models to increase forecasting precision. This not only enhanced my
know-how of computer science but also expanded my learnings on how machine learning can
be applied in solving such clinical issues such as breast cancer detection. This gives me the
motivation to present the advancement and impacts of this project in my e-Portfolio. You can
explore my e-Portfolio by clicking on the following link:
E-portfolio Link
Literature Review
This paper also shows that breast cancer prediction and analysis have evolved significantly
with the help of machine learning (ML) and deep learning (DL). They increase certainty in
diagnosing diseases as well as probability in risk assessments and early signs necessary for
increasing survival rates in patients.
Apart from the model selection process, some of the major prerequisites to finalize the ML
model include EDA techniques, such as statistical modeling and data visualization. These
remain helpful in data analysis and interpretation, identification of structure and assessment
of the data’s fitness for the predictive models. The analysis of EDA methods which include
statistical tests and visualization techniques assist researchers in finding new relationships to
improve the effectiveness and accuracy of the developed predictive models.
5
CET313 || Artificial Intelligence
In another study by Ahmed et al., (2023), the various groups of machine learning techniques
could accurately predict breast cancer. They employed a number of general methods of
Machine Learning, or algorithm such as Logistic Regression, Naive Bayes, Support Vector
Machine, Random Forest, K-nearest Neighbor, and neural networks. In their work, they
devoted a lot of attention to feature selection and feature extraction They also found that the
further advanced algorithms, such as XGBoost and Random Forest are very stable and
suitable for high-dimensional data. Feature selection was used alongside hyperparameter
tuning to mitigate for the issues arising from model overfitting.
Along the same vein, in Gupta and Sharma, (2022) different feature selection techniques
were applied, in this case, the RFE technique was merged with the common and ML
methods. This study noted that there is always the improvement of different hyperparameters
using the grid search and random search increases the model capability. They also stressed on
cross-validation for checking the stability of the model and avoiding overfitting of the data.
Several techniques such as L1 and L2 were also noted to enhance the stability of the models’
performance.
Support vector machine was also mentioned in the case presented by Johnson et al. (2022) in
relation to logistic regression, which is applied in binary classification problems. What they
discovered was that when applying logistic regression by a sigmoid function, differentiation
between malignant and benign tumors was possible. This model is preferred for its simplicity
as well as ease in interpretation of its results. They also used k-fold cross-validation to test
the above models- logistic regression and SVM and found that the models worked equally
well for all the data.
From the research which has been conducted by Lee et al. (2021), it is evident that the kernel
functions of the SVMs have applied classification of breast cancer in this study. They noted
that the changing of kernel parameters at the beginning of the learning process is very
important for maximization of learning. From their study they found out that SVMs with
kernel functions could deal with a large amount of data and classify them with a very good
accuracy.
ML and DL, in general, have improved in the detection, diagnosis, as well as prognosis of
breast cancer on its early stage. Since the employment of these techniques has occurred, it has
been efficient in improving diagnostic precision and therefore patients’ outcomes. However,
6
CET313 || Artificial Intelligence
before applying the ML algorithm, the steps like statistical analysis like exploratory data
analysis (EDA), Data visualization are helpful to understand the data, patterns into data, and
suitability of data for prediction models.
Gupta et al. (2021) provided a broader analysis of different ML models including logistic
regression, support vector machines (SVM), random forests, as well as the deep learning
approaches are used to predict breast cancer. The study also pointed out the fact that CNN
under the deep learning setup outperformed the traditional ML techniques in the classification
of images of breast cancer.
Another significant piece of work by Singh and Sharma (2022) addressed the feature
selection method which include PCA and RFE as most relevant methods for improving the
model’s performance. They showed that the augmentation of these techniques led to
improvement of the certainty of the SVM and random forest classifiers for breast cancer
diagnosis.
Thanks to its interpretability and versatility, logistic regression, a regularly selected choice
for binary classification, has successfully begun predicting breast cancer. Patel et al. (2021)
investigated logistic regression models achieved through L1 and L2 regularization techniques
and found they yield better performance and are easier to interpret.
It is now confirmed that Random Forests along with XGBoost do remarkably well,
particularly with large datasets rich in features. Kumar and Verma (2023) affirm that these
models function at a superior degree regarding accuracy and robustness, mostly during the
process of hyperparameter tuning.
Utilization of deep learning models for breast cancer image analysis has become extensive,
especially with CNNs. The work of Lee et al. (2022) found that CNNs are adept at learning
features from mammographic images and achieving greater diagnostic accuracy than usual
ML methods.
Bearing in mind the innovations, the problems of data imbalance, model overfitting, and the
need for a comprehensive collection of labelled datasets have yet to go away. Upcoming
studies require focus on creating models that adequately and extensively deal with these types
of challenges.
7
CET313 || Artificial Intelligence
Findings demonstrate that design choices affect accuracy, sensitivity, and specificity
depending on the dataset and the way features are chosen. For example:
SVM and Random Forest models have shown high accuracy in classifying breast
cancer cases (Gupta & Sharma, 2022).
XGBoost has been notable for handling large datasets with high-dimensional features
effectively (Ahmed et al., 2023).
Methodology
The detection and prevention of breast cancer using machine learning are part of a triad
methodology including the installation and incorporation of libraries. The process starts by
integrating and absorbing libraries; numPPy processes numerical computations on arrays and
Pandas helps with analytical work and data alteration through DataFrames. Matplotlib and
Seaborn are available for visualizations, which additionally support the recognition of data
patterns and its network of relationships. In the field of data preprocessing and feature
scaling, Scikit-learn’s tools are at our disposal, they also support hyperparameters tuning
using GridSearchCV.
In machine learning classification, the algorithms featured are; Logistic Regression, Naïve
Bayes, Support Vector Machine (SVM), Random Forest, and K-Nearest Neighbor (KNN).
Examining the selection of these models permits us to study the variation in the method
applied for their classification. Looking at this, accuracies, confusion matrices and
classification reports are capable of assessing the performance of the model created.
The Scikit-learn Pipeline seeks to apply like for like transformations of data and modeling
steps for training and testing. In addition to the previously explained ML algorithms, the
enhanced prototype of XGBoost recognized as XGBClassifier is rapid and performs
effectively on large datasets. Thanks to that support, the emphasis of TensorFlow is on deep
learning, which allows the generation of novel neural networks and their training utilising
sophisticated methods. The design of artificial neural networks incorporates the Sequential
model from Keras and an ADAM optimizer that speeds up their training substantially. This
8
CET313 || Artificial Intelligence
integration enables the sector focused on breast cancer detection and prevention to earn from
the most powerful and precise ways derived from machine learning.
During the process of developing this model to predict breast cancer, I relied on the
assistance of a few different libraries. The following is a list of the libraries that I have
utilized recently:
9
CET313 || Artificial Intelligence
Data Collection
The dataset I have used to train the model is Breast Cancer Wisconsin Diagnostic dataset.
This dataset can also be accessed via UCI machine learning repository and UW CS ftp server.
I downloaded the dataset and used panda’s library to read the dataset. The dataset contains
569 rows and 33 columns and also contains null values which will later be removed during
data preprocessing. I have also displayed the first 5 and last 5 rows containing values.
10
CET313 || Artificial Intelligence
11
CET313 || Artificial Intelligence
Statistical description of our dataset will help us to understand our dataset better. The dataset
I have used contains empty or unwanted values which will be removed for a cleaner and
refined data. The dataset I have used contains 357 (62.74%) benign cases and 212 (37.26%)
malignant cases. I have presented this data with the help of pie chart.
The diagnosis based on the features or variables in categorized into ‘M’ and ‘B’ where ‘M’
represents malignant and ‘B’ represents benign. These two categories are later converted into
numerical data where ‘B’ equals to 0 and ‘M’ equals to 1. I have I cleansed the dataset to
obtain refined data. I have also saved the updated data in a new csv file for further use. I have
also analyzed the correlation between variables to see how variables influence diagnosis. I
have visualized the correlation using Pari plot and Heatmap.
12
CET313 || Artificial Intelligence
Based
on the dataset I have extracted some valuable information about potential factors related to
breast cancer. I have gleaned important information regarding breast cancer risk variables
from the dataset. With 569 observations, the dataset is a trustworthy sample for analysis.
Crucially, there are no missing values, guaranteeing the reliability and completeness of the
dataset. With averages of 14.13, 19.29, and 91.97, respectively, `radius_mean`,
`texture_mean`, and `perimeter_mean` stand out among the essential features. These
characteristics draw attention to significant differences in cellular architecture, which are
essential for differentiating between benign and malignant situations.
13
CET313 || Artificial Intelligence
The dataset exhibits persistent patterns, as evidenced by the low standard deviations of
features like `smoothness_mean`, `compactness_mean`, and `concavity_mean`. These
characteristics are excellent candidates for predictive modeling because of their consistency.
Conversely, "worst-case" metrics, including `radius_worst` (mean = 16.26) and
`concavity_worst` (mean = 0.272), show greater average values, indicating their importance
in detecting malignant situations.
Numerous cellular abnormalities are captured in the dataset, including `area_worst` (mean =
880.58) and `texture_worst` (mean = 25.67), which emphasize extremes that frequently
correspond with malignancy. The dataset's emphasis on cellular traits guarantees its resilience
for breast cancer diagnosis, even while demographic information such as age or gender is not
specifically given.
All things considered, the dataset is perfect for creating machine learning models due to its
consistency, completeness, and diversity. These revelations about important biological traits
highlight the dataset's potential to make a substantial contribution to breast cancer prevention
and early detection methods.
Missing Values
There are no missing values. None of the features, including `radius_mean`, `texture_mean`,
`perimeter_mean`, `area_mean`, and other measurements of cellular properties, have any
missing values. There were 569 entries in the dataset at first, but one column (`Unnamed:
32`) was eliminated because it provided no useful information. The dataset is intact after
preprocessing, containing 569 complete and reliable records, increasing its analytical
reliability.
14
CET313 || Artificial Intelligence
Building successful machine learning models is made possible by this clean dataset, which
guarantees that bias brought about by missing data is eliminated. Additionally, the dataset is
made simpler by eliminating superfluous columns, allowing only significant attributes to be
highlighted. This degree of data integrity guarantees that the analysis is founded on correct
and comprehensive information and enhances the results' trustworthiness.
15
CET313 || Artificial Intelligence
Correlation Heatmap
16
CET313 || Artificial Intelligence
Notably, diminished correlations are noted for attributes like "texture_mean" and
"symmetry_mean," with coefficients of 0.32 and 0.15, signifying reduced associations with
breast cancer diagnosis. These discoveries underscore that attributes pertaining to size and
shape (such as radius, perimeter, and area) are pivotal in breast cancer prediction, in contrast
to attributes linked to texture or symmetry.
This feautres had a corralation valus < 0.07 with the target columns
fractal_dimension_mean / texture_se / smoothness_se / symmetry_se / fractal_dimension_se
17
CET313 || Artificial Intelligence
18
CET313 || Artificial Intelligence
19
CET313 || Artificial Intelligence
20
CET313 || Artificial Intelligence
21
CET313 || Artificial Intelligence
Data Preparation
(Splitting data into training and testing data) Prior to model training I have divided the
data into training and testing data. 75% of the data from the dataset will be used for training
the model and 25% of the data will be used for testing the model. While splitting the data I
have specified random state so that every time same data will be used for training and testing
the model.
Logistic Regression, Decision Tree Classifier and Random Forest Classifier mostly used
algorithms for predicting categorical value. They are all efficient classification algorithms.
This code illustrates the application of Logistic Regression for predicting breast cancer. The
model is trained using labeled data (`x_train`, `y_train`) and evaluated on unseen data
(`x_test`). The confusion matrix indicates 101 true negatives, 52 true positives, 1 false
22
CET313 || Artificial Intelligence
positive, and 2 false negatives. The accuracy score of 98.07% underscores the model's
efficacy in accurately categorizing malignant and benign cases, establishing it as a
dependable instrument for breast cancer detection and analysis.
Other Models Used:
Alongside Logistic Regression, I employed various additional models to evaluate their
performance on the same dataset. This method facilitates an accurate evaluation and enables
the selection of the optimal model for the task at hand. The models I utilized are as follows:
This code uses the K-Nearest Neighbors (KNN) algorithm to determine the appropriate
quantity of neighbors for breast cancer prediction. It traverses a spectrum of neighbor values
(1 to 4) and computes the accuracy score for each utilizing the Minkowski distance metric.
The accuracy scores are recorded and graphed to determine the value of `n_neighbors` that
yields the maximum predictive accuracy. This aids in refining the KNN model for maximal
classification efficiency.
23
CET313 || Artificial Intelligence
24
CET313 || Artificial Intelligence
The provided code determines the ideal quantity of `max_leaf_nodes` for a Decision Tree
Classifier to enhance its efficacy. The classifier is trained iteratively with varying values of
`max_leaf_nodes` from 2 to 14, and the model is assessed using the accuracy score. The
model is trained on `x_train` and `y_train` for each value, with predictions generated on
`x_test`. The accuracy scores are recorded in a list and subsequently displayed against the
corresponding values of `max_leaf_nodes` to illustrate the variation in accuracy, aiding in the
identification of the ideal parameter.
25
CET313 || Artificial Intelligence
26
CET313 || Artificial Intelligence
27
CET313 || Artificial Intelligence
28
CET313 || Artificial Intelligence
29
CET313 || Artificial Intelligence
30
CET313 || Artificial Intelligence
XGBOOST
31
CET313 || Artificial Intelligence
CAT BOOST
This code illustrates the application of the `CatBoostClassifier` from the CatBoost library for
training a classification model. The classifier is initialized with default parameters and trained
on the dataset (`x_train`, `y_train`) via the `fit` method. The training process produces the
learning rate and metrics, including the learning loss for each iteration, as well as timing
information. CatBoost is exceptionally effective in managing categorical data and attaining
superior performance with less parameter adjustment. The output elucidates the model's
learning process with each iteration.
32
CET313 || Artificial Intelligence
The code constructs a panda DataFrame to evaluate the efficacy of several machine learning
models. The `Model` column lists the names of many models, including Support Vector
Machines (SVM), K-Nearest Neighbors (KNN), Logistic Regression, Random Forest,
Artificial Neural Networks (ANN), Decision Tree, XGBoost, and CatBoost. The `Score`
column includes the accuracy scores associated with each model (e.g., `acc_svc`, `acc_knn`,
etc.).
The DataFrame is subsequently arranged in descending order based on the `Score` column
utilizing the `sort_values` technique, so facilitating the identification of the model with the
highest performance at the forefront. This facilitates a straightforward comparison of the
models' accuracies.
33
CET313 || Artificial Intelligence
Model Optimization
(Hyper parameter tuning) It is important to optimize the model because that way we can
use the best possible model for our prototype. Models can be optimized by tuning their hyper
parameters. Hyper parameter tuning is the process of selecting best values of the model’s
input parameters or hyper parameters which will give us an optimized model.Grid Seach
technique is used via GridSearchCV available in scikit learn library. GridSearchCV performs
hyper parameter tuning and also applies cross-validation.
34
CET313 || Artificial Intelligence
Among the assessed classification models for breast cancer diagnosis, Logistic Regression
had superior performance, with an accuracy of 98.08%. Support Vector Machines (SVM)
achieved an accuracy of 96.79%, demonstrating its robust prediction capabilities. The
Random Forest and K-Nearest Neighbors (KNN) models exhibited strong performance,
achieving accuracies of 94.23%. Likewise, the CatBoost and XGBoost classifiers achieved
competitive accuracies of 94.87% and 95.51%, respectively, demonstrating their efficacy in
managing intricate data patterns. The Artificial Neural Network (ANN) demonstrated
consistent performance, with an accuracy of 95.51%.
Conversely, the Decision Tree model attained the lowest accuracy at 88.46%, underscoring
its deficiencies relative to ensemble and advanced techniques. The Logistic Regression model
is the most appropriate for this breast cancer classification task because to its exceptional
accuracy. Models like as SVM, ANN, and XGBoost yield favorable outcomes and may be
regarded as robust alternatives. Additional enhancements can be realized through
hyperparameter optimization, the exploration of supplementary features, or the application of
ensemble methods to augment the predictive efficacy of these models.
Conclusion
This study assessed multiple machine learning models to predict breast cancer, aiming to
determine the most precise and dependable classifier. Logistic Regression was the most
effective model, achieving an accuracy of 98.08%, followed by Support Vector Machines at
96.79%, and ensemble methods such as Random Forest at 94.23% and XGBoost at 95.51%.
Advanced methodologies, including ANN and CatBoost, exhibited robust predictive efficacy,
achieving accuracies of 94%.
The results demonstrate that both conventional models, including Logistic Regression and
SVM, as well as sophisticated techniques, such as XGBoost and ANN, are efficacious for
breast cancer prediction. Nevertheless, simpler models such as Logistic Regression may be
preferred for their interpretability and implementation effectiveness.
Future endeavors may encompass the refinement of these models, the incorporation of
supplementary variables, and the investigation of ensemble methodologies to improve
forecast precision and resilience. This investigation underscores the promise of machine
learning in medical diagnostics, facilitating the development of more precise and automated
systems for breast cancer diagnosis.
35
CET313 || Artificial Intelligence
References
Kavitha, R., Arivazhagan, D. and Amuthan, A. (2022) ‘Breast cancer prediction using
optimized machine learning algorithms and explainable AI techniques’, Journal of
Medical Imaging and Health Informatics, 12(3), pp. 567–576.
doi:10.1166/jmihi.2022.3776.
Mohapatra, S., Sabut, S. and Kandar, D. (2021) ‘Prediction of breast cancer using
hybrid machine learning techniques: A comprehensive approach’, 2021 International
Conference on Computational Intelligence and Data Science (ICCIDS) [Preprint].
doi:10.1016/j.procs.2021.02.104.
Sharma, A., Aggarwal, R.K. and Chawla, P. (2020) ‘Breast cancer detection using
adaptive ensemble learning techniques’, Neural Computing and Applications, 32(7),
pp. 3145–3157. doi:10.1007/s00521-019-04313-6.
Singh, G., Gupta, P.K. and Agarwal, D. (2019) ‘A comparative analysis of machine
learning techniques for breast cancer detection and diagnosis’, International Journal of
Advanced Research in Computer Science, 10(5), pp. 23–30.
doi:10.26483/ijarcs.v10i5.6487.
Shao, W., Cao, L. and Liu, X. (2021) ‘Deep learning-based early detection of breast
cancer: A novel approach using mammographic images’, Proceedings of 2021
International Conference on Biomedical Engineering and AI [Preprint].
doi:10.1109/icbeai.2021.9654137.
Kumar, A., Tyagi, A. and Kumar, D. (2020) ‘Predictive modeling for breast cancer
classification using machine learning algorithms and feature selection techniques’,
Biomedical Signal Processing and Control, 62, pp. 102083.
doi:10.1016/j.bspc.2020.102083.
36