0% found this document useful (0 votes)
26 views10 pages

Journal-Breast Cancer Prediction

Uploaded by

praveenazzay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Journal-Breast Cancer Prediction

Uploaded by

praveenazzay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Breast Cancer Prediction Using Machine

Learning Algorithm
B.Praveen1
, R.Dhilipan2, S.Siva Ranjan 3, Ms.U.Elakkiya4

1,2,3
Under Graduate Student(s), Department of Information Technology,
Sri Ramakrishna Institute of Technology, Coimbatore, Tamil Nadu, India
4
Assistant Professor, Department of Information Technology,
Sri Ramakrishna Institute of Technology, Coimbatore, Tamil Nadu, India

Corresponding Author: [email protected], [email protected].

Abstract
Keywords: Breast cancer prediction, Data
Breast cancer is a major concern for
analytics, Machine learning technique, Decision
women worldwide, with early detection being
tree Classifier, Healthcare practitioners
crucial for effective treatment. Many women,
especially in countries with limited resources, 1. Introduction
are diagnosed late in life due to barriers to
Breast cancer accounts for the majority of new
healthcare. To address this, early diagnostic
cases and cancer-related deaths in women
programs are vital, focusing on identifying
globally, according to data. It is one of the
symptoms early and ensuring prompt referral
most prevalent illnesses impacting women
for diagnosis and treatment. Data analytics,
globally. As such, it is a significant public
particularly machine learning, plays a
health risk in today's society. By promoting
significant role in this process. By analyzing
quick clinical care for patients, an early
large datasets, machine learning techniques
diagnosis of breast cancer can greatly enhance
like K-Nearest Neighbor, Naive Bayes, SVM,
prognosis and probability of survival. An
and Decision Tree Classifier can predict
improved classification system for benign
whether a tumor is cancerous or not. This is
tumours might save people from needless
done through supervised learning, where the
medical procedures. Because of this, a lot of
computer is trained with data on tumor
research is being done on accurately
characteristics and their corresponding
diagnosing BC and classifying people into
diagnosis, enabling it to classify new cases
categories according on whether they are
accurately. This approach aims to improve
malignant or benign. The best method is
access to timely cancer therapy by providing
widely accepted to be machine learning (ML),
accurate predictions, aiding healthcare
which offers certain advantages in terms of
practitioners in decision-making.
extracting significant characteristics from Ch. Shravya (2019) In this study, we leveraged a
complex datasets. Machine learning (ML) has dataset sourced from the UCI repository,

gained widespread acceptance as the primary conducting a comprehensive comparative analysis


of predictive models, including logistic regression,
approach for forecast modelling and pattern
support vector machines (SVM), and K-nearest
classification in breast cancer due to its unique
neighbors (KNN). Employing Python and the
advantages in key feature discovery from
Scientific Python Development Environment,
complicated datasets. Uncontrolled
Spyder, we meticulously examined the accuracy,
proliferation of breast cells is known as breast precision, sensitivity, specificity, and false positive
cancer. Breast cancer, a diverse disease, rate of each algorithm.
manifests based on the malignancy of specific
breast cells. The intricate composition of the Ramik Rawal (2020) In a groundbreaking study,
breast involves three key components: lobules, researchers have undertaken a comprehensive

responsible for milk production; ducts, the comparison of four distinct breast cancer prediction
algorithms Support Vector Machines (SVM),
conduits for transporting milk to the nipple;
Logistic Regression, Random Forest, and K-
and the supportive connective tissue
Nearest Neighbors (KNN) utilizing diverse datasets
enveloping these structures. Predominantly,
within a simulated environment facilitated by the
breast cancer originates in the lobules or ducts.
JUPITER platform .
The potential for its spread beyond the breast
arises through lymphatic and blood vessels, Wang, Haifeng (2015) This groundbreaking
underscoring the complexity of this condition. research delves into the realm of breast cancer
As we reflect on the past year, let us also prediction through innovative data mining
recognize the resilience and strength exhibited techniques, aiming to unveil a reliable method for
by individuals facing breast cancer and the anticipating the onset of breast cancer. Extensively

ongoing efforts in research and awareness analyzing a vast array of patient clinical data, the
study constructs a precise predictive model
employing four distinct data mining methods—
2. Related Work
support vector machine (SVM), artificial neural
Several research studies have significantly
network (ANN), Naive Bayes classifier, and
advanced the understanding and prediction of
AdaBoost tree. Recognizing the pivotal role of
breast cancer prediction using machine learning
feature space in the learning process, the research
(ML) techniques. Deepika Verma (2017) This
explores its impact on speed and efficacy.
study discussed the categorization of data mining
approaches. In this work, we use the WEKA
Gaurav Singh (2020) This pioneering research
interface as well as the nave byes and MLP
endeavors to develop a groundbreaking predictive
classification algorithms. To assess how well these
model for breast cancer using advanced machine
two algorithms worked, hypothyroidism and breast
learning methods such as k Nearest Neighbour
cancer data sets were used.
(kNN), Support Vector Machine (SVM), Logistic
Regression (LR), and Gaussian Naive Bayes (NB).
The study goes beyond model creation, dataset which has major attribute as id, diagnosis
meticulously comparing accuracy, precision, recall, and other real valued features which are computed
f1-Score, and Jaccard index across classifiers. for each cell nucleus like radius, texture,
Leveraging a publicly available UCI Machine parameter, smoothness, area, etc.
Learning Repository dataset, the trainingtesting Removing Null Values: The system will employ
split is set at 80-20%. Strikingly, k Nearest robust techniques for removing null values in the
Neighbours emerges as the standout performer,
row.
showcasing its superiority in breast cancer
Scaling: Feature scaling is a data preprocessing
prediction and opening avenues for significant
technique used to transform the values of features
advancements in medical diagnostics and
or variables in a dataset to a similar scale.
healthcare applications

While challenges persist in predicting breast cancer Training and Testing:

with absolute certainty, the synthesis of these Usually, training data and test data are separated

studies provides a nuanced understanding of the from the data. In this project, 20% of the data are
evolving landscape of ML applications in breast used for testing and 80% of the data are for
cancer prediction. From algorithmic advancements training to get the better accuracy.
to the incorporation of sentiment analysis, these
research endeavors collectively contribute to Prediction:
enhancing the accuracy and reliability of breast Numerous Machine Learning algorithms,
cancer prediction using machine learning such as Logistic Regression, Support Vector
methodologies. Machine (SVM), Random Forest, Decision
Tree Classifier and KNN, will be used by the
3. Methodology system. Compare each algorithm's
performance with and without feature
Data Collection: selection in order to determine how variable
The dataset is a digitally stored collection of subsets affect prediction accuracy.
various types of data that may be utilized to train
the model. For this project, the Wisconsin Dataset
were used, which consists of roughly 600 rows of
data and includes characteristics such as clump
thickness, uniform cell size, uniform cell shape,
marginal adhesion, single epithelial size, naked
nuclei, bland chromatin, normal nucleoli, and
mitosis

Data pre-processing:
The dataset may be Incomplete or have some
missing attribute values, or having only aggregate
data. So, there is a need to pre-process our medical
characteristics. Logistic regression is useful in
determining the relevance of various parameters in
the setting of breast cancer because of its ease of
use and interpretability. It gives a clear picture of
how each factor affects the probability of
malignancy.

p=1+e-(β 0 + β 1 x1 + β 2 x2 + …+ β n xn )1

B. K-Nearest-Neighbor

K-nearest neighbors, or CNN, is one type


of supervised learning algorithm that is used for
both regression and classification. KNN attempts
to predict the correct class for the test data by
computing the distance between the test data and
all of the training points. Select the K points that
are most similar to the test data after that. When
the KNN algorithm calculates the probability that
1: Architecture diagram
the test data will belong to each of the training data
classes for "K," the class with the highest
4. Experimental Setup
probability will be selected. In the regression
scenario, the value is the mean of the selected "K"
A. Logistic Regression
training points. Because KNN considers local
patterns in the dataset, it is useful for predicting
The logistic function, also known as the
breast cancer. Due to the likelihood of comparable
sigmoid function, is the basic idea behind logistic
outcomes among instances with similar feature
regression. It maps any real-valued number to a
profiles, KNN is sensitive to localized patterns that
value between 0 and 1. This function is essential
may indicate certain subtypes or features in cases
for transforming a linear combination of predictor
of breast cancer
variables' output into a probability, which is then
interpretable as the likelihood that an event will C. Support Vector Machine (SVM)
occur.
Since identifying benign or malignant tumors Support Vector Machine (SV
is frequently the aim of breast cancer prediction, Vised Machine learning technique, is utilized for
logistic regression is a good fit since it represents both regression and classification. Regression
the likelihood of a binary outcome. The approach problems are most suited for application in
calculates the likelihood that a given instance classification issues, however. Finding the optimal
belongs to the malignant class by using the logistic hyperplane in an N-dimensional space to partition
function on a linear combination of input data points into different feature space classes is the
main objective of the SVM method [1]. The input data is recursively divided into subsets
hyperplane aims to preserve the greatest feasible according to the values of one of the input
distance between the closest points belonging to features. Using a metric like information gain, the
different classes. The dimension of the hyperplane splits are made according to the feature that
is determined by the number of features. If there are optimizes the distance between the various classes
simply two input characteristics, the hyperplane is in the resultant subsets. A collection of nodes and
merely a line. If there are three input features, the branches that represent a set of rules for
hyperplane changes into a 2-D plane. It becomes categorizing new data according to the features'
difficult to imagine. Let's look at one dependent values make up the decision tree that results.
variable, which is called x1, and two independent Among the many benefits of decision tree
variables, x2 classifiers are their interpretability and simplicity

E. Random Forest of use. On the other hand, if the tree is overly


intricate or noisy, they may experience overfitting.
Well-known machine learning formula one Decision Trees form a tree structure in which each
component of the supervised learning approach is leaf node represents a predicted class by
Random Forest. It can be used for machine recursively splitting the data according to feature
learning problems involving regression and values. Decision trees are useful in the prediction
classification. The concept of ensemble learning, of breast cancer because they may highlight the
which is the process of combining different significance of individual features. The most
classifiers to solve challenging problems and important characteristics for figuring out if a tumor
improve model performance, forms the basis of is benign or malignant may be found by looking at
this approach. As the name suggests, Random
the tree structure. This helps with feature selection
Forest is a classifier that combines multiple
and medical interpretation.
decision trees on different subsets of a given data
set, averaging them to improve the data set's
5. Results
predicted accuracy. Rather than relying just on one
decision tree, a random forest uses predictions
Our research delves into employing
from each decision tree to predict the outcome
decision tree algorithm for predicting breast cancer,
based on the majority of votes.
comparing their parameters against other
By adjusting hyper parameters like the number
parameters Leveraging a diverse dataset, we
of trees, depth of each tree, minimum number of
considered various features and data optimization
samples per leaf, and maximum number of
to enhance accuracy. The decision tree model,
features to consider at each split, random forest implemented in Python, exhibited promising results
models can be made more accurate. in capturing accurate prediction. In comparison to
alternative algorithms, decision trees stand out for
F. Decision Tree their accuracy compared to other algorithms. This
is because decision trees can analyze various
One well-liked machine learning algorithm for factors related to breast cancer. Unlike some other
supervised learning tasks is the decision tree methods, decision trees can handle complex
classifier. Up until a stopping criterion is met, the relationships between these factors, allowing for
more accurate predictions. This ability to consider Confusion Matrix
multiple factors simultaneously makes decision
trees particularly effective in identifying patterns
associated with breast cancer, leading to more
reliable predictions. The integration of a web-based
prediction tool, developed using HTML and CSS.
PyCharm and Flask are used for development of
framework, overall enhances accessibility and
usability for healthcare professionals, contributing
to more timely and reliable breast cancer detection.
performance.
2: Confusion Matrix for Decision Tree

Table 1: Model Evaluation Results

Model Accurac Precision Recall F1 Score MAE MSE


y
Linear 75% 0.78 0.73 0.75 4.62 36.89
Regression
Logistic 79% 0.82 0.76 0.78 4.15 31.74
Regression
KNN 77% 0.80 0.75 0.77 4.31 33.21
Decision Tree 76% 0.79 0.74 0.76 4.45 34.78
Random Forest 84% 0.87 0.82 0.84 3.60 26.45
SVM 81% 0.84 0.79 0.81 3.99 29.62
3: Insertion form for Breast cancer prediction

4: Result for Breast cancer prediction


6. Discussion and Analysis Decision trees can be easily pruned to
prevent overfitting, which occurs when a

Handling Complex Relationships: Decision trees model learns to memorize the training

are adept at handling complex relationships data rather than generalize to unseen data.

between different variables. In the context of breast Pruning techniques such as limiting the

cancer prediction. Decision trees can effectively maximum depth of the tree or setting a

navigate through these various factors and their minimum number of samples required to

interactions to make accurate predictions. Other split a node help prevent overfitting and

algorithms may struggle to capture the intricate improve the model's generalization

relationships between these variables, leading to performance. Other algorithms, such as

less accurate predictions. deep learning models, may require more

Transparency and Interpretability: Decision trees sophisticated regularization techniques to

provide a clear and interpretable structure for achieve similar levels of robustness.

understanding how predictions are made. Each


decision node represents a feature or attribute, and Ensemble Methods: Decision

each branch represents a possible outcome based trees can also benefit from ensemble

on that feature. This transparency allows clinicians methods such as Random Forests or

and researchers to easily interpret the decision- Gradient Boosting, which combine

making process of the algorithm. In contrast, some multiple decision trees to improve

other algorithms, such as neural networks, are often predictive performance. These ensemble

considered "black box" models, making it methods further enhance the accuracy of

challenging to understand the reasoning behind decision trees by reducing variance and

their predictions. bias, leading to more robust predictions.

Handling Missing Data and


Categorical Variables: Decision trees can In conclusion, decision trees

handle missing data and categorical offer several advantages in breast cancer

variables efficiently. Missing data is prediction, including their ability to

common in medical datasets, and decision handle complex relationships,

trees have mechanisms to handle it transparency, robustness to missing data

without requiring imputation techniques and categorical variables, and flexibility in

that may introduce bias. Moreover, preventing overfitting. By leveraging

decision trees can handle categorical these strengths, decision trees often

variables without the need for one-hot outperform other algorithms in accurately

encoding or other preprocessing steps, predicting breast cancer outcomes..

simplifying the modeling process and


potentially preserving important 7. Conclusion
information in the data. In conclusion, decision trees offer several
Robustness to Overfitting: advantages in breast cancer prediction, including
their ability to handle complex relationships, [4]. Khourdifi Y and Bahaj M,
transparency, robustness to missing data and ‘‘Applying best machine learning
categorical variables, and flexibility in preventing algorithms for breast cancer prediction
overfitting. By leveraging these strengths, decision and classification,’’ in Proc. Int. Conf.
trees often outperform other algorithms in Electron., Control, Optim. Comput. Sci.
accurately predicting breast cancer outcomes. (ICECOCS), pp.1–5, Dec 2018. 31
Overall, the integration of a web-based prediction
tool developed using HTML, CSS, PyCharm, and [5]. Muktevi Srivenkatesh, “Prediction
Flask enhances accessibility, usability, and of Breast Cancer Disease using Machine
performance for healthcare professionals, leading Learning Algorithms”, International
to more timely and reliable breast cancer detection Journal of Innovative Technology and
and ultimately improving patient outcomes. Exploring Engineering (IJITEE) ISSN:
2278-3075, Volume-9 Issue-4, 2020.
References
[1]. Deepika Verma, Nidhi Mishra, [6]. Nikhilanand Arya , Archana

“Comparative analysis of breast cancer Mathur , Snehanshu Saha, and Sriparna

and hypothyroid dataset using data Saha ” Proposal of SVM Utility Kernel

mining classification techniques”, IEEE for Breast Cancer Survival Estimation”

International Conference on Power, IEEE/ACM Transactions on

Control, Signals and Instrumentation computational biology and

Engineering (ICPCSI), 2017. bioinformatics, vol. 20. March 2023.

[2]. Gaurav Singh, “Breast Cancer [7]. Noreen Fatima, Li Liu, Sha Hong,

Prediction Using Machine Learning”, Haroon Ahmed, “Prediction of Breast

International Journal of Scientific Cancer, Comparative Review of

Research in Computer Science, Machine Learning Techniques, and

Engineering and Information Their Analysis”, IEEE Access , Volume

Technology, Volume 6, Issue 4, pp 278- 8 ,2020.

284, 2020.
[8]. Prasetyo C, Kardiana A, and

[3]. Hyuna Sung, Jacques Ferlay, Yuliwulandari R, “Breast cancer

Rebecca L. Siegel, Mathieu Laversanne, diagnosis using artificial neural

Isabelle Soerjomataram, Ahmedin networks with extreme learning

Jemal, Freddie Bray, “Global Cancer techniques”, Volume 3, no. 7, pp. 10–

Statistics 2020: GLOBOCAN Estimates 14, 2014.

of Incidence and Mortality Worldwide


for 36 Cancers in 185 Countries”, CA [9]. Ramik Rawal, “Breast Cancer

CANCER J CLIN 2021;71:209–249, Prediction Using Machine Learning”,

Volume 71, pp 209,2021. Journal of Emerging Technologies and


Innovative Research, Volume 7, Issue
5,2020.

[10]. Ritu Madhan, Shivani Kalariya,


“Design and Development of Prosthetic
Brassieres for Breast Cancer Patients”,
Journal of Scientific Research Institute
of Science, Banaras Hindu University,
Varanasi, India, Volume 65, Issue 4,
2021.

[11]. Shravya Ch, Pravalika K, Shaik


Subhani, “Prediction of Breast Cancer
Using Supervised Machine Learning
Techniques”, International Journal of
Innovative Technology and Exploring
Engineering (IJITEE), Volume-8 Issue6,
April 2019.

[12]. Wang Haifeng; Yoon Sang Won,


“Breast Cancer Prediction Using Data
Mining Method”, IIE Annual
Conference. Proceedings; Norcross pp
818- 828,2015

You might also like