Journal-Breast Cancer Prediction
Journal-Breast Cancer Prediction
Learning Algorithm
B.Praveen1
, R.Dhilipan2, S.Siva Ranjan 3, Ms.U.Elakkiya4
1,2,3
Under Graduate Student(s), Department of Information Technology,
Sri Ramakrishna Institute of Technology, Coimbatore, Tamil Nadu, India
4
Assistant Professor, Department of Information Technology,
Sri Ramakrishna Institute of Technology, Coimbatore, Tamil Nadu, India
Abstract
Keywords: Breast cancer prediction, Data
Breast cancer is a major concern for
analytics, Machine learning technique, Decision
women worldwide, with early detection being
tree Classifier, Healthcare practitioners
crucial for effective treatment. Many women,
especially in countries with limited resources, 1. Introduction
are diagnosed late in life due to barriers to
Breast cancer accounts for the majority of new
healthcare. To address this, early diagnostic
cases and cancer-related deaths in women
programs are vital, focusing on identifying
globally, according to data. It is one of the
symptoms early and ensuring prompt referral
most prevalent illnesses impacting women
for diagnosis and treatment. Data analytics,
globally. As such, it is a significant public
particularly machine learning, plays a
health risk in today's society. By promoting
significant role in this process. By analyzing
quick clinical care for patients, an early
large datasets, machine learning techniques
diagnosis of breast cancer can greatly enhance
like K-Nearest Neighbor, Naive Bayes, SVM,
prognosis and probability of survival. An
and Decision Tree Classifier can predict
improved classification system for benign
whether a tumor is cancerous or not. This is
tumours might save people from needless
done through supervised learning, where the
medical procedures. Because of this, a lot of
computer is trained with data on tumor
research is being done on accurately
characteristics and their corresponding
diagnosing BC and classifying people into
diagnosis, enabling it to classify new cases
categories according on whether they are
accurately. This approach aims to improve
malignant or benign. The best method is
access to timely cancer therapy by providing
widely accepted to be machine learning (ML),
accurate predictions, aiding healthcare
which offers certain advantages in terms of
practitioners in decision-making.
extracting significant characteristics from Ch. Shravya (2019) In this study, we leveraged a
complex datasets. Machine learning (ML) has dataset sourced from the UCI repository,
responsible for milk production; ducts, the comparison of four distinct breast cancer prediction
algorithms Support Vector Machines (SVM),
conduits for transporting milk to the nipple;
Logistic Regression, Random Forest, and K-
and the supportive connective tissue
Nearest Neighbors (KNN) utilizing diverse datasets
enveloping these structures. Predominantly,
within a simulated environment facilitated by the
breast cancer originates in the lobules or ducts.
JUPITER platform .
The potential for its spread beyond the breast
arises through lymphatic and blood vessels, Wang, Haifeng (2015) This groundbreaking
underscoring the complexity of this condition. research delves into the realm of breast cancer
As we reflect on the past year, let us also prediction through innovative data mining
recognize the resilience and strength exhibited techniques, aiming to unveil a reliable method for
by individuals facing breast cancer and the anticipating the onset of breast cancer. Extensively
ongoing efforts in research and awareness analyzing a vast array of patient clinical data, the
study constructs a precise predictive model
employing four distinct data mining methods—
2. Related Work
support vector machine (SVM), artificial neural
Several research studies have significantly
network (ANN), Naive Bayes classifier, and
advanced the understanding and prediction of
AdaBoost tree. Recognizing the pivotal role of
breast cancer prediction using machine learning
feature space in the learning process, the research
(ML) techniques. Deepika Verma (2017) This
explores its impact on speed and efficacy.
study discussed the categorization of data mining
approaches. In this work, we use the WEKA
Gaurav Singh (2020) This pioneering research
interface as well as the nave byes and MLP
endeavors to develop a groundbreaking predictive
classification algorithms. To assess how well these
model for breast cancer using advanced machine
two algorithms worked, hypothyroidism and breast
learning methods such as k Nearest Neighbour
cancer data sets were used.
(kNN), Support Vector Machine (SVM), Logistic
Regression (LR), and Gaussian Naive Bayes (NB).
The study goes beyond model creation, dataset which has major attribute as id, diagnosis
meticulously comparing accuracy, precision, recall, and other real valued features which are computed
f1-Score, and Jaccard index across classifiers. for each cell nucleus like radius, texture,
Leveraging a publicly available UCI Machine parameter, smoothness, area, etc.
Learning Repository dataset, the trainingtesting Removing Null Values: The system will employ
split is set at 80-20%. Strikingly, k Nearest robust techniques for removing null values in the
Neighbours emerges as the standout performer,
row.
showcasing its superiority in breast cancer
Scaling: Feature scaling is a data preprocessing
prediction and opening avenues for significant
technique used to transform the values of features
advancements in medical diagnostics and
or variables in a dataset to a similar scale.
healthcare applications
with absolute certainty, the synthesis of these Usually, training data and test data are separated
studies provides a nuanced understanding of the from the data. In this project, 20% of the data are
evolving landscape of ML applications in breast used for testing and 80% of the data are for
cancer prediction. From algorithmic advancements training to get the better accuracy.
to the incorporation of sentiment analysis, these
research endeavors collectively contribute to Prediction:
enhancing the accuracy and reliability of breast Numerous Machine Learning algorithms,
cancer prediction using machine learning such as Logistic Regression, Support Vector
methodologies. Machine (SVM), Random Forest, Decision
Tree Classifier and KNN, will be used by the
3. Methodology system. Compare each algorithm's
performance with and without feature
Data Collection: selection in order to determine how variable
The dataset is a digitally stored collection of subsets affect prediction accuracy.
various types of data that may be utilized to train
the model. For this project, the Wisconsin Dataset
were used, which consists of roughly 600 rows of
data and includes characteristics such as clump
thickness, uniform cell size, uniform cell shape,
marginal adhesion, single epithelial size, naked
nuclei, bland chromatin, normal nucleoli, and
mitosis
Data pre-processing:
The dataset may be Incomplete or have some
missing attribute values, or having only aggregate
data. So, there is a need to pre-process our medical
characteristics. Logistic regression is useful in
determining the relevance of various parameters in
the setting of breast cancer because of its ease of
use and interpretability. It gives a clear picture of
how each factor affects the probability of
malignancy.
p=1+e-(β 0 + β 1 x1 + β 2 x2 + …+ β n xn )1
B. K-Nearest-Neighbor
Handling Complex Relationships: Decision trees model learns to memorize the training
are adept at handling complex relationships data rather than generalize to unseen data.
between different variables. In the context of breast Pruning techniques such as limiting the
cancer prediction. Decision trees can effectively maximum depth of the tree or setting a
navigate through these various factors and their minimum number of samples required to
interactions to make accurate predictions. Other split a node help prevent overfitting and
algorithms may struggle to capture the intricate improve the model's generalization
provide a clear and interpretable structure for achieve similar levels of robustness.
each branch represents a possible outcome based trees can also benefit from ensemble
on that feature. This transparency allows clinicians methods such as Random Forests or
and researchers to easily interpret the decision- Gradient Boosting, which combine
making process of the algorithm. In contrast, some multiple decision trees to improve
other algorithms, such as neural networks, are often predictive performance. These ensemble
considered "black box" models, making it methods further enhance the accuracy of
challenging to understand the reasoning behind decision trees by reducing variance and
handle missing data and categorical offer several advantages in breast cancer
decision trees can handle categorical these strengths, decision trees often
variables without the need for one-hot outperform other algorithms in accurately
and hypothyroid dataset using data Saha ” Proposal of SVM Utility Kernel
[2]. Gaurav Singh, “Breast Cancer [7]. Noreen Fatima, Li Liu, Sha Hong,
284, 2020.
[8]. Prasetyo C, Kardiana A, and
Jemal, Freddie Bray, “Global Cancer techniques”, Volume 3, no. 7, pp. 10–