Chapter One To Three
Chapter One To Three
INTRODUCTION
Breast cancer is a complex disease that arises from a combination of genetic and environmental
factors. It is the second most common cancer in women worldwide and is responsible for a
significant number of cancer-related deaths. Early detection and accurate prediction of breast
cancer can significantly improve patient outcomes by enabling timely treatment and reducing the
Machine learning algorithms have shown great potential in breast cancer prediction by analyzing
large datasets of patient information. These algorithms can identify patterns in the data that are
difficult for humans to detect, allowing for more accurate prediction of breast cancer risk.
Machine learning models can also be trained on mammography images to detect abnormalities
There are several machine learning algorithms that can be used for breast cancer prediction,
including logistic regression, decision trees, random forests, support vector machines, and
artificial neural networks. Each algorithm has its own strengths and weaknesses, and the choice
of algorithm depends on the characteristics of the dataset and the specific requirements of the
problem.
In this project, we propose to develop a machine learning model for breast cancer prediction and
compare the performance of different algorithms. We will collect breast cancer patient data from
publicly available databases and preprocess the data to remove missing values, outliers, and
irrelevant features. We will use feature selection techniques to identify the most relevant features
for breast cancer prediction and develop machine learning models using different algorithms. We
1
will evaluate the performance of the developed models using performance metrics such as
accuracy, precision, recall, and F1-score. Finally, we will compare the performance of the
The results of this project will provide insights into the effectiveness of different machine
learning algorithms for breast cancer prediction and may lead to the development of more
Breast cancer is a major public health concern worldwide, with approximately 2.3 million new
cases diagnosed annually and around 685,000 deaths reported each year. Early detection and
accurate prediction of breast cancer can significantly improve patient outcomes and reduce
mortality rates. Machine learning algorithms have shown great potential in breast cancer
prediction by analyzing large datasets of patient information. Several studies have been
conducted to develop machine learning models for breast cancer prediction and compare the
One study by Alipourfard et al. (2020) compared the performance of logistic regression, decision
trees, random forests, support vector machines, and artificial neural networks in breast cancer
prediction using the Wisconsin Breast Cancer Dataset. The study found that artificial neural
networks had the highest accuracy and F1-score, followed by support vector machines and
random forests.
Another study by Naseem et al. (2020) compared the performance of logistic regression, decision
trees, random forests, and artificial neural networks in breast cancer detection using
mammography images. The study found that artificial neural networks had the highest accuracy
2
and sensitivity, followed by random forests and decision trees. Breast cancer epidemiology,
prevention, and pathology have been extensively studied, and several risk factors have been
identified, including age, family history, genetic mutations, reproductive history, and lifestyle
that can effectively predict breast cancer risk using relevant data sources. This requires collecting
and preprocessing large amounts of data from various sources such as medical records, genetic
information, and imaging data. Another challenge is to identify the most relevant features that
contribute to predicting breast cancer risk and selecting the appropriate machine learning
algorithm to achieve high accuracy in predictions. Additionally, the model needs to be validated
and tested on different datasets to ensure its generalizability and robustness. Finally, ethical
considerations must be taken into account in the development of such models, ensuring that
patient privacy and autonomy are preserved, and that the model is used in a responsible and
transparent manner.
Breast cancer is a significant public health issue, and early detection is critical for improving
patient outcomes. Machine learning can play a valuable role in predicting breast cancer risk and
aiding in the early detection of breast cancer. The development of accurate and reliable machine
learning models can help healthcare professionals make more informed decisions and provide
personalized treatment plans to patients based on their individual risk factors. Furthermore, the
use of machine learning can help reduce healthcare costs by identifying high-risk patients early
on and preventing the need for more expensive and invasive procedures in the later stages of the
disease. Therefore, the motivation behind the prediction of breast cancer using supervised
3
machine learning proposal with reference is to improve patient outcomes, reduce healthcare
were used in this study. Models were initially trained with demographic and laboratory features.
The models were then trained with all demographic, laboratory, and mammographic features to
1.4. METHODOLOGY
We obtained the breast cancer dataset from UCI (unique client identifier) repository and used
that better visualization of machine learning models can be done by plotting the prediction
B. Feature Selection: Feature selection is finding the subset of original features by different
4
C. Feature Projection: Feature projection is transformation of high-dimensional space data to a
lower dimensional space (with few attributes). Both linear and nonlinear reduction techniques
can be used in accordance with the type of relationships among the features in the dataset
used when we need to tackle the curse of dimensionality among data with linear relationships. It
is a linear technique which is used to compress lots of data into something which gives essence
E. Model Selection: The most exciting phase in building any machine learning model is
selection of algorithm. We can use more than one kind of data mining techniques to large
datasets. But, at high level all those different algorithms can be classified in two groups:
Supervised learning: is the method in which the machine is trained on the data which the input
Unsupervised Learning :is giving away information to the machine that is neither classified
nor labeled and allowing the algorithm to analyze the given information without providing any
direction.
Prognosis " has the potential to make significant contributions to the knowledge and
understanding of breast cancer diagnosis and risk prediction. By utilizing machine learning
techniques, we can develop a model that can predict the presence of breast cancer with high
5
The project can also contribute to the development of new diagnostic tools and
personalized risk assessment strategies for breast cancer. By analyzing a large dataset of breast
cancer patients with various clinical and demographic features, we can identify new risk factors
and biomarkers that can be used to improve breast cancer screening and diagnosis.
Moreover, the project can also shed light on the effectiveness of different machine
learning algorithms for breast cancer prediction. By comparing and evaluating the performance
of various supervised learning algorithms such as logistic regression, decision trees, random
forests, and support vector machines, we can identify the most effective algorithm for breast
cancer prediction.
6
CHAPTER TWO
LITERATURE REVIEW
Breast cancer, a highly lethal and diverse disease in the current era, claims the lives of numerous
women globally. It stands as the most prevalent cancer among women, impacting approximately
10% of females at various life stages. Recent trends indicate a rising incidence rate, with a
reported 88% survival rate after five years and 80% after ten years from diagnosis. Early
detection is imperative in the monitoring process, given that breast cancer is the second leading
cause of female mortality after heart disease. The abnormal growth of fatty and fibrous breast
Tumors manifest as either benign, characterized by slow growth and lack of spread, or
malignant, exhibiting rapid growth, invasion of nearby tissues, and systemic dissemination.
These malignant tumors result from abnormal proliferation in the breast's fatty and fibrous
tissues, leading to different cancer stages (Noreen, Liu, Sha, & Ahmed, 2020).
Figure 2.1 illustrates the diverse types of breast cancer. Ductal Carcinoma in Situ (DCIS), a non-
invasive cancer, occurs when abnormal cells extend beyond the breast. Invasive Ductal
Carcinoma (IDC), also known as infiltrative ductal carcinoma, involves the widespread
distribution of abnormal breast cells. Mixed Tumors Breast Cancer (MTBC), or invasive
mammary breast cancer, arises from abnormal duct and lobular cells. Lobular Breast Cancer
(LBC), occurring within the lobule, elevates the risk of other invasive cancers. Mkagglenous
Breast Cancer (MBC), also known as colloid breast cancer, results from invasive ductal cells
spreading around the duct. Inflammatory Breast Cancer (IBC), the final type, induces breast
swelling and reddening, representing a fast-growing cancer stemming from lymph vessel
only one or two. Some people do not have any signs or symptoms at all. The most common signs
• Skin redness;
• Dimpling or puckering;
• Fluid, other than breast milk, from the nipple, especially if it’s bloody;
8
• Scaly, red or swollen skin on the breast, nipple or areola (the dark area of skin that is
Breast ultrasound: A machine that uses sound waves to make pictures, called sonograms, of
Diagnostic mammogram: If you have a problem in your breast, such as lumps, or if an area of
the breast looks abnormal on a screening mammogram, doctors may have you get a diagnostic
Breast magnetic resonance imaging (MRI): A kind of body scan that uses a magnet linked to a
computer. The MRI scan will make detailed pictures of areas inside the breast.
Biopsy: This is a test that removes tissue or fluid from the breast to be looked at under a
microscope and do more testing. There are different kinds of biopsies (for example, fine-needle
Now as an innovation, we go in for a more accurate and effective way of detecting cancer, hence
programs to complete tasks which would otherwise require human intelligence. AI algorithms
9
reasoning. In AI we have machine learning and deep learning. Figure 1.2 shows the relationship
are mimicking human thought patterns to facilitate the digital transformation of society. AI
systems perceive their environment, deal with what they perceive, solve problems and act to help
with tasks to make everyday life easier. The following are ways in which AI has helped
• Voice Assistants: Digital assistants like Siri, Google Home, and Alexa use AI- backed
Voice User Interfaces (VUI) to process and decipher voice commands. AI gives these
applications the freedom to not solely rely on voice commands but also leverage vast
10
• Entertainment Streaming Apps: Streaming giants like Netflix, Spotify, and Hulu are
continually feeding data into machine learning algorithms to make the user experience
seamless.
• Smart Input Keyboards: The latest versions of mobile keyboard apps combine the
• Navigation and Travel: The work of AI programmers behind navigation apps like
Google Maps and Waze never ends. Yottabytes of geographical data which is up- dated
satellite images.
• Security and Surveillance: It is nearly impossible for a human being to keep a constant
eye on too many monitors of a CCTV network at the same time. So, naturally, we have felt
the need to automate such surveillance tasks and further enhance them by leveraging
• Internet of Things: The confluence of AI and the Internet of Things (IoT) opens up a
plethora of opportunities to develop smarter home appliances that require minimal human
interference to operate. While IoT deals with devices interacting with the internet, the AI
11
• Facial Recognition Technologies: The most popular application of this technology is in
the Face ID unlock feature in most of the flagship smartphone models today. The biggest
challenge faced by this technology is widespread concern around the racial and gender bias
sciences. Common applications include diagnosing patients, end-to-end drug discovery and
integration of machine learning techniques. A seminal study conducted by Smith et al. (2018)
diseases using a diverse set of patient data. Leveraging a support vector machine algorithm, the
model showcased high accuracy in discerning patterns indicative of cardiovascular risks. This
underscores the potential of machine learning in contributing to the early diagnosis and
prevention of cardiovascular diseases. In a parallel effort, Jones and colleagues (2019) explored
the application of decision trees in predicting the onset of diabetes based on patient
demographics, lifestyle factors, and genetic markers. The decision tree algorithm exhibited
notable accuracy, shedding light on the intricate interplay of variables influencing diabetes risk.
This case study exemplifies the adaptability of machine learning approaches to diverse disease
domains, providing valuable insights into the nuanced factors contributing to disease
susceptibility.
12
Transitioning to the realm of oncology, a study by Chen et al. (2020) stands out for its
random forest algorithm, the model assimilated radiological imaging data to forecast the
likelihood of tumor progression. The findings underscore the potential of machine learning not
only in disease prediction but also in tailoring treatment strategies based on individualized risk
assessments. While these case studies predominantly focus on non-cancerous diseases, their
methodologies and outcomes offer pertinent lessons for the domain of breast cancer prediction.
The ability of machine learning models to extract meaningful patterns from diverse datasets, as
demonstrated in these studies, forms a solid foundation for our endeavor to construct an accurate
In a more recent exploration by Wang et al. (2021), the researchers employed deep learning
with multi-modal data, including imaging and genetic information, the model exhibited
promising results in early detection. This underscores the evolving landscape of machine
etiologies.
As we navigate through these case studies, it becomes evident that the versatility of machine
learning transcends disease boundaries, offering a promising avenue for the development of our
predictive model for breast cancer. The amalgamation of diverse algorithms and data types in
these studies sets a precedent for our exploration into tailoring a comprehensive and accurate
13
2.4 Review of Previous Works on Machine Learning for General Diseases Prediction
Extensive work was carried out in the field of Artificial Intelligence, especially Machine
Learning, to detect common diseases. Dahiwade et al.2021 proposed a ML based system that
predicts common diseases. The symptoms dataset was imported from the KAGGLE ML
depository, where it contained symptoms of many common diseases. The system used CNN and
proposed solution was supplemented with more information that concerned the living habits of
the tested patient, which proved to be helpful in understanding the level of risk attached to the
predicted disease. Dahiwade et al. compared the results between KNN and CNN algorithm in
terms of processing time and accuracy. The accuracy and processing time of CNN were 84.5%
In light of this study, the findings of Chen et al. 2019 also agreed that CNN outperformed typical
supervised algorithms such as KNN, NB, and DT. The authors concluded that the proposed
model scored higher in terms of accuracy, which is explained by the capability of the model to
detect complex nonlinear relationships in the feature space. Moreover, CNN detects features with
high importance that renders better description of the disease, which enables it to accurately
predict diseases with high complexity. This conclusion is well sup- ported and backed with
empirical observations and statistical arguments. Nonetheless, the presented models lacked
details, for instance, neural networks parameters such as network size, architecture type, learning
rate and back propagation algorithm, etc. In addition, the analysis of the performances is only
evaluated in terms of accuracy, which debunks the validity of the presented findings. Moreover,
the authors did not take into consideration the bias problem that is faced by the tested algorithms.
In illustration, the incorporation of more feature variables could immensely ameliorate the
performance metrics of under- performed algorithms. Uddin et al 2016 compared the various
14
supervised ML techniques. In their study, extensive research efforts were made to identify those
studies that applied more than one supervised machine learning algorithm on single disease
prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search
items. Thus, they selected 48 articles in total for the comparison among variants supervised
machine learning algorithms for dis- ease prediction. They found that the Support Vector
Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Na¨ıve
Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior
accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy
in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was
considered.
Sengar et al. 2019 attempted to detect breast cancer using ML algorithms, namely RF, Bayesian
Networks and SVM. The researchers obtained the Wisconsin original breast cancer dataset from
the KAGGLE repository and utilized it for comparing the learning models in terms of key
parameters such as accuracy, recall, precision, and area of ROC graph. The classifiers were
tested using K-fold validation method, where the chosen value of K is equal to 10. The
simulation results have proved that SVM excelled in terms of recall, accuracy, and precision.
Sengar et al. 2019 attempted to detect breast cancer using ML algorithms, namely RF, Bayesian
Networks and SVM. The researchers obtained the Wisconsin original breast cancer dataset from
the KAGGLE repository and utilized it for comparing the learning models in terms of key
parameters such as accuracy, recall, precision, and area of ROC graph. The classifiers were
tested using K-fold validation method, where the chosen value of K is equal to 10. The
simulation results have proved that SVM excelled in terms of recall, accuracy, and precision.
15
However, RF had a higher probability in the correct classification of the tumor, which was
implied by the ROC graph. In contrast, Yao experimented with various data mining methods
including RF and SVM to determine the best suited algorithm for breast cancer prediction. Per
results, the classification rate, sensitivity, and specificity of Random Forest algorithm were
96.27%, 96.78%, and 94.57%, respectively, while SVM scored an accuracy value of 95.85%, a
sensitivity of 95.95%, and a specificity of 95.53%. Yao came to the conclusion that the RF
algorithm performed better than SVM because the former provides better estimates of
information gained in each feature attribute. Furthermore, RF is the most adequate at breast
diseases classification, since it scales well for large datasets and prefaces lower chances of
variance and data over fitting. The studies advantageously presented multiple performance
metrics that solidified the underlined argument. Nevertheless, the inclusion of the preprocessing
stage to prepare raw data for training proved to be disadvantageous for ML models. According to
Yao, omitting parts of data reduces the quality of images, and therefore the performance of the
ML algorithm is hindered.
Noreen Fatima et al. 2018 performed a comparative review or machine learning techniques and
analyzed their accuracy across various journals. Her main focus is to comparatively analyze
different existing Machine Learning and Data Mining techniques in order to find out the most
appropriate method that will support the large dataset with good accuracy of prediction. She
found out that machine learning techniques were used in 27 papers, ensemble techniques were
used in 4 papers, and deep learning techniques were used in 8 papers. She concluded by saying
that each technique is suitable under different conditions and on different type of dataset, after
the comparative analysis of these algorithms we came to know that machine learning algorithm
SVM is the most suitable algorithm for prediction of breast cancer. Different researchers have
16
provided the analysis of prediction algorithms by using the dataset from Wisconsin Diagnostic
Breast Cancer (WDBC), and the analysis shows that each time the accuracy of SVM algorithm is
Delen et al. 2020 used artificial neural networks, decision trees and logistic regression to develop
prediction models for breast cancer survival by analyzing a large dataset, the SEER cancer
incidence database. Two popular data mining algorithms (artificial neural networks and decision
trees) were used, along with a most commonly used statistical method (logis- tic regression) to
develop the prediction models using a large dataset (more than 200,000 cases). 10-fold cross-
validation method was used to measure the unbiased estimate of the three prediction models for
performance comparison purposes. The results indicated that the decision tree (C5) is the best
predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any
reported in the literature), artificial neural networks came out to be the second with 91.2%
accuracy and the logistic regression models came out to be the worst of the three with 89.2%
accuracy. The comparative study of multiple prediction models for breast cancer survivability
using a large dataset along with a 10-fold cross-validation provided us with an insight into the
relative prediction ability of different data mining methods. Using sensitivity analysis on neural
network models provided us with the prioritized importance of the prognostic factors used in the
study.
Lundin et al. 2015 used ANN and logistic regression models to predict 5, 10, and 15- year breast
cancer survival. They studied 951 breast cancer patients and used tumor size, axillary nodal
status, histological type, mitotic count, nuclear pleomorphism, tubule formation, tumor necrosis,
and age as input variables. In this study, they showed that data mining could be a valuable tool in
identifying similarities (patterns) in breast cancer cases, which can be used for diagnosis,
17
prognosis, and treatment purposes the area under the ROC curve (AUC) was used as a measure
of accuracy of the prediction models in generating survival estimates for the patients in the
independent validation set. The AUC values of the neural network models for 5-, 10- and 15-
year breastcancer-specific survival were 0.909, 0.886 and 0.883, respectively. The corresponding
AUC values for logistic regression were 0.897, 0.862 and 0.858. Axillary lymph node status (N0
vs. N+) predicted 5-year survival with a specificity of 71% and a sensitivity of 77%. The
sensitivity of the neural network model was 91% at this specificity level. The rate of false
predictions at 5 years was 82/300 for nodal status and 40/300 for the neural network. When
nodal status was excluded from the neural network model, the rate of false predictions increased
only to 49/300 (AUC 0.877). An artificial neural network is very accurate in the 5-, 10- and 15-
year breast cancer-specific survival prediction. The consistently high accuracy over time and the
demonstrate that neural networks can be important tools for cancer survival prediction.
Yawen Xiao et al. says that breast cancer disease is common disease in female category of the
people. In this research work demonstrated a new system embedded with deep learning concept
based unsupervised feature extraction algorithm. The stacked auto- encoder concept was also
used with a support vector machine technique to predict breast cancer. The proposed method was
tested by using Wisconsin Diagnostic Breast Cancer data set. The result displays that SAE-SVM
Junaid Ahmad Bhat et al.2021 developed a new tool used to detect the breast cancer disease in
early stage. In this research work the authors was presented preliminary results of the project
BCDM developed by using Matlab software. The algorithm was implemented using adaptive
18
methods including RF and SVM to determine the best suited algorithm for breast cancer
prediction. Per results, the classification rate, sensitivity, and specificity of Random Forest
algorithm were 96.27%, 96.78%, and 94.57%, respectively, while SVM scored an accuracy
value of 95.85%, a sensitivity of 95.95%, and a specificity of 95.53%. Yao came to the
conclusion that the RF algorithm performed better than SVM because the former provides better
estimates of information gained in each feature attribute. Furthermore, RF is the most adequate at
breast diseases classification, since it scales well for large datasets and prefaces lower chances of
variance and data over fitting. The studies advantageously presented multiple performance
metrics that solidified the underlined argument. Nevertheless, the inclusion of the preprocessing
stage to prepare raw data for training proved to be disadvantageous for ML models. According to
Yao, omitting parts of data reduces the quality of images, and therefore the performance of the
ML algorithm is hindered.
Noreen Fatima et al. 2018 performed a comparative review or machine learning techniques and
analyzed their accuracy across various journals. Her main focus is to comparatively analyze
different existing Machine Learning and Data Mining techniques in order to find out the most
appropriate method that will support the large dataset with good accuracy of prediction. She
found out that machine learning techniques were used in 27 papers, ensemble techniques were
used in 4 papers, and deep learning techniques were used in 8 papers. She concluded by saying
that each technique is suitable under different conditions and on different type of dataset, after
the comparative analysis of these algorithms we came to know that machine learning algorithm
SVM is the most suitable algorithm for prediction of breast cancer. Different researchers have
provided the analysis of prediction algorithms by using the dataset from Wisconsin Diagnostic
19
Breast Cancer (WDBC), and the analysis shows that each time the accuracy of SVM algorithm is
Delen et al. 2020 used artificial neural networks, decision trees and logistic regression to develop
prediction models for breast cancer survival by analyzing a large dataset, the SEER cancer
incidence database. Two popular data mining algorithms (artificial neural networks and decision
trees) were used, along with a most commonly used statistical method (logis- tic regression) to
develop the prediction models using a large dataset (more than 200,000 cases). 10-fold cross-
validation method was used to measure the unbiased estimate of the three prediction models for
performance comparison purposes. The results indicated that the decision tree (C5) is the best
predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any
reported in the literature), artificial neural networks came out to be the second with 91.2%
accuracy and the logistic regression models came out to be the worst of the three with 89.2%
accuracy. The comparative study of multiple prediction models for breast cancer survivability
using a large dataset along with a 10-fold cross-validation provided us with an insight into the
relative prediction ability of different data mining methods. Using sensitivity analysis on neural
network models provided us with the prioritized importance of the prognostic factors used in the
study.
Lundin et al. 2015 used ANN and logistic regression models to predict 5, 10, and 15- year breast
cancer survival. They studied 951 breast cancer patients and used tumor size, axillary nodal
status, histological type, mitotic count, nuclear pleomorphism, tubule formation, tumor necrosis,
and age as input variables. In this study, they showed that data mining could be a valuable tool in
identifying similarities (patterns) in breast cancer cases, which can be used for diagnosis,
prognosis, and treatment purposes the area under the ROC curve (AUC) was used as a measure
20
of accuracy of the prediction models in generating survival estimates for the patients in the
independent validation set. The AUC values of the neural network models for 5-, 10- and 15-
year breastcancer-specific survival were 0.909, 0.886 and 0.883, respectively. The corresponding
AUC values for logistic regression were 0.897, 0.862 and 0.858. Axillary lymph node status (N0
vs. N+) predicted 5-year survival with a specificity of 71% and a sensitivity of 77%. The
sensitivity of the neural network model was 91% at this specificity level. The rate of false
predictions at 5 years was 82/300 for nodal status and 40/300 for the neural network. When
nodal status was excluded from the neural network model, the rate of false predictions increased
only to 49/300 (AUC 0.877). An artificial neural network is very accurate in the 5-, 10- and 15-
year breast cancer-specific survival prediction. The consistently high accuracy over time and the
demonstrate that neural networks can be important tools for cancer survival prediction.
Yawen Xiao et al. says that breast cancer disease is common disease in female category of the
people. In this research work demonstrated a new system embedded with deep learning concept
based unsupervised feature extraction algorithm. The stacked auto- encoder concept was also
used with a support vector machine technique to predict breast cancer. The proposed method was
tested by using Wisconsin Diagnostic Breast Cancer data set. The result displays that SAE-SVM
Junaid Ahmad Bhat et al.2016 developed a new tool used to detect the breast cancer disease in
early stage. In this research work the authors was presented preliminary results of the project
BCDM developed by using Matlab software. The algorithm was implemented using adaptive
resonance approach.
21
22
CHAPTER THREE
METHODOLOGY
The success of any predictive modeling endeavor lies in the careful and systematic approach to
data collection, preprocessing, and model development. In this chapter, we delve into the
methodology employed to construct a robust and effective predictive model for breast cancer
ensure the reliability and accuracy of the predictive model. This chapter provides a detailed
account of the steps undertaken, beginning with the selection and collection of pertinent data,
followed by rigorous preprocessing measures to prepare the dataset for analysis. The choice of a
suitable supervised machine learning algorithm and the intricacies of model training are explored
By elkaggledating the methodology, this chapter aims to offer transparency into the research
process, enabling replication and validation of results. The careful consideration of each step in
the development of the predictive model is paramount to its success and, ultimately, to its
potential impact on early breast cancer detection and improved patient outcomes.
build systems that have the ability to automatically learn and improve from experiences without
being explicitly programmed. Deep learning is a type of machine learning and artificial
intelligence (AI) that imitates the way humans gain certain types of knowledge. While traditional
23
machine learning algorithms are linear, deep learning algorithms are stacked in a hierarchy of
At its most basic sense, machine learning uses programmed algorithms that learn and optimize
their operations by analyzing input data to make predictions within an acceptable range. With the
feeding of new data, these algorithms tend to make more accurate predictions. Although there are
some variations of how to group machine learning algorithms, they can be divided into three
broad categories according to their purposes and the way the underlying machine is being taught.
These three categories are: supervised, unsupervised and semi-supervised. There also exists a
fourth category known as reinforcement ML. Figure 3.1 shows an illustration of the
Figure 3.1: Classification of Machine Learning algorithms (Dasgupta and Nath 2016)
24
3.2.1 Supervised Machine Learning Algorithms
In this type of algorithms, a model gains knowledge from the data that has predefined examples
of data with both input and expected output to compare its output with the correct input.
Classification problem is one of the standard formulations for supervised learning task where the
data is mapped into a class after looking at numerous inputoutput examples of a function.
Supervised learning is a branch of ML which deals with a given dataset consisting of multiple
data along with their corresponding classes. It can be used both for decision trees and artificial
neural networks. In decision trees it can be used to determine which attributes of the data given
provides the most relevant information. In artificial neural networks, the models are trained on
the given dataset and classifications of an unknown sample of data are being carried out.
1 Logic Regression: Logistic regression Logistic regression (LR) is a powerful and well-
established method for supervised classification (Dasgupta and A. Nath, 2016). It can be
considered as an extension of ordinary regression and can model only a dichotomous variable
which usually represents the occurrence or non-occurrence of an event. LR helps in finding the
probability that a new instance belongs to a certain class. Since it is a probability, the outcome
lies between 0 and 1. Therefore, to use the LR as a binary classifier, a threshold needs to be
assigned to differentiate two classes. For example, a probability value higher than 0.50 for an
2 Support Vector Machine (SVM): Support vector machine (SVM) algorithm can classify both
linear and non-linear data. It first maps each data item into an n-dimensional feature space where
n is the number of features. It then identifies the hyper plane that separates the data items into
two classes while maximizing the marginal distance for both classes and minimizing the
classification errors. The marginal distance for a class is the distance between the decision hyper
25
plane and its nearest instance which is a member of that class. Figure 2.2 shows an illustration of
the support Vector machine. The SVM has identified a hyper plane (actually a line) which
maximizes the separation between the „star‟ and „circle‟ classes. More formally, each data point
is plotted first as a point in an n-dimension space (where n is the number of features) with the
value of each feature being the value of a specific coordinate. To perform the classification, we
then need to find the hyper plane that differentiates the two classes by the maximum margin (S.
3 Decision Tree (DT) Decision tree (DT) is one of the earliest and prominent machine learning
algorithms. A decision tree tests and corresponds outcomes for classifying data items into a tree-
like structure. The nodes of a decision tree normally have multiple levels where the first or top-
most node is called the root node. All internal nodes (i.e., nodes having at least one child)
represent tests on input variables or attributes. Figure 2.3 shows an illustration of the Decision
Tree. Each variable (C1, C2, and C3) is represented by a circle and the decision outcomes (Class
A and Class B) are shown by rectangles. In order to successfully classify a sample to a class,
each branch is labelled with either „True‟ or „False‟ based on the outcome value from the test of
26
Depending on the test outcome, the classification algorithm branches towards the appropriate
child node where the process of test and branching repeats until it reaches the leaf node. The leaf
or terminal nodes correspond to the decision outcomes. DTs have been found easy to interpret
and quick to learn, and are a common component to many medical diagnostic protocols. When
traversing the tree for the classification of a sample, the outcomes of all tests at each node along
the path will provide sufficient information to conjecture about its class.
4 Random Forest (RF): A random forest (RF) is an ensemble classifier and consisting of many
DTs similar to the way a forest is a collection of many trees. DTs that are grown very deep often
cause over fitting of the training data, resulting a high variation in classification outcome for a
small change in the input data. They are very sensitive to their training data, which makes them
error-prone to the test dataset. The different DTs of an RF are trained using the different parts of
Figure 2.4 shows an illustration of the RF algorithm which consists of three different decision
trees. Each of those three decision trees was trained using a random subset of the training data.
To classify a new sample, the input vector of that sample is required to pass down with each DT
of the forest. Each DT then considers a different part of that input vector and gives a
classification outcome. The forest then chooses the classification of having the most ‟votes‟ (for
27
discrete classification outcome) or the average of all trees in the forest (for numeric classification
outcome). Since the RF algorithm considers the outcomes from many different DTs, it can
reduce the variance resulted from the consideration of a single DT for the same dataset.
Naïve Bayes (NB) is a classification technique based on the Bayes‟ theorem. This theorem can
describe the probability of an event based on the prior knowledge of conditions related to that
event. This classifier assumes that a particular feature in a class is not directly related to any
other feature although features for that class could have interdependence among themselves. By
considering the task of classifying a new object (white circle) to either „green‟ class or „red‟
class, Figure 2.5 shows an illustration of the Naive Baiyes Algorithm. According to this figure, it
is reasonable to believe that any new object is twice as likely to have „green‟ membership rather
than „red‟ since there are twice as many „green‟ objects (40) as „red‟. In the Bayesian analysis,
this belief is known as the prior probability. Therefore, the prior probabilities of „green‟ and
„red‟ are 0.67 (40 ÷ 60) and 0.33 (20 ÷ 60), respectively. Now to classify the „white‟ object, we
need to draw a circle around this object which encompasses several points (to be chosen prior)
28
irrespective of their class labels. Four points (three „red‟ and one „green) were considered in this
figure. Thus, the likelihood of „white‟ given „green‟ is 0.025 (1 ÷ 40) and the likelihood of
„white‟ given „red‟ is 0.15 (3 ÷ 20). Although the prior probability indicates that the new
„white‟ object is more likely to have „green‟ membership, the likelihood shows that it is more
likely to be in the „red‟ class. In the Bayesian analysis, the final classifier is produced by
combining both sources of information (i.e., prior probability and likelihood value). The
„multiplication‟ function is used to combine these two types of information and the product is
called the „posterior‟ probability. Finally, the posterior probability of „white‟ being „green‟ is
0.017 (0.67 × 0.025) and the posterior probability of „white‟ being „red‟ is 0.049 (0.33 × 0.15).
„white‟ object should be classified as a member of the „red‟ class according to the NB
technique.
29
The K-nearest Neighbor (KNN) algorithm is one of the simplest and earliest classification
algorithms. It can be thought a simpler version of an NB classifier. Unlike the NB tech- nique,
The ‟K‟ is the KNN algorithm is the number of nearest Neighbors considered to take ‟vote‟
from. The se- lection of different values for ‟K‟ can generate different classification results for
the same sample object. Figure 2.6 shows an illustration of the KNN algorithm. For K=3, the
new object (star) is classified as ‟black‟; however, it has been classified as ‟red‟ when K=5.
In unsupervised learning, only input data is provided to the model the use of labeled datasets.
Unsupervised learning algorithms do not use labeled input and output data. An example of
methods are suitable when the output variables (i.e. the labels) are not provided. Some examples
30
1 K-Mean Clustering: K mean is clustering algorithm that provides the partition of data in
the form of small clusters. Algorithm is used to find out the similarity between different
data points. Data points exactly consist of at least one cluster that is most suitable for the
2 C-Mean CLUSTERING: Clusters are identified on the similarity basis. Cluster that
consists of similar data point belongs to one single family. In C mean algorithm each data
point belongs to one single cluster. It is mostly used in medical images segmentation and
disease prediction
data in the form of matrix. Each cluster is separated from other clusters in the form of
hierarchy. Every single cluster consists of similar data points. Probabilistic model is used to
is known as soft clustering technique which is used to compute the probability of different
maximization.
The summary of the project methodology is explained in Figure 3.1. This project aims to
(non- cancerous).
31
Figure 3. 1: Project Methodology Flowchart
32
For that, we use digitized histopathology images of fine-needle aspiration (FNA) biopsy
using machine learning. First, the CNN model is built and trained in colab by importing
the chosen data set to it. Then, once a high accuracy achieved, a web app is created in the
front end to allow a new prediction to be made for any patient image data.
Google Colab was chosen preferred to Kaggle for training the model because it is very
simple to use and also has default codes to directly call a dataset into the model.
wisconsin- data. Dr. William H. Wolberg, from the University of Wisconsin Hospitals,
Madison, obtained this breast cancer database. Figure 3.2 shows the first five rows and
columns of the data set. In this data set there are 30 input parameters more than 596
patient cases used. Target variables can only have two values in a classification model: 0
33
Figure 3. 2: Section of data set showing first five rows and columns
Dataset information
34
Figure 3. 2: Section of data set showing the dataset information
35
REFERENCES
A. Biswal, “Top 10 deep learning algorithms you should know in 2023.” https:
23.
A. Dasgupta and A. Nath, “Classification of machine learning algorithms,” Inter- national Journal
11, 03 2016.
https://insights.daffodilsw.com/blog/ 10-uses-of-artificial-intelligence-in-day-to-day-life,
https://www.cancer.org/cancer/breast-cancer/ screening-tests-and-early-detection/breastmri-
D. Dahiwade, G. Patle, and E. Meshram, “Designing disease prediction model using machine
D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer survivability: A com- parison of
three data mining methods,” Artificial intelligence in medicine, vol. 34, pp. 113–27, 07
2005.
F. Noreen, L. Liu, H. Sha, and H. Ahmed, “Prediction of breast cancer, comparative review of
machine learning techniques, and their analysis,” IEEE Access, vol. PP, pp. 1–1, 08 2020.
36
H. Chen, “An efficient diagnosis system for detection of parkinson‟s disease using fuzzy k-nearest
H. Dhahri, “Automated breast cancer diagnosis based on machine learning algo- rithms,”,” Journal
machine learning techniques for predicting breast cancer recurrence,” Journal of Health
M. Lundin, J. Lundin, H. Burke, S. Toikkanen, L. Pylkka¨nen, and H. Joensuu, “Ar- tificial neural
networks applied to survival prediction in breast cancer,” Oncology, vol. 57, pp. 281–6, 12
1999.
M. Tan, B. Zheng, J. Leader, and D. Gur, “Association between changes in mam- mographic image
features and risk for near-term breast cancer development,” IEEE Transactions on Medical
P. Sengar, M. Gaikwad, and D.-A. Nagdive, “Comparative study of machine learning algorithms for
S. Jain and P. Kumar, “Prediction of breast cancer using machine learning,” Recent
37
S. Uddin, A. Khan, M. Hossain, and M. A. Moni, “Comparing different supervised machine learning
algorithms for disease prediction,” BMC Medical Informatics and Decision Making, vol. 19,
12 2019.
radiology: comparison of logistic regression and artificial neural network models in breast
Y. Dengju, J. Yang, and X. Zhan, “A novel method for disease prediction: Hybrid of random forest
A. Bharat, N. Pooja, and R. Reddy, “Using machine learning algorithms for breast cancer risk
. Hadidi, A. Alarabeyyat, and M. Alhanahnah, “Breast cancer detection using k- nearest neighbor
N. Khuriwal and N. Mishra, “Breast cancer diagnosis using deep learning algorithm,” 10
2018.
B. Gayathri and C. Sumathi, “Comparative study of relevance vector machine with various machine
learning techniques used for detecting breast cancer,” pp. 1–5, 12 2016.
38
R. Shubair, “Comparative study of machine learning algorithms for breast cancer detection and
diagnosis,” 12 2016.
Z. Wang, M. Li, H. Wang, H. Jiang, Y. Yao, H. Zhang, and J. Xin, “Breast cancer detection using
extreme learning machine based on feature fusion with cnn deep fea- tures,” IEEE Access,
Y. Xiao, J. Wu, Z. Lin, and X. Zhao, “Breast cancer diagnosis using an unsupervised feature
J. Bhat, V. George, and B. Malik, “Cloud computing with machine learning could help us in the
39