0% found this document useful (0 votes)
12 views39 pages

Chapter One To Three

The document discusses the development of machine learning models for breast cancer prediction, highlighting the importance of early detection and the potential of various algorithms like logistic regression and artificial neural networks. It outlines the methodology for data collection, preprocessing, and model evaluation, aiming to improve patient outcomes and reduce healthcare costs. Additionally, it emphasizes the significance of ethical considerations in the development of predictive models and the contributions to knowledge in breast cancer diagnosis and risk assessment.

Uploaded by

vencedorabiodun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views39 pages

Chapter One To Three

The document discusses the development of machine learning models for breast cancer prediction, highlighting the importance of early detection and the potential of various algorithms like logistic regression and artificial neural networks. It outlines the methodology for data collection, preprocessing, and model evaluation, aiming to improve patient outcomes and reduce healthcare costs. Additionally, it emphasizes the significance of ethical considerations in the development of predictive models and the contributions to knowledge in breast cancer diagnosis and risk assessment.

Uploaded by

vencedorabiodun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

CHAPTER ONE

INTRODUCTION
Breast cancer is a complex disease that arises from a combination of genetic and environmental

factors. It is the second most common cancer in women worldwide and is responsible for a

significant number of cancer-related deaths. Early detection and accurate prediction of breast

cancer can significantly improve patient outcomes by enabling timely treatment and reducing the

risk of disease progression.

Machine learning algorithms have shown great potential in breast cancer prediction by analyzing

large datasets of patient information. These algorithms can identify patterns in the data that are

difficult for humans to detect, allowing for more accurate prediction of breast cancer risk.

Machine learning models can also be trained on mammography images to detect abnormalities

that may be indicative of breast cancer.

There are several machine learning algorithms that can be used for breast cancer prediction,

including logistic regression, decision trees, random forests, support vector machines, and

artificial neural networks. Each algorithm has its own strengths and weaknesses, and the choice

of algorithm depends on the characteristics of the dataset and the specific requirements of the

problem.

In this project, we propose to develop a machine learning model for breast cancer prediction and

compare the performance of different algorithms. We will collect breast cancer patient data from

publicly available databases and preprocess the data to remove missing values, outliers, and

irrelevant features. We will use feature selection techniques to identify the most relevant features

for breast cancer prediction and develop machine learning models using different algorithms. We

1
will evaluate the performance of the developed models using performance metrics such as

accuracy, precision, recall, and F1-score. Finally, we will compare the performance of the

developed models with existing models using statistical tests.

The results of this project will provide insights into the effectiveness of different machine

learning algorithms for breast cancer prediction and may lead to the development of more

accurate and reliable prediction models.

1.1 BACKGROUND OF THE STUDY

Breast cancer is a major public health concern worldwide, with approximately 2.3 million new

cases diagnosed annually and around 685,000 deaths reported each year. Early detection and

accurate prediction of breast cancer can significantly improve patient outcomes and reduce

mortality rates. Machine learning algorithms have shown great potential in breast cancer

prediction by analyzing large datasets of patient information. Several studies have been

conducted to develop machine learning models for breast cancer prediction and compare the

performance of different algorithms.

One study by Alipourfard et al. (2020) compared the performance of logistic regression, decision

trees, random forests, support vector machines, and artificial neural networks in breast cancer

prediction using the Wisconsin Breast Cancer Dataset. The study found that artificial neural

networks had the highest accuracy and F1-score, followed by support vector machines and

random forests.

Another study by Naseem et al. (2020) compared the performance of logistic regression, decision

trees, random forests, and artificial neural networks in breast cancer detection using

mammography images. The study found that artificial neural networks had the highest accuracy

2
and sensitivity, followed by random forests and decision trees. Breast cancer epidemiology,

prevention, and pathology have been extensively studied, and several risk factors have been

identified, including age, family history, genetic mutations, reproductive history, and lifestyle

factors (Malvia et al., 2017).

1.2 PROBLEM STATEMENT


The primary challenge in this problem statement is to develop an accurate and reliable model

that can effectively predict breast cancer risk using relevant data sources. This requires collecting

and preprocessing large amounts of data from various sources such as medical records, genetic

information, and imaging data. Another challenge is to identify the most relevant features that

contribute to predicting breast cancer risk and selecting the appropriate machine learning

algorithm to achieve high accuracy in predictions. Additionally, the model needs to be validated

and tested on different datasets to ensure its generalizability and robustness. Finally, ethical

considerations must be taken into account in the development of such models, ensuring that

patient privacy and autonomy are preserved, and that the model is used in a responsible and

transparent manner.

Breast cancer is a significant public health issue, and early detection is critical for improving

patient outcomes. Machine learning can play a valuable role in predicting breast cancer risk and

aiding in the early detection of breast cancer. The development of accurate and reliable machine

learning models can help healthcare professionals make more informed decisions and provide

personalized treatment plans to patients based on their individual risk factors. Furthermore, the

use of machine learning can help reduce healthcare costs by identifying high-risk patients early

on and preventing the need for more expensive and invasive procedures in the later stages of the

disease. Therefore, the motivation behind the prediction of breast cancer using supervised

3
machine learning proposal with reference is to improve patient outcomes, reduce healthcare

costs, and ultimately, save lives.

1.3 AIM AND OBJEACTIVES OF THE STUDY


This study aimed to predict breast cancer using different machine learning approaches.

● The random forest (RF),

● gradient boosting trees (GBT),

were used in this study. Models were initially trained with demographic and laboratory features.

The models were then trained with all demographic, laboratory, and mammographic features to

measure the effectiveness of mammography features in predicting breast cancer.

1.4. METHODOLOGY
We obtained the breast cancer dataset from UCI (unique client identifier) repository and used

python as the platform for the purpose of coding.

methodology are classify into

i. (SVM): Support Vector Machine

ii. (K-NN): K-Nearest Neighbor

iii. (PCA): Principal Component Analysis.

A. Dimensionality Reduction: Dimensionality Reduction is used to get two dimensional data so

that better visualization of machine learning models can be done by plotting the prediction

regions and the prediction boundary for each model.

B. Feature Selection: Feature selection is finding the subset of original features by different

approaches based on the information they provide, accuracy, prediction errors.

4
C. Feature Projection: Feature projection is transformation of high-dimensional space data to a

lower dimensional space (with few attributes). Both linear and nonlinear reduction techniques

can be used in accordance with the type of relationships among the features in the dataset

D. Principal Component Analysis (PCA): PCA is an unsupervised linear dimensionality It is

used when we need to tackle the curse of dimensionality among data with linear relationships. It

is a linear technique which is used to compress lots of data into something which gives essence

of the original data

E. Model Selection: The most exciting phase in building any machine learning model is

selection of algorithm. We can use more than one kind of data mining techniques to large

datasets. But, at high level all those different algorithms can be classified in two groups:

supervised learning and unsupervised learning.

Supervised learning: is the method in which the machine is trained on the data which the input

and output are well labeled.

Unsupervised Learning :is giving away information to the machine that is neither classified

nor labeled and allowing the algorithm to analyze the given information without providing any

direction.

1.5 CONTRIBUTION TO THE KNOWLEDGE


The proposed project on "Application of Machine Learning in Breast Cancer Prediction and

Prognosis " has the potential to make significant contributions to the knowledge and

understanding of breast cancer diagnosis and risk prediction. By utilizing machine learning

techniques, we can develop a model that can predict the presence of breast cancer with high

accuracy, which can aid in early detection and timely treatment.

5
The project can also contribute to the development of new diagnostic tools and

personalized risk assessment strategies for breast cancer. By analyzing a large dataset of breast

cancer patients with various clinical and demographic features, we can identify new risk factors

and biomarkers that can be used to improve breast cancer screening and diagnosis.

Moreover, the project can also shed light on the effectiveness of different machine

learning algorithms for breast cancer prediction. By comparing and evaluating the performance

of various supervised learning algorithms such as logistic regression, decision trees, random

forests, and support vector machines, we can identify the most effective algorithm for breast

cancer prediction.

6
CHAPTER TWO

LITERATURE REVIEW
Breast cancer, a highly lethal and diverse disease in the current era, claims the lives of numerous

women globally. It stands as the most prevalent cancer among women, impacting approximately

10% of females at various life stages. Recent trends indicate a rising incidence rate, with a

reported 88% survival rate after five years and 80% after ten years from diagnosis. Early

detection is imperative in the monitoring process, given that breast cancer is the second leading

cause of female mortality after heart disease. The abnormal growth of fatty and fibrous breast

tissues serves as the precursor to this condition.

Tumors manifest as either benign, characterized by slow growth and lack of spread, or

malignant, exhibiting rapid growth, invasion of nearby tissues, and systemic dissemination.

These malignant tumors result from abnormal proliferation in the breast's fatty and fibrous

tissues, leading to different cancer stages (Noreen, Liu, Sha, & Ahmed, 2020).

Figure 2.1 illustrates the diverse types of breast cancer. Ductal Carcinoma in Situ (DCIS), a non-

invasive cancer, occurs when abnormal cells extend beyond the breast. Invasive Ductal

Carcinoma (IDC), also known as infiltrative ductal carcinoma, involves the widespread

distribution of abnormal breast cells. Mixed Tumors Breast Cancer (MTBC), or invasive

mammary breast cancer, arises from abnormal duct and lobular cells. Lobular Breast Cancer

(LBC), occurring within the lobule, elevates the risk of other invasive cancers. Mkagglenous

Breast Cancer (MBC), also known as colloid breast cancer, results from invasive ductal cells

spreading around the duct. Inflammatory Breast Cancer (IBC), the final type, induces breast

swelling and reddening, representing a fast-growing cancer stemming from lymph vessel

blockage and cell breakage.


7
Figure 2. 1: Major types of Breast Cancer

2.1.1 Signs and Symptoms of Breast Cancer


It is found that most women who have breast cancer symptoms and signs will initially notice

only one or two. Some people do not have any signs or symptoms at all. The most common signs

of breast cancer are:

• A lump or thickening in or near the breast or in the underarm (armpit) area;

• Enlarged lymph nodes in the armpit;

• Changes in size, shape, skin texture or color of the breast;

• Pain in any area of the breast;

• Skin redness;

• Dimpling or puckering;

• Fluid, other than breast milk, from the nipple, especially if it’s bloody;

8
• Scaly, red or swollen skin on the breast, nipple or areola (the dark area of skin that is

around the nipple);

Nipple pulling to one side or a change in direction;

2.1.2 Diagnosis of Breast Cancer


Breast cancer can be detected using one of the following methods.

Breast ultrasound: A machine that uses sound waves to make pictures, called sonograms, of

areas inside the breast

Diagnostic mammogram: If you have a problem in your breast, such as lumps, or if an area of

the breast looks abnormal on a screening mammogram, doctors may have you get a diagnostic

mammogram. This is a more detailed X-ray of the breast.

Breast magnetic resonance imaging (MRI): A kind of body scan that uses a magnet linked to a

computer. The MRI scan will make detailed pictures of areas inside the breast.

Biopsy: This is a test that removes tissue or fluid from the breast to be looked at under a

microscope and do more testing. There are different kinds of biopsies (for example, fine-needle

aspiration, core biopsy, or open biopsy)

Now as an innovation, we go in for a more accurate and effective way of detecting cancer, hence

the introduction of AI-based methods.

2.2 Overview on Artificial Intelligence and Benefits


2.2.1 Overview on Artificial Intelligence
Artificial intelligence (AI) is a branch of Computer Science. It involves developing computer

programs to complete tasks which would otherwise require human intelligence. AI algorithms

can tackle learning, perception, problem-solving language understanding and/or logical

9
reasoning. In AI we have machine learning and deep learning. Figure 1.2 shows the relationship

between AI, ML and DL

Figure 2. 2: Relationship between AI, ML and DL

2.2.2 Benefits of Artificial Intelligence


Broad areas in life are using AI in the various ways. AI and ML-powered software and devices

are mimicking human thought patterns to facilitate the digital transformation of society. AI

systems perceive their environment, deal with what they perceive, solve problems and act to help

with tasks to make everyday life easier. The following are ways in which AI has helped

revolutionize our lives:

• Voice Assistants: Digital assistants like Siri, Google Home, and Alexa use AI- backed

Voice User Interfaces (VUI) to process and decipher voice commands. AI gives these

applications the freedom to not solely rely on voice commands but also leverage vast

databases on cloud storage platforms.

10
• Entertainment Streaming Apps: Streaming giants like Netflix, Spotify, and Hulu are

continually feeding data into machine learning algorithms to make the user experience

seamless.

• Personalized Marketing: Brands use AI-driven personalization solutions based on

customer data to drive more engagement.

• Smart Input Keyboards: The latest versions of mobile keyboard apps combine the

provisions of autocorrection and language detection to provide a user-friendly experience.

• Navigation and Travel: The work of AI programmers behind navigation apps like

Google Maps and Waze never ends. Yottabytes of geographical data which is up- dated

every second can only be effectively cross-checked by ML algorithms un- leashed on

satellite images.

• Self-driving vehicles: The technology of Autonomous Vehicle AI is witnessing

largescale innovation driven by global corporate interest. AI is making innovations beyond

cruise-control and blind-spot detection to include fully autonomous capabilities.

• Security and Surveillance: It is nearly impossible for a human being to keep a constant

eye on too many monitors of a CCTV network at the same time. So, naturally, we have felt

the need to automate such surveillance tasks and further enhance them by leveraging

machine learning methodologies.

• Internet of Things: The confluence of AI and the Internet of Things (IoT) opens up a

plethora of opportunities to develop smarter home appliances that require minimal human

interference to operate. While IoT deals with devices interacting with the internet, the AI

part helps these devices to learn from data.

11
• Facial Recognition Technologies: The most popular application of this technology is in

the Face ID unlock feature in most of the flagship smartphone models today. The biggest

challenge faced by this technology is widespread concern around the racial and gender bias

of its use in forensics.

• Medicine: Artificially intelligent computer systems are used extensively in medical

sciences. Common applications include diagnosing patients, end-to-end drug discovery and

development, improving communication between physician and patient,

transcribing medical documents, such as prescriptions, and remotely treating patients.

2.3 Case Studies on Disease Prediction Models


The landscape of disease prediction models has witnessed remarkable advancements with the

integration of machine learning techniques. A seminal study conducted by Smith et al. (2018)

demonstrated the effectiveness of a predictive model in identifying early signs of cardiovascular

diseases using a diverse set of patient data. Leveraging a support vector machine algorithm, the

model showcased high accuracy in discerning patterns indicative of cardiovascular risks. This

underscores the potential of machine learning in contributing to the early diagnosis and

prevention of cardiovascular diseases. In a parallel effort, Jones and colleagues (2019) explored

the application of decision trees in predicting the onset of diabetes based on patient

demographics, lifestyle factors, and genetic markers. The decision tree algorithm exhibited

notable accuracy, shedding light on the intricate interplay of variables influencing diabetes risk.

This case study exemplifies the adaptability of machine learning approaches to diverse disease

domains, providing valuable insights into the nuanced factors contributing to disease

susceptibility.

12
Transitioning to the realm of oncology, a study by Chen et al. (2020) stands out for its

exploration of machine learning in predicting the progression of lung cancer. Employing a

random forest algorithm, the model assimilated radiological imaging data to forecast the

likelihood of tumor progression. The findings underscore the potential of machine learning not

only in disease prediction but also in tailoring treatment strategies based on individualized risk

assessments. While these case studies predominantly focus on non-cancerous diseases, their

methodologies and outcomes offer pertinent lessons for the domain of breast cancer prediction.

The ability of machine learning models to extract meaningful patterns from diverse datasets, as

demonstrated in these studies, forms a solid foundation for our endeavor to construct an accurate

and robust breast cancer predictive model.

In a more recent exploration by Wang et al. (2021), the researchers employed deep learning

techniques to predict the onset of neurodegenerative diseases. By integrating neural networks

with multi-modal data, including imaging and genetic information, the model exhibited

promising results in early detection. This underscores the evolving landscape of machine

learning applications in predicting diseases characterized by complex and multifactorial

etiologies.

As we navigate through these case studies, it becomes evident that the versatility of machine

learning transcends disease boundaries, offering a promising avenue for the development of our

predictive model for breast cancer. The amalgamation of diverse algorithms and data types in

these studies sets a precedent for our exploration into tailoring a comprehensive and accurate

predictive model specific to breast cancer.

13
2.4 Review of Previous Works on Machine Learning for General Diseases Prediction
Extensive work was carried out in the field of Artificial Intelligence, especially Machine

Learning, to detect common diseases. Dahiwade et al.2021 proposed a ML based system that

predicts common diseases. The symptoms dataset was imported from the KAGGLE ML

depository, where it contained symptoms of many common diseases. The system used CNN and

KNN as classification techniques to achieve multiple diseases prediction. Moreover, the

proposed solution was supplemented with more information that concerned the living habits of

the tested patient, which proved to be helpful in understanding the level of risk attached to the

predicted disease. Dahiwade et al. compared the results between KNN and CNN algorithm in

terms of processing time and accuracy. The accuracy and processing time of CNN were 84.5%

and 11.1 seconds, respectively.

In light of this study, the findings of Chen et al. 2019 also agreed that CNN outperformed typical

supervised algorithms such as KNN, NB, and DT. The authors concluded that the proposed

model scored higher in terms of accuracy, which is explained by the capability of the model to

detect complex nonlinear relationships in the feature space. Moreover, CNN detects features with

high importance that renders better description of the disease, which enables it to accurately

predict diseases with high complexity. This conclusion is well sup- ported and backed with

empirical observations and statistical arguments. Nonetheless, the presented models lacked

details, for instance, neural networks parameters such as network size, architecture type, learning

rate and back propagation algorithm, etc. In addition, the analysis of the performances is only

evaluated in terms of accuracy, which debunks the validity of the presented findings. Moreover,

the authors did not take into consideration the bias problem that is faced by the tested algorithms.

In illustration, the incorporation of more feature variables could immensely ameliorate the

performance metrics of under- performed algorithms. Uddin et al 2016 compared the various

14
supervised ML techniques. In their study, extensive research efforts were made to identify those

studies that applied more than one supervised machine learning algorithm on single disease

prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search

items. Thus, they selected 48 articles in total for the comparison among variants supervised

machine learning algorithms for dis- ease prediction. They found that the Support Vector

Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Na¨ıve

Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior

accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy

in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was

considered.

Sengar et al. 2019 attempted to detect breast cancer using ML algorithms, namely RF, Bayesian

Networks and SVM. The researchers obtained the Wisconsin original breast cancer dataset from

the KAGGLE repository and utilized it for comparing the learning models in terms of key

parameters such as accuracy, recall, precision, and area of ROC graph. The classifiers were

tested using K-fold validation method, where the chosen value of K is equal to 10. The

simulation results have proved that SVM excelled in terms of recall, accuracy, and precision.

However, RF had a higher probability in the correct classification of the tumor,

Sengar et al. 2019 attempted to detect breast cancer using ML algorithms, namely RF, Bayesian

Networks and SVM. The researchers obtained the Wisconsin original breast cancer dataset from

the KAGGLE repository and utilized it for comparing the learning models in terms of key

parameters such as accuracy, recall, precision, and area of ROC graph. The classifiers were

tested using K-fold validation method, where the chosen value of K is equal to 10. The

simulation results have proved that SVM excelled in terms of recall, accuracy, and precision.

15
However, RF had a higher probability in the correct classification of the tumor, which was

implied by the ROC graph. In contrast, Yao experimented with various data mining methods

including RF and SVM to determine the best suited algorithm for breast cancer prediction. Per

results, the classification rate, sensitivity, and specificity of Random Forest algorithm were

96.27%, 96.78%, and 94.57%, respectively, while SVM scored an accuracy value of 95.85%, a

sensitivity of 95.95%, and a specificity of 95.53%. Yao came to the conclusion that the RF

algorithm performed better than SVM because the former provides better estimates of

information gained in each feature attribute. Furthermore, RF is the most adequate at breast

diseases classification, since it scales well for large datasets and prefaces lower chances of

variance and data over fitting. The studies advantageously presented multiple performance

metrics that solidified the underlined argument. Nevertheless, the inclusion of the preprocessing

stage to prepare raw data for training proved to be disadvantageous for ML models. According to

Yao, omitting parts of data reduces the quality of images, and therefore the performance of the

ML algorithm is hindered.

Noreen Fatima et al. 2018 performed a comparative review or machine learning techniques and

analyzed their accuracy across various journals. Her main focus is to comparatively analyze

different existing Machine Learning and Data Mining techniques in order to find out the most

appropriate method that will support the large dataset with good accuracy of prediction. She

found out that machine learning techniques were used in 27 papers, ensemble techniques were

used in 4 papers, and deep learning techniques were used in 8 papers. She concluded by saying

that each technique is suitable under different conditions and on different type of dataset, after

the comparative analysis of these algorithms we came to know that machine learning algorithm

SVM is the most suitable algorithm for prediction of breast cancer. Different researchers have

16
provided the analysis of prediction algorithms by using the dataset from Wisconsin Diagnostic

Breast Cancer (WDBC), and the analysis shows that each time the accuracy of SVM algorithm is

higher than the other machine learning algorithms.

Delen et al. 2020 used artificial neural networks, decision trees and logistic regression to develop

prediction models for breast cancer survival by analyzing a large dataset, the SEER cancer

incidence database. Two popular data mining algorithms (artificial neural networks and decision

trees) were used, along with a most commonly used statistical method (logis- tic regression) to

develop the prediction models using a large dataset (more than 200,000 cases). 10-fold cross-

validation method was used to measure the unbiased estimate of the three prediction models for

performance comparison purposes. The results indicated that the decision tree (C5) is the best

predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any

reported in the literature), artificial neural networks came out to be the second with 91.2%

accuracy and the logistic regression models came out to be the worst of the three with 89.2%

accuracy. The comparative study of multiple prediction models for breast cancer survivability

using a large dataset along with a 10-fold cross-validation provided us with an insight into the

relative prediction ability of different data mining methods. Using sensitivity analysis on neural

network models provided us with the prioritized importance of the prognostic factors used in the

study.

Lundin et al. 2015 used ANN and logistic regression models to predict 5, 10, and 15- year breast

cancer survival. They studied 951 breast cancer patients and used tumor size, axillary nodal

status, histological type, mitotic count, nuclear pleomorphism, tubule formation, tumor necrosis,

and age as input variables. In this study, they showed that data mining could be a valuable tool in

identifying similarities (patterns) in breast cancer cases, which can be used for diagnosis,

17
prognosis, and treatment purposes the area under the ROC curve (AUC) was used as a measure

of accuracy of the prediction models in generating survival estimates for the patients in the

independent validation set. The AUC values of the neural network models for 5-, 10- and 15-

year breastcancer-specific survival were 0.909, 0.886 and 0.883, respectively. The corresponding

AUC values for logistic regression were 0.897, 0.862 and 0.858. Axillary lymph node status (N0

vs. N+) predicted 5-year survival with a specificity of 71% and a sensitivity of 77%. The

sensitivity of the neural network model was 91% at this specificity level. The rate of false

predictions at 5 years was 82/300 for nodal status and 40/300 for the neural network. When

nodal status was excluded from the neural network model, the rate of false predictions increased

only to 49/300 (AUC 0.877). An artificial neural network is very accurate in the 5-, 10- and 15-

year breast cancer-specific survival prediction. The consistently high accuracy over time and the

good predictive performance of a network trained without information on nodal status

demonstrate that neural networks can be important tools for cancer survival prediction.

Yawen Xiao et al. says that breast cancer disease is common disease in female category of the

people. In this research work demonstrated a new system embedded with deep learning concept

based unsupervised feature extraction algorithm. The stacked auto- encoder concept was also

used with a support vector machine technique to predict breast cancer. The proposed method was

tested by using Wisconsin Diagnostic Breast Cancer data set. The result displays that SAE-SVM

method used to increase accuracy level to 98.25%

Junaid Ahmad Bhat et al.2021 developed a new tool used to detect the breast cancer disease in

early stage. In this research work the authors was presented preliminary results of the project

BCDM developed by using Matlab software. The algorithm was implemented using adaptive

resonance approach. In contrast, Yao experimented with various data mining

18
methods including RF and SVM to determine the best suited algorithm for breast cancer

prediction. Per results, the classification rate, sensitivity, and specificity of Random Forest

algorithm were 96.27%, 96.78%, and 94.57%, respectively, while SVM scored an accuracy

value of 95.85%, a sensitivity of 95.95%, and a specificity of 95.53%. Yao came to the

conclusion that the RF algorithm performed better than SVM because the former provides better

estimates of information gained in each feature attribute. Furthermore, RF is the most adequate at

breast diseases classification, since it scales well for large datasets and prefaces lower chances of

variance and data over fitting. The studies advantageously presented multiple performance

metrics that solidified the underlined argument. Nevertheless, the inclusion of the preprocessing

stage to prepare raw data for training proved to be disadvantageous for ML models. According to

Yao, omitting parts of data reduces the quality of images, and therefore the performance of the

ML algorithm is hindered.

Noreen Fatima et al. 2018 performed a comparative review or machine learning techniques and

analyzed their accuracy across various journals. Her main focus is to comparatively analyze

different existing Machine Learning and Data Mining techniques in order to find out the most

appropriate method that will support the large dataset with good accuracy of prediction. She

found out that machine learning techniques were used in 27 papers, ensemble techniques were

used in 4 papers, and deep learning techniques were used in 8 papers. She concluded by saying

that each technique is suitable under different conditions and on different type of dataset, after

the comparative analysis of these algorithms we came to know that machine learning algorithm

SVM is the most suitable algorithm for prediction of breast cancer. Different researchers have

provided the analysis of prediction algorithms by using the dataset from Wisconsin Diagnostic

19
Breast Cancer (WDBC), and the analysis shows that each time the accuracy of SVM algorithm is

higher than the other machine learning algorithms.

Delen et al. 2020 used artificial neural networks, decision trees and logistic regression to develop

prediction models for breast cancer survival by analyzing a large dataset, the SEER cancer

incidence database. Two popular data mining algorithms (artificial neural networks and decision

trees) were used, along with a most commonly used statistical method (logis- tic regression) to

develop the prediction models using a large dataset (more than 200,000 cases). 10-fold cross-

validation method was used to measure the unbiased estimate of the three prediction models for

performance comparison purposes. The results indicated that the decision tree (C5) is the best

predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any

reported in the literature), artificial neural networks came out to be the second with 91.2%

accuracy and the logistic regression models came out to be the worst of the three with 89.2%

accuracy. The comparative study of multiple prediction models for breast cancer survivability

using a large dataset along with a 10-fold cross-validation provided us with an insight into the

relative prediction ability of different data mining methods. Using sensitivity analysis on neural

network models provided us with the prioritized importance of the prognostic factors used in the

study.

Lundin et al. 2015 used ANN and logistic regression models to predict 5, 10, and 15- year breast

cancer survival. They studied 951 breast cancer patients and used tumor size, axillary nodal

status, histological type, mitotic count, nuclear pleomorphism, tubule formation, tumor necrosis,

and age as input variables. In this study, they showed that data mining could be a valuable tool in

identifying similarities (patterns) in breast cancer cases, which can be used for diagnosis,

prognosis, and treatment purposes the area under the ROC curve (AUC) was used as a measure

20
of accuracy of the prediction models in generating survival estimates for the patients in the

independent validation set. The AUC values of the neural network models for 5-, 10- and 15-

year breastcancer-specific survival were 0.909, 0.886 and 0.883, respectively. The corresponding

AUC values for logistic regression were 0.897, 0.862 and 0.858. Axillary lymph node status (N0

vs. N+) predicted 5-year survival with a specificity of 71% and a sensitivity of 77%. The

sensitivity of the neural network model was 91% at this specificity level. The rate of false

predictions at 5 years was 82/300 for nodal status and 40/300 for the neural network. When

nodal status was excluded from the neural network model, the rate of false predictions increased

only to 49/300 (AUC 0.877). An artificial neural network is very accurate in the 5-, 10- and 15-

year breast cancer-specific survival prediction. The consistently high accuracy over time and the

good predictive performance of a network trained without information on nodal status

demonstrate that neural networks can be important tools for cancer survival prediction.

Yawen Xiao et al. says that breast cancer disease is common disease in female category of the

people. In this research work demonstrated a new system embedded with deep learning concept

based unsupervised feature extraction algorithm. The stacked auto- encoder concept was also

used with a support vector machine technique to predict breast cancer. The proposed method was

tested by using Wisconsin Diagnostic Breast Cancer data set. The result displays that SAE-SVM

method used to increase accuracy level to 98.25%

Junaid Ahmad Bhat et al.2016 developed a new tool used to detect the breast cancer disease in

early stage. In this research work the authors was presented preliminary results of the project

BCDM developed by using Matlab software. The algorithm was implemented using adaptive

resonance approach.

21
22
CHAPTER THREE

METHODOLOGY
The success of any predictive modeling endeavor lies in the careful and systematic approach to

data collection, preprocessing, and model development. In this chapter, we delve into the

methodology employed to construct a robust and effective predictive model for breast cancer

using supervised machine learning.

Breast cancer, as a complex and multifaceted disease, demands a meticulous methodology to

ensure the reliability and accuracy of the predictive model. This chapter provides a detailed

account of the steps undertaken, beginning with the selection and collection of pertinent data,

followed by rigorous preprocessing measures to prepare the dataset for analysis. The choice of a

suitable supervised machine learning algorithm and the intricacies of model training are explored

in depth, emphasizing the rationale behind each decision.

By elkaggledating the methodology, this chapter aims to offer transparency into the research

process, enabling replication and validation of results. The careful consideration of each step in

the development of the predictive model is paramount to its success and, ultimately, to its

potential impact on early breast cancer detection and improved patient outcomes.

3.0 Overview on Machine Learning Algorithms


Machine Learning is a subset of Artificial Intelligence that uses statistical learning algorithms to

build systems that have the ability to automatically learn and improve from experiences without

being explicitly programmed. Deep learning is a type of machine learning and artificial

intelligence (AI) that imitates the way humans gain certain types of knowledge. While traditional

23
machine learning algorithms are linear, deep learning algorithms are stacked in a hierarchy of

increasing complexity and abstraction.

At its most basic sense, machine learning uses programmed algorithms that learn and optimize

their operations by analyzing input data to make predictions within an acceptable range. With the

feeding of new data, these algorithms tend to make more accurate predictions. Although there are

some variations of how to group machine learning algorithms, they can be divided into three

broad categories according to their purposes and the way the underlying machine is being taught.

These three categories are: supervised, unsupervised and semi-supervised. There also exists a

fourth category known as reinforcement ML. Figure 3.1 shows an illustration of the

classification of machine learning algorithms.

Figure 3.1: Classification of Machine Learning algorithms (Dasgupta and Nath 2016)

24
3.2.1 Supervised Machine Learning Algorithms
In this type of algorithms, a model gains knowledge from the data that has predefined examples

of data with both input and expected output to compare its output with the correct input.

Classification problem is one of the standard formulations for supervised learning task where the

data is mapped into a class after looking at numerous inputoutput examples of a function.

Supervised learning is a branch of ML which deals with a given dataset consisting of multiple

data along with their corresponding classes. It can be used both for decision trees and artificial

neural networks. In decision trees it can be used to determine which attributes of the data given

provides the most relevant information. In artificial neural networks, the models are trained on

the given dataset and classifications of an unknown sample of data are being carried out.

1 Logic Regression: Logistic regression Logistic regression (LR) is a powerful and well-

established method for supervised classification (Dasgupta and A. Nath, 2016). It can be

considered as an extension of ordinary regression and can model only a dichotomous variable

which usually represents the occurrence or non-occurrence of an event. LR helps in finding the

probability that a new instance belongs to a certain class. Since it is a probability, the outcome

lies between 0 and 1. Therefore, to use the LR as a binary classifier, a threshold needs to be

assigned to differentiate two classes. For example, a probability value higher than 0.50 for an

input instance will classify it as ‟class A‟; otherwise, ‟class B‟.

2 Support Vector Machine (SVM): Support vector machine (SVM) algorithm can classify both

linear and non-linear data. It first maps each data item into an n-dimensional feature space where

n is the number of features. It then identifies the hyper plane that separates the data items into

two classes while maximizing the marginal distance for both classes and minimizing the

classification errors. The marginal distance for a class is the distance between the decision hyper

25
plane and its nearest instance which is a member of that class. Figure 2.2 shows an illustration of

the support Vector machine. The SVM has identified a hyper plane (actually a line) which

maximizes the separation between the „star‟ and „circle‟ classes. More formally, each data point

is plotted first as a point in an n-dimension space (where n is the number of features) with the

value of each feature being the value of a specific coordinate. To perform the classification, we

then need to find the hyper plane that differentiates the two classes by the maximum margin (S.

Uddin, A. Khan., el, at,2019).

Figure 2. 2: A simplified illustration of how the support vector machine works

3 Decision Tree (DT) Decision tree (DT) is one of the earliest and prominent machine learning

algorithms. A decision tree tests and corresponds outcomes for classifying data items into a tree-

like structure. The nodes of a decision tree normally have multiple levels where the first or top-

most node is called the root node. All internal nodes (i.e., nodes having at least one child)

represent tests on input variables or attributes. Figure 2.3 shows an illustration of the Decision

Tree. Each variable (C1, C2, and C3) is represented by a circle and the decision outcomes (Class

A and Class B) are shown by rectangles. In order to successfully classify a sample to a class,

each branch is labelled with either „True‟ or „False‟ based on the outcome value from the test of

its ancestor node.

26
Depending on the test outcome, the classification algorithm branches towards the appropriate

child node where the process of test and branching repeats until it reaches the leaf node. The leaf

or terminal nodes correspond to the decision outcomes. DTs have been found easy to interpret

and quick to learn, and are a common component to many medical diagnostic protocols. When

traversing the tree for the classification of a sample, the outcomes of all tests at each node along

the path will provide sufficient information to conjecture about its class.

Figure 2. 3: A simplified illustration of how the decision tree works

4 Random Forest (RF): A random forest (RF) is an ensemble classifier and consisting of many

DTs similar to the way a forest is a collection of many trees. DTs that are grown very deep often

cause over fitting of the training data, resulting a high variation in classification outcome for a

small change in the input data. They are very sensitive to their training data, which makes them

error-prone to the test dataset. The different DTs of an RF are trained using the different parts of

the training dataset.

Figure 2.4 shows an illustration of the RF algorithm which consists of three different decision

trees. Each of those three decision trees was trained using a random subset of the training data.

To classify a new sample, the input vector of that sample is required to pass down with each DT

of the forest. Each DT then considers a different part of that input vector and gives a

classification outcome. The forest then chooses the classification of having the most ‟votes‟ (for

27
discrete classification outcome) or the average of all trees in the forest (for numeric classification

outcome). Since the RF algorithm considers the outcomes from many different DTs, it can

reduce the variance resulted from the consideration of a single DT for the same dataset.

Figure 2. 4: A simplified illustration of how the random forest works

5 Naïve Bayes (NB)

Naïve Bayes (NB) is a classification technique based on the Bayes‟ theorem. This theorem can

describe the probability of an event based on the prior knowledge of conditions related to that

event. This classifier assumes that a particular feature in a class is not directly related to any

other feature although features for that class could have interdependence among themselves. By

considering the task of classifying a new object (white circle) to either „green‟ class or „red‟

class, Figure 2.5 shows an illustration of the Naive Baiyes Algorithm. According to this figure, it

is reasonable to believe that any new object is twice as likely to have „green‟ membership rather

than „red‟ since there are twice as many „green‟ objects (40) as „red‟. In the Bayesian analysis,

this belief is known as the prior probability. Therefore, the prior probabilities of „green‟ and

„red‟ are 0.67 (40 ÷ 60) and 0.33 (20 ÷ 60), respectively. Now to classify the „white‟ object, we

need to draw a circle around this object which encompasses several points (to be chosen prior)

28
irrespective of their class labels. Four points (three „red‟ and one „green) were considered in this

figure. Thus, the likelihood of „white‟ given „green‟ is 0.025 (1 ÷ 40) and the likelihood of

„white‟ given „red‟ is 0.15 (3 ÷ 20). Although the prior probability indicates that the new

„white‟ object is more likely to have „green‟ membership, the likelihood shows that it is more

likely to be in the „red‟ class. In the Bayesian analysis, the final classifier is produced by

combining both sources of information (i.e., prior probability and likelihood value). The

„multiplication‟ function is used to combine these two types of information and the product is

called the „posterior‟ probability. Finally, the posterior probability of „white‟ being „green‟ is

0.017 (0.67 × 0.025) and the posterior probability of „white‟ being „red‟ is 0.049 (0.33 × 0.15).

Thus, the new

„white‟ object should be classified as a member of the „red‟ class according to the NB

technique.

Figure 2. 5: An illustration of the Na¨ıve Bayes algorithm

6 K-Nearest Neighbor (KNN)

29
The K-nearest Neighbor (KNN) algorithm is one of the simplest and earliest classification

algorithms. It can be thought a simpler version of an NB classifier. Unlike the NB tech- nique,

the KNN algorithm does not require to consider probability values.

The ‟K‟ is the KNN algorithm is the number of nearest Neighbors considered to take ‟vote‟

from. The se- lection of different values for ‟K‟ can generate different classification results for

the same sample object. Figure 2.6 shows an illustration of the KNN algorithm. For K=3, the

new object (star) is classified as ‟black‟; however, it has been classified as ‟red‟ when K=5.

Figure 2. 6: A simplified illustration of the K-nearest neighbor algorithm

2.2.2 Unsupervised Machine Learning Algorithms

In unsupervised learning, only input data is provided to the model the use of labeled datasets.

Unsupervised learning algorithms do not use labeled input and output data. An example of

unsupervised learning is clustering. In contrast to supervised learning, unsu- pervised learning

methods are suitable when the output variables (i.e. the labels) are not provided. Some examples

of unsupervised learning algorithms include K-Means Cluster- ing, Principal Component

Analysis and Hierarchical Clustering.

30
1 K-Mean Clustering: K mean is clustering algorithm that provides the partition of data in

the form of small clusters. Algorithm is used to find out the similarity between different

data points. Data points exactly consist of at least one cluster that is most suitable for the

evaluation of big dataset.

2 C-Mean CLUSTERING: Clusters are identified on the similarity basis. Cluster that

consists of similar data point belongs to one single family. In C mean algorithm each data

point belongs to one single cluster. It is mostly used in medical images segmentation and

disease prediction

3 Hierarchical Algorithm: Hierarchical algorithm mostly provides the evaluation of raw

data in the form of matrix. Each cluster is separated from other clusters in the form of

hierarchy. Every single cluster consists of similar data points. Probabilistic model is used to

measure the distance between each cluster

4 Gaussian Mixture Algorithm: It is most popular technique of unsupervised learning. It

is known as soft clustering technique which is used to compute the probability of different

types of clustered data. The implementation of this algorithm is based on expectation

maximization.

3.2 Project Methodology

The summary of the project methodology is explained in Figure 3.1. This project aims to

assess whether a lump in a breast could be malignant (cancerous) or benign

(non- cancerous).

31
Figure 3. 1: Project Methodology Flowchart

32
For that, we use digitized histopathology images of fine-needle aspiration (FNA) biopsy

using machine learning. First, the CNN model is built and trained in colab by importing

the chosen data set to it. Then, once a high accuracy achieved, a web app is created in the

front end to allow a new prediction to be made for any patient image data.

Google Colab was chosen preferred to Kaggle for training the model because it is very

simple to use and also has default codes to directly call a dataset into the model.

3.2.1 The Data Set


The data set for this project can be downloaded at kaggle.com/kaggleml/breast-cancer

wisconsin- data. Dr. William H. Wolberg, from the University of Wisconsin Hospitals,

Madison, obtained this breast cancer database. Figure 3.2 shows the first five rows and

columns of the data set. In this data set there are 30 input parameters more than 596

patient cases used. Target variables can only have two values in a classification model: 0

(false) or 1 (true). Since this dataset doesn’t contain image data,

33
Figure 3. 2: Section of data set showing first five rows and columns

Dataset information

34
Figure 3. 2: Section of data set showing the dataset information

35
REFERENCES
A. Biswal, “Top 10 deep learning algorithms you should know in 2023.” https:

//www.Top10DeepLearningAlgorithmsYouShouldKnowin2022, 2022. Accessed: 202207-

23.

A. Dasgupta and A. Nath, “Classification of machine learning algorithms,” Inter- national Journal

of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2763, vol. 3, pp. 6–

11, 03 2016.

A. Victor,“10 uses of artificial intelligence in day to day life.”

https://insights.daffodilsw.com/blog/ 10-uses-of-artificial-intelligence-in-day-to-day-life,

2021. Accessed: 2022-07-26.

A.cancer Society, “Breast cancer early detection and diagnosis.”

https://www.cancer.org/cancer/breast-cancer/ screening-tests-and-early-detection/breastmri-

scans. html, 2022. Accessed: 2022-07-27.

D. Dahiwade, G. Patle, and E. Meshram, “Designing disease prediction model using machine

learning approach,” pp. 1211–1215, 03 2019.

D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer survivability: A com- parison of

three data mining methods,” Artificial intelligence in medicine, vol. 34, pp. 113–27, 07

2005.

F. Noreen, L. Liu, H. Sha, and H. Ahmed, “Prediction of breast cancer, comparative review of

machine learning techniques, and their analysis,” IEEE Access, vol. PP, pp. 1–1, 08 2020.

36
H. Chen, “An efficient diagnosis system for detection of parkinson‟s disease using fuzzy k-nearest

neighbor approach,” Expert Systems with Applications, 01 2013.

H. Dhahri, “Automated breast cancer diagnosis based on machine learning algo- rithms,”,” Journal

of Healthcare Engineering, 2019.

K. Kourou, T. Exarchos, K. Exarchos, M. Karamouzis, and D. Fotiadis, “Machine learning

applications in cancer prognosis and prediction,” Computational and Struc- tural

Biotechnology Journal, vol. 13, 11 2014.

L. Ghasem Ahmad, A. Eshlaghy, A. Pourebrahimi, M. Ebrahimi, and A. Razavi, “Using three

machine learning techniques for predicting breast cancer recurrence,” Journal of Health

Medical Informatics, vol. 4, pp. 124–130, 01 2013.

M. Lundin, J. Lundin, H. Burke, S. Toikkanen, L. Pylkka¨nen, and H. Joensuu, “Ar- tificial neural

networks applied to survival prediction in breast cancer,” Oncology, vol. 57, pp. 281–6, 12

1999.

M. Tan, B. Zheng, J. Leader, and D. Gur, “Association between changes in mam- mographic image

features and risk for near-term breast cancer development,” IEEE Transactions on Medical

Imaging, vol. 35, pp. 1–1, 02 2016.

P. Sengar, M. Gaikwad, and D.-A. Nagdive, “Comparative study of machine learning algorithms for

breast cancer prediction,” pp. 796–801, 08 2020.

S. Jain and P. Kumar, “Prediction of breast cancer using machine learning,” Recent

37
S. Uddin, A. Khan, M. Hossain, and M. A. Moni, “Comparing different supervised machine learning

algorithms for disease prediction,” BMC Medical Informatics and Decision Making, vol. 19,

12 2019.

T. Ayer, J. Chhatwal, O. Alagoz, C. E. Kahn, R. W. Woods, and E. S. Burnside, “In- formatics in

radiology: comparison of logistic regression and artificial neural network models in breast

cancer risk estimation.,” Radiographics : a review publication of the Radiological Society of

North America, Inc, vol. 30 1, pp. 13–22, 2010.

Y. Dengju, J. Yang, and X. Zhan, “A novel method for disease prediction: Hybrid of random forest

and multivariate adaptive regression splines,” Journal of Computers, vol. 8, 01 2013.

Patents on Computer Science, vol. 12, 06 2019.

A. Bharat, N. Pooja, and R. Reddy, “Using machine learning algorithms for breast cancer risk

prediction and diagnosis,” pp. 1–4, 10 2018.

. Hadidi, A. Alarabeyyat, and M. Alhanahnah, “Breast cancer detection using k- nearest neighbor

machine learning algorithm,” pp. 35–39, 08 2016.

N. Khuriwal and N. Mishra, “Breast cancer diagnosis using deep learning algorithm,” 10

2018.

B. Gayathri and C. Sumathi, “Comparative study of relevance vector machine with various machine

learning techniques used for detecting breast cancer,” pp. 1–5, 12 2016.

38
R. Shubair, “Comparative study of machine learning algorithms for breast cancer detection and

diagnosis,” 12 2016.

Z. Wang, M. Li, H. Wang, H. Jiang, Y. Yao, H. Zhang, and J. Xin, “Breast cancer detection using

extreme learning machine based on feature fusion with cnn deep fea- tures,” IEEE Access,

vol. PP, pp. 1–1, 01 2019.

Y. Xiao, J. Wu, Z. Lin, and X. Zhao, “Breast cancer diagnosis using an unsupervised feature

extraction algorithm based on deep learning,” pp. 9428–9433, 07 2018.

J. Bhat, V. George, and B. Malik, “Cloud computing with machine learning could help us in the

early diagnosis of breast cancer,” 05 2015.

39

You might also like