0% found this document useful (0 votes)
21 views23 pages

Mini Project Report (1) .Final

This project report focuses on the development of a breast cancer prediction model using machine learning techniques, specifically comparing the performance of algorithms such as SVM, Logistic Regression, Random Forest, and KNN. The research aims to enhance prediction accuracy for breast cancer outcomes across three domains: pre-diagnosis, diagnosis and treatment prediction, and treatment outcome prediction. The methodology includes data processing, feature selection, and model training, utilizing various datasets to validate the effectiveness of the proposed techniques.

Uploaded by

Pratibha Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views23 pages

Mini Project Report (1) .Final

This project report focuses on the development of a breast cancer prediction model using machine learning techniques, specifically comparing the performance of algorithms such as SVM, Logistic Regression, Random Forest, and KNN. The research aims to enhance prediction accuracy for breast cancer outcomes across three domains: pre-diagnosis, diagnosis and treatment prediction, and treatment outcome prediction. The methodology includes data processing, feature selection, and model training, utilizing various datasets to validate the effectiveness of the proposed techniques.

Uploaded by

Pratibha Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

A PROJECT REPORT

ON
Breast Cancer Prediction
For the partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
(Artificial Intelligence and Machine Learning)

Submitted By

Akash Yadav (2001921530005)


Akhil kumar (2001921530006)
Anurag Yadav (2001921530015)

Under the Supervision of


Mrs. Anju Chandna

G.L. BAJAJ INSTITUTE OF TECHNOLOGY AND MANAGEMENT,


GREATER NOIDA

Affiliated to
DR. APJ ABDUL KALAM TECHNICAL UNIVERSITY, LUCKNOW
2022-23
TABLE OF CONTENT

1. Declaration................................................................................................ (1)
2. Certificate …............................................................................................. (2)
3. Acknowledgement .................................................................................... (3)
4. Abstract...................................................................................................... (4)

Chapters

1. Introduction..........................................................................................(5)
2. Related Works……………….….........................................................(6)
3. Fundamentals........................................................................................(7)
4. System Requirements and Specification…………………………….(11)
5. Proposed Methodology………………………………………………..(12)
6. Results……………………………………………………..…………(14)
7. Conclusions and Future Scope……………………………………….(18)
8. Bibliography………………………………………………………….(19)
Declaration

We hereby declare that the project work presented in this report “Breast Cancer
Prediction”, in partial fulfillment of the requirement for the award of thedegree
of Bachelor of Technology in Computer Science & Engineering, submittedto A.P.J.
Abdul Kalam Technical University, Lucknow, is based on my own workcarried
out at Department of Computer Science & Engineering, G.L. BajajInstitute of
Technology & Management, Greater Noida. The work contained in the report is
original and project work reported in this report has not been submitted by me/us
for award of any other degree or diploma.

Name: Akash yadav Name: Akhil kumar

Roll No.: 2001921530005 Roll No.: 2001921530006

Signature: Signature:

Name: Anurag Yadav

Roll No.: 2001921530015

Signature:

Date:

1
Certificate

This is to certify that the Project report “Breast Cancer Prediction” done by

Akash yadav (2001921530005), Akhil kumar (2001921530006) and Anurag

yadav (2001921530015) is an original work carried out by them in Department

of Computer Science & Engineering, G.L Bajaj Institute of Technology &

Management, Greater Noida under my guidance. The matter embodied in this

project work has not been submitted earlier for the award of any degree or diploma

to the best of my knowledge and belief.

Date:

Mrs. Anju Chandna Dr. Sansar Singh Chauhan

Signature of the Signature of


Supervisor Head of the Department

2
Acknowledgement

The merciful guidance bestowed to us by the almighty made us stick out this
project to a successful end. We humbly pray with sincere heart for his guidance
to continue forever.

We pay thanks to our project guide Mrs. Anju Chandna who has given guidance
and light to us during this project. Her versatile knowledge has caused us in the
critical times during the span of this project.

We pay special thanks to our Head of Department Dr. Sansar Singh Chauhan
who has been always present as a support and help us in all possible way during
this project.

We also take this opportunity to express our gratitude to all those people who have
been directly and indirectly with us during the completion of the project.

We want to thanks our friends who have always encouraged us during this project.

At the last but not least thanks to all the faculty of CSE Department who provided
valuable suggestions during the period of project.

3
Abstract

Women are seriously threatened by breast cancer with high morbidity and mortality. The

lack of robust prognosis models results in difficulty for doctors to prepare a treatment plan

that may prolong patient survival time. Hence, the requirement of time is to develop the

technique which gives minimum error to increase accuracy. Four algorithm SVM, Logistic

Regression, Random Forest and KNN which predict the breast cancer outcome have been

compared in the paper using different datasets. All experiments are executed within a

simulation environment and conducted in JUPYTER platform. Aim of research categorises

in three domains. First domain is prediction of cancer before diagnosis, second domain is

prediction of diagnosis and treatment and third domain focuses on outcome during treatment.

The proposed work can be used to predict the outcome of different technique and suitable

technique can be used depending upon requirement. This research is carried out to predict the

accuracy. The future research can be carried out to predict the other different parameters and

breast cancer research can be categorises on basis of other parameters.

Keywords — Breast Cancer, machine learning, feature selection, classification, prediction,


KNN , Random Forest, ROC.

4
Chapter 1
INTRODUCTION

Problem Definition

Fig. 1.1:Mine Image

The second major cause of women's death is breast cancer (after lung cancer). 246,660 of
women's new cases of invasive breast cancer are expected to be diagnosed in the US during
2016 and 40,450 of women’s death is estimated. Breast cancer is a type of cancer that starts
in the breast. Cancer starts when cells begin to grow out of control.Breast cancer cells usually
form a tumour that can often be seen on an x-ray or felt as a lump.

Breast cancer can spread when the cancer cells get into the blood or lymph system and are
carried to other parts of the body. The cause of Breast Cancer includes changes and mutations
in DNA. There are many different types of breast cancer and common ones include ductal
carcinoma in situ (DCIS) and invasive carcinoma. Others, like phyllodes tumours and
angiosarcoma are less common. There are many algorithms for classification of breast cancer
outcomes.

The side effects of Breast Cancer are – Fatigue, Headaches, Pain and numbness (peripheral
neuropathy), Bone loss and osteoporosis. There are many algorithms for classification and
prediction of breast cancer outcomes. The present paper gives a comparison between the
performance of four classifiers: SVM , Logistic Regression , Random Forest and kNN which
are among the most influential data mining algorithms. It can be medically detected early
during a screening examination through mammography or by portable cancer diagnostic tool.

5
Cancerous breast tissues change with the progression of the disease, which can be directly
linked to cancer staging. The stage of breast cancer (I–IV) describes how far a patient’s cancer
has proliferated. Statistical indicators such as tumour size, lymph node metastasis, and distant
metastasis and so on are used to determine stages. To prevent cancer from spreading, patients
have to undergo breast cancer surgery, chemotherapy, radiotherapy and endocrine. The goal
of the research is to identify and classify Malignant.

Typical cancer screening procedures are grounded on the "gold-standard", that consists of three
tests: clinical evaluation, radiological imaging, and pathology testing. [18]. This traditional
technique, which is based on regression, detects the existence of cancer, whereas new ML
techniques and algorithms are built on model creation.

In its training and testing stages, the model is meant to forecast unknown data and offers a
satisfactory predicted outcome [19]. Preprocessing, feature selection or extraction, and
classification are the three major methodologies used in machine learning [20]. The feature
extraction part of the machine learning method is crucial for cancer diagnosis and prediction.
This process may differentiate between benign and malignant tumours [21]

6
Chapter 2
RELATED WORK

The cause of Breast Cancer includes somechanges and mutations in DNA. Cancer starts
when cells begin to grow out of control. Breast cancer cells usually form a tumour that
can often be seen on an x-ray or felt as a lump. There are many different types of breast
cancer and common ones include some ductal carcinoma in situ (DCIS) and invasive
carcinoma. Others, like phyllodes tumours and angiosarcoma are less common. Wang,
D.; Zhang and Y.-H Huang (2018) et al.

[1] used Logistic Regression and achieved an Accuracy of 96.4 %. Akbugday et al., [2]
performed classification on Breast Cancer Dataset by using KNN, SVM and achieved
accuracy of 96.85%. KAYA KELES et al., [3] in the paper titled “Breast Cancer
Prediction and Detection Using Data Mining” used Random Forest and achieved
accuracy of 92.2%.Vikas Chaurasia and Saurabh Pal et al., [4] compare the performance
criterion of supervised learning classifiers; such as Naïve Bayes, SVM-RBF kernel, RBF
neural networks, Decision trees (J48) and simple CART; to find the best classifier in
breast cancer datasets. Dalen, D.Walker and G. Kadam et al. [5] used ADABOOST and
achieved accuracy of 97.5% better than Random Forest. Kavitha et al., [6] used ensemble
methods with Neural Networks and achieved accuracy of 96.3% lesser than previous
7
studies. According to Sinthia et al., [7] used backpropagation method with 94.2 %
accuracy.

The experimental result shows that SVM-RBF kernel is more accurate than other
classifiers; it scores accuracy of 96.84% in Wisconsin Breast Cancer (original) datasets .
We have used classification methods like SVM, KNN, Random Forest, Naïve Bayes,
ANN. Prediction and prognosis of cancer development are focused on three major
domains: risk assessment or prediction of cancer susceptibility, prediction of cancer
relapse, and prediction of cancer survival rate. The first domain comprises prediction of
the probability of developing certain cancer prior to the patient diagnostics.

The second issue is related to prediction of cancer recurrence in terms of diagnostics and
treatment, and the third case is aimed at prediction of several possible parameters
characterizing cancer development and treatment after the diagnosis of the disease:
survival time, life expectancy, progression, drug sensitivity, etc. The survivability rate
and the cancer relapse are dependent very much on the medical treatment and the quality
of the diagnosis.

Radiology professionals frequently struggle with mammography mass lesion labelling,


which can lead to unneeded and costly breast biopsies. The paper's implementation was
evaluated using three publicly available benchmark datasets: the DDMS, INbreast, and
BCDR databases for training and testing, and the MIAS dataset for testing only. The
results showed that when PCNN is paired with CNN, it outperforms other approache
for the same publicly available datasets .
As we know that data pre-processing is a data mining technique that used for filter data
in a usable format. Because the realworld dataset almost available in different format. It
is not available as per our requirement so it must be filtered in understandable format.
Data pre-processing is a proven method of resolving such issues. Data preprocessing
convert the dataset into usable format for pre-processing we have used standardization
method.

8
Chapter 3
Fundamentals
1. Logistic regression:

This type of statistical model (also known as logit model) is often used for classification
and predictive analytics. Logistic regression estimates the probability of an event
occurring, such as voted or didn’t vote, based on a given dataset of independent variables.
Since the outcome is a probability, the dependent variable is bounded between 0 and 1. In
logistic regression, a logit transformation is applied on the odds—that is, the probability of
success divided by the probability of failure. This is also commonly known as the log odds,
or the natural logarithm of odds, and this logistic function is represented by the following
formulas:

In this logistic regression equation, logit(pi) is the dependent or response variable and x is
the independent variable. The beta parameter, or coefficient, in this model is commonly
estimated via maximum likelihood estimation (MLE). This method tests different values
9
of beta through multiple iterations to optimize for the best fit of log odds.

All of these iterations produce the log likelihood function, and logistic regression seeks to
maximize this function to find the best parameter estimate. Once the optimal coefficient
(or coefficients if there is more than one independent variable) is found, the conditional
probabilities for each observation can be calculated, logged, and summed together to yield
a predicted probability. For binary classification, a probability less than 0.5 will predict 0
while a probability greater than 0 will predict 1.

After the model has been computed, it’s best practice to evaluate the how well the model
predicts the dependent variable, which is called goodness of fit. The Hosmer–Lemeshow
test is a popular method to assess model fit.

Graph for Logistic Regression

10
2. Random Forest Classifier:

Random forest, as the name implies, constitutes of many separate decision


trees which all works as an ensemble Each separate tree of the Random forest
[19] gives out a class forecast and the class with the most votes transform
into our model’s desire as shown in Fig 3.5.

Fig. 3.5: Random Forest Classification

The principal idea propelling random forest is a straightforward however an


amazing way — the knowledge of groups. In information science talk, the
clarification that the random forest model works so well is: A colossal number
of commonly uncorrelated models (trees) functioning as a council will outrun
any of the its fundamental models exclusively.

11
3. Support Vector Machine

Support Vector Machine is a supervised machine learning algorithm which is doing well in
pattern recognition problems and it is used as a training algorithm for studying classification
and regression rules from data. SVM is most precisely used when the number of features and
number of instances are high. A binary classifier is built by the SVM algorithm. In an SVM
model, each data item is represented as points in an n-dimensional space where n is the number
of features where each feature is represented as the value of a coordinate in the n-dimensional
space. Here's how a support vector machine algorithm model works: (1) First, it finds lines or
boundaries that correctly classify the training dataset. (2) Then, from those lines or boundaries,
it picks the one that has the maximum distance from the closest data points.

12
Chapter 4
System Requirements and Specifications

1. Hardware Requirements:

• Processor/CPU: Core i5 or above.


• RAM: Minimum 4 GB.
• Storage: Minimum 1 GB.

2. Software Requirements:

• Programming Language: Python.


• OS Version: Windows 8.1 or above.
• IDE: Visual Studio Code.

3. Supporting Python Modules:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier

13
Chapter 5
PROPOSED METHODOLOGY

Data Processing
Data Processing Data preparation

Feature Feature Selection


Projection

Feature
Scaling

Model selecton Prediction

Phase 1- Pre-Processing Data

The first phase we do is to collect the data that we are interested in collecting for pre-processing
and to apply classification and Regression methods. Data pre-processing is a data mining
technique that involves transforming raw data into an understandable format. Real world data
is often incomplete, inconsistent, and lacking certain to contain many errors. Data pre-
processing is a proven method of resolving such issues. Data pre-processing prepares raw data
for further processing. For pre-processing we have used standardization method to pre-process
the UCI dataset. This step is very important because the quality and quantity of data that you
gather will directly determine how good your predictive model can be. In this case we collect
the Breast Cancer samples which are Benign and Malignant. This will be our training data.

14
Phase 2- DATA PREPRATION

Data Preparation, where we load our data into a suitable place and prepare it for use in our
machine learning training. We’ll first put all our data together, and then randomize the
ordering.

Phase 3- FEATURE SELECTION

In machine learning and statistics, feature selection, also known as variable selection, attribute
selection, is the process of selection a subset of relevant features for use in model construction.

Data File and Feature Selection Breast Cancer Wisconsin (Diagnostic):- Data Set from Kaggle
repository and out of 31 parameters we have selected about 8-9 parameters. Our target
parameter is breast cancer diagnosis – malignant or benign. We have used Wrapper Method
for Feature Selection. The important features found by the study are: Concave points worst,
Area worst, Area se, Texture worst, Texture mean, Smoothness worst, Smoothness mean,
Radius mean, Symmetry mean.

We have used Wrapper Method for Feature Selection. The important features found by the
study are: 1. Concave points worst 2. Area worst 3. Area se 4. Texture worst 5. Texture mean
6. Smoothness worst 7. Smoothness mean 8. Radius mean 9. Symmetry means.

Attribute Information:
ID number 2) Diagnosis (M = malignant, B = benign) 3–32)

Phase 4- FEATURE PROJECTION

Feature projection is transformation of high-dimensional space data to a lower dimensional


space (with few attributes). Both linear and nonlinear reduction techniques can be used in
accordance with the type of relationships among the features in the dataset.

Phase 5- FEATURE SCALING

Most of the times, your dataset will contain features highly varying in magnitudes, units and
range. But since, most of the machine learning algorithms use Euclidian distance between two
data points in their computations. We need to bring all features to the same level of magnitudes.
This can be achieved by scaling.

Phase 6- MODAL SELECTION

15
Supervised learning is the method in which the machine is trained on the data which the input
and output are well labelled. The model can learn on the training data and can process the
future data to predict outcome. They are grouped to Regression and Classification techniques.
A regression problem is when the result is a real or continuous value, such as “salary” or
“weight”. A classification problem is when the result is a category like filtering emails spam”
or “not spam”. Unsupervised Learning: Unsupervised learning is giving away information to
the machine that is neither classified nor labelled and allowing the algorithm to analyse the
given information without providing any directions.

In unsupervised learning algorithm the machine is trained from the data which is not labelled
or classified making the algorithm to work without proper instructions. In our dataset we have
the outcome variable or Dependent variable i.e. Y having only two set of values, either M
(Malign) or B (Benign). So Classification algorithm of supervised learning is applied on it. We
have chosen three different types of classification algorithms in Machine Learning. We can
use a small linear model, which is a simple.

Phase 7- PREDICTION

Machine learning is using data to answer questions. So Prediction, or inference, is the step
where we get to answer some questions. This is the point of all this work, where the value of
machine learning is real.

16
Chapter 6

RESULTS:-

The work was implemented on i3 processor with 2.30GHz speed, 2 GB RAM, 320 GB external
storage and all experiments on the classifiers described in this paper were conducted using
libraries from Anaconda machine learning environment. In Experimental studies we have
partition 70-30% for training & testing. JUPYTER contains a collection of machine learning
algorithms for data pre-processing, classification, regression, clustering and association rules.
Machine learning techniques implemented in JUPYTER are applied to a variety of real-world
problems. The results of the data analysis are reported. To apply our classifiers and evaluate
them, we apply the 10-fold cross validation test which is a technique used in evaluating
predictive models that split the original set into a training sample to train the model, and a test
set to evaluate it. After applying the pre-processing and preparation methods, we try to analyse
the data visually and figure out the distribution of values in terms of effectiveness and
efficiency.

We evaluate the effectiveness of all classifiers in terms of time to build the model, correctly
classified instances, incorrectly classified instances and accuracy.

Algorithms Accuracy Sensitivity Specificity Precision F1-Score ROC

Logistic 0.96244131 TP / (TP + F TN / (FP +


Regression 45539906 N) TN)

Decision 1.0
Tree

Random 0.99295774 TP/(TP + F TN/(FP + TN) (TP + TN ) /


Forest 64788732 N) (TP + FP +
TN + FN)

TP- True positive


FN- false Negative
TN- True negative
FN- False Negative

17
In order to better measure the performance of classifiers, simulation error is also considered in
this study. To do so, we evaluate the effectiveness of our classifier in terms of: x Kappa statistic
(KS) as a chance-corrected measure of agreement between the classifications and the true
classes, x Mean Absolute Error (MAE) as how close forecasts or predictions are to the eventual
outcomes, x Root Mean Squared Error (RMSE), x Relative Absolute Error (RAE), x Root
Relative Squared Error (RRSE).

In this section, the results of the data analysis are reported. To apply our classifiers and evaluate
them, we apply the 10-fold cross validation test which is a technique used in evaluating
predictive models that split the original set into a training sample to train the model, and a test
set to evaluate it. After applying the preprocessing and preparation methods, we try to analyse
the data visually and figure out the distribution of values in terms of effectiveness and
efficiency.

EFFECTIVENESS

In this section, we evaluate the effectiveness of all classifiers in terms of time to build the
model, correctly classified instances, incorrectly classified instances and accuracy.

The ROC space is defined with true positives and false positives as the x and y coordinates,
respectively. ROC curve summarizes the performance across all possible thresholds. The
diagonal of the ROC graph can be interpreted as random guessing, and classification models
that fall below the diagonal are considered worse than random guessing.

18
Chapter 7
CONCLUSION AND FUTURE WORK:-

We can notice that SVM takes about 0.07 s to build its model unlike k-NN that takes just 0.01
s. It can be explained by the fact that k-NN is a lazy learner and does not do much during
training process unlike others classifiers that build the models. In other hand, the accuracy
obtained by SVM (97.13%) is better than the accuracy obtained by C4.5, Naïve Bayes and k-
NN that have an accuracy that varies between 95.12 % and 95.28 %. It can also be easily seen
that SVM has the highest value of correctly classified instances and the lower value of
incorrectly classified instances than the other classifiers.

After creating the predicted model, we can now analyse results obtained in evaluating
efficiency of our algorithms. SVM and C4.5 got the highest value (97 %) of TP for benign
class but k-NN correctly predicts 97% of instance that belong to malignant class. The FP rate
is lower when using SVM classifiers (0.03 for benign class and 0.02 for malignant class), and
then other algorithms follow: k-NN, C4.5 and NB. From these results, we can understand why
SVM has outperformed other classifiers.

FUTURE WORK:-
The analysis of the results signifies that the integration of multidimensional data along with
different classification, feature selection and dimensionality reduction techniques can provide
auspicious tools for inference in this domain. Further research in this field should be carried
out for the better performance of the classification techniques so that it can predict on more
variables. We are intending how to parametrize our classification techniques hence to achieve
high accuracy. We are looking into many datasets and how further Machine Learning
algorithms can be used to characterize Breast Cancer. We want to reduce the error rates with
maximum accuracy.

19
Chapter 8
BIBLIOGRAPHY:-

[1] Mert A, Kilic N, Akan A.,”Breast cancer classification by using support vector machines
with reduced dimension.”,ELMAR, ,IEEE,2011 Sep 14 ,pp.37-40.

[2]Gou J, Du L, Zhang Y, Xiong T ,”A new distance-weighted k-nearest neighbor classifier”,


J.ournal of Information of Computer Science,June 2012 ,vol.9(6),pp.1429-36.

[3]Octaviani TL, Rustam Z,”Random forest for breast cancer prediction”,AIP Conference
Proceedings, AIP Publishing Nov 4 ,2019 ,vol. 2168.

[4 ] UCHealth 2015, “How Accurate are mammograms?”, UCHealth viewed 16 November


2019,

[5]Karabatak M, Ince MC,” An expert system for detection of breast cancer based on
association rules and neural network”, Expert systems with Applications”,2009 March,vol
36(2),pp.3465-3469

20
21

You might also like