Mini Project Report (1) .Final
Mini Project Report (1) .Final
ON
Breast Cancer Prediction
For the partial fulfillment for the award of the degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
(Artificial Intelligence and Machine Learning)
Submitted By
Affiliated to
DR. APJ ABDUL KALAM TECHNICAL UNIVERSITY, LUCKNOW
2022-23
TABLE OF CONTENT
1. Declaration................................................................................................ (1)
2. Certificate …............................................................................................. (2)
3. Acknowledgement .................................................................................... (3)
4. Abstract...................................................................................................... (4)
Chapters
1. Introduction..........................................................................................(5)
2. Related Works……………….….........................................................(6)
3. Fundamentals........................................................................................(7)
4. System Requirements and Specification…………………………….(11)
5. Proposed Methodology………………………………………………..(12)
6. Results……………………………………………………..…………(14)
7. Conclusions and Future Scope……………………………………….(18)
8. Bibliography………………………………………………………….(19)
Declaration
We hereby declare that the project work presented in this report “Breast Cancer
Prediction”, in partial fulfillment of the requirement for the award of thedegree
of Bachelor of Technology in Computer Science & Engineering, submittedto A.P.J.
Abdul Kalam Technical University, Lucknow, is based on my own workcarried
out at Department of Computer Science & Engineering, G.L. BajajInstitute of
Technology & Management, Greater Noida. The work contained in the report is
original and project work reported in this report has not been submitted by me/us
for award of any other degree or diploma.
Signature: Signature:
Signature:
Date:
1
Certificate
This is to certify that the Project report “Breast Cancer Prediction” done by
project work has not been submitted earlier for the award of any degree or diploma
Date:
2
Acknowledgement
The merciful guidance bestowed to us by the almighty made us stick out this
project to a successful end. We humbly pray with sincere heart for his guidance
to continue forever.
We pay thanks to our project guide Mrs. Anju Chandna who has given guidance
and light to us during this project. Her versatile knowledge has caused us in the
critical times during the span of this project.
We pay special thanks to our Head of Department Dr. Sansar Singh Chauhan
who has been always present as a support and help us in all possible way during
this project.
We also take this opportunity to express our gratitude to all those people who have
been directly and indirectly with us during the completion of the project.
We want to thanks our friends who have always encouraged us during this project.
At the last but not least thanks to all the faculty of CSE Department who provided
valuable suggestions during the period of project.
3
Abstract
Women are seriously threatened by breast cancer with high morbidity and mortality. The
lack of robust prognosis models results in difficulty for doctors to prepare a treatment plan
that may prolong patient survival time. Hence, the requirement of time is to develop the
technique which gives minimum error to increase accuracy. Four algorithm SVM, Logistic
Regression, Random Forest and KNN which predict the breast cancer outcome have been
compared in the paper using different datasets. All experiments are executed within a
in three domains. First domain is prediction of cancer before diagnosis, second domain is
prediction of diagnosis and treatment and third domain focuses on outcome during treatment.
The proposed work can be used to predict the outcome of different technique and suitable
technique can be used depending upon requirement. This research is carried out to predict the
accuracy. The future research can be carried out to predict the other different parameters and
4
Chapter 1
INTRODUCTION
Problem Definition
The second major cause of women's death is breast cancer (after lung cancer). 246,660 of
women's new cases of invasive breast cancer are expected to be diagnosed in the US during
2016 and 40,450 of women’s death is estimated. Breast cancer is a type of cancer that starts
in the breast. Cancer starts when cells begin to grow out of control.Breast cancer cells usually
form a tumour that can often be seen on an x-ray or felt as a lump.
Breast cancer can spread when the cancer cells get into the blood or lymph system and are
carried to other parts of the body. The cause of Breast Cancer includes changes and mutations
in DNA. There are many different types of breast cancer and common ones include ductal
carcinoma in situ (DCIS) and invasive carcinoma. Others, like phyllodes tumours and
angiosarcoma are less common. There are many algorithms for classification of breast cancer
outcomes.
The side effects of Breast Cancer are – Fatigue, Headaches, Pain and numbness (peripheral
neuropathy), Bone loss and osteoporosis. There are many algorithms for classification and
prediction of breast cancer outcomes. The present paper gives a comparison between the
performance of four classifiers: SVM , Logistic Regression , Random Forest and kNN which
are among the most influential data mining algorithms. It can be medically detected early
during a screening examination through mammography or by portable cancer diagnostic tool.
5
Cancerous breast tissues change with the progression of the disease, which can be directly
linked to cancer staging. The stage of breast cancer (I–IV) describes how far a patient’s cancer
has proliferated. Statistical indicators such as tumour size, lymph node metastasis, and distant
metastasis and so on are used to determine stages. To prevent cancer from spreading, patients
have to undergo breast cancer surgery, chemotherapy, radiotherapy and endocrine. The goal
of the research is to identify and classify Malignant.
Typical cancer screening procedures are grounded on the "gold-standard", that consists of three
tests: clinical evaluation, radiological imaging, and pathology testing. [18]. This traditional
technique, which is based on regression, detects the existence of cancer, whereas new ML
techniques and algorithms are built on model creation.
In its training and testing stages, the model is meant to forecast unknown data and offers a
satisfactory predicted outcome [19]. Preprocessing, feature selection or extraction, and
classification are the three major methodologies used in machine learning [20]. The feature
extraction part of the machine learning method is crucial for cancer diagnosis and prediction.
This process may differentiate between benign and malignant tumours [21]
6
Chapter 2
RELATED WORK
The cause of Breast Cancer includes somechanges and mutations in DNA. Cancer starts
when cells begin to grow out of control. Breast cancer cells usually form a tumour that
can often be seen on an x-ray or felt as a lump. There are many different types of breast
cancer and common ones include some ductal carcinoma in situ (DCIS) and invasive
carcinoma. Others, like phyllodes tumours and angiosarcoma are less common. Wang,
D.; Zhang and Y.-H Huang (2018) et al.
[1] used Logistic Regression and achieved an Accuracy of 96.4 %. Akbugday et al., [2]
performed classification on Breast Cancer Dataset by using KNN, SVM and achieved
accuracy of 96.85%. KAYA KELES et al., [3] in the paper titled “Breast Cancer
Prediction and Detection Using Data Mining” used Random Forest and achieved
accuracy of 92.2%.Vikas Chaurasia and Saurabh Pal et al., [4] compare the performance
criterion of supervised learning classifiers; such as Naïve Bayes, SVM-RBF kernel, RBF
neural networks, Decision trees (J48) and simple CART; to find the best classifier in
breast cancer datasets. Dalen, D.Walker and G. Kadam et al. [5] used ADABOOST and
achieved accuracy of 97.5% better than Random Forest. Kavitha et al., [6] used ensemble
methods with Neural Networks and achieved accuracy of 96.3% lesser than previous
7
studies. According to Sinthia et al., [7] used backpropagation method with 94.2 %
accuracy.
The experimental result shows that SVM-RBF kernel is more accurate than other
classifiers; it scores accuracy of 96.84% in Wisconsin Breast Cancer (original) datasets .
We have used classification methods like SVM, KNN, Random Forest, Naïve Bayes,
ANN. Prediction and prognosis of cancer development are focused on three major
domains: risk assessment or prediction of cancer susceptibility, prediction of cancer
relapse, and prediction of cancer survival rate. The first domain comprises prediction of
the probability of developing certain cancer prior to the patient diagnostics.
The second issue is related to prediction of cancer recurrence in terms of diagnostics and
treatment, and the third case is aimed at prediction of several possible parameters
characterizing cancer development and treatment after the diagnosis of the disease:
survival time, life expectancy, progression, drug sensitivity, etc. The survivability rate
and the cancer relapse are dependent very much on the medical treatment and the quality
of the diagnosis.
8
Chapter 3
Fundamentals
1. Logistic regression:
This type of statistical model (also known as logit model) is often used for classification
and predictive analytics. Logistic regression estimates the probability of an event
occurring, such as voted or didn’t vote, based on a given dataset of independent variables.
Since the outcome is a probability, the dependent variable is bounded between 0 and 1. In
logistic regression, a logit transformation is applied on the odds—that is, the probability of
success divided by the probability of failure. This is also commonly known as the log odds,
or the natural logarithm of odds, and this logistic function is represented by the following
formulas:
In this logistic regression equation, logit(pi) is the dependent or response variable and x is
the independent variable. The beta parameter, or coefficient, in this model is commonly
estimated via maximum likelihood estimation (MLE). This method tests different values
9
of beta through multiple iterations to optimize for the best fit of log odds.
All of these iterations produce the log likelihood function, and logistic regression seeks to
maximize this function to find the best parameter estimate. Once the optimal coefficient
(or coefficients if there is more than one independent variable) is found, the conditional
probabilities for each observation can be calculated, logged, and summed together to yield
a predicted probability. For binary classification, a probability less than 0.5 will predict 0
while a probability greater than 0 will predict 1.
After the model has been computed, it’s best practice to evaluate the how well the model
predicts the dependent variable, which is called goodness of fit. The Hosmer–Lemeshow
test is a popular method to assess model fit.
10
2. Random Forest Classifier:
11
3. Support Vector Machine
Support Vector Machine is a supervised machine learning algorithm which is doing well in
pattern recognition problems and it is used as a training algorithm for studying classification
and regression rules from data. SVM is most precisely used when the number of features and
number of instances are high. A binary classifier is built by the SVM algorithm. In an SVM
model, each data item is represented as points in an n-dimensional space where n is the number
of features where each feature is represented as the value of a coordinate in the n-dimensional
space. Here's how a support vector machine algorithm model works: (1) First, it finds lines or
boundaries that correctly classify the training dataset. (2) Then, from those lines or boundaries,
it picks the one that has the maximum distance from the closest data points.
12
Chapter 4
System Requirements and Specifications
1. Hardware Requirements:
2. Software Requirements:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
13
Chapter 5
PROPOSED METHODOLOGY
Data Processing
Data Processing Data preparation
Feature
Scaling
The first phase we do is to collect the data that we are interested in collecting for pre-processing
and to apply classification and Regression methods. Data pre-processing is a data mining
technique that involves transforming raw data into an understandable format. Real world data
is often incomplete, inconsistent, and lacking certain to contain many errors. Data pre-
processing is a proven method of resolving such issues. Data pre-processing prepares raw data
for further processing. For pre-processing we have used standardization method to pre-process
the UCI dataset. This step is very important because the quality and quantity of data that you
gather will directly determine how good your predictive model can be. In this case we collect
the Breast Cancer samples which are Benign and Malignant. This will be our training data.
14
Phase 2- DATA PREPRATION
Data Preparation, where we load our data into a suitable place and prepare it for use in our
machine learning training. We’ll first put all our data together, and then randomize the
ordering.
In machine learning and statistics, feature selection, also known as variable selection, attribute
selection, is the process of selection a subset of relevant features for use in model construction.
Data File and Feature Selection Breast Cancer Wisconsin (Diagnostic):- Data Set from Kaggle
repository and out of 31 parameters we have selected about 8-9 parameters. Our target
parameter is breast cancer diagnosis – malignant or benign. We have used Wrapper Method
for Feature Selection. The important features found by the study are: Concave points worst,
Area worst, Area se, Texture worst, Texture mean, Smoothness worst, Smoothness mean,
Radius mean, Symmetry mean.
We have used Wrapper Method for Feature Selection. The important features found by the
study are: 1. Concave points worst 2. Area worst 3. Area se 4. Texture worst 5. Texture mean
6. Smoothness worst 7. Smoothness mean 8. Radius mean 9. Symmetry means.
Attribute Information:
ID number 2) Diagnosis (M = malignant, B = benign) 3–32)
Most of the times, your dataset will contain features highly varying in magnitudes, units and
range. But since, most of the machine learning algorithms use Euclidian distance between two
data points in their computations. We need to bring all features to the same level of magnitudes.
This can be achieved by scaling.
15
Supervised learning is the method in which the machine is trained on the data which the input
and output are well labelled. The model can learn on the training data and can process the
future data to predict outcome. They are grouped to Regression and Classification techniques.
A regression problem is when the result is a real or continuous value, such as “salary” or
“weight”. A classification problem is when the result is a category like filtering emails spam”
or “not spam”. Unsupervised Learning: Unsupervised learning is giving away information to
the machine that is neither classified nor labelled and allowing the algorithm to analyse the
given information without providing any directions.
In unsupervised learning algorithm the machine is trained from the data which is not labelled
or classified making the algorithm to work without proper instructions. In our dataset we have
the outcome variable or Dependent variable i.e. Y having only two set of values, either M
(Malign) or B (Benign). So Classification algorithm of supervised learning is applied on it. We
have chosen three different types of classification algorithms in Machine Learning. We can
use a small linear model, which is a simple.
Phase 7- PREDICTION
Machine learning is using data to answer questions. So Prediction, or inference, is the step
where we get to answer some questions. This is the point of all this work, where the value of
machine learning is real.
16
Chapter 6
RESULTS:-
The work was implemented on i3 processor with 2.30GHz speed, 2 GB RAM, 320 GB external
storage and all experiments on the classifiers described in this paper were conducted using
libraries from Anaconda machine learning environment. In Experimental studies we have
partition 70-30% for training & testing. JUPYTER contains a collection of machine learning
algorithms for data pre-processing, classification, regression, clustering and association rules.
Machine learning techniques implemented in JUPYTER are applied to a variety of real-world
problems. The results of the data analysis are reported. To apply our classifiers and evaluate
them, we apply the 10-fold cross validation test which is a technique used in evaluating
predictive models that split the original set into a training sample to train the model, and a test
set to evaluate it. After applying the pre-processing and preparation methods, we try to analyse
the data visually and figure out the distribution of values in terms of effectiveness and
efficiency.
We evaluate the effectiveness of all classifiers in terms of time to build the model, correctly
classified instances, incorrectly classified instances and accuracy.
Decision 1.0
Tree
17
In order to better measure the performance of classifiers, simulation error is also considered in
this study. To do so, we evaluate the effectiveness of our classifier in terms of: x Kappa statistic
(KS) as a chance-corrected measure of agreement between the classifications and the true
classes, x Mean Absolute Error (MAE) as how close forecasts or predictions are to the eventual
outcomes, x Root Mean Squared Error (RMSE), x Relative Absolute Error (RAE), x Root
Relative Squared Error (RRSE).
In this section, the results of the data analysis are reported. To apply our classifiers and evaluate
them, we apply the 10-fold cross validation test which is a technique used in evaluating
predictive models that split the original set into a training sample to train the model, and a test
set to evaluate it. After applying the preprocessing and preparation methods, we try to analyse
the data visually and figure out the distribution of values in terms of effectiveness and
efficiency.
EFFECTIVENESS
In this section, we evaluate the effectiveness of all classifiers in terms of time to build the
model, correctly classified instances, incorrectly classified instances and accuracy.
The ROC space is defined with true positives and false positives as the x and y coordinates,
respectively. ROC curve summarizes the performance across all possible thresholds. The
diagonal of the ROC graph can be interpreted as random guessing, and classification models
that fall below the diagonal are considered worse than random guessing.
18
Chapter 7
CONCLUSION AND FUTURE WORK:-
We can notice that SVM takes about 0.07 s to build its model unlike k-NN that takes just 0.01
s. It can be explained by the fact that k-NN is a lazy learner and does not do much during
training process unlike others classifiers that build the models. In other hand, the accuracy
obtained by SVM (97.13%) is better than the accuracy obtained by C4.5, Naïve Bayes and k-
NN that have an accuracy that varies between 95.12 % and 95.28 %. It can also be easily seen
that SVM has the highest value of correctly classified instances and the lower value of
incorrectly classified instances than the other classifiers.
After creating the predicted model, we can now analyse results obtained in evaluating
efficiency of our algorithms. SVM and C4.5 got the highest value (97 %) of TP for benign
class but k-NN correctly predicts 97% of instance that belong to malignant class. The FP rate
is lower when using SVM classifiers (0.03 for benign class and 0.02 for malignant class), and
then other algorithms follow: k-NN, C4.5 and NB. From these results, we can understand why
SVM has outperformed other classifiers.
FUTURE WORK:-
The analysis of the results signifies that the integration of multidimensional data along with
different classification, feature selection and dimensionality reduction techniques can provide
auspicious tools for inference in this domain. Further research in this field should be carried
out for the better performance of the classification techniques so that it can predict on more
variables. We are intending how to parametrize our classification techniques hence to achieve
high accuracy. We are looking into many datasets and how further Machine Learning
algorithms can be used to characterize Breast Cancer. We want to reduce the error rates with
maximum accuracy.
19
Chapter 8
BIBLIOGRAPHY:-
[1] Mert A, Kilic N, Akan A.,”Breast cancer classification by using support vector machines
with reduced dimension.”,ELMAR, ,IEEE,2011 Sep 14 ,pp.37-40.
[3]Octaviani TL, Rustam Z,”Random forest for breast cancer prediction”,AIP Conference
Proceedings, AIP Publishing Nov 4 ,2019 ,vol. 2168.
[5]Karabatak M, Ince MC,” An expert system for detection of breast cancer based on
association rules and neural network”, Expert systems with Applications”,2009 March,vol
36(2),pp.3465-3469
20
21