Prediction of Breast Cancer Using Supervised Machine Learning Techniques
Prediction of Breast Cancer Using Supervised Machine Learning Techniques
Abstract: Breast Cancer is the most often identified cancer undergo sentinel node biopsy. This helps to detect cancerous
among women and major reason for increasing mortality rate cells in lymph nodes to confirm metastasis of breast cancer
among women. As the diagnosis of this disease manually takes into lymphatic system. If required, oncologist may also
long hours and the lesser availability of systems, there is a need
to develop the automatic diagnosis system for early detection of order additional tests or procedures. In the conventional way
cancer. Data mining techniques contribute a lot in the of diagnosing breast cancer some tests and procedures are
development of such system. For the classification of benign and carried out. These tests include Breast exam Mammogram
malignant tumor we have used classification techniques of Breast ultrasound Biopsy. As an alternative we can also use
machine learning in which the machine is learned from the past Machine Learning techniques for the classification of benign
data and can predict the category of new input. This paper is a and malignant tumors. The prior diagnosis of Breast Cancer
relative study on the implementation of models using Logistic
Regression, Support Vector Machine (SVM) and K Nearest can enhance the prediction and survival rate notably [1], so
Neighbor (KNN) is done on the dataset taken from the UCI that patients can be informed to take clinical treatment at the
repository. With respect to the results of accuracy, precision, right time. Classification of benign tumors can help the
sensitivity, specificity and False Positive Rate the efficiency of patients avoid undertaking needless treatments. Thus the
each algorithm is measured and compared. These techniques are research is to be carried for the proper diagnosis of Breast
coded in python and executed in Spyder, the Scientific Python Cancer and categorization of patients into malignant and
Development Environment. Our experiments have shown that
SVM is the best for predictive analysis with an accuracy of benign groups. Machine Learning, with its advancements in
92.7%.We infer from our study that SVM is the well suited detection of critical featuresfrom the complex datasets is
algorithm for prediction and on the whole KNN presented well largely acknowledged as the method in the prediction of
next to SVM. breast cancer. Application of data mining techniques in the
medical field can help in prediction of outcomes,
Keywords— Classification, Logistic Regression, KNN, SVM.
minimizing the cost of medicines, aid people’s health,
upgrade the healthcare value and to rescue lives of people.
I. INTRODUCTION
This process of classifying benign and malignant tumors can
be best done by the application of Classification techniques
Breast Cancer is the prime reason for demise of women.
of machine learning. Lot of research is being conducted in
It is the second dangerous cancer after lung cancer. In the
this area by the application of various machine learning and
year 2018 according to the statistics provided by World
data mining techniques for many different datasets on Breast
Cancer Research Fund it is estimated that over 2 million
Cancer. Most of them show that classification techniques
new cases were recorded out of which 626,679 deaths were
give a good accuracy in prediction of the type of tumor
approximated. Of all the cancers, breast cancer constitutes
of 11.6% in new cancer cases and come up with 24.2% of
II. RELATED WORK
cancers among women. In case of any sign or symptom,
usually people visit doctor immediately, who may refer to AlirezaOsarech, BitaShadgar used SVM classification
an oncologist, if required. The oncologist can diagnose technique on two different benchmark datasets for breast
breast cancer by: Undertaking thorough medical history, cancer which got 98.80% and 96.63% accuracies[2].
Physical examination of both the breasts and also check for MandeepRana, PoojaChandorkar, AlishibaDsouza worked
swelling or hardening of any lymph nodes in the armpit. on the diagnosis and the prediction of recurrence of breast
cancer by applying KNN, SVM, Naïve Bayes and Logistic
A. Imaging tests:
Regression techniques, programmed in MATLAB. The
Mammogram, Magnetic resonance imaging (MRI) of classification techniques are applied on two datasets taken
breast, Ultrasound of breastX-ray of the breast, Tissue from UCI depository. A dataset of them is used for
biopsy: Removal of the tissue of the breast for examination identification of disease(WDBC) and the next one is used
by a pathologist. Sentinel node biopsy: Once breast cancer for recurrence prediction (WPBC)[3].VikasChaurasia, BB
is confirmed, patients regularly Tiwari and Saurabh Pal used three famous algorithms such
as J48, Naive bayes, RBF, to build predictive models on
breast cancer prediction and compared their accuracy. The
results had shown that Naive Bayes predicted well among
Revised Manuscript Received on April 07, 2019. them with an accuracyof97.36% [4]. Haifeng Wang and
Kuthuru Pravalika, Information Technology, Sreenidhi Institute of Sang Won Yoon compared Naive Bayes Classifier, Support
Science and Technology, Hyderabad, Telangana, India.
Chakinam Shravya, Information Technology, Sreenidhi Institute of
Vector Machine (SVM), AdaBoost tree, Artificial Neural
Science and Technology, Hyderabad, Telangana, India. Networks (ANN), to find a
Dr.Shaik Subhani,InformationTechnology,SreenidhiInstitute of Science powerful model for breast
and Technology,Hyderabad,Telangana,India.
cancer prediction. They
Published By:
Retrieval Number: F3384048619/19©BEIESP Blue Eyes Intelligence Engineering
1106 & Sciences Publication
Prediction of Breast Cancer Using Supervised Machine Learning Techniques
implemented PCA for dimensionality reduction[5]. dimensions to 2 or 3 dimensions. It is used when we need to
S.Kharya worked on breast cancer prediction and stated tackle the curse of dimensionality among data with linear
that artificial neural networks are widely used. The paper relationships.
featured about the advantages and short comings of using It is a linear technique which is used to compress lots of
machine learning methods like SVM, Naive Bayes, Neural data into something which gives essence of the original data.
network and Decision trees[6]. NareshKhuriwal,Nidhi Based on the variance of the data it plots the actual data into
Mishra took data from Wisconsin Breast Cancer database a dimensional space with less attributes such that the
and worked on breast cancer diagnosis..The results of their variance is maximized.PCA extracts p independent variables
experiments proved that ANN and Logistic Algorithm from n independent variables of our dataset (p<=n) that
worked better and provided a good solution. It achieved an explain the most variance of our dataset, despite of the
accuracy of 98.50% [7]. independent variables.With the help of covariance matrix of
III. METHODOLOGY the dataset, the eigen vectors are calculated. The principal
components are those eigen vectors which have the largest
We obtained the breast cancer dataset from UCI
eigen values and these can be used to rebuild a huge portion
repository and used spyder as the platform for the purpose
of the variance of the actual data. These few eigen vectors
of coding. Our methodology involves use of classification
(with most important variance) span a lesser space reducing
techniques like Support Vector Machine (SVM), K-Nearest
the original space But this process may cause some data
Neighbor (K-NN), Logistic Regression, with Dimensionality
loss. So, we should make sure that they retain the remaining
Reduction technique i.e. Principal Component Analysis
eigenvectors.All these individual principal components sum
(PCA) .
up to give total variance. Each individual principal
A. Dimensionality Reduction component is the ratio to the variance of the principal
Dimensionality Reduction is a process in which the component to the total variance.The result of applying PCA
number of independent variables is reduced to a set of gives us two principal components PC1 (the first principal
principle variables by removing those which are less component) and PC2 (the second principal component).PC1
significant in predicting the outcome. gives the most variance and PC2 gives the second most
Dimensionality Reduction is used to get two dimensional variance.Now, our dataset is ready and data mining
data so that better visualization of machine learning models techniques can be applied on it for classification of benign
can be done by plotting the prediction regions and the and malignant tumors.
prediction boundary for each model. Whatever may be the E. Model Selection
number of independent variables, we often end up with two
The most exciting phase in building any machine learning
independent variables by applying a suitable dimensionality
model is selection of algorithm. We can use more than one
reduction technique.
kind of data mining techniques to large datasets. But, at high
There are two methods, namely Feature selection and
level all those different algorithms can be classified in two
Feature Extraction
groups: supervised learning and unsupervised learning.
Supervised learning is the method in which the machine is
B. Feature Selection trained on the data which the input and output are well
labeled. The model can learn on the training data and can
Feature selection is finding the subset of original features
process the future data to predict outcome. They are grouped
by different approaches based on the information they
to Regression and Classification techniques.
provide, accuracy, prediction errors.
A regression problem is when the result is a real or
continuous value, such as “salary” or “weight”.
C. Feature Projection A classification problem is when the result is a category
like filtering emails “spam” or “not spam”. Unsupervised
Feature projection is transformation of high-dimensional
Learning : Unsupervised learning is giving away
space data to a lower dimensional space (with few
information to the machine that is neither classified nor
attributes). Both linear and nonlinear reduction techniques
labeled and allowing the algorithm to analyze the given
can be used in accordance with the type of relationships
information without providing any directions. In
among the features in the dataset.
unsupervised learning algorithm the machine is trained from
The dataset used in this research is a multidimensional
the data which is not labeled or classified making the
dataset with 32 attributes, which are related to cell
algorithm to work without proper instructions. In our dataset
parameters. Selection of features by the application of
we have the outcome variable or Dependent variable i.e. Y
feature selection is a complex task. Moreover, it cannot give
having only two set of values, either M (Malign) or
the most accurate features. Therefore we have applied a
B(Benign). So Classification algorithm of supervised
feature projection technique, PCA to derive two principal learning is applied on it. We have chosen three different
components from the dataset.
types of classification algorithms in Machine Learning.
D. Principal Component Analysis (PCA)
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number: F3384048619/19©BEIESP 1107 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8 Issue-6, April 2019
Published By:
Retrieval Number: F3384048619/19©BEIESP Blue Eyes Intelligence Engineering
1108 & Sciences Publication
Prediction of Breast Cancer Using Supervised Machine Learning Techniques
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number: F3384048619/19©BEIESP 1109 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8 Issue-6, April 2019
Table1.Comparision of the performances of various 9. Logistic Regression for Machine Learning - Machine Learning
Masteryhttps://machinelearningmastery.com/logistic-regression-for-
algorithms
machine-learning/
10. InJaeMyung –“MaximumLikelihoodEstimation”
11. Onel Harrison,” Machine Learning Basics with the K-
NearestNeighbors Algorithm “
12. Mohammad Bolandraftar and SadeghBafandehImandoust -
“Application of K-Nearest Neighbor (KNN) Approach for Predicting
Economic Events: Theoretical Background”- International Journal of
Engineering Research and Applications Vol. 3, Issue 5, Sep-Oct
2013
13. EbrahimEdrissEbrahim Ali1 , Wu Zhi Feng2- “Breast Cancer
Classification using Support Vector Machine and Neural Network”–
InternationalJournalofScienceandResearch(IJSR) Volume 5 Issue 3,
March 2016
AUTHORS PROFILE
V. CONCLUSION
Our work mainly focused in the advancement of
Chakinam Shravya, Information Technology, Sreenidhi
predictive models to achieve good accuracy in predicting Institute of Science and Technology, Hyderabad,
valid disease outcomes using supervised machine learning Telangana, India.
methods. The analysis of the results signify that the
integration of multidimensional data along with different
classification, feature selection and dimensionality reduction
techniques can provide auspicious tools for inference in this Kuthuru Pravalika, Information Technology, Sreenidhi Institute of
domain. Further research in this field should be carried out Science and Technology, Hyderabad, Telangana, India.
for the better performance of the classification techniques so
that it can predict on more variables.
ACKNOWLEDGMENT
We would like to thank our Research Guide Dr. Shaik Dr.Shaik Subhani,InformationTechnology,
Subhani, Associate Professor in Information technology, SreenidhiInstitute of Science and Technology,
Hyderabad,Telangana,India. He received Bachelor of
Sreenidhi Institute of Science and Technology, Hyderabad Technology (B.Tech) degree from Andhra University,
for their continuous support and valuable suggestions Visakapatnam .M.Tech from JNTUH, Hyderabad. His
throughout this work carried out by us. Authors are also Research area in Image Processing and Data Mining.
Ph. D. from AcharyaNagarjuna University, Guntur.
grateful to the reviewer for perilously going through the Research interests are Data Mining, Computer Networks, Cloud
manuscript and giving valuable suggestions for the Computing, Machine learning and Soft Computing techniques. He
renovation of manuscript. We would also like to thank the published many Research papers in National and International conferences
and journals.
Department of Information Technology, Sreenidhi Institute
of Science and Technology, Hyderabad for providing us
with the facility for carrying out the simulations.
REFERENCES
1. Yi-Sheng Sun, Zhao Zhao, Han-Ping-Zhu,”Risk factors and
Preventions of Breast Cancer” International Journal of Biological
Sciences.
2. AlirezaOsarech, BitaShadgar,”A Computer Aided Diagnosis System
for Breast Cancer”,International Journal of Computer Science Issues,
Vol. 8, Issue 2, March 2011
3. MandeepRana, PoojaChandorkar, AlishibaDsouza, “Breast cancer
diagnosis and recurrence prediction using machine learning
techniques”, International Journal of Research in Engineering and
Technology Volume 04, Issue 04, April 2015.
4. VikasChaurasia, BB Tiwari and Saurabh Pal – “Prediction of benign
and malignant breast cancer using data miningstechniques”,Journal of
Algorithms and Computational Technology
5. Haifeng Wang and Sang Won Yoon – Breast Cancer Prediction using
Data Mining Method, IEEE Conference paper
6. D.Dubey ,S.Kharya, S.Soni and –“Predictive Machine Learning
techniques for Breast Cancer Detection”, International Journal of
Computer Science and Information
Technologies,Vol.4(6),2013,1023-1028.
7. Nidhi Mishra ,NareshKhuriwal.- “Breast cancer diagnosis using
adaptive voting ensemble machine learning algorithm”, 2018 IEEMA
Engineer Infinite Conference (eTechNxT), 2018
8. Chao-Ying ,Joanne, PengKukLida Lee, Gary M. Ingersoll –“An
Introduction to Logistic Regression Analysis and Reporting “,
September/October 2002 [Vol. 96(No. 1)]
Published By:
Retrieval Number: F3384048619/19©BEIESP Blue Eyes Intelligence Engineering
1110 & Sciences Publication