0% found this document useful (0 votes)
62 views5 pages

Prediction of Breast Cancer Using Supervised Machine Learning Techniques

Note
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views5 pages

Prediction of Breast Cancer Using Supervised Machine Learning Techniques

Note
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Innovative Technology and Exploring Engineering (IJITEE)

ISSN: 2278-3075, Volume-8 Issue-6, April 2019

Prediction of Breast Cancer Using Supervised


Machine Learning Techniques
Ch. Shravya, K. Pravalika, Shaik Subhani

Abstract: Breast Cancer is the most often identified cancer undergo sentinel node biopsy. This helps to detect cancerous
among women and major reason for increasing mortality rate cells in lymph nodes to confirm metastasis of breast cancer
among women. As the diagnosis of this disease manually takes into lymphatic system. If required, oncologist may also
long hours and the lesser availability of systems, there is a need
to develop the automatic diagnosis system for early detection of order additional tests or procedures. In the conventional way
cancer. Data mining techniques contribute a lot in the of diagnosing breast cancer some tests and procedures are
development of such system. For the classification of benign and carried out. These tests include Breast exam Mammogram
malignant tumor we have used classification techniques of Breast ultrasound Biopsy. As an alternative we can also use
machine learning in which the machine is learned from the past Machine Learning techniques for the classification of benign
data and can predict the category of new input. This paper is a and malignant tumors. The prior diagnosis of Breast Cancer
relative study on the implementation of models using Logistic
Regression, Support Vector Machine (SVM) and K Nearest can enhance the prediction and survival rate notably [1], so
Neighbor (KNN) is done on the dataset taken from the UCI that patients can be informed to take clinical treatment at the
repository. With respect to the results of accuracy, precision, right time. Classification of benign tumors can help the
sensitivity, specificity and False Positive Rate the efficiency of patients avoid undertaking needless treatments. Thus the
each algorithm is measured and compared. These techniques are research is to be carried for the proper diagnosis of Breast
coded in python and executed in Spyder, the Scientific Python Cancer and categorization of patients into malignant and
Development Environment. Our experiments have shown that
SVM is the best for predictive analysis with an accuracy of benign groups. Machine Learning, with its advancements in
92.7%.We infer from our study that SVM is the well suited detection of critical featuresfrom the complex datasets is
algorithm for prediction and on the whole KNN presented well largely acknowledged as the method in the prediction of
next to SVM. breast cancer. Application of data mining techniques in the
medical field can help in prediction of outcomes,
Keywords— Classification, Logistic Regression, KNN, SVM.
minimizing the cost of medicines, aid people’s health,
upgrade the healthcare value and to rescue lives of people.
I. INTRODUCTION
This process of classifying benign and malignant tumors can
be best done by the application of Classification techniques
Breast Cancer is the prime reason for demise of women.
of machine learning. Lot of research is being conducted in
It is the second dangerous cancer after lung cancer. In the
this area by the application of various machine learning and
year 2018 according to the statistics provided by World
data mining techniques for many different datasets on Breast
Cancer Research Fund it is estimated that over 2 million
Cancer. Most of them show that classification techniques
new cases were recorded out of which 626,679 deaths were
give a good accuracy in prediction of the type of tumor
approximated. Of all the cancers, breast cancer constitutes
of 11.6% in new cancer cases and come up with 24.2% of
II. RELATED WORK
cancers among women. In case of any sign or symptom,
usually people visit doctor immediately, who may refer to AlirezaOsarech, BitaShadgar used SVM classification
an oncologist, if required. The oncologist can diagnose technique on two different benchmark datasets for breast
breast cancer by: Undertaking thorough medical history, cancer which got 98.80% and 96.63% accuracies[2].
Physical examination of both the breasts and also check for MandeepRana, PoojaChandorkar, AlishibaDsouza worked
swelling or hardening of any lymph nodes in the armpit. on the diagnosis and the prediction of recurrence of breast
cancer by applying KNN, SVM, Naïve Bayes and Logistic
A. Imaging tests:
Regression techniques, programmed in MATLAB. The
Mammogram, Magnetic resonance imaging (MRI) of classification techniques are applied on two datasets taken
breast, Ultrasound of breastX-ray of the breast, Tissue from UCI depository. A dataset of them is used for
biopsy: Removal of the tissue of the breast for examination identification of disease(WDBC) and the next one is used
by a pathologist. Sentinel node biopsy: Once breast cancer for recurrence prediction (WPBC)[3].VikasChaurasia, BB
is confirmed, patients regularly Tiwari and Saurabh Pal used three famous algorithms such
as J48, Naive bayes, RBF, to build predictive models on
breast cancer prediction and compared their accuracy. The
results had shown that Naive Bayes predicted well among
Revised Manuscript Received on April 07, 2019. them with an accuracyof97.36% [4]. Haifeng Wang and
Kuthuru Pravalika, Information Technology, Sreenidhi Institute of Sang Won Yoon compared Naive Bayes Classifier, Support
Science and Technology, Hyderabad, Telangana, India.
Chakinam Shravya, Information Technology, Sreenidhi Institute of
Vector Machine (SVM), AdaBoost tree, Artificial Neural
Science and Technology, Hyderabad, Telangana, India. Networks (ANN), to find a
Dr.Shaik Subhani,InformationTechnology,SreenidhiInstitute of Science powerful model for breast
and Technology,Hyderabad,Telangana,India.
cancer prediction. They

Published By:
Retrieval Number: F3384048619/19©BEIESP Blue Eyes Intelligence Engineering
1106 & Sciences Publication
Prediction of Breast Cancer Using Supervised Machine Learning Techniques

implemented PCA for dimensionality reduction[5]. dimensions to 2 or 3 dimensions. It is used when we need to
S.Kharya worked on breast cancer prediction and stated tackle the curse of dimensionality among data with linear
that artificial neural networks are widely used. The paper relationships.
featured about the advantages and short comings of using It is a linear technique which is used to compress lots of
machine learning methods like SVM, Naive Bayes, Neural data into something which gives essence of the original data.
network and Decision trees[6]. NareshKhuriwal,Nidhi Based on the variance of the data it plots the actual data into
Mishra took data from Wisconsin Breast Cancer database a dimensional space with less attributes such that the
and worked on breast cancer diagnosis..The results of their variance is maximized.PCA extracts p independent variables
experiments proved that ANN and Logistic Algorithm from n independent variables of our dataset (p<=n) that
worked better and provided a good solution. It achieved an explain the most variance of our dataset, despite of the
accuracy of 98.50% [7]. independent variables.With the help of covariance matrix of
III. METHODOLOGY the dataset, the eigen vectors are calculated. The principal
components are those eigen vectors which have the largest
We obtained the breast cancer dataset from UCI
eigen values and these can be used to rebuild a huge portion
repository and used spyder as the platform for the purpose
of the variance of the actual data. These few eigen vectors
of coding. Our methodology involves use of classification
(with most important variance) span a lesser space reducing
techniques like Support Vector Machine (SVM), K-Nearest
the original space But this process may cause some data
Neighbor (K-NN), Logistic Regression, with Dimensionality
loss. So, we should make sure that they retain the remaining
Reduction technique i.e. Principal Component Analysis
eigenvectors.All these individual principal components sum
(PCA) .
up to give total variance. Each individual principal
A. Dimensionality Reduction component is the ratio to the variance of the principal
Dimensionality Reduction is a process in which the component to the total variance.The result of applying PCA
number of independent variables is reduced to a set of gives us two principal components PC1 (the first principal
principle variables by removing those which are less component) and PC2 (the second principal component).PC1
significant in predicting the outcome. gives the most variance and PC2 gives the second most
Dimensionality Reduction is used to get two dimensional variance.Now, our dataset is ready and data mining
data so that better visualization of machine learning models techniques can be applied on it for classification of benign
can be done by plotting the prediction regions and the and malignant tumors.
prediction boundary for each model. Whatever may be the E. Model Selection
number of independent variables, we often end up with two
The most exciting phase in building any machine learning
independent variables by applying a suitable dimensionality
model is selection of algorithm. We can use more than one
reduction technique.
kind of data mining techniques to large datasets. But, at high
There are two methods, namely Feature selection and
level all those different algorithms can be classified in two
Feature Extraction
groups: supervised learning and unsupervised learning.
Supervised learning is the method in which the machine is
B. Feature Selection trained on the data which the input and output are well
labeled. The model can learn on the training data and can
Feature selection is finding the subset of original features
process the future data to predict outcome. They are grouped
by different approaches based on the information they
to Regression and Classification techniques.
provide, accuracy, prediction errors.
A regression problem is when the result is a real or
continuous value, such as “salary” or “weight”.
C. Feature Projection A classification problem is when the result is a category
like filtering emails “spam” or “not spam”. Unsupervised
Feature projection is transformation of high-dimensional
Learning : Unsupervised learning is giving away
space data to a lower dimensional space (with few
information to the machine that is neither classified nor
attributes). Both linear and nonlinear reduction techniques
labeled and allowing the algorithm to analyze the given
can be used in accordance with the type of relationships
information without providing any directions. In
among the features in the dataset.
unsupervised learning algorithm the machine is trained from
The dataset used in this research is a multidimensional
the data which is not labeled or classified making the
dataset with 32 attributes, which are related to cell
algorithm to work without proper instructions. In our dataset
parameters. Selection of features by the application of
we have the outcome variable or Dependent variable i.e. Y
feature selection is a complex task. Moreover, it cannot give
having only two set of values, either M (Malign) or
the most accurate features. Therefore we have applied a
B(Benign). So Classification algorithm of supervised
feature projection technique, PCA to derive two principal learning is applied on it. We have chosen three different
components from the dataset.
types of classification algorithms in Machine Learning.
D. Principal Component Analysis (PCA)

PCA is an unsupervised linear dimensionality reduction


algorithm used to find the strongest features based on the
covariance matrix of the dataset. It flattens large number of

Published By:
Blue Eyes Intelligence Engineering
Retrieval Number: F3384048619/19©BEIESP 1107 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8 Issue-6, April 2019

1. Logistic Regression 2. k-Nearest Neighbor(k-NN)


2. Nearest Neighbor K-Nearest Neighbor is a supervised machine learning
3. Support Vector Machines algorithm as the data given to it is labeled. It is a non-
1. Logistic Regression parametric method as the classification of test data point
relies upon the nearest training data points rather than
Logistic Regression is a supervised machine learning
considering the dimensions (parameters) of the dataset. It is
technique, employed in classification jobs ( for predictions
employed in solving both classification and regression tasks.
based on training data).Logistic Regression uses an equation
In Classification technique, it classifies the objects based on
similar to Linear Regression but the outcome of logistic
the k closest training examples in the feature space.
regression is a categorical variable whereas it is a value for
The working principle behind KNN is it presumes that
other regression models. Binary outcomes can be predicted
alike data points lie in same surroundings. It reduces the
from the independent variables. The outcome of dependent
burden of building a model, adapting a number of
variable is discrete. Logistic Regression uses a simple
parameters, or building furthermore assumptions. It catches
equation which shows the linear relation between the
the idea of proximity based on mathematical formula called
independent variables. These independent variables along
as Euclidean distance, calculation of distance between two
with their coefficients are united linearly to form a linear
points in a plane. Suppose the two points in a plane are
equation that is used to predict the output[8].
A(x0,y0) and B( x1,y1) then the Euclidean distance between
The equation used by basic logistic model is
them is calculated as follows[11].
Ln ( )=a0+a1*x+a2*x (1)
This is called the logistic function (3)
This algorithm is entitled as logistic regression as the key
method behind it is logistic function. The output can be
An object to be classified is allotted to the respective class
predicted from the independent variables, which form a
which represents the greater number of its nearest neighbors.
linear equation. The output predicted has no restrictions, it
If k takes the value as 1, then the data point is classified into
can be any value from negative infinity to positive infinity.
the category that contains only one nearest neighbor. Given
But the output required is a class variable (i.e., yes or no,
a new input data point, the distances between that points to
1 or 0). So, the outcome of the linear equation should be
all the data points in the training dataset are computed.
flattened into a small range (i.e [0,1]).Logistic function is
Based on the distances, the training set data points with
used here to suppress the outcome value between 0 and 1.
shorter distances from the test data point are considered as
Logistic function can also be called sigmoid function or
the nearest neighbors of our test data. Finally, the test data
Cost function. Logistic function is a Shaped curve which
point is classified to one of the classes of its nearest
takes the input (numeric value) and changes it to a value
neighbor. Thus the classification of the test data point hinges
between 0 and 1[9].
on the classification of its nearest neighbors [12].Choosing
Applying antilog on both sides of the above equation gives
the value of K is the crucial step in the implementation of
the eq(2) in
KNN algorithm. The value of K is not fixed and it varies for
every dataset, depending on the type of the dataset. If the
y= (2) value of K is less the stability of the prediction is less. In the
same manner if we increase its value the ambiguity is
which the predicted value is y and a0 is the y intercept and reduced, leads to smoother boundaries and increases
a1, the coefficient of the independent variable x1(principal stability.In KNN, assigning a new data point to a category
component) a2 is the coefficient of the independent variable entirely depends upon K’s value. K represents the number of
x2 and e is the base of natural algorithm. In our research the nearest training data points in the proximity of a given test
principal components (pc1 and pc2) derived from the data point and then the test data point is allotted to the class
dimensionality reduction replace the independent variables containing highest number of nearest neighbors(i.e. class
x1 and x2. The y intercept and the regression coefficients with high frequency).
are estimated by the maximum likelihood estimation [10]
method rather than least squares method of estimation. 3. Support Vector machine

Support Vector Machine is a supervised machine learning


algorithm which is doing well in pattern recognition
problems and it is used as a training algorithm for studying
classification and regression rules from data.SVM is most
precisely used when the number of features and number of
instances are high. A binary classifier is built by the SVM
algorithm [13]. This binary classifier is constructed using a
hyper plane where it is a line in more than 3-dimensions.The
hyper plane does the work of separating the members into
one of the two classes.

Fig 1.Logistic Function

Published By:
Retrieval Number: F3384048619/19©BEIESP Blue Eyes Intelligence Engineering
1108 & Sciences Publication
Prediction of Breast Cancer Using Supervised Machine Learning Techniques

Hyper plane of SVM is built on mathematical equations.


The equation of hyper plane is WTX=0 which is similar to
the line equation y= ax + b. Here W and X represent vectors
where the vector W is always normal to the hyper
plane.WTX represents the dot product of vectors. As SVM
deals with the dataset when the number of features are more
so, we need to use the equation WTX=0 in this case instead
of using the line equation y= ax + b.
If a set of training data is given to the machine, each data
item will be assigned to one or the other categorical
variables, a SVM training algorithm builds a model that
plots new data item to one or the other category. In an SVM Fig 4.KNN with PCA training set
model, each data item is represented as points in an n-
dimensional space where n is the number of features where
each feature is represented as the value of a particular
coordinate in the n-dimensional space. Classification is
carried out by finding a hyper-plane that divides the two-
classes proficiently. Later, new data item is mapped into the
same space and its category is predicted based on the side of
the hyper-plane they turn up.

IV. RESULTS AND DISCUSSION


As our dataset contains 32 attributes dimensionality
reduction contributes a lot in decreasing the multi- Fig 5.KNN with PCA testing set
dimensional data to a few dimensions. Of all the three
applied algorithms Support Vector Machine, k Nearest
Neighbor and Logistic Regression, SVM gives the highest
accuracy of 92.7% when compared to other two algorithms.
So, we propose that SVM is the best suited algorithm for the
prediction of Breast Cancer Occurrence with complex
datasets.

Fig 6.SVM with PCA training set

Fig 2.Logistic Regression with PCA training set

Fig 7.SVM with PCA testing set

Table1 shows the comparison between the algorithms in


terms of Accuracy, Precision, Sensitivity, Specificity and
False Positive Rate.

Fig 3.Logistic Regression with PCA testing set

Published By:
Blue Eyes Intelligence Engineering
Retrieval Number: F3384048619/19©BEIESP 1109 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8 Issue-6, April 2019

Table1.Comparision of the performances of various 9. Logistic Regression for Machine Learning - Machine Learning
Masteryhttps://machinelearningmastery.com/logistic-regression-for-
algorithms
machine-learning/
10. InJaeMyung –“MaximumLikelihoodEstimation”
11. Onel Harrison,” Machine Learning Basics with the K-
NearestNeighbors Algorithm “
12. Mohammad Bolandraftar and SadeghBafandehImandoust -
“Application of K-Nearest Neighbor (KNN) Approach for Predicting
Economic Events: Theoretical Background”- International Journal of
Engineering Research and Applications Vol. 3, Issue 5, Sep-Oct
2013
13. EbrahimEdrissEbrahim Ali1 , Wu Zhi Feng2- “Breast Cancer
Classification using Support Vector Machine and Neural Network”–
InternationalJournalofScienceandResearch(IJSR) Volume 5 Issue 3,
March 2016

AUTHORS PROFILE
V. CONCLUSION
Our work mainly focused in the advancement of
Chakinam Shravya, Information Technology, Sreenidhi
predictive models to achieve good accuracy in predicting Institute of Science and Technology, Hyderabad,
valid disease outcomes using supervised machine learning Telangana, India.
methods. The analysis of the results signify that the
integration of multidimensional data along with different
classification, feature selection and dimensionality reduction
techniques can provide auspicious tools for inference in this Kuthuru Pravalika, Information Technology, Sreenidhi Institute of
domain. Further research in this field should be carried out Science and Technology, Hyderabad, Telangana, India.
for the better performance of the classification techniques so
that it can predict on more variables.

ACKNOWLEDGMENT
We would like to thank our Research Guide Dr. Shaik Dr.Shaik Subhani,InformationTechnology,
Subhani, Associate Professor in Information technology, SreenidhiInstitute of Science and Technology,
Hyderabad,Telangana,India. He received Bachelor of
Sreenidhi Institute of Science and Technology, Hyderabad Technology (B.Tech) degree from Andhra University,
for their continuous support and valuable suggestions Visakapatnam .M.Tech from JNTUH, Hyderabad. His
throughout this work carried out by us. Authors are also Research area in Image Processing and Data Mining.
Ph. D. from AcharyaNagarjuna University, Guntur.
grateful to the reviewer for perilously going through the Research interests are Data Mining, Computer Networks, Cloud
manuscript and giving valuable suggestions for the Computing, Machine learning and Soft Computing techniques. He
renovation of manuscript. We would also like to thank the published many Research papers in National and International conferences
and journals.
Department of Information Technology, Sreenidhi Institute
of Science and Technology, Hyderabad for providing us
with the facility for carrying out the simulations.

REFERENCES
1. Yi-Sheng Sun, Zhao Zhao, Han-Ping-Zhu,”Risk factors and
Preventions of Breast Cancer” International Journal of Biological
Sciences.
2. AlirezaOsarech, BitaShadgar,”A Computer Aided Diagnosis System
for Breast Cancer”,International Journal of Computer Science Issues,
Vol. 8, Issue 2, March 2011
3. MandeepRana, PoojaChandorkar, AlishibaDsouza, “Breast cancer
diagnosis and recurrence prediction using machine learning
techniques”, International Journal of Research in Engineering and
Technology Volume 04, Issue 04, April 2015.
4. VikasChaurasia, BB Tiwari and Saurabh Pal – “Prediction of benign
and malignant breast cancer using data miningstechniques”,Journal of
Algorithms and Computational Technology
5. Haifeng Wang and Sang Won Yoon – Breast Cancer Prediction using
Data Mining Method, IEEE Conference paper
6. D.Dubey ,S.Kharya, S.Soni and –“Predictive Machine Learning
techniques for Breast Cancer Detection”, International Journal of
Computer Science and Information
Technologies,Vol.4(6),2013,1023-1028.
7. Nidhi Mishra ,NareshKhuriwal.- “Breast cancer diagnosis using
adaptive voting ensemble machine learning algorithm”, 2018 IEEMA
Engineer Infinite Conference (eTechNxT), 2018
8. Chao-Ying ,Joanne, PengKukLida Lee, Gary M. Ingersoll –“An
Introduction to Logistic Regression Analysis and Reporting “,
September/October 2002 [Vol. 96(No. 1)]

Published By:
Retrieval Number: F3384048619/19©BEIESP Blue Eyes Intelligence Engineering
1110 & Sciences Publication

You might also like