Comparative Study of Classification Techniques On Breast Cancer FNA Biopsy Data
Comparative Study of Classification Techniques On Breast Cancer FNA Biopsy Data
on Breast Cancer
FNA Biopsy Data
Haowen You1 and George Rumbe2
1
Department of Systems and Information Engineering, University of Virginia, Charlottesville,
Virginia.
2
Department of Systems Science and Industrial Engineering, Binghamton University. Binghamton,
New York.
Abstract - Accurate diagnostic detection of the who are newly diagnosed [8]. Additional statistics as of
cancerous cells in a patient is critical and may alter the 2006 estimated 214,460 new cancer diagnosis and total
subsequent treatment and increase the chances of death at least 41,000 within the US [10]. Early detection
survival rate. Machine learning techniques have been and accurate diagnosis has been crucial in reducing the
instrumental in disease detection and are currently number of deaths which has increased the survival rate of
being used in various classification problems due to those diagnosed with breast cancer [8].
their accurate prediction performance. Various The challenging effect of the identification of the
techniques may provide different desired accuracies and cancerous cells in a patient is highly subjective and it is
it is therefore imperative to use the most suitable method reliant on the physician expertise. This may lead to
which provides the best desired results. This research inaccurate predictions since the experiments are prone to
seeks to provide comparative analysis of Support Vector human and visual error and may be affected by blurred
Machine, Bayesian classifier and other Artificial neural mammogram visuals [11]. The aforementioned
network classifiers (Backpropagation, linear challenges necessitate the need for accurate tools for
programming, Learning vector quantization, and K detection and classification of the breast cancer cells.
nearest neighborhood) on the Wisconsin breast cancer There have been effective systems such as the machine
classification problem. decision support systems (MDSS) used in aiding breast
cancerous cells detection [8]. Machine learning techniques
Keywords: Artificial Neural Networks, Classification, have been instrumental in providing evidence in support
Breast cancer diagnosis of the accuracy of the classification of breast cancer
patients. Once the breast cancer diagnosis has been
performed the prognosis is subsequently determined to
III. INTRODUCTION predict the future development and characteristics of the
cancerous cells. Prognosis has been determined to be
The development of automated diagnostics was instigated
more complex due to the censoring of data [9].
by the need to aid the physician in decision making. There
Diagnosis is employed to significantly and accurately
application in healthcare has spanned from the
discern between malignant and benign cancerous patterns.
electrocardiograms to ultrasounds etc. The traditional set-
Some of the conventional used approaches for breast
up for error detection and monitoring of disease
cancer detection/diagnosis include mammography;
progression heavily rest on the technicians within the
surgical biopsy and fine needle aspirate [9]. The
healthcare. The increase in the number of patients within
sensitivity results from the aforementioned approaches in
healthcare who require continuous assessment has led to
accurately identifying the malignant lumps ranges as
the technical development of the automated systems.
follows, mammography 68%-79%, fine needle aspirate
Transformations of the qualitative information to
65%-98% and surgical biopsy about 100% [9]. The
quantitative measures are at the forefront in solving
surgical biopsy despite being an effective approach has
classification problems. Breast cancer has been identified
been determined as a costly procedure which induces
as the second largest cause of cancer deaths among
negative psychological behavior on the patients [10].
women of age 40 and 55. The number of breast cancer
Another effective method to diagnose breast masses is
diagnosis is estimated to be 1.2 million among women
based on Fine Needle Aspiration biopsy, which is a
every year according to projections by the World Health
technique to extract cell samples from lump and conduct
Organization [4]. According to statistics by the American
vision observation on the cellular under microscope [1].
cancer society in 2001, about 40,200 deaths are caused by
Diagnosis conclusion (benign and malignant) can be
the breast cancers and 192,000 cases consist of women
-6-
International Journal of Artificial Intelligence and Interactive Multimedia, Vol. 1, Nº 3.
drawn according to the judgment of domain experts [2]. systems should be able to provide higher accuracy of
Currently, artificial intelligence techniques, which deal disease identification as malignant or benign. In addition,
with the diagnosis as a pattern classification problem with the systems should also be able to determine with a degree
the cellular nuclei shape information from cell slides of confidence indicating the accuracy of diagnosis with
images, have been introduced into this area, to improve some levels. Another major important aspect is the
the accuracy, consistency and efficiency of this diagnosis systems interpretability which provides information on the
process. steps followed resulting to the outcomes generated. The
Artificial neural network on the other hand has been
A. Research Objective
determined to be an effective tool in classification though
The objective of this research is to provide a comparative the operations within the network structure are hidden.
study on the utilized potential classification tools (linear Classification problem seems to have generated interests
programming, back-propagation neural network, support among researchers. The classification approach is used in
vector machine and Bayesian network) on the problem by data analysis and pattern recognition problems. This
a benchmark dataset which consist of numeric cellular approach involves classifier modeling which is used as a
shape features extracted from preprocessed Fine Needle function that associates a class to different attributes. The
Aspiration biopsy image of cell slides. concept of association based on similarities or trained
B. Research scope performance has been embedded in various approaches
such as neural networks, decision trees, decision graphs
This research will first implement Support Vector
and etc [14]. The methodology of the neural networks can
Machine (SVM) and Bayesian network solution on the
be performed in two phases i.e. training and testing. The
benchmark dataset. Then a comparison on this benchmark
training phase involves feature extraction and computation
dataset between the former adopted techniques (linear
utilizing the classification rules. On the other hand, testing
programming and back-propagation neural network) and
data is used for performance evaluation on the accuracy of
these two newly developed modeling approaches will be
the classification process determined by the training data
conducted. The measurement of this comparative study
[10]. Breast cancer diagnosis and prognosis has instigated
will be selected according to the proposed measures by
the research interest and has been explored utilizing
the latest publication on this problem [4]. These will
various artificial neural networks such as Radial Basis
include classification accuracy, sensitivity, specificity,
Function, Multilayer perceptrons, Backpropagation, and
positive predictive value and negative predictive value. K-
Learning Vector Quantization network. Other methods
fold cross-validation [5] will also be used to evaluate the
which have been utilized to determine the breast cancer
overall performance of each model built by
diagnosis includes Fuzzy systems and Evolutionary
aforementioned approaches. The organization for the rest
algorithms. The fuzzy systems are used to represents
of this research will be as follows, Section 2: provides
different degrees of the disease (malignant or benign) a
detail information on the literature review, Section 3
patient suffers from; on the other hand, the evolutionary
introduces the strategies employed by the SVM and
algorithms are used to perform search to determine the
Bayesian network classifier, Section 4 discusses detail
most suitable fuzzy systems [6].
analysis on the results, the complexity of modeling
Isotonic separation which is a linear programming
process and the computation expenditure of these
technique is based on the underlying assumption of
approaches, and Section 5 provides the summary and
maintaining same consistency in diagnosis. For example
conclusion of the research.
the Breast cancer dataset (Wisconsin) a patients being
diagnosed with malignant tumor based on certain
IV. LITERATURE REVIEW characteristics of the cell structures, for other patients
The increase in the number of deaths determined within showing similar symptoms with more damage to the cells
the healthcare systems has led to the development of would end up receiving the same diagnosis [7] and Rank
medical diagnostic support systems to aid the medical nearest neighbor technique (k-RNN) [11]. The k-RNN has
personnel’s in decision making process [10]. Various been determined as technique used in approximating the
experts systems and machine learning algorithms have densities based on the evaluations of the nearest neighbors
been utilized to provide supporting information based on [11]. The aforementioned technique has been applied in
the input knowledge. Some of the significant univariate and multivariate data in examining various
developments include 2D and 3D medical imaging, classifications problems including breast cancer. In order
feature extraction, pattern analysis and classification have for a patient to receive the appropriate breast cancer
been used in providing solutions for edge detection and treatment, it is necessary that accurate classification of the
region growing among other problems [10]. According to cells be determined. This has lead researchers to combine
Pena-Reyes and Sipper (1999) an effective diagnostic and employ various machine learning techniques and
select the one with the highest prediction accuracy [16].
-7-
A Direct Path to Intelligent Tools ISSN - 1989-1660
The comparative analysis of the ANN ranges from two to with the different random variables. The independence
six networks or more being evaluated to determine the assumption provides information on the probability
most appropriate technique. Integration of different ANN distribution that is represented within the network.
networks has lead to improve performance measures. The Generally, the probability distribution within the networks
RBF properties when applied to tuning the SVM has been must initially be specified from the root nodes followed by
determined to provide higher prediction accuracy for the condition probabilities of the remaining non-root
breast cancer data [12]. nodes based on the direct predecessor’s combinations
[13]. The conditional probabilities can only determine
V. METHODOLOGY based on the fact that information on some of the nodes in
the network have been identified.
There have been numerous artificial neural network The Bayesian network classifier uses the unsupervised
approaches used for examining the classification of breast learning algorithm, where the class target is unknown
cancer cells, some of these approaches are Bayesian though we have the inputs (attributes) [14] and the
classifier and SVM. This section provides descriptive classifier learning algorithm can be structured into two
discussions on the SVM and Bayesian classifier phases (i) Function for assessment of a certain network
framework. In addition, it examines the strategies based on a data and (ii) an approach for examining space
employed and some of the parameters that are used for within the networks. There are various learning algorithm
effective classification of patterns. employed to the Bayesian network this includes AD (All
A. 3.1 Support Vector Machine Stratagem dimensions) Trees, TAN (Tree Augmented Naïve Bayes)
Support Vector Machine (SVM) was introduced by and K2. K2 has been used in breast cancer classification
Vapnik and it is a technique based on the statistical problems due to its fast convergence ability. Bayesian nets
learning theory and has been applied for solving have been utilized in providing solutions to medical
classification and regression problems [15]. The objective diagnosis, heuristic search and map learning problems
of the SVM is to separate two classes by determining the among other challenges [13]. The Bayesian network is
linear classifier that maximizes the margin and it is based on independence assumption between the nodes.
referred to as the optimal separating hyperplane [15]. C. 3.3 Data Structure
SVM has been employed in various classification problem The benchmark dataset in this research will obtained from
and mostly current interest in breast cancer detection due the UCI Irvine machine learning repository
its robustness. The regularization parameter and kernel http://archive.ics.uci.edu/ml/index.html. This dataset was
function are the two major components that have to been originally created by Dr. Wolberg, Street and
determined before conducting training. Some of the Mangasarian all from University of Wisconsin. Data items
significant researches employed using the SVM for breast in the dataset are composed of ID number, the diagnosis
cancer detection utilized heuristics SVM approaches such which will either be classified as malignant (M) or benign
as the smooth SVM, the linear SVM and general non (B) and numeric shape features of extract cellular nuclei
linear SVM [12]. The goal of SVM is to determine a such as radius, texture, perimeter, area, smoothness,
suitable hyperplane with maximum margin which can be compactness, concavity, concave points, and symmetry
computed as an optimization problem [10]. and fractal dimension. The dataset was composed of a
B. 3.2 Bayesian Network Approach total of 569 observations with benign and malign cases
Bayesian networks are characterized by the use of the being 357 and 212 observations respectively. Each of the
probabilistic approach in problem solving and encompass dataset in the observation is composed of 30 variables and
the uncertainty of certain occurrences. Its origin is based 10 of the featured variables are related to the
on the probability distribution which can be depicted aforementioned characteristics [3].
graphically. The Bayesian network classifier is composed
of a set of variables related to each other by directed VI. RESULTS AND DATA ANALYSIS
edges. The variables represent the data attributes, class This section provides discussion on the result and analysis
and arcs, which when applied to the conditional for SVM, Bayesian, LVQ, KNN and BNT _Clustering.
probability table depicts their relationship in a visual Furthermore, a comparative analysis of the
format. The Bayesian network classifiers are also referred aforementioned approaches is presented. The SVM and
to as directed acyclic graphs that provide information on Bayesian network classifier approach were developed
joint probability distribution on various random variables using MATLAB, and the 10 variables (see section 3.3)
[14]. It has been determined that the Bayesian network were experimented with within the classifiers.
classifier, the connecting arcs between different nodes
provides an independence assumption that is associated
-8-
International Journal of Artificial Intelligence and Interactive Multimedia, Vol. 1, Nº 3.
-9-
A Direct Path to Intelligent Tools ISSN - 1989-1660
-10-
International Journal of Artificial Intelligence and Interactive Multimedia, Vol. 1, Nº 3.
-11-
International Journal of Artificial Intelligence and Interactive Multimedia, Vol. 1, Nº 3.
-12-
International Journal of Artificial Intelligence and Interactive Multimedia, Vol. 1, Nº 3.
[17] Kim, Y., Jang, S., Cho, K., and Park, G., Performance comparison
between Backpropagation, Neuro-Fuzzy Network, and SVM, Springer
VII. DISCUSSION AND CONCLUSION Berlin/Heidelberg, 2006.
Early detection of breast cancer cells can be predicted [18] Hsu, C., Chang, C., and Lin, C., “A practical guide to support vector
classification”, Technical report, Department of Computer Science and
accurately by the use of machine learning techniques. This Information Engineering, National Taiwan University, Taipei,
may result in the decrease of health cost and may enhance time 2003.http//www.csie.ntu.edu.tw/~cjlin/libsvm/.
required for a patient to receive treatment. In this project the
SVM and the Bayesian network have been discussed in
providing diagnostic and prognosis assessment for breast
cancer. The SVM has been determined to be more superior to
Bayesian network since it provides higher prediction accuracy.
By comparing the performance of both networks to other
neural network approaches, the KNN has been examined to
provide 100% classification. The prediction accuracy of the
networks discussed in this project emphasizes the need of
employing the machine learning techniques not only on the
prediction of breast cancer data but on other medical
conditions in which predictions of conditions are difficult to
diagnose.
VIII. REFERENCES
[1] McMorran, J., Crowther., D.C., “Fine needle aspiration cytology
(breast)”, General Practice Notebook – a UK medical reference on the
world wide web, Feb 2009.
[2] Olvi, L.M., Street, W.N., “Breast cancer diagnosis and prognosis via
linear programming”, Operations Research, Vol.43, No.4, 1995, pp.
570-577.
[3] UCI Irvine machine learning repository, “Breast Cancer Wisconsin
(Diagnostic) Data Set”,
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisc
onsin+(Diagnostic), Nov. 1995.
[4] Akay, M., “Support vector machines combined with feature selection
for breast cancer diagnosis”, Expert systems with applications, Vol.36,
2009, pp.3240-3247.
[5] Kohavi, R., "A study of cross-validation and bootstrap for accuracy
estimation and model selection". Proceedings of the fourteenth
international joint conference on artificial intelligence, Vol.12, No.2,
1995, pp. 1137–1143.
[6] Pena-Reyes, C., and Sipper, M., A fuzzy approach to breast cancer
diagnosis, Artificial intelligence medicine, Vol.17, 1999, pp.131-135.
[7] Ryu, Y., Chandrasekaran, R., and Jacob, V., Breast cancer prediction
using the isotonic separation technique, European journal of operation
research, Vol.181, 2007, pp.842-854.
[8] West, D., Mangiameli, P., rampal, R., West, V., “Ensemble strategies
for a medical diagnostic decision support system: A breast cancer
diagnosis application”, European journal of operation research, Vol.162,
2005, pp.532-551.
[9] Pantel, P., “Breast cancer diagnosis and prognosis”, University of
Manitoba (1998).
[10] Maglogiannis, I., and Zafiropoulos, “An intelligent system for
automated breast cancer diagnosis and prognosis using SVM based
classifiers”, Application intelligence, Vol.30, 2009, pp.24-36.
[11] Bagui, S., Bagui, S., Pal, K., and Pal, N., “Breast cancer detection using
rank nearest neighbor classification rules”, Pattern recognition, 36,
2003, pp.25-34.
[12] Mu, T., and Nandi, A., “Breast cancer detection from FNA using SVM
with different parameter tuning systems and SOM-RBF classifier”,
Journal of the Franklin Institute, Vol. 344, 2007, pp.285-311.
[13] Charniak, E., “Bayesian networks without tears,” Artificial intelligence
magazine, Vol.12, No.4, 1991, pp.50-63.
[14] Friedman. N., Geiger, D., and Goldszmidt, M., “Bayesian classifier”,
Machine learning, Vol. 29, 1997, pp.131-163.
[15] Gunn, S., “Support vector machines for classification and regression,
Technical paper, 1998.
[16] Ubyeli. E., “Implementing automated diagnostic systems for breast
cancer detection, Expert systems with application”, Expert systems with
applications, Vol.33, 2007, pp.1054-1062.
-13-