0% found this document useful (0 votes)
20 views11 pages

BR Inel

Uploaded by

2317029madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

BR Inel

Uploaded by

2317029madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

BREAST CANCER PREDICTION USING MACHINE LEARNING

ALGORITHMS

Dr. Jeevan Pinto1, Kushali2, Madhu3

IT, Master of Computer Applications, St. Aloysius Institute of Management and IT(AIMIT)

Beeri, Mangalore, INDIA

ABSTRACT

This study presents the abilities of Independent Component Analysis for feature reduction in a decision support
system for breast cancer diagnostics. The Wisconsin Diagnostic Breast Cancer database was reduced to a feature
vector of single dimension. In evaluating the performance of the diagnostic ability of a classifier, for instance k-
NN, ANN, RBFNN, and SVM, the original dataset of 30 features as well as the reduced dataset with only one
feature, IC are considered. A comparative analysis is conducted for the classification results that IC and the original
feature set obtained. Multiple validations, including 5-fold and 10-fold cross-validation, along with partitioning
techniques on data from 20% to 40%, are applied. These classifiers were tested based on their sensitivity and
specificity toward the ability of distinguishing between benign and malignant tumors using such metrics as
specificity, sensitivity, accuracy, F-score, Youden's index, discriminant power, and Receiver Operating
Characteristic (ROC) analysis. It also reports area under the curve (AUC) and 95% confidence intervals. This
method provides results of higher accuracy for diagnostic decision making and low computational complexity.

Keywords: Independent Component Analysis (ICA), Breast Cancer, WDBC Dataset, Feature Reduction, k-NN,
ANN, RBFNN, SVM, Classification, Sensitivity, Specificity, Accuracy, F-score, Youden’s Index

1. INTRODUCTION

Breast cancer remains a leading cause of mortality among women globally. Early and accurate detection is critical
for effective treatment.[1] Traditional diagnostic methods often rely heavily on the expertise and visual
assessments of physicians, which can sometimes result in errors due to human limitations. While humans excel in
pattern recognition, challenges arise when probabilistic
interpretations are required. Even though the diagnostic testing has been complete, getting an appropriate diagnosis
is sometimes not an easy task even for experts. Hence, in recent years, the demand for automated diagnostic
technology has been on an all-time high in research studies of breast cancer. [2] Computer-assisted diagnostic
systems seek to enable doctors to make diagnoses of greater accuracy. It has been found that machine learning is
highly efficient in enhancing the diagnostic accuracy compared to the conventional methods. For example, one
study reported that there was a 79.97% diagnostic accuracy from the experienced physicians that increased up to
91.1% with the help of the machine learning techniques. [3-5]
Breast tumors can be benign or malignant. Benign tumors are not cancerous but also increase the risk of breast
cancer. Malignant tumors are cancerous and thus harmful. Despite the progress achieved in early detection, still, a
big proportion of females carrying the malignant tumour dies by this disease [7]. Moreover, in relation to the better
creation of skills of the predictions in terms of the classifications of tumors at high accuracy, performance
considering the back-propagation-based ANN as well as the radial basis function NN was taken into consideration.
Indeed, the RBFNN shows a very satisfactory performance taking into account rapid rates for both training and
generalizing together with its simple structure concerning detections, such as those involving microcalcifications.
However, a dimension in the input does hamper their ability of generalization. Similarly, SVMs emerged as very
strong statistical learning methods, where data classes are distinguished pretty well using hyperplanes. Among
others, their advantages include very fast training and the ability to adapt to huge sets of data, etc. [ 9-12]
Some other commonly known methods that perform this particular procedure are principal component analysis
and independent component analysis. The most essential task of both these methods is to optimize the performance
of the classifier. ICA, being relatively new, can extract features more enriching than PCA because it depends on
higher-order statistical properties. Therefore, the statistical properties minimize the input dimension, which
reduces the complexity of the classifier. It leads to an increase in both convergence speed and accuracy. [13]
This paper presents an exploration of ICA performance for feature dimensionality reduction for the purpose of
classifying the breast tumors as either malignant or benign. This research reduces the feature dimensionality of the
WDBC dataset into one feature, based on the application of ICA. Then several classifiers are presented, according
to their performances, k-NN, ANN, RBFNN, and SVM. The work is done using a number of validation techniques,
cross-validation with 5-folds and 10-folds, and also data splitting into 20%. Classifiers are compared over a set of
metrics such as accuracy, specificity, sensitivity, F-score, Youden's index, and discriminant power. Finally,
Receiver Operating Characteristic analysis is used to compare overall performance as it gives the insight whether
ICA proves useful for improving classification systems. [14-17]
2. LITERATURE SURVEY

Machine learning has revolutionized breast cancer diagnostics by enhancing accuracy beyond traditional methods.
Studies have shown that while experienced physicians may achieve approximately 80% diagnostic accuracy,
machine learning approaches can reach over 91%, highlighting their significant role in assisting clinical decision-
making. These advancements mitigate the limitations of human perception and reduce diagnostic errors through
automated analysis and pattern recognition.[2]

Several machine learning classifiers, including Artificial Neural Networks (ANN), Radial Basis Function Neural
Networks (RBFNN), and Support Vector Machines (SVM), have been shown to successfully distinguish between
benign and malignant breast tumors. In terms of ANN, there has been an emphasis on adaptability in modeling
complex patterns while also maintaining high sensitivity levels to detect malignant cases. Another widely used
classifier that performs rapid learning and can represent nonlinear relationships is RBFNN. SVMs are used because
they are very useful in identifying the optimal hyperplanes that are crucial for classifying data in high-dimensional
spaces. They are generally used for accurate tumor classification. [6]

Feature reduction is imperative for simplifying complex data and improving computational efficiency. Techniques
such as PCA and ICA are widely used. More effective is ICA because this technique captures independent
components thus preserving more intricate relationships with the data than PCA. Here, researchers have reduced

2
the 30-features WDBC dataset through ICA to a feature vector of one dimension with loss of accuracy while
maintaining some computational complexity. [18]

The effectiveness of machine learning classifiers can vary significantly depending on the dataset's dimensionality
and the feature reduction techniques applied. For instance, ICA-based feature reduction has shown the following
impacts:
• IIncreased accuracy of RBFNN, from 87.17% when 30 features were used to 90.49% when only one
reduced feature was used.
• Increased sensitivity of SVM classifiers, which is essential for identifying malignant tumors, highlighting
the practical applicability of feature reduction in enhancing clinical performance.

Past studies have proved that feature reduction can be combined with machine learning algorithms to increase the
accuracy of diagnosis. For instance:
• Studies based on SVMs with polynomial and RBF kernels provided diagnostic accuracies of over 92%.
• Hybrid techniques, like ICA in combination with discrete wavelet transforms, among others, have
provided more accurate results, with accuracies of over 97%. These methods show that incorporating
feature reduction with powerful classification algorithms indeed increases the accuracy of decision
support in diagnosis.[3]

3. MATERIALS AND METHODS

3.1. Dataset Information.


WBDC dataset consists of 569 instances with class distribution of 357 benign and 212 malignant. Each sample
consists of ID number, diagnosis (B = benign, M=malignant), and 30 features. Features have been computed from
a digitized image of a fine needle aspirate (FNA).

(a) (b)

Figure 1: FNA biopsies of breast. Malignant (a) and benign (b) breast tumours [24].
3.2 Flow Chart

3.3 Machine Learning Algorithms for Breast Cancer Prediction

3.2.1 Logistic Regression Logistic regression is a basic type of supervised learning technique; its application in
binary classification has been quite common. Since it always predicts probabilities to fall between 0 and 1, it forms
an S-shaped curve. With some simple datasets, logistic regression would work effectively, but its performance in
complex data structures may not be as powerful as the advanced algorithms[6].

3.2.2 K-Nearest Neighbors (KNN): Classification technique using which the largest class amongst kkk neighbours
of any point is retrieved, taking Euclidean Distance as an example of some distance metric. It has been easy to use
and many times works out quite well to classify problems such as medical diagnosis. Though it might be susceptible
to noisy values or more high-dimensional attribute spaces.[7]

3.2.3 Naïve Bayes (NB): Naïve Bayes uses probabilistic calculations and assumes independence of features. It has
fewer computational complexities and lower amounts of training data requirements making it very applicable for
predicting the outcome of breast cancer, but this assumption might greatly reduce its accuracy on many datasets
in which the correlation between features is strong. [8]

3.2.4 Support Vector Machines (SVM): SVM is a highly effective supervised learning algorithm that separates
data into classes using hyperplanes. It is particularly well-suited for high-dimensional data and handles non-
linear relationships effectively. SVM is often considered one of the most accurate algorithms for breast cancer
prediction, frequently outperforming other methods like Logistic Regression, KNN, and Random Forest in terms
of precision and robustness.[6]

3.2.5 Decision Tree: Decision trees classify data by recursively splitting it into branches based on feature values.
They are intuitive and easy to interpret but are prone to overfitting, which can reduce their effectiveness in some
cases. Despite this, they can serve as a strong foundation for ensemble methods.[1]

3.2.6 Random Forest: Random forest is an ensemble technique to improve the accuracy of more than one decision
tree to a large extent and reduce overfitting. It is reliable but not as accurate as SVM to some extent in some
datasets because of its overfitting effect. [16]

4
3.2.6 Random Forest: Random Forest is an ensemble technique that combines the predictions of multiple decision
trees to improve accuracy and mitigate overfitting. It is reliable for breast cancer prediction but may still fall short
in accuracy when compared to SVM in some datasets.[16]

3.3PerformanceMeasures
There are several ways to evaluate the performance of classifiers. Confusion matrix keeps the correct and
incorrect classification results to measure the quality of the classifier. Table 2 shows a confusion matrix for
binary classification, where TP, TN, FP, and FN denote true positive, true negative, false positive, and false
negative counts, respectively.[15]

The most common empirical measure to test effectiveness is the accuracy for classifier and it is measured by
sensitivity which measures the proportion of actual positives that are correctly identified and specificity
measures the proportion of negatives that are correctly identified. These are formulated by the below table.

The performance of classifiers is tested using metrics based on a confusion matrix that includes True Positives
(TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
• Accuracy: Measures correct predictions in general

• Sensitivity (Recall): Proportion of actual positives correctly identified:

• Specificity: Proportion of actual negatives correctly identified

• F-score: Combines precision and recall:

• Discriminant Power (DP) measures how well the model distinguishes between classes, with higher
values indicating better performance.

• Youden's Index evaluates a classifier's ability to avoid errors and is calculated as:
Performance was assessed using 5/10-fold cross-validation (dividing data into subsets for iterative testing) and
20% random partitioning (20% for testing, 80% for training). Cross-validation is more reliable, while partitioning
offers simplicity.[8]

4. METHODOLOGY

This experiment uses the WDBC dataset, which has 30 original features and one feature extracted by ICA. The
classifier is used to determine the efficiency of the classification of the patient samples in breast cancer prediction.
The model in Figure 4 is used on 569 patient samples from the dataset to split into training and testing for
performance analysis.

Using ICA, the dimensionality of the dataset is reduced. The data is split into subsets by 5-fold and 10-fold cross-
validation (CV) and 20% partitioning. These subsets are then used to train and test classifiers including ANN,
RBFNN, SVM, and k-NN. The classifiers are assessed by using metrics such as sensitivity, specificity, accuracy,
F-score, Youden's index, discriminant power (DP), and ROC curves.

During the ICA process, independent components (ICs) are computed, and the first IC, with a considerably high
eigenvalue as illustrated in Figure 5, is selected as the feature of interest. This IC manages to capture 98.205% of
the nonzero eigenvalues in the dataset, thus summarizing the original 30 features. Figure 6 demonstrates its
discriminative capability between the classes. The subsets formed during partitioning are used for training, and the
test data are applied to measure the classifiers' diagnostic accuracy.

ANN

Performance
measure

test

Figure 4: The basic model of the study.

6
Ei
ge
nv
alu
e

Figure 5: Corresponding eigenvalues Figure 6: The distribution of computed IC


of the WDBC data.

The Euclidean distance formula, , is applied to measure differences between test and
training samples. k-NN classifiers are tested across kk values from 1 to 25 and performance metrics were recorded
for the optimal kk value. Feedforward ANN consisting of one hidden layer is used: the number of neurons in a
hidden layer is gradually increasing to achieve maximum accuracy. The activation function is log-sigmoid and the
training method uses gradient descent with momentum and adaptive learning rate.

In case of RBFNN, the performance of the spread parameter σσ is checked. For SVM classifiers, different types
of kernels are used, which are linear, quadratic, and RBF, in order to check which of them is best suited to classify
the cases of breast cancer.
5. RESULTS
5.1 Classifiers
Measures 𝑘-NN ANN RBFNN SVM (RBF K.)
1F 30F 1F 30F 1F 30F 1F 30F

𝐹-score 92.98 94.65 92.76 98.07 92.61 80.57 93.04 96.21

DP 2.539 2.912 2.655 InF 2.606 1.131 2.769 3.267

Υ 0.795 0.839 0.766 0.934 0.763 0.284 0.772 0.899

Accuracy 91.03 93.14 90.5 95.53 90.49 87.17 90.86 97.25

Specificity 84.9 87.26 79.71 93.39 79.71 34.9 79.71 93.86

Sensitivity 94.67 96.63 96.91 100 96.63 93.55 97.47 96.07

Table 4: The comparison of ICA algorithm’s effect on the classifiers’ performance measures (sensitivity,
specificity, accuracy, and 𝐹-score in %).

5.2 Data Visualisation

5.3 User Interface

8
Fig 7. Flask
In Fig. 7 a portion of the HTML page for the breast cancer prediction system is depicted. The application contains
an input form in which users can add details such as tumour size, texture, and other related characteristics. When
the "Predict Breast Cancer Diagnosis" button is clicked, the user-added data is sent to a Flask backend for
processing.

The Flask application processes the input values and passes it to a trained machine learning model for analysis.
According to the data, the model will predict whether the tumor is benign or malignant. After this, the Flask
application refreshes the webpage with the prediction result, allowing the user to view the outcome right away.

6. CONCLUSIONS

This paper discusses the effect of dimensionality reduction by Independent Component Analysis (ICA) on breast
cancer decision support systems that employ classifiers like Artificial Neural Networks (ANN), k-Nearest
Neighbours (k-NN), Radial Basis Function Neural Networks (RBFNN), and Support Vector Machines (SVM).
The performance of classifiers using the original 30 features from the Wisconsin Diagnostic Breast Cancer
(WDBC) dataset is compared with the results from a reduced, one-dimensional feature vector obtained through
ICA.

For most classifiers, the accuracy slightly decreased while using ICA-reduced features compared to the original
30 features, with accuracy dropping from 97.53%, 91.03%, and 95.25% to 90.5%, 91.03%, and 90.86%,
respectively. For the RBFNN classifier, the one-dimensional feature vector improved performance, increasing
accuracy from 87.17% to 90.49%. In addition, sensitivity that represents the true positive percentage of malignant
cases was also improved for RBFNN (from 93.5% to 96.63%) and SVM (from 96.07% to 97.47%). Other
classifiers lost only a small amount of sensitivity varying from 0.96% to 3.09%.
In summary, ICA-based feature reduction seems to be a very promising approach to improving malignant breast
cancer detection when used with RBFNN. At the same time, it reduces the complexity without a significant loss
of accuracy.

7. REFERENCES

[1] Kokkinakis, G., Christoyianni, I., and Dermatas, E. (2000). Mammography masses can be quickly detected
with computer-aided systems. 17(1), 54–64; IEEE Signal Processing Magazine.

[2] N. Salim (2013). Applications of neural networks in medical diagnosis. University Faculty of Information
Technology. accessible at Generation 5.

[3] Kilic, N., Tartar, A., and Akan, A. (2013). Classification of pulmonary nodules by hybrid feature extraction
methods. Article ID 148363 in Computational and Mathematical Methods in Medicine, 2013.

[4] Ucan, O. N., Kilic, N., and Osman, O. (2009). utilizing fuzzy rule-based 3D template matching to identify
colonic polyps in CT scans. Medical Systems Journal, 33(1), 9–18.

[5] Kilic, N., Mert, A., and Akan, A. (2014). diagnosing arrhythmia beats using bagging ensemble techniques
and time-domain characteristics. 317–326 in Neural Computing and Applications, 24(2).

[6] In the Proceedings of the 2nd International Symposium on Medical Data Analysis (ISMDA '01), R. W.
Brause, "Medical analysis and diagnosis by neural networks," pp. 1–13, Madrid, Spain, October 2001.

[7] "Breast mass classification based on cytological patterns using RBFNN and SVM," Expert Systems with
Applications, vol. 36, no. 3, pp. 5284–5290, 2009, by T. S. Subashini, V. Ramalingam, and S. Palanivel.

[8] “Optimal neural network architecture selection: improvement in computerized detection of


microcalcifications,” Academic Radiology, vol. 9, no. 4, pp. 420–429, 2002, M. N. Gurcan, H.-P. Chan, B.
Sahiner, L. Hadjiiski, N. Petrick, and M. A. Helvie.

[9] "Radialbasis-function based classification of mammographic microcalcifications using texture features," by


A. P. Dhawan, Y. Chitre, C. Bonasso, and K. Wheeler, in Proceedings of the 17th IEEE Engineering in
Medicine and Biology Annual Conference, pp. 535–536, September 1995.

[10] A. T. Azar and S. A. El-Said, “Superior neuro-fuzzy classification systems,” Neural Computing and
Applications , vol. 23, no. 1, appendix, p. 55–72, 2012.

[11] M. Jia, C. Zhao, F. Wang, and D. Niu, "A new method for decision on the structure of RBF neural network,"
Proceedings of the 2006 International Conference on Computational Intelligence and Security, volumes 147–
150, November 2006.

[12] Sing, J. K., Thakur, S., Basu, D. K., Nasipuri, M., & Kundu, M. (2009). High-speed face recognition through
self-adaptive radial basis function neural networks. Neural Computing & Applications, 18(8), 979–990.

[13] In Proceedings of the 2002 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1941–
1945, November 2002, R. Huang, L. Law, and Y. Cheung provide an experimental investigation on lowering
RBF input dimension via ICA and PCA.

[14] Statistical Learning Theory, by V. N. Vapnik, John Wiley & Sons, 1989, New York, NY, USA.

10
[15] M. F. Akay, "Feature selection and support vector machines for breast cancer diagnosis," Expert Systems
with Applications, vol. 36, no. 2, pp. 3240–3247, 2009.

[16] B. Wang, H. Huang, and X. Wang, “A support vector machine based MSM model for financial short-term
volatility forecasting,” Neural Computing and Applications, vol. 22, no. 1, pp. 21– 28, 2013.

[17] Support vector machines for 3D object recognition, IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 20, no. 6, pp. 637–646, 1998, M. Pontil and A. Verri

[18] A multi-route detection-based face and fingerprint identity authentication system,Neurocomputing, vol. 70,
no. 4–6, pp. 922–931, 2007. J. Zhou, G. Su, C. Jiang, Y. Deng, and C. Li.

[19] The study "Evaluation of face recognition techniques using PCA, wavelets, and SVM" by E. Gumus, N. Kilic,
A. Sertbas, and O. N. Ucan was published in Expert Systems with Applications in 2010.

You might also like