Cancer Detection and Analysis Using Machine Learning: Abstract-Among The Various Types of Diseases, Cancer Is
Cancer Detection and Analysis Using Machine Learning: Abstract-Among The Various Types of Diseases, Cancer Is
Learning
2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA) | 978-1-6654-5834-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICCSEA54677.2022.9936457
1st Abhishek Verma 2nd Cabinet Kumar Shah 3rd Veerpal Kaur
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Lovely Professional University Lovely Professional University Lovely Professional University
Jalandhar, India Jalandhar, India Jalandhar, India
[email protected] [email protected] [email protected]
Abstract—Among the various types of Diseases, Cancer is where this study has made a web page using the Python Flask
considered as one of the deadly diseases in the world. Lung, API framework to gather the inputs from end-users. Through
Prostate, and Breast Cancer are some of the Cancer types that the best accurate model, this study is focusing to differentiate
are contributing most to the Mortality Rate. In order to Benign and Malignant tumors where this study is classifying
overcome, our proposed research work includes Data Collection patients into non-cancerous (Benign) and cancerous
which is further analyzed and modelled using Machine (Malignant). On the other hand, the study has a module for
Learning Techniques. Moreover, the Machine Learning models making analysis on their own data through the web API. For
were evaluated as well as compared based on Performance that end-user has to submit data link where they get textual
Metrics parameters like Accuracy, Precision, Recall, F1Score.
analysis (where missing case are being handled efficiently
The best fit Algorithm obtained with the highest Accuracy is
Simple Linear Regression for the prediction of Lung Cancer,
and showing all the relevant information), Visualization of
Logistic Regression for the prediction of Prostate Cancer and the data with a single click. As this study has two major
Breast Cancer. This has been integrated with python script modules (i.e. making prediction and making analysis) in the
through the Python Flask Framework. This allows us to fetch study so end-user can have insight over data and can have
the user inputs/responses by the help of Forms and according to predictions over 3 different cancers according to user
the values entered by the users, the interface will predict response. Hence, with this study, this paper has tried to detect
whether the person is suffering from Benign (Noncancerous) or early cancer in humans and help them to reduce the serious
Malignant (Cancerous). Hence, this study would help in the impact on human life. Moreover, this concept will also help
early detection of Cancer in humans which would give a to save lives, time, and money too.
warning and make them alert regarding the treatment of
diagnosed Cancer. It will also help in reducing the Mortality II. LITERATURE REVIEW
Rate of Cancer patients along with the saving of their funds. O. Günaydin, M. Günay, Ö. Şengel [1], implemented
Comparison of Lung Cancer Detection Algorithms on the
Keywords—Lung Cancer, Prostate Cancer, Breast Cancer, Standard Digital Image Database, Japanese Society of
Machine Learning Algorithms, Performance Metrics, Numpy,
Radiological Technology with the 5 different types of
Scikit learn, Flask API
Machine learning Algorithms (K-Nearest Neighbors, Support
I. INTRODUCTION Vector Machine, Naïve Bayes, Decision Tree, Artificial
Neural Network) and evaluated the model on the parameters
Every year growing exponential number of patients such as Accuracy, Precision, Recall and Confusion Matrix.
throughout the globe has cancer patients as the maximum in
number. Cancer has turned out to be a major threat to human Dr. M. Srivenkatesh [2], in 2020 analyzed a Prostate
life. As per the WHO Survey report, these (Breast, Lung, Cancer Dataset having 100 samples and 10 features to build
Prostate) cancers have affected the maximum number of a prediction model on Prostate Cancer by applying different
patients and have been seen as dangerous due to which supervised learning techniques (K-Nearest Neighbors,
Mortality Rate has rapidly increased because it's usually late Support Vector Machine, Logistic Regression, Naïve Bayes,
for doctors to detect cancer. To improve cancer screening our Random Forest) and checked the efficiency of models using
study has made an effort through this research where the Performance Measurement Metrics like (Confusion Matrix,
study implemented 3 major cancer detection machine Accuracy, Root Mean Square Error (RMSE), Relative
learning models. In order to build ML models, this study has Absolute Error (RAE), Mean Absolute Error (MAE) and
used many classification algorithms like Decision Tree (DT), Kappa Statistics) to find the best fit model for prediction.
Random Forest (RF), Logistic Regression (LR), Simple
D. E. Gbenga, N. Christopher, D. C. Yetunde [3],
Linear Regression (SLR), K-Nearest Neighbor (K-NN),
analyzed the Wisconsin Breast Cancer dataset (Diagnostic)
Support Vector Machine (SVM), Naive Bayes (NB).
having 569 observations and 32 attributes using 10-fold
Moreover, these are widely used algorithms to train and test
cross-validation and built the model using Machine Learning
the datasets. Among these algorithms, the best accurate
algorithms Support Vector Machine, Naïve Bayes, K-Nearest
algorithm has been used at the backend to make predictions
Neighbors, Simple Linear Logistic Regression, AdaBoost,
Authorized licensed use limited to: University of Sunderland. Downloaded on January 23,2024 at 11:19:31 UTC from IEEE Xplore. Restrictions apply.
Fuzzy Unordered Role Induction, Radial Based Function, which can be used to predict Lung Cancer, Prostate Cancer,
Decision Tree which evaluated based on Accuracy, Precision, and Breast cancer that can be further deployed on the cloud
and F1-Score. with the help of Python dependencies like Flask API [11],
Joblib, etc. It can be helpful in the early Detection or
Radhika P R, Rakhi. A. S. Nair, Veena G [4], analyzed a Diagnosis of Cancer for humans. Hence, ultimately it would
dataset on Lung Cancer obtained from UCI Machine be beneficial in reducing Mortality Rate, spreading
Learning Repository to build Lung Cancer Detection Model awareness regarding the necessity of medical treatment, and
using supervised learning techniques (Logistic Regression, saving money for the people as well.
Support Vector Machine, Naïve Bayes, Decision Tree) and
obtained the Support Vector Machine model as the best-fit III. METHODOLOGY
model for prediction.
A. Data and pre-processing:
J. Nuhić, J. Kevrić [5], in 2020 built Prostate Cancer
Detection Model on the dataset from Zhou W. 387 samples In this study, 3 different datasets have been used for 3
(included dataset) with Machine learning techniques major cancer detection and cancer categories are Lung,
Multilayer Perceptron, Simple Logistic, Naive Bayes, K- Breast, Prostate. In order to model Breast Cancer detection,
Nearest Neighbor, Logistic model trees, Random Forest Breast Cancer Wisconsin (Diagnostic) Dataset has been used
where it contains 569 observations/instances with 33
Classifier, Random Committee, Attribute Selected Classifier
for attribute selection, Ada Boost Classifier which further attributes. In the same way, for Prostate Cancer '100 Sample
evaluated based on Sensitivity and Specificity, given Prostate Cancer' dataset has been used where it contains 100
AdaBoost Classifier as the best fit classifier. observations/instances with 10 attributes. Similarly, for Lung
Cancer, 'Lung-Cancer-Survey' dataset has been used where it
S. Sharma, A. Aggarwal, T. Choudhury [6], in 2018 built contains 309 observations/instances with 16 attributes.
Breast Cancer Detection using the Wisconsin Breast Cancer Moreover, the Breast cancer dataset is divided into two parts
dataset (Original) having 669 observations and 10 attributes for training and testing with a ratio of 7:3 (i.e. 70% train set
segregated with 10-fold cross-validation (90% training, 10% and 30% test set) and the Lung and Prostate cancer dataset is
testing) applied to Supervised Learning techniques Naïve divided with a ratio of 8:2 (i.e. 80% train set and 20% test
Bayes, Random Forest, K-Nearest Neighbor. The efficiency set). The major importance of making this study is consulting
of the models was compared using Precision, Recall, F1- the doctors to collect the necessary parameters that may cause
Score to find the best fit model for prediction. cancer. After this study, there are chances of reducing the
mortality rate which will predict the occurrence so that
D. Bazazeh, R. Shubair [7], implemented Comparative patients can have advanced treatment.
Study of Machine Learning Algorithms for Breast Cancer on
Wisconsin Breast Cancer dataset (Original) collected from B. K-Nearest Neighbor (K-NN):
UCI Machine Learning Repository using supervised learning
techniques such as Support Vector Machine, Random Forest, K-Nearest Neighbor is one of the supervised learning
Bayesian Network and being evaluated on the Accuracy, algorithms used to classify the given data set on labeled
Specificity and Precision parameters and obtained Support inputs to train the data and find the likelihood or similarity
Vector Machine model as the most accurate model. between the examples which can be used to predict the class
of unknown samples of the dataset.
M. Amrane, S. Oukid, I. Gagaoua, T. Ensari [8], in 2018
built Breast Cancer Classifier Wisconsin Breast Cancer In Breast Cancer detection, taking k = 3 then it gives the
dataset (Original) having 669 observations and 10 attributes highest accuracy 95% over the dataset (split ratio 7:3) having
using supervised learning techniques - Naïve Bayesian 398 observations/instances for training set and 171
Classifier, K-Nearest Neighbor which further evaluated on observations/instances for test set with 33 attributes each.
the Accuracy parameter resulted in K-Nearest Neighbor Similarly, in Prostate Cancer detection, k = 5 then it gives the
model as the most accurate between the two models. best accuracy as 80% over the dataset (split ratio 8:2) having
80 observations/instances for training set and 20
B. Sekeroglu, K. Tuncal [9], implemented Prediction of observations/instances for test set with 10 attributes each. In
Cancer incidence rates for the European Continent, using the the same way, for Lung Cancer detection, k = 3 then it gives
dataset obtained from the World Health Organization 2018 the accuracy 89.83% over the dataset (split ratio 8:2) having
containing data of European Continent Incidence Rates of 29 248 observations/instances for training set and 61
Cancer Types from 22 countries and built model with Linear observations/instances for test set with 16 attributes each.
Regression, Support Vector Regression, Long Term Short Figure 1 shows the comparison over the fitting of K-NN on
Memory, Back-Propagation, Radial Basis Machine Learning all three (Breast, Lung, Prostate) cancer datasets.
techniques for gender category wise, different cancer types
which would help in predicting the incidence rate for future TABLE I. K-NN PERFORMANCE METRIC MEASURES
years.
Breast cancer Lung cancer Prostate cancer
R. Hazra, M. Banerjee, L. Badia [10], built Machine
Learning Model for Breast Cancer Classification on the
Accuracy 97% 90% 80%
Wisconsin Breast Cancer dataset (Diagnostic) having 569
observations and 32 attributes using supervised learning Precision 97% 98% 95%
algorithms – Artificial Neural Network and Decision Tree.
The best fit model obtained is the Decision Tree model which Recall 96% 98% 75%
can be used for prediction. F1-score 97% 98% 85%
After going through almost 20 research papers, we begin
our study on building an integrated Cancer Detection model
Authorized licensed use limited to: University of Sunderland. Downloaded on January 23,2024 at 11:19:31 UTC from IEEE Xplore. Restrictions apply.
D. Logistic Regression (LR):
120%
100%
Logistic Regression is a type of supervised learning
algorithm applied to solve binary classification problems
80% based on the concept of probability. It predicts the output of
60% a categorical dependent feature and gives results in the form
40% of yes or no, 0 or 1, true or false. It uses a cost function as the
20% sigmoid function also known as the Logistic function.
0% In Breast Cancer detection, solver = 'liblinear', C = 1, max
Accuracy Precision Recall F1-score iter =30, multi class='auto' gives best accuracy 98% over the
dataset (split ratio 7:3) having 398 observations/instances for
Breast cancer Lung cancer Prostate cancer training set and 171 observations/instances for test set with
Fig. 1. Performance Metrics of K-Nearest Neighbors on different Cancers 33 attributes each. Similarly, in Prostate Cancer detection,
C=10 gives the best accuracy 90% over dataset (split ratio
C. Support Vector Machine (SVM): 8:2) having 80 observations/instances for training set and 20
observations/instances for test set with 10 attributes each. In
Support Vector Machine is a type of supervised learning the same way, for Lung Cancer detection, "C": np.logspace (-
technique used for classification by segregating the samples 3,3,7),"penalty": ["l2","l2"] gives the best accuracy 91.91%
of the dataset into classes using the best hyperplane over the dataset(split ratio 8:2) 248 observations/instances for
calculated by using different types of kernels available in training set and 61 observations/instances for test set with 16
machine learning. attributes each. Figure 3 shows the comparison over the
In Breast Cancer detection, C=1, kernel='rbf', cache fitting of LR on all three (Breast, Lung, Prostate) cancer
size=200 gives the best accuracy as 97% over the dataset datasets.
(split ratio 7:3) having 398 observations/instances for training
set and 171 observations/instances for test set with 33 TABLE III. LR PERFORMANCE METRICS MEASURES
attributes each. Similarly, in Prostate Cancer detection, C=10,
kernel= 'rbf', random state=9 gives the best accuracy as 80% Breast cancer Lung cancer Prostate cancer
over dataset (split ratio 8:2) having 80 observations/instances
for training set and 20 observations/instances for test set with Accuracy 98% 91% 90%
10 attributes each. In the same way, for Lung Cancer
detection, "kernel" : ["rbf"], "gamma": [0.001, 0.01, 0.1, 1], Precision 98% 98% 97%
"C": [1,10,50,100,200,300,1000], random state = 42 gives the
Recall 99% 98% 87%
best accuracy 90.69% over the dataset (split ratio 8:2) having
248 observations/instances for training set and 61 F1-score 98% 98% 93%
observations/instances for test set with 16 attributes each.
Figure 2 shows the comparison over the fitting of SVM on all
three (Breast, Lung, Prostate) cancer datasets. 100%
TABLE II. SVM PERFORMANCE METRICS MEASURES 95%
90%
Breast cancer Lung cancer Prostate cancer
85%
Accuracy 97% 89% 80% 80%
Accuracy Precision Recall F1-score
Precision 97% 98% 95%
Breast Cancer Lung Cancer Prostate cancer
Recall 98% 97% 75%
Fig. 3. Performance Metrics of Logistic Regression on different Cancer
F1-score 98% 97% 85%
E. Random Forest (RF):
Authorized licensed use limited to: University of Sunderland. Downloaded on January 23,2024 at 11:19:31 UTC from IEEE Xplore. Restrictions apply.
attributes each. In the same way, for Lung Cancer detection,
"max features": [1,3,10], "min samples split" : [2,3,10], "min 150%
samples leaf":[1,3,10], "boot strap":[False], "n estimators"
:[100,300], "criterion":["gini"] gives best accuracy 89.06% 100%
over the dataset (split ratio 8:2) 248 observations/instances
for training set and 61 observations/instances for test set with 50%
16 attributes each. Figure 4 shows the comparison over the
fitting of RF on all three (Breast, Lung, Prostate) cancer 0%
datasets. Accuracy Precision Recall F1-score
Breast cancer Lung cancer Prostate cancer Fig. 5. Performance Metrics of Decision Tree on different Cancer
Authorized licensed use limited to: University of Sunderland. Downloaded on January 23,2024 at 11:19:31 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION
Among various Data Mining techniques for making
predictions over medical data. Our study involves widely
(3) used classification algorithms (Supervised Learning
Techniques) to make predictions for all cancer types where
4. F1 score: It gives the average of both Precision and Logistic Regression fits best with the highest performance
Recall with following equation: metrics in Breast Cancer detection with Accuracy value of
98% and Prostate Cancer detection with Accuracy value of
90%. While, Simple Linear Regression fits best in Lung
Cancer detection with the highest performance metrics
having Accuracy value of 96.70%. Thus, prior detection will
(4) help numerous patients to get early medical treatments which
IV. RESULT will gradually reduce the mortality rate. It would have been
more effective if Deep Learning would have been played a
Our study has covered supervised machine learning role.
techniques which include mainly the classification
algorithms namely Random Forest (RF), Decision Tree (DT), VI. FUTURE SCOPE
Support Vector Machine (SVM), Logistic Regression (LR), Deep Learning Concept would bring a new turn to this
Simple Linear Regression (SLR), Naive Bayes (NB), and K- research work where the interface could directly scan the
Nearest Neighbor (K-NN). This study has been implemented images of patient organs, MRI scan reports, etc. Also, this
on two types of systems having a configuration as i3 & i5 research can have data privacy protocols to secure patient
processors with 4 and 8 GB RAM respectively. In this study, data in efficient way.
dependencies used are Flask API framework, Numpy, Plotly,
Scikit-learn, joblib, jinja2, json, and many more. Also, this REFERENCES
study has been performed on open source software namely [1] Ö. Günaydin, M. Günay, Ö. Şengel, “Comparison of Lung Cancer
Visual Studio Code (VS code), Jupyter Note Book to run Detection Algorithms”, Scientific Meeting on Electrical-Electronics &
code. amp; Biomedical Engineering and Computer Science (EBBT), 2019.
[2] Dr. M. Srivenkatesh, “Prediction of Prostate Cancer using Machine
Out of 3 (Breast, Lung, Prostate) cancer datasets, the Learning Algorithms”, International Journal of Recent Technology and
Breast dataset has a split ratio of 7:3 and the rest two dataset Engineering (IJRTE), 2020.
have a split ratio of 8:2 to give the highest accuracy. In Breast [3] D. E. Gbenga, N. Christopher, D. C. Yetunde, “Performance
cancer detection, four classification algorithms have been Comparison of Machine Learning Techniques for Breast Cancer
Detection”, Nova Journal of Engineering and Applied Sciences, 2017.
applied namely LR (98%), KNN (97%), SVM (97%), RF
(95%). Among all four algorithms LR algorithms have been [4] Radhika P R, Rakhi. A. S. Nair, Veena G, “A Comparative Study of
Lung Cancer Detection”, IEEE International Conference on Electrical,
selected to compute user input at the backend of Flask web Computer and Communication Technologies (ICECCT), 2018.
API. Similarly, for prostate cancer detection, six [5] J. Nuhić and J. Kevrić, “Prostate Cancer Detection Using Different
classification algorithms have been applied namely LR Classification Techniques”, International Conference on Medical and
(90%), SVM (80%), NB (80%), RF (80%), KNN (80%), and Biological Engineering, 2020.
DT (75%). So, among these algorithms, LR has been selected [6] S. Sharma, A. Aggarwal, T. Choudhury, “Breast Cancer Detection
to compute user inputs at the backend of flask web API. In Using Machine Learning Algorithms”, International Conference on
the same way for Lung Cancer Detection, six algorithms have Computational Techniques, Electronics and Mechanical Systems
(CTEMS), 2018.
been applied namely SLR (96.77%), LR (91.91%), SVM
[7] D. Bazazeh and R. Shubair, “Comparative Study of Machine Learning
(90.69%), KNN (89.83%), RF (89.06%), and DT (87.44%). Algorithms for Breast Cancer Detection and Diagnosis”, 5th
So, SLR has been selected to compute user input at the International Conference on Electronic Devices, Systems and
backend of Flask web API. Applications (ICEDSA), 2016.
[8] M. Amrane, S. Oukid, I. Gagaoua, T. Ensari, “Breast Cancer
Web page view Figure 6 allows end-users to give specific Classification Using Machine Learning”, Electric Electronics,
inputs which will be collected at the backend by forms Computer Science, Biomedical Engineerings' Meeting (EBBT), 2018.
(HTML) which will provide that data to python script through [9] B. Sekeroglu and K. Tuncal, “Prediction of cancer incidence Rates for
the use of Flask API. Again, python will redirect the data to the European continent Using machine learning models”, Health
the webpage for results after computing through the use of Informatics Journal, 2021.
the render template method. [10] R. Hazra, M. Banerjee, L. Badia, “Machine Learning for Breast Cancer
Classification with ANN and Decision Tree”, 11th IEEE Annual
Information Technology, Electronics and Mobile Communication
Conference (IEMCON), 2021.
[11] A. Yaganteeswarudu, “Multi Disease Prediction Model by using
Machine Learning and Flask API”, International Conference on
Communication and Electronics Systems (ICCES), Fifth International
Conference on Communication and Electronics Systems (ICCES),
2020.
Authorized licensed use limited to: University of Sunderland. Downloaded on January 23,2024 at 11:19:31 UTC from IEEE Xplore. Restrictions apply.