Survival Analysis of Thyroid Cancer Patients Using Machine Learning Algorithms

Uploaded by

boinpallyvamshi3

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Survival Analysis of Thyroid Cancer Patients Using Machine Learning Algorithms

Uploaded by

boinpallyvamshi3

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Received 21 December 2023, accepted 9 April 2024, date of publication 22 April 2024, date of current version 7 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3392275

Survival Analysis of Thyroid Cancer Patients

Using Machine Learning Algorithms
SAADAT M. ALHASHMI 1 , MD. SHOHIDUL ISLAM POLASH 2 , AMINUL HAQUE 2,

FAZLEY RABBE 3 , SHAZZAD HOSSEN2 , NURUZZAMAN FARUQUI 4 ,

IBRAHIM ABAKER TARGIO HASHEM 5 , AND NIRASE FATHIMA ABUBACKER6
1 Department of Information Systems, University of Sharjah, Sharjah, United Arab Emirates
2 Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Savar, Dhaka 1216, Bangladesh
3 Department of Information Technology, Frankfurt University of Applied Sciences, 60318 Frankfurt am Main, Germany
4 Department of Software Engineering, Daffodil International University, Daffodil Smart City, Savar, Dhaka 1216, Bangladesh
5 Department of Computer Science, University of Sharjah, Sharjah, United Arab Emirates
6 School of Computing, Asia Pacific University, Kuala Lumpur 57000, Malaysia

Corresponding author: Aminul Haque ([email protected])

ABSTRACT The medical community strives continually to improve the quality of care patients receive.
Predictions of prognosis are essential for doctors and patients to choose a course of treatment. Recent years
have witnessed the development of numerous new cancer survival prediction models. Most attempts to
predict the prognosis of people with malignant growth rely on classification techniques. We could experiment
with significantly different results using only a subset of SEER (Surveillance, Epidemiology, and End
Results) data. These models were created using machine learning techniques by selecting univariate features
and calculating correlations. We illustrated the variation in results and discrepancy of impurity that can result
from varying data quantities and critical factors. Seventeen crucial factors were identified, and a group of
classification algorithms were trained to evaluate the effectiveness of an estimation technique. In the display
mode, the accuracy of these computations ranges from 97% to 99%Ȧlong with accuracy, the models are
further evaluated regarding the F1 score, precision, recall, and the AUC score. Compared to earlier studies,
a more accurate model has been developed, and, to the best of our knowledge, our prediction model is
superior to the models studied in the previous works.

INDEX TERMS Logistic regression, machine learning, random forest, thyroid survivability.

I. INTRODUCTION is a procedure that collects information from complex data

Of the various forms of cancer, thyroid carcinoma is the through clever tactics. Disclosure can enhance the quality
most prevalent endocrine cancer, with a constant increase of health and treatment management. Computer science and
in prevalence globally [1]. The Surveillance, Epidemiology, data mining techniques are used in decision-making systems
and End Results Program (SEER) of the National Cancer to consider all relevant factors [2]. Because cancer therapy
Institute, USA, contains much information about thyroid can- takes so long and costs so much money, accurate survival
cer. Over the past decade, progress has been made in cancer projections are a must for providing affordable healthcare.
studies. This way, categorizing malignant growth patients The clinical importance of thyroid nodules varies from 7%
into danger classes is an incredibly lively subject of study to −15% depending on the requirement to rule out thyroid
with obvious therapeutic applications. More research on the cancer, which in turn varies based on age, sex, radiation
treatment’s intricacy and difficulties is required. This study exposure date, and family history, among other factors [3].
may be done on how medical organs, such as data mining Above 90% of all thyroid tumors [4] are classified as distinct
approaches, are discussed [2]. The data mining approach thyroid cancer (DTC), which includes papillary and pituitary
tumors. Around 63,000 new cases of thyroid illness were
The associate editor coordinating the review of this manuscript and identified in the United States in 2014 [5], up from 37,200
approving it for publication was Jenny Mahoney. in 2009 when the most recent ATA recommendations were
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
61978 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

issued. The number of new cases per year has climbed from Mourad et al. [7] predicted thyroid cancer prognoses
4,9 per 100,000 in 1975 to 100,000 in 2009 (100) 14.3. using machine learning, feature selection, and the SEER
This trend has been studied using various statistical dataset. Thirty-four clinical variables and 61,362 items
approaches, including Cox regression, log logistics, log- make up the dataset. MPL1 was the most accurate of
normal, and Kaplan-Meyer experimental models [6]. New, his ANN-based MLP models at 94.5%Ṡonuc [11] used
cutting-edge data mining techniques outperform their more machine learning to classify thyroid illnesses as usual,
established counterparts in terms of versatility and efficiency. hypothyroidism and hyperthyroidism. He used SVM, RF, DT,
Our investigation began by identifying the factors that NB, LR, K-NN, MLP, and linear discriminant analysis on
affected the objective characteristic most. The Sklearn tools a 14-attribute dataset of Iraqi nationals. MLP has 96.4%
of Select K-Best and Chi-Square have been employed here. accuracy. Wu et al. [12] utilized machine learning to predict
The survival of the patient is what we are interested in seeing. central lymph node metastases. This study used 22 unique
Many machine-learning techniques were initially conceived characteristics. 5-fold cross-validation is employed. The
to solve this grouping problem. The novelty of our work: 7-variable gradient boosting decision tree model has the
(i) Improved Accuracy: Provided a 99.30% accurate greatest ROC (AUC = 0.731) and decision curves. It was cho-
prediction model using the Random Forest classifier, which sen as the top model. Park and Lee [13] used machine learning
is superior to Mourad’s work [7]. to predict illness recurrence and analyzed 1040 patients
(ii) Data Cleaning: Introduced sophisticated data prepro- with 12 characteristics. They evaluated sex, tumor size,
cessing techniques tailored to tackle the dataset’s specific and disease recurrence. The Decision Tree model had
impurities, ensuring the results’ quality and reliability. 95% accuracy. Duggal and Shukla [14] predicted thyroid
(iii) Robustness Under Impurity: The Logistic Regression conditions using machine learning. He diagnosed this thyroid
model retains high accuracy even with significant dataset disease using feature selection and classification methods.
impurities. This robustness could set a benchmark for future Tree-based, recursive, and univariate feature selections are
models in similar contexts. recommended. Naive Bayes, Support vector machines, and
(iv) Feature Sensitivity Analysis: Delved deep into under- Random Forest classified thyroid illnesses into four classes:
standing which features are most sensitive to impurities Hypothyroid, Hyperthyroid, Sick Euthyroid, and Euthyroid.
and how they influence the Logistic Regression model’s The SVM classifier was best, with 96.92% accuracy.
performance. Deep learning has also been used extensively to predict
(v) Cross-Validation Techniques: Implemented robust prostate cancer survival. Wen et al. [15] studied prostate
cross-validation strategies, ensuring that the reported 98.77% cancer prognoses and employed an artificial neural network
accuracy is consistent across different dataset splits and is not as a form of deep learning. He used Naive Bayes, Decision
a result of overfitting. Trees, K Nearest Neighbors, and Support Vector Machines
(vi) Scalable Solutions: Provided methodologies that can (SVM). They split survival into fewer than 60 months and
scale to larger datasets with similar impurity challenges, more than 60 months. The highest success rate, 85.64%
ensuring the applicability of our findings to broader contexts. is shown by ANN. Montazeri and Beigzadeh [16] created
These methods helped us to get closer to the mark while a rule-based survival classification system for breast cancer.
making predictions—the effects of both a balanced and They used Naive Bayes (NB), Trees Random Forest (TRF),
unbalanced outcome on the evaluation. According to what 1-Nearest Neighbor (1NN), AdaBoost (AD), Support Vector
we know, we can confidently assert that our work is superior Machine (SVM), RBF Network (RBFN), and Multilayer
to that of others, particularly when compared to Mourad’s Perceptron (MLP) with 10-cross fold approach on a small
work [7]. dataset of 900 patients. They measured model accuracy,
precision, sensitivity, specificity, and area under the ROC
II. LITERATURE REVIEW curve. The Trees Random Forest technique was more
Accurate cancer survival prediction should aid doctors in accurate (96%).
making wise choices and creating successful treatment Liu et al. [17] investigated the SEER dataset of 107,114
plans [8]. Meanwhile, it can save many individuals from thyroid cancer records to see if ETE affected cancer prognosis
obtaining unnecessary treatment and the high medical and survival. Liu et al. [18] created a machine learning-based
expenditures that accompany it [9]. Malignant cells can random forest to predict poor thyroid cancer quality of life.
develop in the thyroid gland’s tissues, a condition known as Two hundred sixty thyroidectomy-receiving thyroid cancer
thyroid cancer. In recent decades, the incidence of thyroid individuals were studied. Training and validation courts had
cancer [10] has increased in several countries, including the 0.834 and 0.897 areas under the curve. Kukar et al. [19]
US. Based on past patient treatment, the doctor sometimes predicted anaplastic thyroid cancer survival using machine
mispredicts longevity. Doctors and patients need survival learning. They enrolled 126 patients and compared machine
estimates to choose the best medicine. Many research learning to statistical studies.
experts have attempted to solve the task of predicting cancer Agrawal et al. [20] used SEER data to forecast lung
prognosis using machine learning that aims to estimate cancer patient survival. Two of the 11 derived traits they
entirely accurately. found were highly predictive. Preprocessing, data mining
VOLUME 12, 2024 61979
S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

optimizations, and dataset validations commence. They majority and minority classes is one of the critical issues with
selected 13 attributes using multiple methods and attribute employing data analysis for diagnosis and therapy [26].
selection methodologies. Ensemble voting of five decision In addition to the significance of data diversity, it is
tree-based classifiers and meta-classifiers enhanced predic- essential to use more methods to narrow down the best
tion. Lundin et al. [21] predicted breast cancer survival using algorithms that suit the data and the business requirements.
an artificial neural network. The area under the ROC curve This paper has experimented with 14 different ways. At the
(AUC) measured how effectively prediction models predicted same time, Jajroudi et al. [2], Mourad et al. [7], Sonuç et al.
patient survival rates. The neural network models’ 5-, 10-, [11], Yijun Wu et al. [12], Park and Lee [13], Duggal and
and 15-year breast cancer-specific survival AUCs were 0.909, Shukla [14], Wen et al. [15], Montazeri and Beigzadeh [16],
0.886, and 0.883. Logistic regression AUC values were 0.897, Liu et al. [17], Lundin et al. [21] just used a few (less than 10)
0.862, and 0.858. Neural network accurately predicts breast different methods for thyroid, prostate, and breast cancer.
cancer survival after 5, 10, and 15 years. Delen et al. [28] Table 1 presents some more recent works related to ours.
predicted breast cancer survival using data mining. They In addition, the works mentioned above have yet to deal
used logistic regression and two data mining approaches with the problem of imbalanced data. Our work addresses
to develop prediction models (artificial neural networks the data imbalance problem by employing the SMOTE
and decision trees)—performance comparison made using approach based on the work by [27]. A heavy class imbalance
10-fold cross-validation. The decision tree (C5) model has is examined as a critical problem for their dataset, and
93.6% accuracy. Jajroudi et al. [2] used Logistic Regression multiple balancing techniques, such as weight balancing
with MLP as the optimal neural network ANN for survival and data augmentation, were considered. In addition to
prediction of thyroid cancer patients. He used evidence this, our proposed work also investigates a few efficient
and radiation oncologists to choose important SEER dataset feature selection methods that select only the prominent
properties. They investigated 16 attributes and 7706 data features to attain greater accuracy in determining whether the
points. He estimated 1-, 3-, and 5-year survival. He measured patient would survive compared to the recent research work
model accuracy, sensitivity, and specificity. Their work conducted by Lee et al. [8], where only seven elements were
suggested MLP for thyroid cancer survival prediction. employed for the prediction. Likewise, Delen et al. [28] and
Although approaches and algorithms are well-developed Thongkam et al. [29] also predicted survival for breast cancer
and have an adequate logical foundation, they frequently patients. The high-performing model is Logistic regression
encounter difficulties because of the size and properties of with 98.77% accuracy in the low degree of impurity. Also,
the underlying data [22]. the best-performing model is the Random Forest classifier,
Yet, the benefits of a large sample size for interpreting with 99.30 correctness for the high degree of impurity.
significant results include that it permits more accurate
estimation of the treatment impact and often makes it simpler III. METHODS
to judge the sample’s representativeness and generalize the Essential procedures include data collection, preparation,
findings [23]. Many treatments classified as ‘‘no difference feature selection, model creation, cross-validation, and model
from control’’ in studies with insufficient samples were testing (Fig.1).
unfairly examined. Planning clinical studies should pay more
attention to the possibility of missing a critical therapeutic A. DATA COLLECTION AND TRANSFORMATION
advancement due to limited sample numbers [24]. This paper aims to use a dataset with a higher number
Since it is clear based on [23] and [24], it is clear that the of records to obtain the benefit of a large sample size for
large sample size in the medical field results in a more precise interpreting significant results. The SEER database from
estimate of the treatment effect, so we have considered the National Cancer Institute’s SEER program is vital for
collecting extensive data for the prediction of accurate understanding cancer patterns, trends, and outcomes in the
survival analysis. After reviewing the research works by United States. Its comprehensive nature, longitudinal data
Jajroudi et al. [2], Park and Lee [13], Duggal and Shukla [14], collection, and broad coverage of cancer types make it a
Montazeri and Beigzadeh [16], and Liu et al. [17], it is noticed valuable tool for research, policy development, and cancer
that they all have used smaller datasets of 7706, 1040, 7200, control efforts. We retrieved 57155 records on thyroid cancer
900, and 286 records, respectively, hence to obtain the benefit from the SEER database; however, we had to drop out
of a large sample size for interpreting significant results, this a majority of the records due to a significant number of
paper aims to use a dataset with more records. records with missing values identified. Finally, we consider
The other major issue in datasets that contain unevenly 25217 records in total for the analysis. To further set up
distributed data, also known as imbalanced data, is what the data for analysis, other pre-processing techniques such
gives rise to the problem of class imbalance. The class as feature selection, class imbalance problem, and class
imbalance problem is where data points with class labels encoding are applied to the dataset.
have one class instance devalued by the other instances. Our study attempted to determine the survivability of
Class imbalance distribution is typical for real-world medical thyroid cancer patients. To do this, we have employed
data, particularly cancer data [25]. The unequal quality of several well-known machine learning methods. There were