0% found this document useful (0 votes)
18 views

Survival Analysis of Thyroid Cancer Patients Using Machine Learning Algorithms

Uploaded by

boinpallyvamshi3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Survival Analysis of Thyroid Cancer Patients Using Machine Learning Algorithms

Uploaded by

boinpallyvamshi3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 21 December 2023, accepted 9 April 2024, date of publication 22 April 2024, date of current version 7 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3392275

Survival Analysis of Thyroid Cancer Patients


Using Machine Learning Algorithms
SAADAT M. ALHASHMI 1 , MD. SHOHIDUL ISLAM POLASH 2 , AMINUL HAQUE 2,

FAZLEY RABBE 3 , SHAZZAD HOSSEN2 , NURUZZAMAN FARUQUI 4 ,


IBRAHIM ABAKER TARGIO HASHEM 5 , AND NIRASE FATHIMA ABUBACKER6
1 Department of Information Systems, University of Sharjah, Sharjah, United Arab Emirates
2 Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Savar, Dhaka 1216, Bangladesh
3 Department of Information Technology, Frankfurt University of Applied Sciences, 60318 Frankfurt am Main, Germany
4 Department of Software Engineering, Daffodil International University, Daffodil Smart City, Savar, Dhaka 1216, Bangladesh
5 Department of Computer Science, University of Sharjah, Sharjah, United Arab Emirates
6 School of Computing, Asia Pacific University, Kuala Lumpur 57000, Malaysia

Corresponding author: Aminul Haque ([email protected])

ABSTRACT The medical community strives continually to improve the quality of care patients receive.
Predictions of prognosis are essential for doctors and patients to choose a course of treatment. Recent years
have witnessed the development of numerous new cancer survival prediction models. Most attempts to
predict the prognosis of people with malignant growth rely on classification techniques. We could experiment
with significantly different results using only a subset of SEER (Surveillance, Epidemiology, and End
Results) data. These models were created using machine learning techniques by selecting univariate features
and calculating correlations. We illustrated the variation in results and discrepancy of impurity that can result
from varying data quantities and critical factors. Seventeen crucial factors were identified, and a group of
classification algorithms were trained to evaluate the effectiveness of an estimation technique. In the display
mode, the accuracy of these computations ranges from 97% to 99%Ȧlong with accuracy, the models are
further evaluated regarding the F1 score, precision, recall, and the AUC score. Compared to earlier studies,
a more accurate model has been developed, and, to the best of our knowledge, our prediction model is
superior to the models studied in the previous works.

INDEX TERMS Logistic regression, machine learning, random forest, thyroid survivability.

I. INTRODUCTION is a procedure that collects information from complex data


Of the various forms of cancer, thyroid carcinoma is the through clever tactics. Disclosure can enhance the quality
most prevalent endocrine cancer, with a constant increase of health and treatment management. Computer science and
in prevalence globally [1]. The Surveillance, Epidemiology, data mining techniques are used in decision-making systems
and End Results Program (SEER) of the National Cancer to consider all relevant factors [2]. Because cancer therapy
Institute, USA, contains much information about thyroid can- takes so long and costs so much money, accurate survival
cer. Over the past decade, progress has been made in cancer projections are a must for providing affordable healthcare.
studies. This way, categorizing malignant growth patients The clinical importance of thyroid nodules varies from 7%
into danger classes is an incredibly lively subject of study to −15% depending on the requirement to rule out thyroid
with obvious therapeutic applications. More research on the cancer, which in turn varies based on age, sex, radiation
treatment’s intricacy and difficulties is required. This study exposure date, and family history, among other factors [3].
may be done on how medical organs, such as data mining Above 90% of all thyroid tumors [4] are classified as distinct
approaches, are discussed [2]. The data mining approach thyroid cancer (DTC), which includes papillary and pituitary
tumors. Around 63,000 new cases of thyroid illness were
The associate editor coordinating the review of this manuscript and identified in the United States in 2014 [5], up from 37,200
approving it for publication was Jenny Mahoney. in 2009 when the most recent ATA recommendations were
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
61978 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

issued. The number of new cases per year has climbed from Mourad et al. [7] predicted thyroid cancer prognoses
4,9 per 100,000 in 1975 to 100,000 in 2009 (100) 14.3. using machine learning, feature selection, and the SEER
This trend has been studied using various statistical dataset. Thirty-four clinical variables and 61,362 items
approaches, including Cox regression, log logistics, log- make up the dataset. MPL1 was the most accurate of
normal, and Kaplan-Meyer experimental models [6]. New, his ANN-based MLP models at 94.5%Ṡonuc [11] used
cutting-edge data mining techniques outperform their more machine learning to classify thyroid illnesses as usual,
established counterparts in terms of versatility and efficiency. hypothyroidism and hyperthyroidism. He used SVM, RF, DT,
Our investigation began by identifying the factors that NB, LR, K-NN, MLP, and linear discriminant analysis on
affected the objective characteristic most. The Sklearn tools a 14-attribute dataset of Iraqi nationals. MLP has 96.4%
of Select K-Best and Chi-Square have been employed here. accuracy. Wu et al. [12] utilized machine learning to predict
The survival of the patient is what we are interested in seeing. central lymph node metastases. This study used 22 unique
Many machine-learning techniques were initially conceived characteristics. 5-fold cross-validation is employed. The
to solve this grouping problem. The novelty of our work: 7-variable gradient boosting decision tree model has the
(i) Improved Accuracy: Provided a 99.30% accurate greatest ROC (AUC = 0.731) and decision curves. It was cho-
prediction model using the Random Forest classifier, which sen as the top model. Park and Lee [13] used machine learning
is superior to Mourad’s work [7]. to predict illness recurrence and analyzed 1040 patients
(ii) Data Cleaning: Introduced sophisticated data prepro- with 12 characteristics. They evaluated sex, tumor size,
cessing techniques tailored to tackle the dataset’s specific and disease recurrence. The Decision Tree model had
impurities, ensuring the results’ quality and reliability. 95% accuracy. Duggal and Shukla [14] predicted thyroid
(iii) Robustness Under Impurity: The Logistic Regression conditions using machine learning. He diagnosed this thyroid
model retains high accuracy even with significant dataset disease using feature selection and classification methods.
impurities. This robustness could set a benchmark for future Tree-based, recursive, and univariate feature selections are
models in similar contexts. recommended. Naive Bayes, Support vector machines, and
(iv) Feature Sensitivity Analysis: Delved deep into under- Random Forest classified thyroid illnesses into four classes:
standing which features are most sensitive to impurities Hypothyroid, Hyperthyroid, Sick Euthyroid, and Euthyroid.
and how they influence the Logistic Regression model’s The SVM classifier was best, with 96.92% accuracy.
performance. Deep learning has also been used extensively to predict
(v) Cross-Validation Techniques: Implemented robust prostate cancer survival. Wen et al. [15] studied prostate
cross-validation strategies, ensuring that the reported 98.77% cancer prognoses and employed an artificial neural network
accuracy is consistent across different dataset splits and is not as a form of deep learning. He used Naive Bayes, Decision
a result of overfitting. Trees, K Nearest Neighbors, and Support Vector Machines
(vi) Scalable Solutions: Provided methodologies that can (SVM). They split survival into fewer than 60 months and
scale to larger datasets with similar impurity challenges, more than 60 months. The highest success rate, 85.64%
ensuring the applicability of our findings to broader contexts. is shown by ANN. Montazeri and Beigzadeh [16] created
These methods helped us to get closer to the mark while a rule-based survival classification system for breast cancer.
making predictions—the effects of both a balanced and They used Naive Bayes (NB), Trees Random Forest (TRF),
unbalanced outcome on the evaluation. According to what 1-Nearest Neighbor (1NN), AdaBoost (AD), Support Vector
we know, we can confidently assert that our work is superior Machine (SVM), RBF Network (RBFN), and Multilayer
to that of others, particularly when compared to Mourad’s Perceptron (MLP) with 10-cross fold approach on a small
work [7]. dataset of 900 patients. They measured model accuracy,
precision, sensitivity, specificity, and area under the ROC
II. LITERATURE REVIEW curve. The Trees Random Forest technique was more
Accurate cancer survival prediction should aid doctors in accurate (96%).
making wise choices and creating successful treatment Liu et al. [17] investigated the SEER dataset of 107,114
plans [8]. Meanwhile, it can save many individuals from thyroid cancer records to see if ETE affected cancer prognosis
obtaining unnecessary treatment and the high medical and survival. Liu et al. [18] created a machine learning-based
expenditures that accompany it [9]. Malignant cells can random forest to predict poor thyroid cancer quality of life.
develop in the thyroid gland’s tissues, a condition known as Two hundred sixty thyroidectomy-receiving thyroid cancer
thyroid cancer. In recent decades, the incidence of thyroid individuals were studied. Training and validation courts had
cancer [10] has increased in several countries, including the 0.834 and 0.897 areas under the curve. Kukar et al. [19]
US. Based on past patient treatment, the doctor sometimes predicted anaplastic thyroid cancer survival using machine
mispredicts longevity. Doctors and patients need survival learning. They enrolled 126 patients and compared machine
estimates to choose the best medicine. Many research learning to statistical studies.
experts have attempted to solve the task of predicting cancer Agrawal et al. [20] used SEER data to forecast lung
prognosis using machine learning that aims to estimate cancer patient survival. Two of the 11 derived traits they
entirely accurately. found were highly predictive. Preprocessing, data mining
VOLUME 12, 2024 61979
S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

optimizations, and dataset validations commence. They majority and minority classes is one of the critical issues with
selected 13 attributes using multiple methods and attribute employing data analysis for diagnosis and therapy [26].
selection methodologies. Ensemble voting of five decision In addition to the significance of data diversity, it is
tree-based classifiers and meta-classifiers enhanced predic- essential to use more methods to narrow down the best
tion. Lundin et al. [21] predicted breast cancer survival using algorithms that suit the data and the business requirements.
an artificial neural network. The area under the ROC curve This paper has experimented with 14 different ways. At the
(AUC) measured how effectively prediction models predicted same time, Jajroudi et al. [2], Mourad et al. [7], Sonuç et al.
patient survival rates. The neural network models’ 5-, 10-, [11], Yijun Wu et al. [12], Park and Lee [13], Duggal and
and 15-year breast cancer-specific survival AUCs were 0.909, Shukla [14], Wen et al. [15], Montazeri and Beigzadeh [16],
0.886, and 0.883. Logistic regression AUC values were 0.897, Liu et al. [17], Lundin et al. [21] just used a few (less than 10)
0.862, and 0.858. Neural network accurately predicts breast different methods for thyroid, prostate, and breast cancer.
cancer survival after 5, 10, and 15 years. Delen et al. [28] Table 1 presents some more recent works related to ours.
predicted breast cancer survival using data mining. They In addition, the works mentioned above have yet to deal
used logistic regression and two data mining approaches with the problem of imbalanced data. Our work addresses
to develop prediction models (artificial neural networks the data imbalance problem by employing the SMOTE
and decision trees)—performance comparison made using approach based on the work by [27]. A heavy class imbalance
10-fold cross-validation. The decision tree (C5) model has is examined as a critical problem for their dataset, and
93.6% accuracy. Jajroudi et al. [2] used Logistic Regression multiple balancing techniques, such as weight balancing
with MLP as the optimal neural network ANN for survival and data augmentation, were considered. In addition to
prediction of thyroid cancer patients. He used evidence this, our proposed work also investigates a few efficient
and radiation oncologists to choose important SEER dataset feature selection methods that select only the prominent
properties. They investigated 16 attributes and 7706 data features to attain greater accuracy in determining whether the
points. He estimated 1-, 3-, and 5-year survival. He measured patient would survive compared to the recent research work
model accuracy, sensitivity, and specificity. Their work conducted by Lee et al. [8], where only seven elements were
suggested MLP for thyroid cancer survival prediction. employed for the prediction. Likewise, Delen et al. [28] and
Although approaches and algorithms are well-developed Thongkam et al. [29] also predicted survival for breast cancer
and have an adequate logical foundation, they frequently patients. The high-performing model is Logistic regression
encounter difficulties because of the size and properties of with 98.77% accuracy in the low degree of impurity. Also,
the underlying data [22]. the best-performing model is the Random Forest classifier,
Yet, the benefits of a large sample size for interpreting with 99.30 correctness for the high degree of impurity.
significant results include that it permits more accurate
estimation of the treatment impact and often makes it simpler III. METHODS
to judge the sample’s representativeness and generalize the Essential procedures include data collection, preparation,
findings [23]. Many treatments classified as ‘‘no difference feature selection, model creation, cross-validation, and model
from control’’ in studies with insufficient samples were testing (Fig.1).
unfairly examined. Planning clinical studies should pay more
attention to the possibility of missing a critical therapeutic A. DATA COLLECTION AND TRANSFORMATION
advancement due to limited sample numbers [24]. This paper aims to use a dataset with a higher number
Since it is clear based on [23] and [24], it is clear that the of records to obtain the benefit of a large sample size for
large sample size in the medical field results in a more precise interpreting significant results. The SEER database from
estimate of the treatment effect, so we have considered the National Cancer Institute’s SEER program is vital for
collecting extensive data for the prediction of accurate understanding cancer patterns, trends, and outcomes in the
survival analysis. After reviewing the research works by United States. Its comprehensive nature, longitudinal data
Jajroudi et al. [2], Park and Lee [13], Duggal and Shukla [14], collection, and broad coverage of cancer types make it a
Montazeri and Beigzadeh [16], and Liu et al. [17], it is noticed valuable tool for research, policy development, and cancer
that they all have used smaller datasets of 7706, 1040, 7200, control efforts. We retrieved 57155 records on thyroid cancer
900, and 286 records, respectively, hence to obtain the benefit from the SEER database; however, we had to drop out
of a large sample size for interpreting significant results, this a majority of the records due to a significant number of
paper aims to use a dataset with more records. records with missing values identified. Finally, we consider
The other major issue in datasets that contain unevenly 25217 records in total for the analysis. To further set up
distributed data, also known as imbalanced data, is what the data for analysis, other pre-processing techniques such
gives rise to the problem of class imbalance. The class as feature selection, class imbalance problem, and class
imbalance problem is where data points with class labels encoding are applied to the dataset.
have one class instance devalued by the other instances. Our study attempted to determine the survivability of
Class imbalance distribution is typical for real-world medical thyroid cancer patients. To do this, we have employed
data, particularly cancer data [25]. The unequal quality of several well-known machine learning methods. There were

61980 VOLUME 12, 2024


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

TABLE 1. Related works.

FIGURE 1. Procedural architecture.

three separate sessions when we oversaw the examinations. be nearly 100%Ȯur calculations show the results achieved
Each meeting is divided into halves once again, with each by using the 17 best qualities Using the two scenarios as
half exploring the results of both evenness and imbalance. examples. Moreover, our research indicates that the SMOTE
We have observed outcomes shift in both attributes to approach is the most advantageous inequality provision.
varying degrees, depending on their relative relevance. Due
to the asymmetry of the data, we discovered that logistic B. FEATURE SELECTION
regression outperformed other methods. With an area under The first problem is that not all of our features will have
the curve (AUC) of 0.96, an F1 score of 0.76, and an the same impact on our desired characteristics. It’s essential
accuracy of 98.77%Ṫhe results, however, were enhanced to detail the factors that will have an effect. Chi-square and
when we normalized the two objective categories. The pick k-best were employed, as well as a correlation measure,
accuracy was calculated using a random forest classifier to for this purpose. These two tests have been conducted using

VOLUME 12, 2024 61981


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

TABLE 2. One-hot encoding transformation.

Label Encoding, transforming the dataset into a numerical


representation. Label encoding is giving labels a numerical
expression that computers can understand. This method is
relatively straightforward and requires turning each value in
a column into a number. Consider a dataset of bridges with
the following column values for ‘‘bridge types’’: arch, beam,
truss, cantilever, tied arch, suspension, and cable. We encode
the text values by inserting a running sequence for each text
value: 0, 1, 2, 3, 4, 5, 6, and 7.
X (Oi − Ei)2
x2 = (1)
Ei
In equation 1, x2 indicates the value of chi-squared.
Oi and Ei refer to the observed value and expected value,
respectively.

C. CLASS ENCODING
The second step is to construct a prediction model using
machine learning techniques. We must use one-hot encod- FIGURE 2. Types of predictive models.
ing [30] to turn our nominal qualities into numbers.
The one-hot encoding technique encodes categorical data
helps assemble cohesive yet evenly distributed teams. It’s a
variables to improve predictions with machine learning
standard method for making datasets fairer. The exact process
algorithms. Each category of the nominal attribute is encoded
was used to develop models for each category. Fig. 2 shows
as a binary column using a single hot encoding. A 1 is entered
how predictive models are made.
into the column if that feature is present, and a 0 otherwise.
Take a dataset with the column ‘‘Race recode’’ as an example.
‘‘Race recode’’ has white, black, and other classes. Assigning E. MODELS CREATION AND RESULT ANALYSIS
a 1 or 0 (the true/false notation) to each category value in A total of 14 models have been created, and their results have
a brand-new column is one way to implement this tactic been reviewed. Algorithms are used: Decision Tree [16], [18],
(Table 2). If the first column in a row has the value ‘‘1’’ [20], [23], Random Forest [18], [43], Extra Tree [33], Ada
(meaning true), then all subsequent columns in that row Boost [32], Gradient descent, Stochastic Gradient Descent,
will have the value ‘‘0’’ (indicating false); similarly, for any Hist Gradient Boost, Light Gradient Boost, K- Nearest
additional rows where the value in that row corresponds to Neighbor, Naive Bayes, Logistic Regression, Bagging clas-
the value in that column. sifier, Multilayer Perceptron, Voting Classifier. The results of
accuracy score, F1 score, recall, precision, AUC score [8],
D. HANDLING IMBALANCED DATA [18], [19], [25], [34], [35], [36], [37], etc., measurement
Fourteen machine learning techniques are used to create have been used to analyze the model’s performance. These
prediction models from each group’s data. There are a total performance measurements have been calculated with 20%
of two sets of information in each category. One kind has of test data. In addition, we employ 10-fold cross-validation
been represented with an unbalanced proportion of the target for obtaining justified accuracies.
two classes, while the other type has been designed with
a balanced distribution of the two classes. The ratio of IV. RESULTS AND DISCUSSION
24605:412 is exceptionally skewed and indicative of a low A. FEATURE IMPORTANCE
level of impurities. The Synthetic Minority Over-sampling The features with which the model is built are determined
Method (SMOTE) has been implemented to address the with the help of the Select K-best and chi-square library. For
issue [31]. To create a new sample, SMOTE picks samples better understanding, the correlation among attributes also
that are near together in the feature space, draws a line has been calculated. We already knew that if the correlation
between them, and picks a point on the line. This method coefficient is between 0.9 and 1.0, it has a very high positive

61982 VOLUME 12, 2024


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

FIGURE 3. Correlation heatmap of highly correlated features.

correlation [35]. If it is 0.7 to 0.89, it has a high positive with ‘‘Derived AJCC M,’’ ‘‘CS extension/EOD exten-
correlation. Similarly, if the correlation coefficient ranges sion,’’ ‘‘Radiation’’, and ‘‘Radiation sequence with surgery’’
from 0.5 to 0.69, 0.3 to 0.49, and 0.0 to 0.29, there has been a insignificant negative Correlation with ‘‘Reason no cancer-
moderately positive, low positive, and negligible correlation, directed surgery.’’ ‘‘Derived AJCC N’’ has a low positive
respectively. On the other hand, the negative value of the correlation with ‘‘Derived AJCC M,’’ ‘‘CS extension,’’
correlation coefficient refers to the negative association. The and ‘‘Derived AJCC T,’’ negligible positive Correlation
Correlation Heatmap of Highly Correlated Features is shown with ‘‘CS extension/EOD extension,’’ ‘‘Radiation,’’ and
in Fig. 3. ‘‘Radiation sequence with surgery,’’ and insignificant nega-
‘‘CS lymph nodes’’ has a moderate positive correlation tive Correlation with ‘‘Reason no cancer-directed surgery.’’
with ‘‘Derived AJCC M,’’ a low positive correlation with ‘‘Derived AJCC M’’ has a low positive correlation with
‘‘Derived AJCC T,’’ ‘‘CS extension,’’ and ‘‘CS exten- ‘‘CS extension/EOD extension,’’ and ‘‘CS extension,’’ and
sion/EOD extension,’’ and a negligible positive correlation negligible negative Correlation with ‘‘Radiation,’’ ‘‘Reason
with ‘‘Radiation,’’ and ‘‘Radiation sequence with surgery,’’ no cancer-directed surgery,’’ and ‘‘Radiation sequence with
and insignificant negative Correlation with ‘‘Reason no surgery’’. ‘‘Radiation sequence with surgery’’ has a high
cancer-directed surgery.’’ Therefore, ‘‘Derived AJCC T’’ positive correlation with ‘‘Radiation’’ and negligible positive
has a moderate positive correlation with ‘‘CS extension,’’ Correlation with ‘‘Reason no cancer-directed surgery,’’ ‘‘CS
low positive Correlation with ‘‘CS lymph nodes,’’ and extension,’’ and ‘‘CS extension/EOD extension.’’ ‘‘Reason
‘‘Derived AJCC N’’ has a negligible positive correlation no cancer-directed surgery’’ has a negligible positive and

VOLUME 12, 2024 61983


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

negative correlation with ‘‘Radiation’’, and ‘‘CS exten-


sion/EOD extension,’’ respectively, and a low negative
correlation with ‘‘CS extension.’’ ‘‘Radiation’’ has a neg-
ligible positive correlation with ‘‘CS extension’’ and ‘‘CS
extension/EOD extension.’’ ‘‘CS extension’’ positively cor-
relates with ‘‘CS extension/EOD extension.’’ Correlation
shows that ‘‘CS lymph node’’ and ‘‘Derived AJCC N’’ are
highly correlated, with 0.84 scores having a high positive
correlation. During the model creation time, we will take
one of these two. The abbreviation is AJCC- American Joint
Committee on Cancer, CS - Collaborative Stage, and EOD
- Extent of the disease. The AJCC-TNM system is used to
understand the staging of Thyroid cancer, where TNM is
abbreviated as:
- The extent (size) of the tumor (T): How large is the
cancer? Has it grown into nearby structures?
- The spread to nearby lymph nodes (N): Has the cancer
spread to nearby lymph nodes?
- The spread (metastasis) to distant sites (M): Has the
FIGURE 4. Accuracy of prediction models using 17 features.
cancer spread to distant organs such as the lungs or liver?

1) TYPE 1
TABLE 3. Feature scores.
In Type 1, 17 features are taken by using feature selection.
In each type, two data groups are available: balanced
and imbalanced. Imbalance data has been balanced using
SMOTE. In each case, 14 algorithms have been used, and the
best of the three algorithms results have been mentioned here.
Table 4 shows that logistic regression performed the best with
imbalanced data. Then, GBC and RF both turn in excellent
performances. In such cases, their accuracy rates were
98.55% 98.65% and 98.07%Ẇhen the data are balanced, the
random forest classifier demonstrates maximum performance
with enhanced accuracy. Each method achieved near-perfect
accuracy in the balanced dataset, with RF at 99.3% ETC
at 99.23% and LGBM at 99.24%Ṫhe accuracy, F1 Score,
Precision, Recall, and Area Under the Curve (AUC) of the
LR model applied to the imbalanced data were respectively
In Table 3, features with a score higher than ten are 98.77, 0.76, 0.87, 0.70, and 0.96. Which in Random Forest
reported. The scores are obtained using the select K-best and (Fig. 4) grows to 99.30, 0.99, 0.99, 0.99, and 1 with balanced
chi-squared process. Seventeen (17) attributes scored above data. The findings make it abundantly evident that having a
10, 9 highly important scored above 500. The top 5 essential balanced dataset resulted in the best possible outcomes, even
features are:’ CS tumor size,’’ Derived AJCC Stage Group,’’ when the highest number of 17 attributes were considered.
CS extension,’’ Age,’ Derived AJCC T.’ The tumor size is at
the top, meaning that the patient’s survival mostly depends 2) TYPE 2
on the tumor’s size. Gradually, other essential features are In this type, we have gleaned nine distinguishing features—
available in Table 3. specifics with a score greater than or equal to 500 (Table 3).
In Fig. 5, the outputs of models created from balanced and
B. MODELS PERFORMANCE EVALUATION unbalanced data are displayed using the nine most essential
We have compared our study with some existing studies that attributes. When imbalanced data is considered, the LR,
used similar datasets. Our results outperformed the existing GBC, and ABC models perform better than any other models
studies (Table 5). In our future research study, we will with nine characteristics chosen. Their degrees of accuracy
consider exploring the performances of the algorithms for came in at 98.73, 98.69, and 98.67% respectively. When
additional datasets. However, to obtain better accuracy, we took into account the balanced data, on the other hand,
we have applied 10-fold cross-validation, for which the we found that HGB, LR, and ABC attained greater height
results are presented in the last column of Table 4. The results accuracy than the other models. Their accuracy degrees were
of the models are discussed below: 99.2% 98.87% and 98.14%. The LR and HGB models rank

61984 VOLUME 12, 2024


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

TABLE 4. Performance results of the prediction models.

highest. Once more, the models designed with balanced data most successful findings. Here, seven selected features [7]
yield the best results. In contrast, the model’s accuracy with have been taken. The total data points are (8256 Alive/
17 features considerably surpasses the one with only nine 221 COD-TC) [7]. From this, we can observe that our
characteristics. In this particular scenario, the LR and HGB model’s performance is comparable to that of Paper 7. In this
models achieved an accuracy rating of 99.20 percent for their instance, we investigated imbalanced data consisting of seven
predictions. This model does not have the same level of characteristics. The LR, ABC, and HGB results were superior
accuracy as Type 1 models. to those of the other models. Their accuracy rates came
up at 97.75% 97.58% and 97.34% respectively. The LR
3) TYPE 3 model performs noticeably better than the competition. Fig. 9
The paper’s author [7] selected features using Fisher’s demonstrates that our accuracy was 2% higher than that
discriminant ratio, Kruskal-Wallis analysis, and Relief-F. achieved by the author of the paper [7]. Additionally, the F1
He suggests seven qualities for the model training process. score was 0.27 points higher, and he achieved an AUC of
We tried to develop models using the same data depending 0.98, whereas we achieved 0.95. Two of our measurements
on their preferred characteristics. Their best model achieved are more significant than theirs, indicating that our models’
an accuracy of 94.49% an F1 score of 0.431, and an area performance is comparable.
under the curve (AUC) of 0.988. They employed a model We can observe that LR works better in this case
called multilayer perceptron with 19 neurons. We applied with imbalanced data. It exhibits 98.77% accuracy with
several machine learning techniques, and Fig. 5 displays the 17 features, 98.73% with nine features, and 97.75% with

VOLUME 12, 2024 61985


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

TABLE 5. Comparing to previous works’ findings.

FIGURE 5. Accuracy of prediction models using 9 features. FIGURE 6. Accuracy of prediction models using 7 features.

seven features. When we examine the balanced data, we find From the ROC curve, we can see that RF covers more
that RF and LGBM exhibit the highest accuracy, with respec- area than LR with 17 features. In this case, RF fire forms
tive values of 99.3% and 99.24% and 17 characteristics. better than LR. Similarly, for nine features, HGB outperforms
HGB and LR, respectively, demonstrate 99.2% and 98.98% LR with an AUC of 0.999, where the AUC of LR is 0.959.
accuracy, with nine characteristics. However, compared to all On the other hand, LR performs better with seven features
three categories, our 17-feature prediction model performs with an AUC of 0.948. We have seen that the model built with
better. 17 features has shown the best results. Figures 7 and 8 help

61986 VOLUME 12, 2024


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

FIGURE 7. ROC curves of 3 groups of features.

FIGURE 8. Results comparison between top models. FIGURE 9. Results comparison between other author’s models.

us select the best model more clearly. In Fig. 7, we can see the accuracy, but other measurements also reached 99%Ẇith
that the AUC score of the RF model made with 17 attributes 17 attributes, two models’ AUC scores were found: 1. But
is 1, which means that the ROC curve can cover 100% of the with nine attributes, results are so near but downward than
data, ahead of models made with 9 and 7 features. Finally, type 1. More features are giving more performance in this
Fig. 8 shows that the RF (RF-17F-Balanced) model, built with case.
17 features, is at the highest position, with an accuracy of Table 5 compares some recent works. From the table,
99.3% an F1 score of 0.99, and an AUC of 1. we can learn about their works, the algorithm they used, and
Moreover, our element choice framework worked finely. the accuracy they achieved.
For this kind of SEER data, Logistic regression and Ada boost Fig. 9 is a Results Comparison Between Other Authors’s
performed very well with different features and different Models. We employed a dataset of 25217 records and
amounts of data. Most importantly, if we look at the results employed 14 methods where Jajroudi et al. [2], Park
of the balanced dataset, we can see that (type 1) with and Lee [13], Duggal and Shukla [14], Montazeri and
17 attributes, three models are giving 99% accuracy; not only Beigzadeh [16], and Liu et al. [17] used smaller datasets of

VOLUME 12, 2024 61987


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

7706, 1040, 7200, 900, and 286 records, respectively and comprehensive view of a patient’s health and identify novel
employed less than ten methods. If we compare type 3, type 1 prognostic factors.
(imbalanced), and type 2(imbalanced), it is clear that the Temporal Modeling: We compared the performance
SMOTE technique helps increase performance. The author of obtained on random forests with other machine learning
the paper [7] applied machine learning and feature selection algorithms commonly used in survival prediction tasks, such
techniques for thyroid cancer prediction, but he did not as support vector machines, gradient boosting, or neural
work on imbalanced data handling. However, we resolved networks. We will expand the analysis to include long-term
the problem by using the SMOTE technique. Moreover, survival and prognosis beyond the initial survivability
he used only seven features, whereas we used 17 features, prediction. We will investigate how the Random Forest
and we achieved the highest accuracy with our balance model or other algorithms predict survival outcomes over
data. Previously, Delen et al. [28] and Thongkam et al. [29] extended periods, such as 5 or 10 years. This investigation
also tried to predict the survivability of other cancers, but would provide insights into the long-term prognosis and
their models’ accuracy could have been more satisfactory. guide treatment planning for thyroid cancer patients. Machine
Nevertheless, our proposed models give 99% accuracy. Our learning models that consider temporal patterns in patient
best-proposed model is a Random Forest classifier with an data could provide more accurate survival predictions. These
accuracy of 99.30% with 17 attributes and balanced data. may consist of recurrent neural networks and temporal
convolutional networks.
V. CONCLUSION AND FUTURE WORK As the complexity of models increases, it becomes
Our research aims to estimate the survivability of thy- progressively imperative to understand their decision-making
roid cancer patients. To do this, we have employed process.
several well-known machine learning methods. The three External Validation: To affirm their generalizability and
best-looking computations are used in nearly all the ML reliability, models should be validated using multiple data
algorithms we have tried. Also, we have determined which sets.
characteristics will play crucial roles. We recommend select- As machine learning in healthcare uses sensitive patient
ing the k-best and the Chi-squared test for this circumstance. data, future research must continue to address ethical
There were three separate sessions when we oversaw the considerations and data privacy concerns.
examinations. Each meeting is divided into halves once
again, with each half exploring the results of both evenness ACKNOWLEDGMENT
and imbalance. We have observed outcomes shift in both Saadat M. Alhashmi, Aminul Haque, and Ibrahim Abaker
attributes to varying degrees, depending on their relative Targio Hashem: conceptualization, review, and editing; and
relevance. Due to the asymmetry of the data, we discovered Md. Shohidul Islam Polash, Fazley Rabbe, Shazzad Hossen,
that the Random Forest prediction model outperformed Nuruzzaman Faruqui, and Nirase Fathima Abubacker: litera-
other models. With an area under the curve (AUC) of 1, ture review, experimentation, and writing.
an F1 score of 0.99, and an accuracy of 99.30%Ṫhe results
were enhanced by utilizing two effective techniques. Robust REFERENCES
cross-validation strategies were implemented to ensure that
[1] X. Wu, Y. Yan, H. Li, N. Ji, T. Yu, Y. Huang, W. Shi, L. Gao, L. Ma,
the reported 98.77% accuracy remains consistent across and Y. Hu, ‘‘DNA copy number gain-mediated lncRNA LINC01061
different dataset splits, guarding against overfitting. Our upregulation predicts poor prognosis and promotes papillary thyroid
calculations show the results of the 17 best qualities and cancer progression,’’ Biochem. Biophysical Res. Commun., vol. 503, no. 3,
pp. 1247–1253, Sep. 2018.
the two scenarios. They are introducing sophisticated data [2] M. Jajroudi, T. Baniasadi, L. Kamkar, F. Arbabi, M. Sanei, and
preprocessing techniques tailored to tackle the dataset’s M. Ahmadzade, ‘‘Prediction of survival in thyroid cancer using data
specific impurities, ensuring the quality and reliability of the mining technique,’’ Technol. Cancer Res. Treatment, vol. 13, no. 4,
pp. 353–359, Aug. 2014.
results. Moreover, our research indicates that the SMOTE
[3] S. J. Mandel, ‘‘A 64-Year-Old woman with a thyroid nodule,’’ JAMA,
approach is advantageous in balancing an imbalanced dataset. vol. 292, no. 21, pp. 2632–2642, Dec. 2004.
Notably, the Logistic Regression model demonstrates high [4] S. I. Sherma, ‘‘Thyroid carcinoma,’’ Lancet, vol. 361, no. 9356,
accuracy despite significant dataset impurities, potentially pp. 501–511, 2003.
setting a benchmark for future models in similar contexts. [5] R. Siegel, J. Ma, Z. Zou, and A. Jemal, ‘‘Cancer statistics, 2014,’’ CA, A
Cancer J. Clinicians, vol. 64, no. 1, pp. 9–29, 2014.
Future research could concentrate on integrating data [6] J. Llobera, M. Esteva, J. Rifa, E. Benito, J. Terrasa, C. Rojas, O. Pons,
categories, including genomics, clinical, and lifestyle data G. Catalan, and A. Avella, ‘‘Terminal cancer: Duration and prediction
across diverse populations [44], [45]. Furthermore, method- of survival time,’’ Eur. J. Cancer, vol. 36, no. 16, pp. 2036–2043,
2000.
ologies were provided to scale these solutions to larger [7] M. Mourad, S. Moubayed, A. Dezube, Y. Mourad, K. Park, A. Torreblanca-
datasets with similar impurity challenges, ensuring the broad Zanca, J. S. Torrecilla, J. C. Cancilla, and J. Wang, ‘‘Machine learning and
applicability of the findings. Moreover, it is possible to do feature selection applied to SEER data to reliably assess thyroid cancer
prognosis,’’ Sci. Rep., vol. 10, no. 1, p. 5176, Mar. 2020.
hyperparameter tuning by incorporating additional datasets
[8] S. Lee, S. Lim, T. Lee, I. Sung, and S. Kim, ‘‘Cancer subtype classification
and neural network technologies, thereby augmenting the and modeling by pathway attention and propagation,’’ Bioinformatics,
precision of the obtained outcomes. This can provide a vol. 36, no. 12, pp. 3818–3824, Jun. 2020.

61988 VOLUME 12, 2024


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

[9] D. Sun, M. Wang, and A. Li, ‘‘A multimodal deep neural network for [31] W. Satriaji and R. Kusumaningrum, ‘‘Effect of synthetic minority over-
human breast cancer prognosis prediction by integrating multi-dimensional sampling technique (SMOTE), feature representation, and classification
data,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 16, no. 3, pp. 841–850, algorithm on imbalanced sentiment analysis,’’ in Proc. 2nd Int. Conf.
May 2019. Informat. Comput. Sci. (ICICoS), Oct. 2018, pp. 1–5.
[10] C. M. Kitahara and J. A. Sosa, ‘‘The changing incidence of thyroid cancer,’’ [32] B. R. A. Cirkovic, A. M. Cvetkovic, S. M. Ninkovic, and N. D. Filipovic,
Nature Rev. Endocrinology, vol. 12, no. 11, pp. 646–653, Nov. 2016. ‘‘Prediction models for estimation of survival rate and relapse for breast
[11] K. Salman and E. Sonuç, ‘‘Thyroid disease classification using machine cancer patients,’’ in Proc. IEEE 15th Int. Conf. Bioinf. Bioengineering
learning algorithms,’’ J. Phys., Conf., vol. 1963, no. 1, Jul. 2021, (BIBE), Nov. 2015, pp. 1–6.
Art. no. 012140. [33] A. Endo, T. Shibata, and H. Tanaka, ‘‘Comparison of seven algorithms to
[12] Y. Wu, K. Rao, J. Liu, C. Han, L. Gong, Y. Chong, Z. Liu, and predict breast cancer survival (<special issue> contribution to 21 century
X. Xu, ‘‘Machine learning algorithms for the prediction of central lymph intelligent technologies and bioinformatics),’’ Int. J. Biomed. Comput.
node metastasis in patients with papillary thyroid cancer,’’ Frontiers Hum. Sci., Off. J. Biomed. Fuzzy Syst. Assoc., vol. 13, no. 2, pp. 11–16,
Endocrinol., vol. 11, Oct. 2020, Art. no. 577537. 2008.
[13] Y. M. Park and B.-J. Lee, ‘‘Machine learning-based prediction model using [34] K. R. Pradeep and N. C. Naveen, ‘‘Lung cancer survivability prediction
clinico-pathologic factors for papillary thyroid carcinoma recurrence,’’ Sci. based on performance using classification techniques of support vector
Rep., vol. 11, no. 1, pp. 1–7, Mar. 2021. machines, C4.5 and naive Bayes algorithms for healthcare analytics,’’ Proc.
[14] P. Duggal and S. Shukla, ‘‘Prediction of thyroid disorders using advanced Comput. Sci., vol. 132, pp. 412–420, Jan. 2018.
machine learning techniques,’’ in Proc. 10th Int. Conf. Cloud Comput., [35] Md. S. I. Polash, S. Hossen, R. K. R. Sarker, Md. A. Bhuiyan, and A. Taher,
Data Sci. Eng., Jan. 2020, pp. 670–675. ‘‘Functionality testing of machine learning algorithms to anticipate life
[15] H. Wen, S. Li, W. Li, J. Li, and C. Yin, ‘‘Comparision of four machine expectancy of stomach cancer patients,’’ in Proc. Int. Conf. Advancement
learning techniques for the prediction of prostate cancer survivability,’’ in Electr. Electron. Eng. (ICAEEE), Feb. 2022, pp. 1–6.
Proc. 15th Int. Comput. Conf. Wavelet Act. Media Technol. Inf. Process. [36] C.-H. Yang, S.-H. Moi, F. Ou-Yang, L.-Y. Chuang, M.-F. Hou, and
(ICCWAMTIP), Dec. 2018, pp. 112–116. Y.-D. Lin, ‘‘Identifying risk stratification associated with a cancer for
[16] M. Montazeri, M. Montazeri, M. Montazeri, and A. Beigzadeh, ‘‘Machine overall survival by deep learning-based CoxPH,’’ IEEE Access, vol. 7,
learning models in breast cancer survival prediction,’’ Technol. Health pp. 67708–67717, 2019.
Care, vol. 24, no. 1, pp. 31–42, Jan. 2016. [37] M. S. I. Polash, S. Hossen, and A. Haque, ‘‘Five-year life expectancy
[17] Z. Liu, Y. Huang, S. Chen, D. Hu, M. Wang, L. Zhou, W. Zhou, D. Chen, prediction of prostate cancer patients using machine learning algorithms,’’
H. Feng, W. Wei, C. Zhang, W. Zeng, and L. Guo, ‘‘Minimal extrathyroidal in Soft Computing and Its Engineering Applications (Communications
extension affects the prognosis of differentiated thyroid cancer: Is there a in Computer and Information Science), vol. 1788. Cham, Switzerland:
need for change in the AJCC classification system?’’ PLoS ONE, vol. 14, Springer, 2023, pp. 314–326.
no. 6, Jun. 2019, Art. no. e0218171. [38] Z. Wang, L. Qu, Q. Chen, Y. Zhou, H. Duan, B. Li, Y. Weng, J. Su,
[18] Y. H. Liu, J. Jin, and Y. J. Liu, ‘‘Machine learning-based random forest and W. Yi, ‘‘Deep learning-based multifeature integration robustly predicts
for predicting decreased quality of life in thyroid cancer patients after central lymph node metastasis in papillary thyroid cancer,’’ BMC Cancer,
thyroidectomy,’’ Supportive Care Cancer, pp. 1–7, Mar. 2022. vol. 23, no. 1, Feb. 2023, doi: 10.1186/s12885-023-10598-8.
[19] M. Kukar, N. Besic, I. Kononenko, M. Auersperg, and M. Robnik- [39] A. Abbasian Ardakani, A. Mohammadi, M. Mirza-Aghazadeh-Attari,
Sikonja, ‘‘Prognosing the survival time of patients with anaplastic F. Faeghi, T. J. Vogl, and U. R. Acharya, ‘‘Diagnosis of metastatic lymph
thyroid carcinoma using machine learning,’’ in Intelligent Data Analysis nodes in patients with papillary thyroid cancer,’’ J. Ultrasound Med.,
in Medicine and Pharmacology (International Series in Engineering vol. 42, no. 6, pp. 1211–1221, Jun. 2023, doi: 10.1002/jum.16131.
and Computer Science), vol. 414. Boston, MA, USA: Springer, 1997, [40] M. D. Kate and V. Kale, ‘‘The role of machine learning in thyroid cancer
pp. 115–129. diagnosis,’’ in Advances in Computer Science Research. The Netherlands:
Atlantis Press, 2023, pp. 276–287, doi: 10.2991/978-94-6463-136-4_25.
[20] A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary,
‘‘Lung cancer survival prediction using ensemble data mining on seer [41] M. S. I. Polash, S. Hossen, and A. Haque, ‘‘Model analysis for predicting
data,’’ Sci. Program., vol. 20, no. 1, pp. 29–42, 2012. prostate cancer patient’s survival: A seer case study,’’ in Proc. 4th Int. Conf.
Trends Comput. Cogn. Eng., 2023, pp. 279–290, doi: 10.1007/978-981-19-
[21] M. Lundin, J. Lundin, H. B. Burke, S. Toikkanen, L. Pylkkänen, and
9483-8_24.
H. Joensuu, ‘‘Artificial neural networks applied to survival prediction in
[42] R. H. Nobin, M. Rahman, and M. J. Alam, ‘‘Survivability prediction
breast cancer,’’ Oncology, vol. 57, no. 4, pp. 281–286, 1999.
for patients with tonsil cancer utilizing machine learning algorithms,’’ in
[22] D. Devi, S. K. Biswas, and B. Purkayastha, ‘‘Redundancy-driven modified
Proc. 2nd Int. Conf. Intell. Cybern. Technol. Appl. (ICICyTA), Dec. 2022,
tomek-link based undersampling: A solution to class imbalance,’’ Pattern
pp. 210–215, doi: 10.1109/ICICyTA57421.2022.10038122.
Recognit. Lett., vol. 93, pp. 3–12, Jul. 2017.
[43] H. Torkey, M. Atlam, N. El-Fishawy, and H. Salem, ‘‘Machine learning
[23] D. J. Biau, S. Kernéis, and R. Porcher, ‘‘Statistics in brief: The importance model for cancer diagnosis based on RNAseq microarray,’’ Menoufia
of sample size in the planning and interpretation of medical research,’’ Clin. J. Electron. Eng. Res., vol. 30, no. 1, pp. 65–75, Jan. 2021, doi:
Orthopaedics Rel. Res., vol. 466, no. 9, pp. 2282–2288, Sep. 2008. 10.21608/mjeer.2021.146277.
[24] J. A. Freiman, T. C. Chalmers, H. A. Smith, and R. R. Kuebler, [44] H. Salem, G. Attiya, and N. El-Fishawy, ‘‘Intelligent decision support
‘‘The importance of beta, the type II error, and sample size in the design system for breast cancer diagnosis by gene expression profiles,’’ in
and interpretation of the randomized controlled trial,’’ in Medical Uses of Proc. 33rd Nat. Radio Sci. Conf. (NRSC), Feb. 2016, pp. 421–430, doi:
Statistics. Boca Raton, FL, USA: CRC Press, 2019, pp. 357–389. 10.1109/NRSC.2016.7450870.
[25] D. A. Dablain, C. Bellinger, B. Krawczyk, D. W. Aha, and N. V. Chawla, [45] M. Atlam, H. Torkey, H. Salem, and N. El-Fishawy, ‘‘A new feature selec-
‘‘Interpretable ML for imbalanced data,’’ 2022, arXiv:2212.07743. tion method for enhancing cancer diagnosis based on DNA microarray,’’
[26] S. E. Roshan and S. Asadi, ‘‘Improvement of bagging performance for in Proc. 37th Nat. Radio Sci. Conf. (NRSC), Sep. 2020, pp. 285–295.
classification of imbalanced datasets using evolutionary multi-objective
optimization,’’ Eng. Appl. Artif. Intell., vol. 87, Jan. 2020, Art. no. 103319.
[27] A. Aldwgeri and N. F. Abubacker, ‘‘Ensemble of deep convolutional neural
network for skin lesion classification in dermoscopy images,’’ in Proc. Int.
Vis. Inform. Conf., Bangi, Malaysia, 2019, pp. 214–226. SAADAT M. ALHASHMI received the Ph.D.
[28] D. Delen, G. Walker, and A. Kadam, ‘‘Predicting breast cancer surviv- degree from Sheffield Hallam University, Sheffield,
ability: A comparison of three data mining methods,’’ Artif. Intell. Med., U.K. He is currently an Associate Professor
vol. 34, no. 2, pp. 113–127, Jun. 2005. of information systems with the University of
[29] J. Thongkam, G. Xu, Y. Zhang, and F. Huang, ‘‘Breast cancer survivability Sharjah, Sharjah, United Arab Emirates. He has
via AdaBoost algorithms,’’ in Proc. 2nd Australasian Workshop Health supervised several Ph.D. students and published
Data Knowl. Manag., vol. 80, 2008, pp. 55–64. extensively in various high-impact journals and
[30] R. Karthiga, G. Usha, N. Raju, and K. Narasimhan, ‘‘Transfer learning conferences.
based breast cancer classification using one-hot encoding technique,’’ in
Proc. Int. Conf. Artif. Intell. Smart Syst. (ICAIS), Mar. 2021, pp. 115–120.

VOLUME 12, 2024 61989


S. M. Alhashmi et al.: Survival Analysis of Thyroid Cancer Patients

MD. SHOHIDUL ISLAM POLASH received the NURUZZAMAN FARUQUI received the B.Sc.
B.Sc. degree in CSE from the Computer Science degree in electrical and electronics engineering
and Engineering Department, Daffodil Interna- from North South University and the master’s
tional University, Dhaka, Bangladesh, in 2023, degree in information technology from the Insti-
with a focus on 3.97/4.00 CGPA. He is currently tute of Information Technology (IIT), Jahangirna-
a Lecturer with the Computer Science and Engi- gar University (JU), Bangladesh, in 2018, with a
neering Department, Daffodil International Uni- focus on 4/4 CGPA.
versity. Several worldwide peer-reviewed confer- He is currently an Assistant Professor with the
ences have published his scientific contributions. Department of Software Engineering (SWE), Daf-
His academic research interests include machine fodil International University, Bangladesh. He is a
learning, deep learning, and computer vision. Research Coordinator with the Department of Software Engineering. He is
also a YouTuber and an Author. He is globally recognized for his educational
video content on MATLAB neural networks. He has authored three books.
His research interests include artificial intelligence, machine learning, deep
learning, cloud computing, and image processing. He is a member of The
Institution of Engineers (IEB), Bangladesh, and Bangladesh Society for
Private University Academics (BSPUA).

AMINUL HAQUE received the B.Sc. degree


from the Shahjalal University of Science and
Technology, Bangladesh, and the Ph.D. degree
from MONASH University. He is currently a Pro-
fessor with the Department of Computer Science
and Engineering, Daffodil International University
(DIU), Daffodil Smart City, Dhaka, Bangladesh. IBRAHIM ABAKER TARGIO HASHEM rece-
He has published his research outputs in several ived the master’s degree in computer science from
international peer-reviewed journals and confer- the University of Wales, Newport, and the Ph.D.
ences. He also contributed data science-related degree in computer science from the University of
courses to online platforms, such as International Online University (IOU). Malaya, Kula Lumpur, Malaysia. He is currently
Recently, he contributed to developing a skill-based national curriculum on an Assistant Professor of computer science with
big data and data science-related courses. His research interests include data the University of Sharjah, United Arab Emirates.
mining, machine learning, and distributed computing. He is an Active Member of the Center for
Mobile Cloud Computing Research (C4MCCR),
University of Malaya. His numerous research
articles are famous and among the most downloaded in top journals. He has
published several research articles in refereed international journals and
magazines. His areas of research interests include big data, cloud computing,
distributed computing, and machine learning. He obtained professional
certificates from CISCO (CCNP, CCNA, and CCNA Security) and the
APMG Group (PRINCE2 Foundation, ITIL v3 Foundation, and OBASHI
FAZLEY RABBE received the bachelor’s degree in Foundation).
computer science and engineering from Daffodil
International University, Bangladesh, in 2021.
He is currently pursuing the Master of Engineering
degree in information technology with Frank-
furt University of Applied Sciences. His current
research interests include data mining, the Internet
of Things, cyber security, and mobile application
authentication. NIRASE FATHIMA ABUBACKER was an Asso-
ciate Professor with Dublin City University
(DCU), Ireland, and taught a joint M.Sc. degree
in computing (data analytics) with Princess Noura
University, Riyadh, through collaboration with
DCU. She is currently an Active Supervisor for
data analytics master’s students capstone projects.
She has good hands-on experience in teaching
universities, such as Dublin City University, the
SHAZZAD HOSSEN received the Bachelor of University of East London, U.K., Staffordshire
Science degree in computer science and engi- University, U.K., and London School of Commerce, U.K., modules for
neering from Daffodil International University, both degrees and master’s international students from all over the country
Daffodil Smart City, Ashulia, Dhaka, Bangladesh, more than the past 20 years in India, Malaysia, and Saudi Arabia. She has
in 2022. He is currently a Software Engineer, developed sound teaching and research skills with an excellent grasp of the
leveraging his expertise in computer science and subject material covered by IT and computer science, specifically in data
engineering. Within the domain of web3 technolo- analytics/data science courses. She has excellent experience in developing
gies, he has delved into blockchain development, materials for data science courses with good hands-on experience in teaching
smart contracts, and the creation of decentral- data science modules for master’s degree students, such as data mining and
ized applications (dApps). His research interests data analytics, applied machine learning, data management and visualization,
include across blockchain technology, machine learning, deep learning, and big data analytics and technologies, artificial intelligence, data mining, and
reinforcement learning, reflecting his passion for cutting-edge advancements predictive modeling.
in the tech industry.

61990 VOLUME 12, 2024

You might also like