Appendicitis Diagnosis: Ensemble Machine Learning and Explainable Artificial Intelligence-Based Comprehensive Approach

Gollapalli, Mohammed; Rahman, Atta; Kudos, Sheriff A.; Foula, Mohammed S.; Alkhalifa, Abdullah Mahmoud; Albisher, Hassan Mohammed; Al-Hariri, Mohammed Taha; Mohammad, Nazeeruddin

doi:10.3390/bdcc8090108

Open AccessArticle

Appendicitis Diagnosis: Ensemble Machine Learning and Explainable Artificial Intelligence-Based Comprehensive Approach

by

Mohammed Gollapalli

¹

,

Atta Rahman

^2,*

,

Sheriff A. Kudos

³

,

Mohammed S. Foula

⁴,

Abdullah Mahmoud Alkhalifa

⁴,

Hassan Mohammed Albisher

⁴,

Mohammed Taha Al-Hariri

⁵

and

Nazeeruddin Mohammad

⁶

¹

Department of Computer Information Systems, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia

²

Department of Computer Science, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia

³

Department of Computer Engineering, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia

⁴

Department of Surgery, King Fahd University Hospital, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia

⁵

Department of Physiology, College of Medicine, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia

⁶

Cybersecurity Center, Prince Mohammad Bin Fahd University, P.O. Box 1664, Alkhobar 31952, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2024, 8(9), 108; https://doi.org/10.3390/bdcc8090108

Submission received: 4 July 2024 / Revised: 28 August 2024 / Accepted: 29 August 2024 / Published: 4 September 2024

(This article belongs to the Special Issue Machine Learning Applications and Big Data Challenges)

Download

Browse Figures

Versions Notes

Abstract

:

Appendicitis is a condition wherein the appendix becomes inflamed, and it can be difficult to diagnose accurately. The type of appendicitis can also be hard to determine, leading to misdiagnosis and difficulty in managing the condition. To avoid complications and reduce mortality, early diagnosis and treatment are crucial. While Alvarado’s clinical scoring system is not sufficient, ultrasound and computed tomography (CT) imaging are effective but have downsides such as operator-dependency and radiation exposure. This study proposes the use of machine learning methods and a locally collected reliable dataset to enhance the identification of acute appendicitis while detecting the differences between complicated and non-complicated appendicitis. Machine learning can help reduce diagnostic errors and improve treatment decisions. This study conducted four different experiments using various ML algorithms, including K-nearest neighbors (KNN), DT, bagging, and stacking. The experimental results showed that the stacking model had the highest training accuracy, test set accuracy, precision, and F1 score, which were 97.51%, 92.63%, 95.29%, and 92.04%, respectively. Feature importance and explainable AI (XAI) identified neutrophils, WBC_Count, Total_LOS, P_O_LOS, and Symptoms_Days as the principal features that significantly affected the performance of the model. Based on the outcomes and feedback from medical health professionals, the scheme is promising in terms of its effectiveness in diagnosing of acute appendicitis.

Keywords:

disease diagnosis; explainable AI (XAI); machine learning; ensemble learning; stacking; bagging; acute appendicitis

1. Introduction

Appendicitis is a medical condition wherein the appendix becomes infected, causing painful inflammation in the lower right side of the abdomen [1,2,3,4]. However, diagnosing appendicitis can be challenging [5], and there are currently no clear guidelines to differentiate between complicated and uncomplicated cases [6,7]. Acute appendicitis can be categorized as complicated acute appendicitis (CA) or uncomplicated acute appendicitis (UCA). CA is often associated with complications such as appendicitis abscess, gangrenous appendicitis, perforated appendix, or phlegmon [8]. The main risk factor for appendicitis is the perforation or rupture of the appendix [9]. A delayed or missed diagnosis of acute appendicitis can lead to high morbidity and mortality, whereas early identification of suspected appendicitis is crucial for treating acute cases [10]. It is therefore important that patients with appendicitis are taken to the operating room as soon as possible, as appendiceal rupture can significantly increase the risk of morbidity and death [11]. To help speed up the diagnosis of complicated and uncomplicated acute appendicitis, and to identify the underlying differences between these two levels of appendicitis, this study aims to develop models using machine learning algorithms and a Saudi Arabian dataset. Alvarado proposed a clinical scoring system that includes symptoms, signs, and laboratory data to diagnose acute appendicitis [10]. However, studies have shown that this system is not enough, and ultrasound (US) and computed tomography (CT) imaging with rectal contrast are the best diagnostic tests for appendicitis. Nevertheless, both tests have some downsides [12,13,14].

CT imaging exposes patients to radiation, and US diagnostic performance depends on the operator and cannot be used after hours. Moreover, the image analysis technique sometimes results in a delayed diagnosis of acute appendicitis [15]. The diagnosis of appendicitis is challenging because more than half of all appendicitis patients do not exhibit typical symptoms [16]. The signs and symptoms of appendicitis are unpredictable, with only 50% of cases presenting with anorexia and periumbilical pain, followed by nausea, right lower quadrant (RLQ) pain, and vomiting. CBC tests have shown that approximately 80–85% of adults with appendicitis have WBC levels above 10,500 cells/µL, and more than 75–78% of patients have neutrophilia. Less than 4% of appendicitis patients have a WBC count of less than 10,500 cells/µL, and neutrophilia less than 75% [17]. An early diagnosis and treatment of acute appendicitis are necessary to prevent further complications.

Machine learning (ML) techniques have proven to be highly effective in improving healthcare and the accuracy of diagnosis. By using ML, clinicians can make informed decisions that lead to improved healthcare services, which includes reducing diagnostic errors and ensuring patients receive the correct treatment. The growth in accessibility to medical data and developments in technology for processing and storing data have contributed significantly to the adoption of ML in healthcare [18,19]. Although many studies have been conducted in the diagnosis of acute appendicitis using machine learning, only a few have utilized Saudi datasets and machine learning in this domain. However, there have been several published works, such as [20,21,22,23], that have implemented ML algorithms with Saudi datasets in the diagnosis and decision support for various diseases and achieved high accuracies.

In this study, we investigated five machine learning algorithms to diagnose acute appendicitis and distinguish between different levels of the disease. The algorithms we used were K-nearest neighbor (KNN), decision tree, KNN bagging, DT bagging, and a stacking model. We chose decision tree for its effectiveness in clinical decision-making, and KNN for its simplicity [24,25]. Additionally, ensemble approaches like bagging and stacking offer diversity, stability, and high performance, as demonstrated in recent research [26].

Additionally, we combined KNN, DT, KNN bagging, and DT bagging models with KNN bagging as the meta-classifier into a single, robust model. We used the permutation feature importance tool to identify the most important features contributing to model accuracy and XAI for deeper explanations and a better understanding of the underlying features. Using the SHAP and LIME techniques with the stacking model, we found the most relevant features that contributed to the model’s prediction.

The remainder of this paper is organized as follows: The literature review is covered in Section 2, while Section 3 contains a description of the proposed techniques. Section 4 outlines the empirical studies, which include the study data, statistical analysis, data preparation, experimental setup, and optimization strategy. Section 5 presents the empirical results of the experiments, while Section 6 contains the findings obtained by the explainable AI (XAI) approach. Section 7 contains further discussion and noteworthy findings, while Section 8 concludes this paper.

2. Literature Review

Numerous studies have been conducted on the use of artificial intelligence (AI) and machine learning (ML) in the diagnosis of acute appendicitis. Akmese et al. [27] proposed a gradient boost algorithm to predict the occurrence of acute appendicitis. Their algorithm was tested on 595 clinical samples of acute Appendicitis patients, comprising 348 males and 247 females. The model achieved an accuracy rate of 95.31%. The authors believe that their model could be particularly beneficial for individuals who exhibit signs of appendicitis, especially in hospital emergency settings.

It is sometimes difficult for a model to perform well on an imbalanced dataset. This was addressed by the authors in [28], who proposed a pre-clustering-based ensemble learning method to tackle this issue in diagnosing acute appendicitis. They tested their model on 574 clinical samples from a university hospital in Taiwan. The model makes use of pre-clustering to group the majority class samples, and then samples from each class representative are selected to form a balanced sample. This method, called the PEL method, minimizes any information loss due to random undersampling. Compared to other techniques, their model achieved the highest AUC score of 0.619.

Park et al. [29] proposed an AI model created with support vector machines (SVMs) to diagnose acute appendicitis. They used the medical records of 760 patients to create the SVM model. The model was then compared with the Alvarado clinical scoring system (ACSS) and multilayered neural networks (MLNNs). The results showed that their model outperformed the other two with an AUC score of 99.7%.

Yoldaş et al. [30] conducted a study to evaluate the performance of artificial neural networks in diagnosing acute appendicitis in patients who complained of right lower abdomen pain. They collected data from 156 patients who visited a rural hospital over 12 months with probable appendicitis. The artificial neural network’s specificity, sensitivity, and positive and negative predictive values were 97.2%, 100%, 96.0%, and 100%, respectively. The results suggest that artificial neural networks may help prevent unnecessary surgeries and be an efficient tool for correctly identifying acute appendicitis.

Issaiy et al. [31] compared the effectiveness of traditional approaches and artificial intelligence (AI) models in the diagnosis and prognosis of acute appendicitis (AA) in adult patients. They analyzed a total of 29 studies, 7 of which focused on prognosis, 21 on diagnosis, and 1 on both. The most used diagnosis algorithms were artificial neural networks (ANNs). Both ANNs and logistic regression were used to categorize different forms of AA. ANNs showed high performance in most scenarios, with accuracy rates frequently above 80% and AUC values peaking at 0.985.

In a recent study, researchers [32] used several machine learning (ML) algorithms, such as SVM, KNN, GB, and LR, to develop an ML model that can help in detecting complicated appendicitis. The study included 1950 cohorts, out of which 483 were diagnosed with complicated appendicitis. The GB model was found to have the highest accuracy and AUC values of over 0.8, in both experiments involving SMOTE and non-SMOTE.

Similarly, Akbulut et al. [33] used ML algorithms and XAI to predict perforated and non-perforated appendicitis. Among all the models, their CatBoost was able to differentiate between acute appendicitis (AAp) patients from non-AAp patients with an accuracy of 82%, while it could differentiate perforated AAp individuals from non-perforated ones with an accuracy of 92%. Additionally, the SHAP technique revealed that WBC, WLR, high total bilirubin, CRP, neutrophil, NLR, and WNR values, as well as low PDW, PNR, and MCV values, improved the biochemical prediction of AAp.

In a previous study [34], a 3D deep learning model called AppendiXNet was developed to detect appendicitis. The model was pre-trained on Kinetics, a large collection of YouTube videos, and then fine-tuned on a dataset of 438 CT-scanned images. It was found that pre-training the model on Kinetics improved its performance from an AUC of 72.4% to 81%.

Goswami et al. [35] compared the performance of three machine learning algorithms—decision tree, support vector machine, and K-nearest neighbor—in diagnosing acute appendicitis. The study was conducted on a dataset of 590 patients with appendicitis. The results showed that the decision tree algorithm performed better than the other two algorithms, with 73.72% accuracy.

In another study [36], authors used a grasshopper optimization technique to improve the performance of a support vector machine algorithm in distinguishing complicated and uncomplicated appendicitis. A random forest analysis was used to identify the two groups before the optimization. They used records of 298 acute appendicitis patients from Wenzhou Central Hospital and achieved an average accuracy of 83.56%. Similarly, Eddama et al. [37] used logistic regression to differentiate between patients with complicated and uncomplicated appendicitis. They used 895 samples of patients who underwent appendectomy.

This study aims to use a Saudi Arabian dataset to develop models that can detect and classify complicated and non-complicated appendicitis while identifying the underlying distinguishing factors of the two conditions. The goal of this study is to aid clinicians in diagnosing these conditions quickly and efficiently using the available resources.

3. Description of the Proposed Techniques

This section briefly describes the proposed techniques investigated in the current study. It is worth noting that the schemes have been chosen after carefully reviewing the literature. The underlying schemes were promising candidates for similar problems with affordable complexity. Additionally, the stacking models are among the prominent approaches for complex problems with large feature spaces.

3.1. Decision Tree

The decision tree is a popular and efficient algorithm used in data mining [38]. It is recognized as one of the top ten algorithms in data mining and is well-established after being explored by numerous researchers [39]. The decision tree is represented as a flowchart-like tree structure, where leaf nodes are represented by ovals and inside nodes by rectangles. Every inside node has at least two child nodes and splits that evaluate the features’ expression. The arcs connecting an internal node to its offspring are labeled based on different test outcomes, and a class label is assigned to each leaf node. There are various methods for selecting the best attribute at each node [40]. The Gini impurity and information gain techniques are commonly used as splitting criteria in decision tree models. They help evaluate the effectiveness of each test condition and its ability to classify samples into a category [41]. The formula for information gain is provided by Equation (1).

I n f o r m a t i o n G a i n (S, A) = E n t r o p y (S) - \sum_{v \in V a l u e s (A)} \frac{|S_{v}|}{|S|} E n t r o p y (S_{v})

(1)

A denotes a specific feature or class label, Entropy(S) is the entropy of the dataset, |

S_{v}

|/|

S

| denotes the ratio of the values in

S_{v}

to the number of values in the dataset

S

, while Entropy (

S_{v}

) is the entropy of the dataset

S_{v}

.

Entropy is calculated as Equation (2):

E n t r o p y (S) = \sum_{j} - P_{j} {l o g}_{2} P_{j}

(2)

S denotes the dataset whose entropy is being calculated.
$j$ denotes the classes in set S.
$P_{j}$ denotes the ratio of data points that belong to class $j$ to the number of total data points in the set S.

3.2. K-Nearest Neighbors (KNN)

KNN is among the simplest machine learning algorithms yet regarded as one of the best data mining algorithms [39]. It selects a neighborhood of k objects in the training set that is closest to the test item and assigns a label based on most of a specific class in this neighborhood. When dealing with a large training set, high accuracy rates can be achieved by developing a decision surface that adjusts to the structure of the data distribution [42]. KNN was first introduced in [43] to perform discriminant analysis when valid parametric estimates of probability density were unknown or impossible to calculate. KNN makes no assumptions about the underlying data distribution, therefore making it a nonparametric lazy learning algorithm [42]. KNN uses three approaches to select the nearest neighbor: an odd k value, a distance or similarity metric, and a set of labeled objects [39]. The selection of the k nearest neighbors is based on the similarity metric [44]. Apart from its simplicity, one advantage of KNN compared to other algorithms is its suitability for multi-modal classes, which allows for it to be used in applications having an object with more than one class. The problem with KNN is the k value. If k is too small, the outcome may be susceptible to noise points. However, if k is too large, the neighborhood may contain too many points from different classes [39].

To determine which data points are closest to a specific query point, it is necessary to calculate the distance between the query point and the other data points. Distance measures play a crucial role in creating decision boundaries that divide query points into different regions. There are several distance measures available, with the most common one being the Minkowski distance measure, represented by Equation (3):

d (x, y) = {(\sum_{i = 1}^{k} |x_{i} - y_{i}|)}^{\frac{1}{p}}

(3)

where x and y represent the different data points. Equation (3) represents Euclidean distance when p equals 2, and Manhattan distance when p equals 1 [45,46].

3.3. Bagging

Bagging is an ensemble method [47] used to generate numerous versions of an algorithm, which are then used to produce an aggregated algorithm. For numerical prediction, the aggregation of the means over the versions, while a class prediction uses a plurality vote. Numerous versions are produced by making bootstrap duplicates of the learning set and using them as new learning sets. The outcome is obtained through majority vote of the model’s predictions [48]. An example of the bagging technique is the random forest algorithm [49]. Bagging methods employ parallel ensemble techniques, which involve the simultaneous generation of base learners. Since this method is not data-dependent, the fusion methods rather depend on distinct voting methods. Equation (4) shows the function for bagging [48]:

f (x) = \frac{1}{B} \sum_{B = 1}^{B} f_{b (x),}

(4)

where

\frac{1}{B}

represents generated bootstrap learning set, and

f_{b (x)}

, denotes weak leaners.

3.4. Stacking

Stacking, also known as stacked generalization, is an ensemble technique that leverages predictions from multiple models to create a new model, known as a meta-model. The architecture of a stacking model comprises two or more base models, referred to as level 0 models, and a meta or level 1 model that combines the predictions from the base models. In the level 0 models, the training data are used to create predictions, while in the level 1 model (meta-model), the model learns the optimal way to combine the predictions from the level 0 models. The outputs of the basic models utilized as input to the meta-model can be either probability values or class labels in the case of classification. Equation (5) depicts the function for stacking [50,51].

f_{s} (x) = \sum_{i = 1}^{n} a_{i} f_{i} (x),

(5)

4. Study Data

This section provides the details about the dataset used for the proposed study, its statistical analysis and preprocessing. Table 1 presents the description of the dataset used in this study by explaining each feature.

4.1. Statistical Analysis

Statistical analysis tools facilitate the discovery of vital information about the dataset, ensuring the use of the right preprocessing techniques and modeling. Table 2 displays the statistical analysis for the numerical attributes, including the count, mean, standard deviation, and five-number summary for each attribute, while Table 3 shows the count and missing values of the nominal attributes. As illustrated in Table 2, the significant difference between the third quartile and the maximum values of the Neutrophils shows the presence of outliers which were treated in the preprocessing stage.

Moreover, during the box plot analysis for the numerical variables, the presence of outliers in the neutrophils variable was observed. Additionally, Figure 1 shows the correlation between the attributes and the target class. P_O_LOS, Total_LOS, and neutrophils are shown to be the most important features in the classification of appendicitis complication. Overall, the heatmap shows a weak correlation between the features and the target variable.

4.2. Data Preprocessing

Data preparation is one of the crucial processes carried out to transform raw data into something useful and efficient. Several preprocessing methods were applied in this study utilizing Python’s Pandas and Sklearn libraries. Categorical values were first converted to numerical values. This was carried out to ensure the inputs and outputs were all numerical. Samples with ‘Yes’ were converted to 1 and ‘No’ to 0, while ‘M’ was converted to 1 and ‘F’ to 0 for the Sex variable. The distribution of the dataset has a significant impact on ML algorithms. The presence of outliers can have a negative effect on a model’s performance by adding skewness and bias to the statistical power of the dataset. Neutrophils variable contains outliers, found with the interquartile range (IQR), and treated by means of normalization accordingly. The last stage of the preprocessing step was fixing the missing values. The KNNImputer function from the Sklearn’s impute module with the nearest neighbor value set to 5 was used to impute missing values. It works by locating the closest neighbors using the Euclidean distance metric, as shown in Equation (6):

d (p, q) = \sqrt{\sum_{i = 1}^{n} {(q_{i} - p_{i})}^{2}}

(6)

where p and q represent two points, while n is the n-space.

5. Experimental Setup

The experiments for this study were carried out on Jupyter Notebook using an HP Notebook with an Intel(R) Core (TM) i7-7500U CPU and 8 GB RAM. Before training the model, the dataset was first preprocessed by removing columns that had almost half of their values missing. The categorical variables were then encoded using the Label Encoder. As mentioned earlier, the dataset was preprocessed by imputation, normalization, and capping. Subsequently, the best features were selected using the K-best algorithm with chi criteria and a K-value set of 11 before conducting the experiment. Then, the dataset was partitioned using the train/test split ratio of 70%–30%. The train set was used for training and hyperparameter optimization while the test set was used for model evaluation.

The first experiment was conducted using the preprocessed dataset and two ML algorithms, namely, DT and KNN, to create two models. The second experiment involved the use of SMOTE–Tomek [52,53] to oversample the negative class, which was then used to create two models from the algorithms. The results of the 1st and 2nd experiments were compared, and the best performing experiment was used for subsequent experiments. In the third experiment, a KNN and a DT bagging technique was used to create two models by utilizing the hyperparameters of those two models obtained in the second experiment. The final experiment was the creation of a stacking model using hyperparameters obtained from the 2nd and 3rd experiments. In addition, the “permutation_importance” function provided by the Sklearn library was used to rank the important features of the best model with a “random_state” of 1 and “n_repeats” of 5 using the accuracy metric. Finally, explainable AI algorithms LIME and SHAP were used to provide further analysis. The procedure for the experimental setup is shown in Figure 2.

5.1. Performance Measures

Four performance metrics, namely, precision, recall, accuracy, and F1 score, were used to evaluate the performance of the models described in this paper. Accuracy is the most often used metric, and it evaluates the rate at which models make accurate predictions. Precision, on the other hand, measures the number of actual positive predicted occurrences accurately classified as positive samples, whereas recall calculates the number of successfully identified positive cases. The F1 score was used due to the imbalanced nature of the dataset, and it calculates the weighted average of the recall and the precision. The formulae used to calculate precision, recall, and accuracy are represented by Equations (7), (8), (9) and (10), respectively [54].

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(9)

F 1 - S c o r e = 2 \times \frac{P r e c i s o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(10)

5.2. Optimization Strategy

Utilizing the powerful GridSearchCV hyperparameter tuning approach, we meticulously explored specific hyperparameter values to pinpoint the optimal combination that delivers maximum accuracy. Through comprehensive stratified 10-fold cross-validation, we exhaustively tested every conceivable combination within the grid to identify the most effective configuration. This method is poised to elevate model performance, resulting in significantly improved outcomes. Please find the hyperparameter grid for each of the algorithms tested below.

KNN: N_neighbors {3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39} and Metric {‘minkowski’,‘euclidean’,‘manhattan’}.
DT: Criterion {Gini, Entropy} and Max_depth {2,4,6,8,10,12,14,16,18,20,50,100,150,200}.

5.2.1. Experient 1: Performance Analysis on Actual Data without Oversampling

In Figure 3, the performance of KNN was evaluated using different metrics and numbers of neighbors. The graph shows that the Minkowski and Euclidean metrics had similar scores, while the Manhattan metric had a different trend. Overall, the highest accuracy, 83.30%, was achieved using the Manhattan metric with 3 neighbors.

Figure 4a,b illustrates the effect of using different splitters and criterion in particular maximum depth levels. Overall, the Gini and Entropy performance saw a gradual decline as the maximum depth values changed. The Entropy criterion and a maximum depth of 2 produced the best performance score of 85.72% using the best splitter.

5.2.2. Experient 2: Performance Analysis with Oversampling

Figure 5 shows the graph for the hyperparameter tuning for KNN classifier with the combination of different metrics and number of neighbors after SMOTE–Tomek was applied. The graph shows a decreasing trend for all the metrics. Furthermore, it can be observed that Minkowski and Euclidean portrayed the exact same results, while Manhattan showed a different result. The highest score of 84.64% was obtained with Manhattan and the number of neighbors was 3.

Figure 6a,b shows the trends for the best and random splitters when combined with different criterion and maximum depth values. From the plot, we can observe that the entropy criterion performed better than the Gini in both splitters. However, both plots showed that the values remained constant after some time. Overall, the best splitter, with entropy criteria and maximum depth value of 14 produced the highest result of 84.62%.

A synopsis of the best hyperparameters of each experimented model is presented in Table 4. The table indicates an increase in the performance of the KNN, which moved from 86% accuracy to 91% when oversampling was applied. Likewise, the DT also saw an increase in its performance when the dataset was oversampled. Overall, we can say that the SMOTE–Tomek technique had a positive impact on the performance of the classifiers as such, subsequent experiments were conducted with the SMOTE–Tomek technique.

5.2.3. Experient 3: Performance Analysis of Proposed Bagging Model

Lately, ensemble techniques have gained popularity for their ability to enhance the performance of traditional algorithms. In the third experiment, ensemble algorithms were created using the two standard algorithms tested earlier, along with their optimized hyperparameters from the second experiment using the GridSearchCV technique. Subsequently, GridSearchCV was utilized to fine-tune the bagging models. The hyperparameter grid for the bagging algorithms is provided below.

n_estimators {10,50,100,150,200,250}
max_features {0.5,0.6,0.7,0.8,0.9,1}
max_samples {0.5,0.6,0.7,0.8,0.9,1}.

Table 5 presents the optimal hyperparameter values of the bagging models.

5.2.4. Experient 4: Performance Analysis of Proposed Stacking Model

Unlike bagging, stacking combines multiple algorithms into a single model by utilizing a meta-classifier to combine their prediction abilities. The fourth experiment involves the building of a stacking model utilizing the models created in the second and third experiments along with their optimal hyperparameters. The meta-classifier chosen was the best performing model among the four created, which was DT bagging. In all, five models were used in creating the stacking model. Like previous experiments, GridSearchCV was also utilized here to obtain the optimal hyperparameter to attain highest possible accuracy. The hyperparameter gird for the stacking model is presented below.

final_estimator__n_estimators {10,50,100,150,200,250}
final_estimator__max_features {0.5,0.6,0.7,0.8,0.9,1}
final_estimator__max_samples {0.5,0.6,0.7,0.8,0.9,1}.

Table 6 presents the optimal hyperparameter values for the stacking model.

6. Results and Discussion

This study utilized a stratified 70:30 percentage split to train seven machine learning algorithms. Table 7 presents the model performance of each model’s training and test set accuracy, precision, recall, and F1 score obtained with the optimal hyperparameters determined by the GridSearchCV approach. As can be seen from the table, in experiment 1, both KNN and DT had the same score of 86.75% for the training set accuracy; however, DT outperformed KNN in all other performance metrics in the test set. Apart from test set accuracy, KNN barely scored above 40% in the other performance metrics. Although DT outperformed KNN, this can be attributed to the unbalanced nature of the dataset. Experiment 2 observed a boost in the performance of the two models in all aspects of the metrics used. DT had a score of 100% in the training accuracy, while KNN achieved a 91.40% score.

On the other hand, KNN outperformed DT in all the test set metrics, particularly in the recall, where KNN scored 91.20% to that of DT’s 84.74%. In experiment 3, it was revealed that the two bagging models’ performances were very close. DT bagging outscored KNN bagging in the training accuracy with a score of 96.15% to 95.70%. Likewise, in the testing precision, DT bagging had 93.82% while KNN bagging had 92.22%. The other metrics saw KNN bagging outperforming DT bagging. The fourth and final experiment was a stacking model composed of models from experiments 2 and 3. It had a better performance than all the models in all metrics except in the testing recall where KNN bagging had 91.20%, which was the highest score. The stacking model had a 97.51%, 92.63%, 89.01%, 95.29%, and 92.04% in training accuracy, testing accuracy, recall, precision, and F1 score, respectively.

In general, SMOTE–Tomek significantly enhanced the models’ performance as evidenced by the increases in KNN, and DT training and testing accuracies in experiment 2. Furthermore, according to the recall, precision, and F1 scores, there was a significant improvement following the implementation of SMOTE–Tomek, with huge differences for KNN and DT. The significant improvement can be attributed to SMOTE–Tomek’s capacity to artificially create new instances that exhibit a reasonable degree of variation from the original records. This is achieved by upsampling the minority class in order to maintain class balance. As a result, it strengthens the models’ capacity for generalization, which raises overall performance.

Additionally, it was shown that the bagging ensemble considerably enhanced the performance of DT and KNN across all metrics. Moreover, it shows that bagging KNN produced the highest recall score of 91.20% along with KNN from experiment 2. Overall, the proposed stacking ensemble in experiment 4 is said to have the best performance, with a training accuracy of 97.51%, testing accuracy of 92.63%, recall of 89.10%, precision of 95.29%, and F1 score of 92.04%. The close results of the training and the test accuracies of the proposed models portray their capabilities in overcoming overfitting and underfitting.

6.1. Effect of Feature Selection and Permutation Importance on the Dataset

Feature selection is thought to be critical in developing a good prediction classifier. It achieves greater prediction accuracy while employing the fewest features, as such reducing computing complexity. This study utilized the K-best feature selection algorithm provided by Sklearn library with the chi-squared test score function and a k value of 11. The algorithm works by proposing a way for estimating the significance of a feature as well as the number of features to be used, denoted as k. The algorithm then simply returns the top-k features.

The key advantage of the K-best feature selection is that we can choose from a range of criteria or score functions for calculating the relevance of a feature. For example, we may use the chi-squared test to determine a feature’s independence from a target. The higher the score, the more dependent the attribute is on the target, and hence the greater the relevance of that attribute. Other methods that can be used with the K-best are the ANOVA F-value (f_classif) and mutual information (mutual_info_classif) [55].

Additionally, the eleven features selected by the K-best algorithm were ranked using the permutation importance algorithm provided by Sklearn library. Permutation importance explains how black box models detect and rank features depending on their predictive power during or after training. The score assigned to each predictor is determined by the degree to which it can improve the predictions, allowing for feature interpretation based on relative predictive power [21].

Figure 7 shows the values obtained for all the models’ permutation importance using the training set while Figure 8 shows that of the test set. It is apparent from the graph in Figure 7 that out of the 24 features, age, P_O_LOS, Total_LOS, Symptoms_Days, G_A_Pain, Pain_Radiation, RLQ_Mass, neutrophils, WBC_Count, diarrhea, and readmission were selected by the K-best algorithms as the eleven best features. Additionally, the graph shows how each feature contributed to the accuracy of each model.

From the graphs in Figure 8, it is apparent that age, WBC_Count, and neutrophils were the highest contributors to the performance of the KNN model, whiles DT selected neutrophils, Total_LOS, and WBC_Count as its highest contributors. Conversely, age, P_O_LOS, and Total_LOS were chosen by the ensemble classifiers as its highest contributors. From these results, we can conclude that the most important feature in the classification of complicated and non-complicated appendicitis are age, P_O_LOS, Total_LOS, WBC_Count, and neutrophils.

6.2. Explainable AI

ML models, despite performing at a level analogous to humans, have a limited range of applications because they are considered a “black box”. This lack of transparency hinders their practical use, particularly in healthcare. To address this issue and promote the use of AI in healthcare, explainable artificial intelligence (XAI) has been developed. XAI aims to increase user confidence in a model’s predictions by providing an explanation of how the predictions were generated. XAI is described as a set of attributes that explain how a model arrived at its prediction [56]. This study utilized two XAI methodologies called Shapley additive explanations (SHAP) and local interpretable model-agnostic explanations (LIME) to explain the best model outcomes.

These models have been effectively utilized in the literature for similar problems. That is why the current study employs them for the problem at hand.

6.2.1. SHAP

SHAP is an XAI technique that uses a Shapley value based on the game theory technique to equally allocate benefits and costs among participants working cooperatively. SHAP allocates each attribute a relevance value for a specific prediction [57], which estimate how significant a feature is within a model. Figure 9 shows the SHAP summary plot for the stacking model, which shows P_O_LOS, neutrophils, Total_LOS, age, and WBC_count as the top five features having the most impact on the model’s prediction having a positive correlation with the outcome. The red dots suggest the higher value of the feature, while the blue dots suggest lower value of the feature. Higher values of the P_O_LOS leads to higher chances of the condition being complicated, whiles lower values lead to lower chances of the condition being uncomplicated. Likewise, the higher the neutrophils value, the higher the chance of it being complicated and vice versa. Similarly, Total_LOS, age, WBC_Count, and Symptoms_Days show the same tendency. Conversely, higher values in G_A_Pain, Pain_Radiation, and RLQ_Mass denote uncomplicated appendicitis while their lower values have no impact on the model’s prediction. Additionally, patients with high values of readmission show the chances of a complicated condition. By comparing these insights to the overall understanding of the problem, we may be confident that the model is intuitive and making the correct conclusion. For instance, a patient diagnosed with complicated appendicitis will need to stay longer at the hospital after the operation than a patient whose condition is not complicated. So, we can conclude from the SHAP plot that the features with non-binary value have significant impact on the model’s prediction than those with binary values.

6.2.2. LIME

Like SHAP, LIME also explains a model’s prediction through its important features. However, it does not implement the concept of game theory cooperation like SHAP does. Instead, it explains how a model’s prediction probability is affected by each feature [56].

Figure 10 and Figure 11 show the stacking model’s LIME graph for the predicted values for test set samples 0 and 11, respectively. The left graph of Figure 10 shows the confidence interval stating that the sample has a 91% tendency of being non-complicated and a 9% confidence interval of being complicated. The middle graph shows the feature importance scores for this sample with P_O_LOS having a 0.27% score, followed by Symptoms_Days having 12%, and Total_LOS having 8%. Neutrophils and WBC_Count were part of the features contributing to the sample’s tendency of being complicated. Conversely, Figure 11 shows a complicated sample with a confidence interval of 96%, and non-complicated confidence interval of 4%. Neutrophils and Total_LOS were the highest-ranking features, with a score of 14% each. They were followed by P_O_LOS, Symptoms_Days, and WBC_Count as others in the top 5% features. Age, diarrhea, and readmission contributed to the sample’s tendency to being non-complicated. The graph on the right shows the feature importance for each sample along with their values, with blue representing non-complicated and orange representing complicated. Apart from explaining how the stacking model came by its prediction; LIME also allows for us to extract knowledge from the data. For non-complicated samples, P_O_LOS and Symptoms_Days had a value less than or equal to 1.

Additionally, the Total_LOS values range from 1.03 and 2 inclusively. Additionally, patients being 21 years of age and below a likely to have a non-complicated condition. Moreover, conditions without diarrhea are considered non-complicated. On the other hand, a condition with a neutrophils above 85.76 is considered complicated. Furthermore, patients with a complicated condition have a post-operative length of stay greater than 1 and less than or equal to 2. The total length of stay of these patients is likely between 2 to 2.34 days. Also, these patients exhibit a WBC_Count value greater than 15.33.

It can be concluded that Symptoms_Days, P_O_LOS, and Total_LOS are common features in the diagnosing of complicated and non-complicated appendicitis. Age and diarrhea are important indicators for a non-complicated appendicitis, whereas neutrophils and WBC_Count are indicators of a complicated condition.

7. Further Discussion

Acute appendicitis is one of the most prevalent causes of acute abdominal pain, the most common reason for abdominal surgery in children, and the most common reason for litigation against ER doctors [58]. Acute appendicitis develops when the appendiceal lumen becomes clogged, resulting in the accumulation of fluid, inflammation, luminal distention, and, eventually, perforation [34,59,60,61]. Although the symptoms of acute appendicitis are well known [62], almost one-third of acute appendicitis patients portray unusual symptoms [63]. Furthermore, people with other abdominal illnesses may present with clinical symptoms similar to acute appendicitis [64]. The concern of developing perforated appendicitis, which can increase morbidity and require a lengthy hospital stay, is a key factor in this diagnostic conundrum [65]. Historically, the best strategy to reduce the rate of perforation has been to lower the threshold for admitting patients to the operating room, which results in a higher negative appendectomy rate [11,66].

However, the dawn of AI and machine learning has ushered in a paradigm shift in how healthcare decisions are made. Utilizing the potential of ML technology and pinpointing the key characteristics of acute appendicitis will greatly aid in quick decision-making, thereby reducing the mortality rate. In this study, we used a Saudi Arabian clinical dataset to conduct four distinct ML experiments to predict complicated and non-complicated acute appendicitis. A dataset of 411 samples having 24 features was used in creating predictive models comprising KNN, DT, KNN bagging, DT bagging, and stacking approaches. Before building the models, the nominal variables were label encoded, outliers were capped, and missing values were imputed using the KNN imputer. The first experiment involved the building of KNN and DT models without upsampling. The second experiment involved the use of SMOTE–Tomek to upsample the negative class. The outcomes of experiments 1 and 2 demonstrated the superiority of balanced datasets over unbalanced datasets. Because the results from utilizing an upsampled dataset were significantly better, the ensemble models were created using the SMOTE–Tomek dataset. The DT bagging outperformed the KNN bagging in the training accuracy; however, the KNN performed better in the test set, with a score of 92.10% to 89.47%. The stacking model then integrated the prior optimal three models, KNN, DT, KNN bagging, and DT bagging, as demonstrated in experiment 4. The highest training accuracy was 97.51%, test accuracy was 92.63%, recall was 89.01%, precision was 95.29% and F1 score was 92.04%.

Non-complicated appendicitis is characterized as an inflamed appendix that is phlegmonous in nature and does not show any symptoms of necrosis or perforation, as opposed to complicated appendicitis, which includes focal or transmural necrosis that may eventually result in perforation [66,67]. There are no clear instructions on how to distinguish between non-complicated and complicated appendicitis. However, the same recommendations suggest that non-complicated appendicitis should merely be treated with antibiotics and that complicated appendicitis is an emergency that needs urgent treatment [60,61]. Nevertheless, explainable AI implemented in this study showed neutrophils and WBC_Count to be important markers in diagnosing complicated appendicitis. Neutrophil and WBC_Count values above 85.76 and 15.33, respectively, are likely to be complicated. Furthermore, people below 21 years of age are more prone to non-complicated appendicitis than complicated ones.

Moreover, non-complicated appendicitis does not exhibit symptoms of diarrhea. On the other hand, symptoms persisting for more than a day are more likely to be complicated. In terms of length of stay, non-complicated appendicitis is characterized by a shorter length of stay and post-operative length of stay than complicated ones. Diagnosis of acute appendicitis is complicated. A variety of illnesses, including the two degrees of appendicitis severity, are present in patients with stomach discomfort and suspicion of appendicitis [68,69,70]. This diagnosis dilemma might lead to high mortality if there is any misdiagnosis or late diagnosis of the condition [10]. The proposed ML models have clearly shown the potential value in not only diagnosing appendicitis but also distinguishing between complicated and non-complicated. Per the findings, neutrophils and WBC_count are key markers in the diagnosis of complicated appendicitis, which non-complicated appendicitis might not present. As a result, we anticipate that clinicians and medical practitioners will be on the lookout for these symptoms in their diagnosis of appendicitis and its associated conditions. The least significant feature might also portray additional patterns.

8. Conclusions

This study presented a new and comprehensive machine learning approach to diagnosing appendicitis using clinical data collected from the university hospital. The experiments were conducted in close collaboration with medical doctors from the college of medicine. Four main experiments were carried out to achieve the objective. The first experiment used the preprocessed dataset and two ML algorithms, namely, DT and KNN, to create two models. The second experiment involved oversampling the negative class using SMOTE–Tomek, creating two models from the algorithms. The results of the two experiments were compared, and the best-performing experiment was used for subsequent experiments. In the third experiment, a KNN and DT bagging technique was used to create two models by utilizing the hyperparameters of the two models obtained in the second experiment. The final experiment created a stacking model using the hyperparameters obtained from the last two experiments. Finally, explainable AI algorithms LIME and SHAP were used to provide further analysis. The experimental results showed that the stacking model had the highest training accuracy, test set accuracy, precision, and F1 score. Feature importance and explainable AI identified the principal features that significantly affected the performance of the model. Based on the outcomes and feedback from medical health professionals, the scheme is promising in terms of its effectiveness in the diagnosis of acute appendicitis. In the future, the authors intend to explore other ensemble models, deep learning models, and transformers for the same problem.

Author Contributions

Conceptualization, A.M.A., H.M.A. and M.T.A.-H.; Data curation, M.S.F.; Formal analysis, A.R., M.S.F. and H.M.A.; Investigation, M.G., A.M.A. and N.M.; Methodology, M.G., S.A.K., M.T.A.-H. and N.M.; Software, S.A.K.; Supervision, M.G.; Validation, A.R., M.S.F., A.M.A., H.M.A. and M.T.A.-H.; Writing—original draft, S.A.K.; Writing—review and editing, A.R. and N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset is available from the first author and can be provided upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lotfollahzadeh, S.; Lopez, R.A.; Deppen, J.G. Appendicitis, StatPearls Publishing. 2022. Available online: https://www.ncbi.nlm.nih.gov/books/NBK493193/ (accessed on 11 December 2022).
Mayo Clinic. Appendicitis. 2021. Available online: https://www.mayoclinic.org/diseases-conditions/appendicitis/symptoms-causes/syc-20369543 (accessed on 11 December 2022).
Johns Hopkins Medicine. Appendicitis. Available online: https://www.hopkinsmedicine.org/health/conditions-and-diseases/appendicitis (accessed on 12 November 2022).
Cleveland Clinic. Appendicitis. Available online: https://my.clevelandclinic.org/health/diseases/8095-appendicitis (accessed on 12 November 2022).
Puylaert, J.B.C.M.; Rutgers, P.H.; Lalisang, R.I.; De Vries, B.C.; van der Werf, S.D.; Dörr, J.P.; Blok, R.A. A Prospective Study of Ultrasonography in the Diagnosis of Appendicitis. N. Engl. J. Med. 1987, 317, 666–669. [Google Scholar] [CrossRef]
Gorter, R.R.; Eker, H.H.; Gorter-Stam, M.A.W.; Abis, G.S.A.; Acharya, A.; Ankersmit, M.; Antoniou, S.A.; Arolfo, S.; Babic, B.; Boni, L.; et al. Diagnosis and management of acute appendicitis. EAES consensus development conference 2015. Surg. Endosc. 2016, 30, 4668–4690. [Google Scholar] [CrossRef] [PubMed]
Di Saverio, S.; Podda, M.; De Simone, B.; Ceresoli, M.; Augustin, G.; Gori, A.; Boermeester, M.; Sartelli, M.; Coccolini, F.; Tarasconi, A.; et al. Diagnosis and treatment of acute appendicitis: 2020 update of the WSES Jerusalem guidelines. World J. Emerg. Surg. 2020, 15, 27. [Google Scholar] [CrossRef] [PubMed]
Ozdemir, D.B.; Karayigit, A.; Dizen, H.; Unal, B. Role of hyponatremia in differentiating complicated appendicitis from uncomplicated appendicitis: A comparative study. Eur. Rev. Med. Pharmacol. Sci. 2022, 26, 8057–8063. [Google Scholar] [CrossRef] [PubMed]
MedBroadcast. Appendicitis. Available online: https://medbroadcast.com/condition/getcondition/appendicitis (accessed on 12 November 2022).
Alvarado, A. A practical score for the early diagnosis of acute appendicitis. Ann. Emerg. Med. 1986, 15, 557–564. [Google Scholar] [CrossRef]
Khairy, G. Acute Appendicitis: Is Removal of a Normal Appendix Still Existing and Can We Reduce Its Rate? Saudi J. Gastroenterol. 2009, 15, 167–170. [Google Scholar] [CrossRef]
Kosloske, A.M.; Love, C.L.; Rohrer, J.E.; Goldthorn, J.F.; Lacey, S.R. The Diagnosis of Appendicitis in Children: Outcomes of a Strategy Based on Pediatric Surgical Evaluation. Pediatrics 2004, 113, 29–34. [Google Scholar] [CrossRef]
Pritchett, C.V.; Levinsky, N.C.; Ha, Y.P.; Dembe, A.E.; Steinberg, S.M. Management of acute appendicitis: The impact of CT scanning on the bottom line. J. Am. Coll. Surg. 2010, 210, 699–705. [Google Scholar] [CrossRef]
Fergusson, J.A.E.; Hitos, K.; Simpson, E. Utility of white cell count and ultrasound in the diagnosis of acute appendicitis. ANZ J. Surg. 2002, 72, 781–785. [Google Scholar] [CrossRef]
Park, S.Y.; Seo, J.S.; Lee, S.C.; Kim, S.M. Application of an Artificial Intelligence Method for Diagnosing Acute Appendicitis: The Support Vector Machine. In Future Information Technology: FutureTech 2013; Springer: Berlin/Heidelberg, Germany, 2014; pp. 85–92. [Google Scholar] [CrossRef]
Medical News Today. Everything You Need to Know about a Burst Appendix. 2020. Available online: https://www.medicalnewstoday.com/articles/appendix-burst (accessed on 13 November 2022).
Craig, S. Appendicitis. Medscape. 2022. Available online: https://emedicine.medscape.com/article/773895-overview?form=fpf (accessed on 28 August 2024).
Michie, D. ‘Memo’ functions and machine learning. Nature 1968, 218, 19–22. [Google Scholar] [CrossRef]
Bhavsar, K.A.; Singla, J.; Alzubi, A.A. A comprehensive review on medical diagnosis using machine learning. Comput. Mater. Contin. 2021, 67, 1997. [Google Scholar] [CrossRef]
Gollapalli, M.; Kudos, S.A.; Alhamad, M.A.; Alshehri, A.A.; Alyemni, H.S. Machine Learning Models Towards Prediction of COVID and Non-COVID 19 Patients in the Hospital’ s Intensive Care Units (ICU). Math. Model. Eng. Probl. 2022, 9, 1471–1480. [Google Scholar] [CrossRef]
Gollapalli, M.; Alansari, A.; Alkhorasani, H.; Alsubaii, M.; Sakloua, R.; Alzahrani, R.; Al-Hariri, M.; Alfares, M.; AlKhafaji, D.; Al Argan, R.; et al. A novel stacking ensemble for detecting three types of diabetes mellitus using a Saudi Arabian dataset: Pre-diabetes, T1DM, and T2DM. Comput. Biol. Med. 2022, 147, 105757. [Google Scholar] [CrossRef] [PubMed]
Ahmed, M.S.; Rahman, A.; AlGhamdi, F.; AlDakheel, S.; Hakami, H.; AlJumah, A.; AlIbrahim, Z.; Youldash, M.; Alam Khan, M.A.; Basheer Ahmed, M.I. Joint Diagnosis of Pneumonia, COVID-19, and Tuberculosis from Chest X-ray Images: A Deep Learning Approach. Diagnostics 2023, 13, 2562. [Google Scholar] [CrossRef] [PubMed]
Jan, F.; Rahman, A.; Busaleh, R.; Alwarthan, H.; Aljaser, S.; Al-Towailib, S.; Alshammari, S.; Alhindi, K.R.; Almogbil, A.; Bubshait, D.A.; et al. Assessing Acetabular Index Angle in Infants: A Deep Learning-Based Novel Approach. J. Imaging 2023, 9, 242. [Google Scholar] [CrossRef]
Khan, T.A.; Fatima, A.; Shahzad, T.; Rahman, A.-U.; Alissa, K.; Ghazal, T.M.; Al-Sakhnini, M.M.; Abbas, S.; Khan, M.A.; Ahmed, A. Secure IoMT for Disease Prediction Empowered With Transfer Learning in Healthcare 5.0, the Concept and Case Study. IEEE Access 2023, 11, 39418–39430. [Google Scholar] [CrossRef]
Mucherino, A.; Papajorgji, P.J.; Pardalos, P.M.; Mucherino, A.; Papajorgji, P.J.; Pardalos, P.M. K-Nearest Neighbor Classification. Data Min. Agric. 2009, 34, 83–106. [Google Scholar] [CrossRef]
Musleh, D.A.; Olatunji, S.O.; Almajed, A.A.; Alghamdi, A.S.; Alamoudi, B.K.; Almousa, F.S.; Aleid, R.A.; Alamoudi, S.K.; Jan, F.; Al-Mofeez, K.A.; et al. Ensemble Learning Based Sustainable Approach to Carbonate Reservoirs Permeability Prediction. Sustainability 2023, 15, 14403. [Google Scholar] [CrossRef]
Akmese, O.F.; Dogan, G.; Kor, H.; Erbay, H.; Demir, E. The Use of Machine Learning Approaches for the Diagnosis of Acute Appendicitis. Emerg. Med. Int. 2020, 2020, 7306435. [Google Scholar] [CrossRef]
Lee, Y.; Hu, P.J.; Cheng, T.; Huang, T. Artificial Intelligence in Medicine A preclustering-based ensemble learning technique for acute appendicitis diagnoses. Artif. Intell. Med. 2022, 58, 115–124. [Google Scholar] [CrossRef]
Lam, A.; Squires, E.; Tan, S.; Swen, N.J.; Barilla, A.; Kovoor, J.; Gupta, A.; Bacchi, S.; Khurana, S. Artificial intelligence for predicting acute appendicitis: A systematic review. ANZ J. Surg. 2023, 93, 2070–2078. [Google Scholar] [CrossRef] [PubMed]
Yoldaş, Ö.; Tez, M.; Karaca, T. Artificial neural networks in the diagnosis of acute appendicitis. Am. J. Emerg. Med. 2012, 30, 1245–1247. [Google Scholar] [CrossRef] [PubMed]
Issaiy, M.; Zarei, D.; Saghazadeh, A. Artificial Intelligence and Acute Appendicitis: A Systematic Review of Diagnostic and Prognostic Models. World J. Emerg. Surg. 2023, 18, 59. [Google Scholar] [CrossRef]
Phan-Mai, T.-A.; Thai, T.T.; Mai, T.Q.; Vu, K.A.; Mai, C.C.; Nguyen, D.A. Validity of Machine Learning in Detecting Complicated Appendicitis in a Resource-Limited Setting: Findings from Vietnam. BioMed Res. Int. 2023, 2023, 5013812. [Google Scholar] [CrossRef] [PubMed]
Akbulut, S.; Yagin, F.H.; Cicek, I.B.; Koc, C.; Colak, C.; Yilmaz, S. Prediction of Perforated and Nonperforated Acute Appendicitis Using Machine Learning-Based Explainable Artificial Intelligence. Diagnostics 2023, 13, 1173. [Google Scholar] [CrossRef] [PubMed]
Rajpurkar, P.; Park, A.; Irvin, J.; Chute, C.; Bereket, M.; Mastrodicasa, D.; Langlotz, C.P.; Lungren, M.P.; Ng, A.Y.; Patel, B.N. AppendiXNet: Deep Learning for Diagnosis of Appendicitis from A Small Dataset of CT Exams Using Video Pretraining. Sci. Rep. 2020, 10, 3958. [Google Scholar] [CrossRef]
Goswami, R.; Kour, H.; Manhas, J.; Sharma, V. Comparison and Analysis of Machine Learning Techniques for the Prediction of Acute Appendicitis. J. Appl. Inf. Sci. 2020, 8, 14–21. [Google Scholar]
Xia, J.; Wang, Z.; Yang, D.; Li, R.; Liang, G.; Chen, H.; Heidari, A.A.; Turabieh, H.; Mafarja, M.; Pan, Z. Performance optimization of support vector machine with oppositional grasshopper optimization for acute appendicitis diagnosis. Comput. Biol. Med. 2022, 143, 105206. [Google Scholar] [CrossRef]
Eddama, M.M.R.; Fragkos, K.C.; Renshaw, S.; Aldridge, M.; Bough, G.; Bonthala, L.; Wang, A.; Cohen, R. Logistic regression model to predict acute uncomplicated and complicated appendicitis. Ann. R. Coll. Surg. Engl. 2019, 101, 107–118. [Google Scholar] [CrossRef]
Phalak, P.; Bhandari, K.; Sharma, R. Analysis of Decision Tree-A Survey. Int. J. Eng. Res. 2014, 3, 149–154. [Google Scholar]
Wu, X.; Kumar, V.; Ross Quinlan, J.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S.; et al. Top 10 algorithms in data mining. Knowl. Inf. Syst. 2008, 14, 1–37. [Google Scholar] [CrossRef]
Badr, A.; Din, E.; Elaraby, I.S. Data Mining: A prediction for Student’ s Performance Using Classification Method. World J. Comput. Appl. Technol. 2014, 2, 43–47. [Google Scholar] [CrossRef]
IBM. What Is a Decision Tree? Available online: https://www.ibm.com/topics/decision-trees (accessed on 27 April 2023).
Beckmann, M.; Ebecken, N.F.F.; De Lima, B.S.L.P. A KNN Undersampling Approach for Data Balancing. J. Intell. Learn. Syst. Appl. 2015, 7, 104–116. [Google Scholar] [CrossRef]
Silverman, B.W.; Jones, M.C.; Fix, E. An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation: Commentary on Fix and Hodges. Int. Stat. Rev. 1951, 57, 233–238. [Google Scholar] [CrossRef]
Cunningham, P.; Delany, S.J. k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples). arXiv 2020, arXiv:2004.04523. [Google Scholar] [CrossRef]
IBM. What Is the k-Nearest Neighbors Algorithm? Available online: https://www.ibm.com/sa-en/topics/knn (accessed on 29 August 2022).
Arafat, H.; Alfeilat, A.; Hassanat, A.B.A.; Lasassmeh, O.; Tarawneh, A.S. Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data 2019, 7, 221–248. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
Breiman, L.E.O. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Smyth, P.; Wolpert, D. Stacked density estimation. Adv. Neural Inf. Process. Syst. 1997, 10. [Google Scholar]
Ma, Z.; Wang, P.; Gao, Z.; Wang, R.; Khalighi, K. Ensemble of machine learning algorithms using the stacked generalization approach to estimate the warfarin dose. PLoS ONE 2018, 13, e0205872. [Google Scholar] [CrossRef] [PubMed]
Sasada, T.; Liu, Z.; Baba, T.; Hatano, K.; Kimura, Y. A Resampling Method for Imbalanced Datasets Considering Noise and Overlap. Procedia Comput. Sci. 2020, 176, 420–429. [Google Scholar] [CrossRef]
Batista, G.; Prati, R.; Monard, M.-C. A Study of the Behavior of Several Methods for Balancing machine Learning Training Data. SIGKDD Explor. 2004, 6, 20–29. [Google Scholar] [CrossRef]
Alabbad, D.A.; Ajibi, S.Y.; Alotaibi, R.B.; Alsqer, N.K.; Alqahtani, R.A.; Felemban, N.M.; Rahman, A.; Aljameel, S.S.; Ahmed, M.I.B.; Youldash, M.M. Birthweight Range Prediction and Classification: A Machine Learning-Based Sustainable Approach. Mach. Learn. Knowl. Extr. 2024, 6, 770–788. [Google Scholar] [CrossRef]
Peixeiro, M. A Practical Guide to Feature Selection Using Sklearn. Towards Data Science. 2022. Available online: https://towardsdatascience.com/a-practical-guide-to-feature-selection-using-sklearn-b3efa176bd96 (accessed on 16 May 2023).
Loh, H.W.; Ooi, C.P.; Seoni, S.; Barua, P.D.; Molinari, F.; Acharya, U.R. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022). Comput. Methods Programs Biomed. 2022, 226, 107161. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 2017, 4766–4775. [Google Scholar]
Pinto Leite, N.; Pereira, J.M.; Cunha, R.; Pinto, P.; Sirlin, C. CT evaluation of appendicitis and its complications: Imaging techniques and key diagnostic findings. AJR Am. J. Roentgenol. 2005, 185, 406–417. [Google Scholar] [CrossRef]
Park, S.S.; Kim, M.J.; Kim, J.W.; Park, H.-C. Analysis of treatment success with new inclusion criteria for antibiotic therapy for uncomplicated appendicitis: A multicentre cohort study. Int. J. Clin. Pract. 2021, 75, e13840. [Google Scholar] [CrossRef]
Lee, H.J.; Woo, J.Y.; Byun, J. Right hydronephrosis as a sign of complicated appendicitis. Eur. J. Radiol. 2020, 131, 109241. [Google Scholar] [CrossRef]
Kapral, N.M.; Pesch, A.J.; Khot, R. Abdominal Emergencies. Semin. Roentgenol. 2020, 55, 336–363. [Google Scholar] [CrossRef]
Maurer, A.N.; Sayah, A.; Guzman, J.M.C.; Levy, A.D. Imaging of Elders. In Geriatric Forensic Medicine and Pathology; Collins, K.A., Byard, R.W., Eds.; Cambridge University Press: Cambridge, UK, 2020; pp. 507–535. [Google Scholar]
Monsonis, B.; Mandoul, C.; Millet, I.; Taourel, P. Imaging of appendicitis: Tips and tricks. Eur. J. Radiol. 2020, 130, 109165. [Google Scholar] [CrossRef]
Ertan, N.; Akdağ, T.; Subaşı, I.D.; Kaya, İ.O.; Hekimoglu, B. Can appendix bending angle be an additional finding to detect acute appendicitis on MDCT? Acta Medica Alanya 2020, 4, 76–81. [Google Scholar] [CrossRef]
Velanovich, V.; Satava, R. Balancing the normal appendectomy rate with the perforated appendicitis rate: Implications for quality assurance. Am. Surg. 1992, 58, 264–269. [Google Scholar] [PubMed]
Bom, W.J.; Scheijmans, J.C.G.; Salminen, P.; Boermeester, M.A. Diagnosis of Uncomplicated and Complicated Appendicitis in Adults. Scand. J. Surg. 2021, 110, 170–179. [Google Scholar] [CrossRef] [PubMed]
Yazici, H.; Ugurlu, O.; Aygul, Y.; Ugur, M.A.; Sen, Y.K.; Yildirim, M. Predicting severity of acute appendicitis with machine learning methods: A simple and promising approach for clinicians. BMC Emerg. Med. 2024, 24, 101. [Google Scholar] [CrossRef] [PubMed]
Wei, W.; Tongping, S.; Jiaming, W. Construction of a clinical prediction model for complicated appendicitis based on machine learning techniques. Sci. Rep. 2024, 14, 16473. [Google Scholar] [CrossRef]
Marcinkevičs, R.; Wolfertstetter, P.R.; Klimiene, U.; Chin-Cheong, K.; Paschke, A.; Zerres, J.; Denzinger, M.; Niederberger, D.; Wellmann, S.; Ozkan, E.; et al. Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Med. Image Anal. 2024, 91, 103042. [Google Scholar] [CrossRef]
Males, I.; Boban, Z.; Kumric, M.; Vrdoljak, J.; Berkovic, K.; Pogorelic, Z.; Bozic, J. Applying an explainable machine learning model might reduce the number of negative appendectomies in pediatric patients with a high probability of acute appendicitis. Sci. Rep. 2024, 14, 12772. [Google Scholar] [CrossRef]

Figure 1. Variable correlation heatmap.

Figure 2. Experimental setup.

Figure 3. KNN with Different Metrics and N_neighbors.

Figure 4. (a) Best splitter and (b) random splitter for DT with different criterion and maximum depth levels.

Figure 5. KNN with different metrics and number of neighbors.

Figure 6. (a) Best splitter and (b) random splitter for DT with different criterion and Max_depth level.

Figure 7. Proposed models’ permutation importance (testing phase).

Figure 8. Proposed models’ permutation importance (training phase).

Figure 9. SHAP summary plot for stacking model.

Figure 10. Non-complicated sample’s LIME plot.

Figure 11. Complicated sample’s LIME plot.

Table 1. Description of Features.

Feature	Description
Age	Age of patients in years (number)
Sex	Gender (male or female)
Comorbidities	Presence of other diseases or conditions (yes/no)
Readmission	Hospital readmission (yes/no)
Diarrhea	Diarrhea (yes/no)
Nausea	Nausea (yes/no)
Vomiting	Vomiting (yes/no)
E_P_Pain	Epigastric and Peri umbilical (central) abdominal pain: epigastric is the upper central region of the abdomen. Periumbilical pain is a type of abdominal pain that is localized in the region around or behind your belly button. (Yes/no)
G_A_Pain	Generalized pain (yes/no)
Pain_Migration	Shifting abdominal pain from one region to another within abdomen (yes/no)
Pain_Radiation	Shifting pain from one part of body to another (yes/no)
Lack_of_appetite	Lack of appetite (yes/no)
Dysurea	The sensation of pain and/or burning during urination (yes/no)
RLQ_Tenderness	Right lower quadrant (RLQ) of the abdomen. Pain or discomfort when an affected area is touched/pressed (yes/no)
RLQ_Rebound	Pain or discomfort when an affected area is touched (yes/no)
Guarding	Voluntary guarding is a conscious contraction of the abdominal wall in anticipation of an exam that will cause pain (yes/no)
RLQ_Mass	Right lower quadrant of the abdomen (yes/no)
L_RLQ_Pain	Localized right lower quadrant pain: type of pain (yes/no)
P_O_LOS	Post operative length of stay (in days)
Total_LOS	Total length of stay (in days)
Symptoms_Days	Number of days the symptoms appeared (in days)
WBC_Count	White blood cells count (number)
Neutrophils	Type of white blood cells (number)
Complicated	Whether condition is complicated or non-complicated (yes/no)

Table 2. Statistical analysis of numerical features.

Attributes	Count	Mean	Std	Min	25%	50%	75%	Max	Missing Values
Age	411	27.86618	10.12575	8	21	26	34	77	0
P_O_LOS	411	2.16545	2.508886	0	1	1	2	28	0
Total_LOS	411	2.552311	2.619638	0	1	2	3	28	0
Symptoms_Days	357	1.563025	1.005368	1	1	1	2	7	54
WBC_Count	411	12.57275	4.02215	3.8	9.9	12.3	14.9	27.1	0
Neutrophils	405	77.20049383	42.74949192	0.8	68.4	78.1	84	895	6

Table 3. Count and missing values of nominal features.

Attributes	Count	Count	Missing Values
Sex	411	Male (146), female (265)	0
Comorbidities	411	Yes (52), no (359)	0
Readmission	411	Yes (12), no (399)	0
Diarrhea	382	Yes (351), no (31)	29
Nausea	383	Yes (198), no (185)	28
Vomiting	385	Yes (193), no (192)	26
E_P_Pain	381	Yes (164), no (191)	30
G_A_Pain	381	Yes (100), no (281)	30
Pain_Migration	377	Yes (131), no (246)	34
Pain_Radiation	377	Yes (61), no (316)	34
L_RLQ_Pain	378	Yes (289), no (89)	33
Dysurea	383	Yes (36), no (347)	28
RLQ_Tenderness	357	Yes (292), no (65)	54
Lack_of_appetite	383	Yes (112), no (271)	28
RLQ_Rebound	356	Yes (234), no (122)	55
Guarding	355	Yes (64), no (291)	56
RLQ_Mass	353	Yes (35), no (319)	58

Table 4. Classifier’s best hyperparameters in each experiment.

Experiment	Classifier	Hyperparameter	Values	Training Accuracy
Experiment 1 (before SMOTE–Tomek)	KNN	Metrics	Manhattan	86.75%
	KNN	N_neighbors	3	86.75%
	DT	Criterion	Entropy	86.75%
		Splitter	Best
		Max_Depths	2
Experiment 1 (after SMOTE–Tomek)	KNN	Metrics	Manhattan	91.40%
	KNN	N_neighbors	3	91.40%
	DT	Criterion	Entropy	100%
		Splitter	Best
		Max_Depths	14

Table 5. Optimal hyperparameters for bagging models.

Bagging Model	Hyperparameters	Values	Training Accuracy
KNN bagging	n_estimators	150	95.70%
	max_features	0.5
	max_samples	0.9
DT bagging	n_estimators	150	96.15%
	max_features	0.5
	max_samples	0.6

Table 6. Optimal hyperparameters for stacking.

Stacking Model		Hyperparameters	Values	Training Accuracy
Base Estimators	Final Estimator	Hyperparameters	Values	Training Accuracy
KNN, KNN bagging, DT, DT bagging	KNN bagging	final_estimator__n_estimators	150	97.51%
		final_estimator__max_features	0.5
		final_estimator__max_samples	0.6

Table 7. Training and test accuracies of each model.

Experiment	Classifier	Training Accuracy	Testing Accuracy	Testing Recall	Testing Precision	Testing F1 Score
Experiment 1 (without sampling)	KNN	86.75%	75.00%	13.79%	40.00%	20.51%
Experiment 1 (without sampling)	DT	86.75%	83.06%	41.37%	75.00%	53.33%
Experiment 2 (with upsampling)	KNN	91.40%	87.36%	91.20%	83.83%	87.36%
Experiment 2 (with upsampling)	DT	100%	84.74%	84.61%	83.69%	84.15%
Experiment 3 (bagging ensemble with upsampling)	KNN bagging	95.70%	92.10%	91.20%	92.22%	91.71%
Experiment 3 (bagging ensemble with upsampling)	DT bagging	96.15%	89.47%	83.51%	93.82%	88.37%
Experiment 4 (stacking with upsampling)	Stacking	97.51%	92.63%	89.01%	95.29%	92.04%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gollapalli, M.; Rahman, A.; Kudos, S.A.; Foula, M.S.; Alkhalifa, A.M.; Albisher, H.M.; Al-Hariri, M.T.; Mohammad, N. Appendicitis Diagnosis: Ensemble Machine Learning and Explainable Artificial Intelligence-Based Comprehensive Approach. Big Data Cogn. Comput. 2024, 8, 108. https://doi.org/10.3390/bdcc8090108

AMA Style

Gollapalli M, Rahman A, Kudos SA, Foula MS, Alkhalifa AM, Albisher HM, Al-Hariri MT, Mohammad N. Appendicitis Diagnosis: Ensemble Machine Learning and Explainable Artificial Intelligence-Based Comprehensive Approach. Big Data and Cognitive Computing. 2024; 8(9):108. https://doi.org/10.3390/bdcc8090108

Chicago/Turabian Style

Gollapalli, Mohammed, Atta Rahman, Sheriff A. Kudos, Mohammed S. Foula, Abdullah Mahmoud Alkhalifa, Hassan Mohammed Albisher, Mohammed Taha Al-Hariri, and Nazeeruddin Mohammad. 2024. "Appendicitis Diagnosis: Ensemble Machine Learning and Explainable Artificial Intelligence-Based Comprehensive Approach" Big Data and Cognitive Computing 8, no. 9: 108. https://doi.org/10.3390/bdcc8090108

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Appendicitis Diagnosis: Ensemble Machine Learning and Explainable Artificial Intelligence-Based Comprehensive Approach

Abstract

1. Introduction

2. Literature Review

3. Description of the Proposed Techniques

3.1. Decision Tree

3.2. K-Nearest Neighbors (KNN)

3.3. Bagging

3.4. Stacking

4. Study Data

4.1. Statistical Analysis

4.2. Data Preprocessing

5. Experimental Setup

5.1. Performance Measures

5.2. Optimization Strategy

5.2.1. Experient 1: Performance Analysis on Actual Data without Oversampling

5.2.2. Experient 2: Performance Analysis with Oversampling

5.2.3. Experient 3: Performance Analysis of Proposed Bagging Model

5.2.4. Experient 4: Performance Analysis of Proposed Stacking Model

6. Results and Discussion

6.1. Effect of Feature Selection and Permutation Importance on the Dataset

6.2. Explainable AI

6.2.1. SHAP

6.2.2. LIME

7. Further Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI