0% found this document useful (0 votes)
23 views31 pages

EJEMPLO CONCLUSION 2-XAI-IoT - An - Explainable - AI - Framework - For - Enhancing - Anomaly - Detection - in - IoT - Systems

Uploaded by

castlecharles129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views31 pages

EJEMPLO CONCLUSION 2-XAI-IoT - An - Explainable - AI - Framework - For - Enhancing - Anomaly - Detection - in - IoT - Systems

Uploaded by

castlecharles129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Received 27 April 2024, accepted 12 May 2024, date of publication 17 May 2024, date of current version 28 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3402446

XAI-IoT: An Explainable AI Framework for


Enhancing Anomaly Detection in
IoT Systems
ANNA NAMRITA GUMMADI , JERRY C. NAPIER, AND MUSTAFA ABDALLAH , (Member, IEEE)
Computer and Information Technology Department, Purdue School of Engineering and Technology, Indiana University-Purdue University Indianapolis (IUPUI),
Indianapolis, IN 46202, USA
Corresponding author: Mustafa Abdallah ([email protected])
This work was supported in part by the Lilly Endowment through AnalytixIN and the Wabash Heartland Innovation Network (WHIN),
in part by the Enhanced Mentoring Program with Opportunities for Ways to Excel in Research (EMPOWER), and in part by the First Year
Research Immersion Program (1RIP) grants from the Office of the Vice Chancellor for Research at Indiana University-Purdue University
Indianapolis.

ABSTRACT The exponential growth of Internet of Things (IoT) systems inspires new research directions
on developing artificial intelligence (AI) techniques for detecting anomalies in these IoT systems. One
important goal in this context is to accurately detect and anticipate anomalies (or failures) in IoT devices and
identify main characteristics for such anomalies to reduce maintenance cost and minimize downtime. In this
paper, we propose an explainable AI (XAI) framework for enhancing anomaly detection in IoT systems. Our
framework has two main components. First, we propose AI-based anomaly detection of IoT systems where
we adapt two classes of AI methods (single AI methods, and ensemble methods) for anomaly detection
in smart IoT systems. Such anomaly detection aims at detecting anomaly data (from deployed sensors or
network traffic between IoT devices). Second, we conduct feature importance analysis to identify the main
features that can help AI models identify anomalies in IoT systems. For this feature analysis, we use seven
different XAI methods for extracting important features for different AI methods and different attack types.
We test our XAI framework for anomaly detection through two real-world IoT datasets. The first dataset is
collected from IoT-based manufacturing sensors and the second dataset is collected from IoT botnet attacks.
For the IoT-based manufacturing dataset, we detect the level of defect for data from IoT sensors. For the
IoT botnet attack dataset, we detect different attack classes from different kinds of botnet attacks on the IoT
network. For both datasets, we provide extensive feature importance analysis using different XAI methods
for our different AI models to extract the top features. We release our codes for the community to access it
for anomaly detection and feature analysis for IoT systems and to build on it with new datasets and models.
Taken together, we show that accurate anomaly detection can be achieved along with understanding top
features that identify anomalies, paving the way for enhancing anomaly detection in IoT systems.

INDEX TERMS Internet of Things, anomaly detection, explainable AI, SHAP, LIME, Mirai, IoT security,
black-box AI, MEMS, and N-BaIoT.

I. INTRODUCTION Any standalone device connected to the internet, capable


The Internet of Things (IoT) represents a network of of remote monitoring and control, is considered an IoT
interconnected objects with internet capabilities, equipped device. Therefore, IoT systems become more complex and
with embedded sensors for data collection and exchange [1]. are rapidly deployed in different applications [2]. IoT systems
pose certain salient technical challenges for the use of
The associate editor coordinating the review of this manuscript and AI-based models for anomaly detection. Firstly, in IoT
approving it for publication was Kashif Sharif . systems, various types of sensors concurrently generate data
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
71024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

related to the same (or overlapping) events, each possessing ensemble methods (including random forest (RF) [25],
distinct capabilities and costs. Secondly, the characteristics of bagging [26], blending [27], stacking [28], and voting [29]).
sensor data undergo changes based on the sensors’ operating These models are used to predict the anomalies from data
points, such as the Revolutions Per Minute (RPM) of the collected in these IoT systems.
motor in IoT-inspired smart manufacturing systems [3], For the second component, we perform feature impor-
which is a measure of the number of complete revolutions the tance analysis using different XAI methods (SHAP [30],
motor makes in one minute. Consequently, both the inference LOCO [31], CEM [32], ALE [33], PFI [34], and
and anomaly detection processes necessitate calibration ProfWeight [35], and LIME [36]). These XAI methods have
according to the specific operating point. To address these different methodologies for generating feature importance.
requirements, case studies on anomaly detection deployments
in such systems become imperative. The importance of A. SHAP [30]
these deployments and the subsequent analyses has been This popular XAI method facilitates the generation of feature
emphasized in prior works, including those focused on digital explanations for AI models. It uses the game theory concept
agriculture [4], smart manufacturing [5], IoT-based health of Shapely values to explain an AI model. It generates
monitoring systems [6], and other IoT systems [7], [8]. Shapley values for each feature, and it does that by
While there is a considerable body of literature on anomaly considering the prediction using all the other features except
detection within various IoT-based systems that focused on the one under evaluation. This way SHAP can understand the
traditional AI models for anomaly detection and failure impact of the contribution from each feature.
detection in different IoT systems [9], [10], [11], [12], [13],
[14], there is a notable gap in the documentation of the use B. LEAVE-ONE-COVARIATE-OUT (LOCO) [31]
of explainable AI (XAI) methods specifically for anomaly This XAI strategy is used to estimate the importance of
detection in smart IoT systems [15], [16]. In particular, these features in an AI model. When a feature is removed from
previous AI-based studies focused more on the classification the model, the LOCO technique measures the change in
accuracy of various AI algorithms, for detecting anomalies in prediction error to assess how important the feature is.
IoT systems, without providing insights about their behavior In particular, the model is retrained several times, omitting
and reasoning about main features. Moreover, they did not a distinct feature each time, and the effect on the model’s
explore the model-specific feature importance, combining performance is tracked. Thus, a feature is meaningful for the
global and local explanations, and exploring different diverse model’s predictions if omitting it results in a notable change
set of AI models. These limitations motivate the pressing in model’s accuracy.
need to leverage the relatively recent XAI field for enhancing
anomaly detection for IoT systems [17]. This usage of XAI C. CONTRASTIVE EXPLANATIONS METHOD (CEM) [32]
can also help in the ultimate goal of building predictive
This XAI method shows which features may be altered to
maintenance frameworks for these IoT systems.
get a different prediction output, which sheds light on how
In this paper, we study the anomaly detection problem of
an AI method makes decisions. This method can assist in
IoT systems by detecting failures and anomalies that would
the understanding of not just which features are significant
have an impact on the reliability and safety of these systems.
but also how these features could be changed to affect the
In such systems, the data are collected from different sensors
predictions.
via intermediate data collection points and finally aggregated
to a server to further store, process, and perform useful
D. ACCUMULATED LOCAL EFFECT (ALE) [33]
data-analytics on the sensor readings [18], [19]. We propose
The XAI ALE plots are used to illustrate how features, taking
an XAI framework (that we call XAI-IoT) for enhancing
into consideration their interactions with other features,
anomaly detection in IoT systems. Our XAI framework
impact on AI model’s average prediction. They provide
includes loading the IoT dataset, pre-processing the data,
insight into the correlation between input features and
training black-box AI models, generating global and local
anticipated prediction.
feature importance graphs, and extracting top features based
on different XAI methods. The low-level structure of our
proposed framework is shown in Figure 1. E. LIME [36]
In particular, our framework has two main components: This widely used XAI tool enables the creation of a surrogate
(i) AI-based anomaly detection of the IoT system under model approximating the original AI model’s behavior when
consideration, and (ii) feature importance analysis of the main assessed with local samples. In this study, we utilized LIME
features that help AI models identify anomalies in these IoT to generate local feature importance.
systems. For the first component, we consider two classes
of anomaly detection models which are single AI models F. PERMUTATION FEATURE IMPORTANCE (PFI) [34]
(including decision tree (DT) [20], deep neural network this XAI method is another technique used to assess the
(DNN) [21], AdaBoost (ADA) [22], support vector machine importance of features. In PFI, a feature’s importance is
(SVM) [23], and multi-layer perceptron (MLP) [24]), and determined by permuting its values and tracking how the

VOLUME 12, 2024 71025


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

model’s performance indicators change as a result. Each vibrations. To investigate this, we compare the model’s
feature goes through this procedure several times, and the performance with features derived from all three data axes in
average performance change indicates how important the the default setup against a proposed approach where features
feature is. are extracted exclusively from the X-axis and Z-axis data
vectors. For the feature importance analysis on the N-BaIoT
dataset, our results show that the features that give precise
G. PROFILED WEIGHTING (PROFWEIGHT) [35] statistics summarizing the recent traffic from the host to the
This XAI approach assesses the significance of a feature destination in this IoT network can help in identifying the
by considering many criteria, such as the feature’s weight botnet attack class on the IoT network.
within the model and its relationship with other features. This We emphasize that our paper focuses on offering a
approach offers an indicator of the relative contribution of thorough analysis, and comparative viewpoint of different AI
each attribute to model’s predictions. models and Explainable AI (XAI) techniques. It is critical
We apply our framework to study two real-world IoT to comprehend the functionality and interpretability of these
datasets with different characteristics. The first dataset (that models in an era where artificial intelligence (AI) systems are
we call MEMS dataset) is a smart manufacturing dataset becoming more and more complicated. By means of thorough
collected from deployed manufacturing sensors to detect experimentation and comparative analysis, our objective is
anomalous data readings. The second dataset (that we call to highlight the performance of various artificial intelligence
N-BaIoT) is an open source IoT dataset where the goal methodologies and XAI approaches. Our research aims to
of this dataset is to detect IoT botnet attacks [37]. The provide practitioners and researchers in IoT security domain
N-BaIoT data was gathered from nine commercial IoT with essential insights for responsible AI deployment and
devices that were actually infected by two well-known decision-making by examining their effectiveness in a variety
botnets, Mirai [38] and Gafgyt [39]. In our evaluation, of datasets and use scenarios in two IoT systems.
we first analyze the performance of our different AI models in Summary of Contributions: Based on our analysis and
detecting anomalies in these two datasets where we measure evaluation, we have the following contributions:
different standard performance metrics (including accuracy,
precision, recall, and F-1). We observe that the best anomaly 1) XAI Framework: Our contribution consists in the
detection model is dataset-dependent with ensemble methods creation of an XAI framework specifically designed
giving better performance in the anomaly detection task. for anomaly identification in IoT systems. We offer a
One challenge in the MEMS anomaly detection problem complete toolset for doing feature importance analysis
is the prediction from AI models using sparse data, which by combining seven distinct XAI techniques, including
is often the case because of limitations of the sensors or the SHAP, LIME, CEM, ALE, PFI, Profweight, and LOCO.
cost of collecting data. The choice of sensors with lower This approach provides feature importance on both
sampling rates, despite potential drawbacks, is driven by global and local scopes, which is important for com-
significant cost differences. For instance, MEMS sensors are prehending how complicated AI models for anomaly
much more economical ($8) compared to advanced sensors detection in IoT applications make decisions. In our
like piezoelectric sensors ($1305) [40], [41]. To tackle this feature importance analysis, we extract the model-
challenge, we leveraged all available data collected under specific features (i.e., top important features for each AI
various operating conditions, specifically different RPMs. model) and anomaly-specific features (i.e., top features
On the other hand, the challenge in the anomaly detection in for each attack type) for different classes of AI models
N-BaIoT dataset is detecting the different attack classes from that we have and different types of anomalies.
different kinds of botnet attacks on the IoT network. To tackle 2) Anomaly Detection: We expand the use of anomaly
this challenge, we evaluated the performances of different AI detection to IoT systems by modifying two categories
models under different combinations of features to identify of models: individual AI techniques (including DT,
which setup is more efficient in detecting anomalies in the DNN, SVM, ADA, and MLP) and ensemble techniques
traffic data collected from this IoT network. (including bagging, blending, stacking, and voting).
We have also provided extensive feature importance anal- These models are made to detect abnormal data from
ysis using different XAI methods for our different AI models. network traffic or deployed sensors in an efficient
For the MEMS dataset, we also provided feature importance manner, protecting IoT infrastructure security.
for different RPMs considered in collecting MEMS data for 3) Testing: We rigorously evaluate our system on two
more in-depth understanding. In our assessment of feature real-world datasets from IoT botnet attacks and
importance, we validate the notion that vibration data along IoT-based manufacturing sensors. We demonstrate our
certain axes may not convey distinct information in normal approach’s efficacy in identifying anomalies in various
and failure scenarios. The circular motion around the motor’s IoT scenarios through comprehensive evaluation,
center occurs along the X and Z axes, resulting in vibration underscoring its potential for practical use.
values that vary with the motor’s condition. Conversely, the 4) Defect Type Classification: We further contribute to the
Y-axis, representing the direction of the shaft, exhibits smaller field by offering a way to classify the degree of fault

71026 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

in IoT-based smart manufacturing data, in addition to monitoring system for detecting various failures in a DC
anomaly detection. The ability to classify fault degree motor, including gear defects, misalignment, and looseness.
improves the IoT systems’ diagnostic capabilities, However, this study relied on a single sensor (accelerometer)
allowing for proactive maintenance and quality control to collect machine condition data, and it employed several
procedures. convolutional neural network architectures for targeted fail-
5) Benchmark Data and Codes: In order to support ure detection without considering different rotational speeds
future work in this area, we make our database and sensors. Consequently, these techniques would need
corpus available, which consists of two different reapplication for each new sensor type. In contrast, our
datasets and all developed code scripts. We hope approach involves considering learning main features using
that releasing these tools to the community will diverse XAI methods and sensor types, and we conduct a
help to accelerate progress in anomaly detection and comparative analysis of single and ensemble learning-based
classification for IoT systems by facilitating bench- models for our anomaly detection task.
marking, replication, and extension of our work.
https://github.com/agummadi1/XAI_for_IoT_Systems.
Paper Organization: The rest of the paper is organized B. EXPLAINABLE AI AND FEATURE IMPORTANCE IN IoT
as follows. We first present the related works in Section II.
Many studies have been conducted in the field of Explainable
We then explain our framework, including anomaly detection
Artificial Intelligence (XAI) with the goal of improving
and feature importance models in Section III. In Section IV,
machine learning models’ interpretability and transparency
we present our evaluation results of anomaly detection
for different applications [47]. Numerous strategies have
models and XAI feature importance on our two IoT datasets.
been investigated, from decision trees and rule-based systems
We present main limitations of our work and related
to more complex methods like SHAP (SHapley Additive
discussions in Section VII. We conclude the paper in
exPlanations) and LIME (Local Interpretable Model-agnostic
Section VIII.
Explanations). By demystifying complex models’ decision-
making processes, these techniques aim to increase end users’
II. RELATED WORK understanding and confidence.
A. ANOMALY AND FAILURE DETECTION MODELS IN IoT Feature Selection in IoT: It becomes especially important
Anomaly detection methods have been used for identifying to integrate feature importance analysis when it comes to IoT
energy consumption anomalies in IoT systems via monitoring systems [48], [49], [50]. In particular, the work [48] provides
and identifying abnormal energy consumption patterns in an extensive overview of feature selection methods in the
IoT-connected devices to detect malfunctioning or compro- context of IoT-based healthcare applications. The work [49]
mised devices [42]. Furthermore, anomaly detection has proposes a feature selection method for intrusion detection
been explored for health monitoring IoT systems through systems (IDSs) for IoT systems but with only focusing on two
detecting anomalies in health-related data collected by IoT feature selection methods which are Information Gain (IG)
devices such as wearables to identify potential health issues and Gain Ratio (GR). It also focused mainly on the detection
or irregularities [6]. Machine learning methods have been of denial of service (DoS) attacks. The work [50] proposes
used for detecting anomalies in smart agriculture applications a feature selection algorithm that is based on the concepts of
with the focus of detecting anomalies in environmental data the Cellular Automata (CA) engine and Tabu Search (TS)-
collected by IoT devices in agriculture to identify potential based aspiration criteria with main focus on Random Forest
crop diseases, irrigation issues, or pest infestations [43]. (RF) ensemble learning classifier to evaluate the fitness of the
However, these work did not explore XAI feature importance selected features. That work also focused on the TON_IoT
and analysis, and considered different application domains dataset (created by UNSW in Australia).
from those in our current work. The challenges in IoT systems include large volumes
Various studies have explored the detection of failures in of data that are produced by IoT devices. Thus, it is
IoT systems using either single or multiple sensors [44], [45], essential to comprehend the fundamental characteristics
[46]. The main application in this context is monitoring the that influence model predictions in order to maximize
behavior of machinery and equipment in industrial settings to system efficiency [51], spot abnormalities, and guarantee the
detect anomalies, potential faults, or performance deviations. accuracy of decision results. In addition to improving model
Notably, a recent work [44] proposed a kernel principal interpretability for anomaly detection for IoT systems, feature
component analysis-based anomaly detection system to importance analysis makes it easier to identify important
identify cutting tool failures in a machining process in variables [52], [53], which helps with the development and
smart manufacturing system. Although this study utilized implementation of reliable and effective IoT applications.
multi-sensor signals to assess the cutting tool’s condition, In order to enable accountable and trustworthy AI deploy-
it did not address transfer learning between different sensor ment in a variety of applications, researchers are working to
types. Another recent study [46] introduced a fault detection build approaches that strike a compromise between model

VOLUME 12, 2024 71027


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 1. The proposed XAI-IoT framework. It has the following steps: loading IoT database, preprocessing data, black-box AI training,
black-box AI evaluation, XAI feature importance (including global and local scopes), and retraining AI models.

precision and transparency as the convergence of XAI and defect type (or attack) classification, and feature importance.
IoT expands. We now explain the low-level components of our XAI
pipeline. The different components of our framework (shown
C. DATASETS AND BENCHMARKS FOR ANOMALY in Figure 1) are explained below.
DETECTION IN IoT SYSTEMS Loading IoT Database: The first component in our pipeline
Several papers have concentrated on providing datasets for is loading the IoT data from the database as a starting point.
anomaly detection in IoT systems, particularly emphasizing In our work, we use two IoT datasets which are MEMS [43],
the unsupervised anomaly detection process [54], [55]. For and N-BaIoT [37] datasets.
instance, the study by Koizumi et al. [54] presents benchmark Pre-Processing: The second component in our framework
results from the DCASE 2020 Challenge Task, focusing on is preprocessing in which we prepare the dataset for the
unsupervised detection of anomalous sounds for machine anomaly classification task. In particular, such preprocessing
condition monitoring in smart manufacturing. This work is essential for building AI models for the anomaly detection
specifically aims to determine whether the sound emitted task. We followed the prior works [43] for extracting the
from a target machine is normal or anomalous, addressing the basic set of features for MEMS and N-BaIoT datasets,
challenge of anomalous sound detection (ASD). Additionally, respectively. We also emphasize that we take advantage of
another work by Hsieh et al. [55] introduces an unsupervised our XAI-based feature selection to identify top features that
real-time anomaly detection algorithm tailored for smart affect the decisions of different AI models.
manufacturing. In contrast, the study by Fathy et al. [56] Feature Normalization: In order to prevent variations in
delves into learning techniques for failure prediction using scales among different features, we apply a standard feature
imbalanced smart manufacturing datasets. However, none of normalization step (min-max feature scaling) for all columns
these works addresses the crucial aspect of feature importance in our datasets (where we apply feature scaling for each
via different XAI methods, which is a focal point in our column; one column at a time to address inconsistencies
investigation. The work [57] proposed an open-source IoT across different features’ scales). This process ensures that all
framework for sensor networks management in smart cities, features are brought to a consistent numerical scale, thereby
which is a different application domain from the two IoT avoiding any discrepancies in magnitude across the dataset.
domains considered in our current work. Such a process has been applied in several prior works [58],
[59], [60], [61].
III. MATERIALS AND METHODS Black-Box AI Models: Once the preprocessing of the
We now describe our proposed framework. This framework database is complete, we train the AI model where we
mainly consists of algorithms for the anomaly detection, perform splitting of the data with a split of 70% for the

71028 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

training while leaving the unseen 30% for testing purposes. TABLE 1. Summary and statistics of the two IoT datasets used in this
work, including the size of the dataset, the number of attack types
For this part, we have built ten popular AI classification (labels), and the number of intrusion features).
models. They can be classified into the following two classes:

• Single AI Models: In this category, we included deci-


sion tree (DT) [20], deep neural network (DNN) [21],
AdaBoost (ADA) [22], support vector machine
(SVM) [23], and multi-layer perceptron (MLP) [24]). various XAI methods, including SHAP, LOCO, CEM, PFI,
• Ensemble Models: We selected five popular ensemble and ProfWeight.
methods where we included RF [25], bagging [26], XAI Local Explanations: Our framework consists of two
blending [27], stacking [28], and voting [29]. local XAI blocks. First, we use the recent well-known local
interpretable model-agnostic explanations (LIME) [36] for
These models are used to predict the anomalies from giving insights of what happens inside an AI algorithm by
sensors’ readings in IoT systems, along with the attack (or capturing feature interactions. We first generate a model that
defect) type. These models were also used for predictive approximates the original model locally (LIME surrogate
maintenance for the MEMS IoT dataset, i.e., predicting model) and then generate the LIME local explanations.
the level of defect with the MEMS sensor and whether Second, we leverage ALE [33] via generating ALE local
the machine is in normal operation, near-failure (and needs graphs (named as ALE plots).
maintenance), or failure (and needs replacement). On the Feature Explanation: The final component in our
other hand, the models were used to detect anomaly classes in framework is extracting detailed metrics from the global
N-BaIoT dataset, i.e., predicting whether the traffic is benign, explanations. In particular, we extract the model-specific
or is related to one of the 10 attack classes that represent features (i.e., top important features for each AI model)
different attack tactics employed by the Gafgyt [39] and and anomaly-specific features (i.e., top features for each
Mirai [38] botnets to infect IoT devices. anomaly (or attack type) for different classes of AI models
For each model, we generated multiple variants by that we have and different types of anomalies. This additional
changing the values of hyperparameters. We then chose the feature analysis can help in providing human-understandable
model variant with the best performance for each dataset. explanations of the decision-making of the AI model and
We describe the hyper-parameters and the libraries used for accompanied top features.
all forecasting models in Appendix B. Feature Explanation Importance: One of the important
Black-Box AI Evaluation: The next step in our framework outcomes of our framework is generating the list of top
is to study the performance of each model on unseen test features for each IoT dataset. This can help the security
data. For such performance analysis, we first create the analyst managing the IoT network in detecting attacks and
confusion matrix for each model and use it to derive the anomalies (either from IoT sensors (MEMS) readings or
following different metrics for each AI model: accuracy network traffic between IoT devices (N-BaIoT)). In our
(Acc), precision (Prec), recall (Rec), F1-score (F1), Matthews current work, we are focusing on understanding in more depth
correlation coefficient (Mcc) [62], balanced accuracy (Bacc), the feature importance for each model for the two considered
and the area under ROC curve (AucRoc). We selected these datasets (MEMS, and N-BaIoT).
metrics for two primary reasons. Firstly, they are commonly Summary and Statistics of the Datasets: Table 1 shows
used in numerous comparable works that emphasize IoT the number of samples for the two datasets and the
systems, such as [37] and [43] for our two datasets. distribution of samples per attack type. Note that MEMS
Secondly, our choice enables an examination of the impact dataset contains only three classes (near failure, failure,
of XAI-based feature selection on the performance of AI and normal). On the contrary, N-BaIoT has ten attack
models in IoT datasets. This allows for direct comparisons classes (five Gafgyt botnet attacks, and five Mirai botnet
with previous studies conducted on the two IoT datasets under attacks).
consideration. Having introduced the background and the low-level
XAI Global Explanations: The aforementioned AI models details for the proposed framework, we next provide our
operate as black-box models. Consequently, it becomes main evaluation results for anomaly detection and feature
imperative to provide explanations for these models, elucidat- importance tasks on our two IoT datasets.
ing their accompanied features (primary IoT sensors or IoT
network traffic) and labels (attack types). In the subsequent IV. EXPERIMENTAL SETUP EXPLANATION
phase of our framework, we incorporate Explainable AI In our evaluation, we seek to answer the following five
(XAI). In the initial segment of this step, we generate global research questions:
importance values for each feature, creating various graphs • What is the performance of black-box AI models on the
to analyze the influence of each feature on the AI model’s two considered IoT datasets?
decisions. This analysis aids in forming expectations about • How can we detect the operational state of IoT dataset
the model’s behavior. For global explanations, we employ effectively (i.e., with high accuracy)?

VOLUME 12, 2024 71029


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 2. Motor testbed and sensors for smart manufacturing dataset. In (a), we show the testbed. In (b), we show piezoelectric and
MEMS sensors when mounted on motor testbed. In (c), we show the balancing disk to make different levels of imbalance.

• What are the main features that affect the performance both defect-type classification and feature importance tasks,
of different AI models in our IoT datasets? as elaborated in Section V.
• What are the main features for each attack (anomaly)
type in the two IoT datasets? d: ANOMALY (DEFECT) CLASSES
• How does XAI help in understanding the performance The data contains different levels of defects (i.e., different
of different AI models for single data instance? labels for indicating normal operation, near-failure, and
failure for MEMs dataset). These labels would be used in our
A. DEPLOYMENT DETAILS AND DATASETS EXPLANATION evaluation.
1) MEMS DATASET
a: MAIN GOAL OF USING THIS IoT DATASET 2) N-BaIoT DATASET
Anomalous data generally needs to be separated from The goal of this dataset is to detect IoT botnet attacks [37].
machine failure as abnormal patterns of data do not necessar- This dataset is a useful resource for researching cybersecurity
ily imply machine or process failure [5]. We perform anomaly issues in the context of the Internet of Things (IoT). This data
detection using vibration data to identify anomalous events was gathered from nine commercial IoT devices that were
and then attempt to label/link these events with machine actually infected by two well-known botnets, Mirai [38] and
failure information. This way, we aim to identify abnormal Gafgyt [39].
data and correlate the abnormal data to machine failure (i) Details of Nine Devices: We first describe briefly each
coming from IoT manufacturing sensors. To achieve such a device of the nine devices used for data collection.
goal, we build anomaly detection models to detect anomalies Device 1 - Danmini Doorbell: A smart doorbell that
and failures in the IoT sensors. integrates intercom and camera systems for maintaining
home security and communications systems.
b: DATASET CONSTRUCTION AND FEATURES Device 2 - Ecobee Thermostat: An intelligent thermostat
To construct this dataset, an experiment was carried out in that learns the user’s preferences and modifies the heating or
the motor testbed, as depicted in Figure 2a, to gather machine cooling to maximize energy savings.
condition data, specifically acceleration, under various health Device 3 - Ennio Doorbell: A doorbell system with video
conditions. During the experiment, acceleration signals were features, giving visual identification of guests for better home
acquired from MEMS sensors (Figure 2b) simultaneously, security.
with a sampling rate of 10 Hz for the X, Y, and Z axes. Device 4 - Philips B120N10 Baby Monitor: A baby
Different levels of machine health conditions were induced by monitor that has audio and video features so that parents can
affixing a mass to the balancing disk (i.e., mounting a mass on closely monitor their newborns.
the balancing disk, thus different levels of mechanical imbal- Device 5 - Provision PT 737E Security Camera: A
ance are used to trigger failures), as illustrated in Figure 2c, security camera with remote monitoring and motion detection
thereby generating varying degrees of mechanical imbalance capabilities that is intended for surveillance.
to initiate failures. Failure conditions were categorized into Device 6 - Provision PT 838 Security Camera: Another
one of three states: normal, near-failure, and failure. kind of security camera, with expanded functions for
monitoring and surveillance purposes.
c: OPERATIONAL SPEEDS Device 7 - Samsung SNH 1011 N Webcam: A Samsung
Acceleration data were captured at ten rotational speeds (100, webcam utilized for video conferences or basic video
200, 300, 320, 340, 360, 380, 400, 500, and 600 RPM) recording.
for each condition while the motor was operational. Fifty Device 8 - SimpleHome XCS7 1002 WHT Security
samples were collected at 10-second intervals for each of Camera: A security camera from SimpleHome, ideal for
the ten rotational speeds. This same dataset was utilized for effortless home surveillance and monitoring.

71030 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

Device 9 - SimpleHome XCS7 1003 WHT Security then enumerates and probes these devices in an effort to locate
Camera: Another security camera for home surveillance with and compromise them.
a few advanced features. 5. gafgyt.tcp: This class embodies the TCP-based attack
(ii) Main Features: Every data instance in the dataset of the Gafgyt botnet, which targets devices using TCP-based
is represented by a variety of features. These attributes are exploits and attacks.
divided into multiple groups, which are detailed as follows: 6. gafgyt.udp: The User Datagram Protocol (UDP) is
(A) Stream Aggregation: These functions offer data that used in Gafgyt’s ‘‘udp’’ assault to initiate attacks, such as
summarizes the traffic of the past few days. This group’s bombarding targets with UDP packets to stop them from
categories comprise the following features: operating.
H: Statistics providing an overview of the packet’s host’s 7. mirai.ack: To take advantage of holes in IoT devices and
(IP) recent traffic. enlist them in the Mirai botnet, Mirai’s ‘‘ack’’ attack uses the
HH: Statistics providing a summary of recent traffic from Acknowledgment (ACK) packet.
the host (IP) of the packet to the host of the packet’s 8. mirai.scan: By methodically scanning IP addresses and
destination. looking for vulnerabilities, Mirai’s ‘‘scan’’ assault seeks to
HpHp: Statistics providing a summary of recent IP traffic identify susceptible IoT devices.
from the packet’s source host and port to its destination host 9. mirai.syn: The Mirai ‘‘syn’’ attack leverages vulnerabil-
and port. ities in IoT devices to add them to the Mirai botnet by using
HH-jit: Statistics that summarize the jitter of the traffic the SYN packet, which is a component of the TCP handshake
traveling from the IP host of the packet to the host of its procedure.
destination. 10. mirai.udp: Based on the UDP protocol, Mirai’s
(B) Time-frame (Lambda): This characteristic indicates ‘‘udp’’ attack includes bombarding targeted devices with
how much of the stream’s recent history is represented in the UDP packets in an attempt to interfere with their ability to
statistics. They bear the designations L1, L3, L5, and so forth. function properly.
(C) Features Taken Out of the Packet Stream Statistics: 11. mirai.udpplain: This class represents plain UDP
Among these characteristics are the following features: assaults that aim to overload IoT devices with UDP traffic,
Weight: The total number of objects noticed in recent causing service disruption.
history, or the weight of the stream. Having explained the main features and classes of the two
Mean: The statistical mean is called the mean. IoT datasets, we next explain our main experimental setup
Std: The standard deviation in statistics. used for generating our evaluation results.
Radius: The square root of the variations of the two
streams.
B. EXPERIMENTAL SETUP
Magnitude: The square root of the means of the two
The goal is to measure the performance of our anomaly
streams.
detection models to detect anomalies (and their types) for the
Cov: A covariance between two streams that is roughly
two datasets considered in this work (MEMS, and N-BaIoT).
estimated.
We show the performance of our models in terms of the
Pcc: An approximate covariance between two
accuracy of detecting the anomaly correctly (measured by
streams.
precision, recall, and F-1 score). For each proposed model,
(iii) Classes: The dataset consists of the following
the training size was 70% of the total collected data while
11 classes: benign traffic which is defined as network activity
the testing size was 30%. We emphasize that our anomaly
that is benign and does not have malicious intent, and 10 of
detection problem here is a multi-class classification problem
these classes represent different attack tactics employed by
in which we try to predict the class for each sample for each
the Gafgyt and Mirai botnets to infect IoT devices. The
IoT dataset.
classes are summarized as follows:
1. benign: There are no indications of botnet activity in this
class, which reflects typical benign network traffic. It acts as 1) AI MODELS
the starting point for safe network operations. By pairing ten popular AI classification algorithms (which
2. gafgyt.combo: This class is equivalent to the ‘‘combo’’ are decision tree (DT) [20], deep neural network (DNN) [21],
assault of the Gafgyt botnet, which combines different attack random forest (RF) [25], AdaBoost (ADA) [22], support vec-
techniques, like brute-force login attempts and vulnerability- tor machine (SVM) [23], multi-layer perceptron (MLP) [24],
exploiting, to compromise IoT devices. bagging [26], blending [27], stacking [28], and voting [29]),
3. gafgyt.junk: The ‘‘junk’’ attack from Gafgyt entails we evaluate black-box AI methods and different components
flooding a target device or network with too many garbage of our proposed XAI framework for our two IoT datasets.
data packets, which can impair operations and even result in We emphasize that these ten AI algorithms can be categorized
a denial of service. into two categories: (1) single AI models (DT, DNN, ADA,
4. gafgyt.scan: Gafgyt uses the ‘‘scan’’ attack to search SVM, and MLP), and (2) ensemble models (RF, bagging,
for IoT devices that are susceptible to penetration. The botnet blending, stacking, and voting).

VOLUME 12, 2024 71031


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

2) CODING TOOLS prediction models on the test dataset. This accuracy is defined
We built upon the Keras library [63] which is Python-based as the ratio of correctly classified samples to the total number
for creating the variants of our models. of samples. We also explore various performance metrics,
as detailed earlier in Section IV-A.
3) EXPLAINABLE AI (XAI) TOOLS
In addition, we employed the following XAI toolkits: A. PERFORMANCE METRICS UNDER ALL FEATURES
Table 2 shows the performance metrics for each AI model
a: SHAP [30] for our MEMS dataset. We observe that DNN and MLP give
This toolkit facilitates the generation of feature explanations the first and second best anomaly detection performances,
for our intrusion detection AI models. We utilized SHAP to respectively, (i..e., highest accuracy, precision, and recall).
produce both local and global explanations for our AI models. Furthermore, stacking and voting ensemble methods give
the third best performance. On the other hand, SVM gives
b: LIME [36]
the worst performance for MEMS under all features. This
experiment suggests using neural network-based AI models
Another widely used XAI tool, LIME enables the creation
(MLP, or DNN) for classifying the data from the MEMS
of a surrogate model approximating the original AI model’s
vibration-based IoT dataset. These neural network-based
behavior when assessed with local samples. In this study,
prediction models perform better than the traditional models
we also utilized LIME to generate global explanations
due to the fact that the deployments generate enough data for
by aggregating its local explanations and averaging the
accurate training and due to the complex dependencies among
importance of each feature.
the features of the MEMS dataset.
c: OTHER XAI TOOLKITS B. PERFORMANCE METRICS OF FEATURE COMBINATIONS
We also used pandas, sklearn (permutation importance), and We next explore the effect of each of the three features (X,Y,
statistics libraries for the other XAI methods (LOCO, PFI, and Z) via comparing the performance metrics of different AI
Profweight, ALE, and CEM). models under different combinations of these features where
we eliminate one feature for each setup. Thus, we have the
4) COMPUTING RESOURCES following setups: XY, XZ, and YZ, representing different
We conducted anomaly detection experiments using a feature combinations for rotation directions. Tables 3-5 show
workstation equipped with an Intel i7 @2.60 GHz processor, the main results for this experiment.
16 GB RAM, and 8 cores. The feature extraction experiments In Table 3, under the feature combination XY (excluding
were carried out on the BigRed Workstation at IUPUI, the most important feature ‘z’), the models generally perform
a high-performance computer (HPC) boasting four NVIDIA poorly. The accuracy, precision, recall, and F1-score are lower
A100 GPUs, 64 GPU-accelerated nodes (each with 256 GB of across all models compared to Table 2, which incorporates all
memory), and a single 64-core AMD EPYC 7713 processor features including ‘z’. This decline suggests that excluding
(2.0 GHz and 225-watt), achieving a peak performance ‘z’ significantly impacts the models’ ability to correctly
of approximately 7 petaFLOPs. This supercomputer is identify anomalies, underlining the importance of feature
specifically designed to support researchers in advanced AI z. Table 4 shows results for the feature combination YZ.
and ML tasks [64]. The inclusion of ‘z’ leads to a notable improvement in
performance metrics compared to Table 3. The Decision
5) PERFORMANCE METRICS Tree, Random Forest, and DNN models all show increased
We first do benchmarking of the ten AI models for each of accuracy, precision, recall, and F1-scores. This is consistent
the two datasets (described above in Section IV-A). We show with the stated importance of feature ‘z’, as its presence
such a comparison of performances of the ten models in alongside ‘y’ provides enough information for models to
terms of the performance metrics (represented by the typical more accurately predict outcomes. However, they still have
metrics: Precision, Recall, and F-1 Score [65]). We then show lower performance compared to having all features. Finally,
the feature importance for main features given different XAI in Table 5, the XZ combination is analyzed. Again, the
methods considered in this work. metrics demonstrate better model performance than in
Table 3, further supporting the significant role of ‘z’ in
V. MEMS RESULTS EVALUATION the model’s prediction capability. All models experience an
In this section, we start the evaluation of the models’ increase in accuracy and F1-score when ‘z’ is present, even
performances using a real dataset sourced from MEMS when paired with ‘x’ instead of ‘y’.
vibration sensors, our production-grade sensors. Through We summarize below the main findings of such an
data analytics applied to the vibration sensor data, we discern experiment: (i) Accuracy: In terms of accuracy, the com-
one of the motor’s three operational modes. The model’s bination XZ gives the highest accuracy with the AdaBoost
performance is presented in terms of defect level detection model with a value of 0.655. (ii) Precision: In terms of
accuracy, gauged by the classification accuracy of the AI precision, combination XZ gives the highest precision with

71032 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

TABLE 2. Anomaly detection results (the higher the better) for each AI model. DNN gives the best performance metrics.

TABLE 3. Performance metrics under combination XY. TABLE 5. Performance metrics under combination XZ.

TABLE 4. Performance metrics under combination YZ.

FIGURE 3. SHAP global summary plot example for MEMS using KNN AI
model. It shows both top features per anomaly class and overall top
feature for the MEMs dataset.
the AdaBoost model with a value of 0.664. (iii) Recall: In
terms of recall, the combination YZ gives the highest recall
with the Deep Neural Network model with a value of 0.648. influence on the various prediction classifications. For
(iv) F1 Score: In terms of F-1 score, the combination XZ example, feature ‘z’ significantly affects the three‘classes, but
gives the highest F1 score with the AdaBoost model with a features ‘x’ and ‘y’ appear to have varying effects on both
value of 0.614. the ‘‘Normal’’ and ‘‘Failure’’ predictions. Overall, feature ‘z’
In total, the combination XZ gives the best overall appears to be most influential feature in predicting all the
performance across all combination. Across all tables, the 3 labels (anomaly states), suggesting it helps in recognizing
consistency of ‘z’ in contributing to higher performance normal operational states, detecting situations that are close
metrics suggests that it is indeed a critical feature for anomaly to failure but not yet critical, and also identifying potential
detection in the MEMS dataset. failures.

C. OVERALL FEATURE IMPORTANCE (SHAP GLOBAL D. FEATURE IMPORTANCE USING SHAP WITH DIFFERENT
SUMMARY PLOT) AI MODELS
We next show the overall feature importance under SHAP Having explained the global summary plot by SHAP, we next
(SHapley Additive exPlanations) values, which measure provide the overall feature importance by SHAP for the
each feature’s contribution to a machine learning model’s different AI models in our work. The average effect of
prediction. Figure 3 shows such overall feature importance each feature (x, y, and z) on the model’s output magnitude,
using KNN model. Every horizontal bar is a unique as determined by the mean absolute SHAP values in Figure 4.
feature, and its length signifies the mean absolute SHAP Main Insights: By giving each feature a value that
value, which reflects the average effect of the feature on corresponds to its importance for a certain prediction, SHAP
the output magnitude of the model. Each bar’s color— values provide a broad overview of feature relevance for
magenta for ‘‘Failure,’’ green for ‘‘Normal,’’ and blue for different AI models. Figure 4a shows that the most significant
‘‘Near-failure’’—represents the percentage of each feature’s feature in the AdaBoost model is z, which is followed by

VOLUME 12, 2024 71033


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 4. Overall feature importance for all three features (x,y, and z) using SHAP for different AI models for MEMS dataset.

FIGURE 5. Feature importance (using SHAP global summary plots) for different RPMs used in collecting data for MEMS.

x and y. As before, Figure 4b shows that feature z is the closely followed by x, and y as the least significant. We have
most significant feature for Deep Neural Network model, noticed similar insights for SVM and MLP models (omitted
having a significantly larger SHAP value than features x here).
and y. Figure 4c shows that the most significant feature in
the Decision Tree model is z, however y has less influence
compared to x for this Decision Tree model. In this case, E. FEATURE IMPORTANCE FOR EACH RPM
x has nearly as much influence as y. According to the We next show the feature importance for each RPM where
Random Forest model summary in Figure 4d, feature z has we built a separate AI model for each RPM for the six RPM
also the biggest average influence on model predictions, values used for collecting the data for MEMS. The length of

71034 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

each bar in the charts, which corresponds to a feature (labeled Understanding the AI model’s reasoning behind certain
x, y, and z), indicates the mean absolute SHAP value for that predictions of the IoT instance is made easier with the help
feature over a large number of cases. The manner in which of this LIME explanation. This is particularly helpful for
the bars are colored—blue for ‘‘Near-failure,’’ magenta for high-stakes applications like IoT or smart manufacturing,
‘‘Failure,’’ and green for ‘‘Normal’’—indicates how well the where decision-making must be transparent and comprehen-
feature predicts the various classes or possible outcomes. sible to ensure safety in these systems.
A factor that has a greater impact on the model’s prediction
for a given class is shown by a longer bar in that color. G. FEATURE IMPORTANCE USING LEAVE ONE COVARIATE
The charts suggest that, for all RPMs, feature ‘z’ usually OUT (LOCO)
has a significant influence on the prediction of ‘‘Normal’’ We next show the feature importance with an interpretability
and ‘‘Failure’’ classes prediction. Throughout the classes and technique called LOCO [31]. The main idea of this method
models/scenarios, the effects of characteristics ‘x’ and ‘y’ is to evaluate the effect of each feature individually on the
differ more noticeably. In particular, the feature ‘x’ ranked prediction. It entails retraining the AI model several times,
the second in four RPMs while feature ‘y’ has the second omitting a distinct characteristic each time, and tracking how
rank only in two RPMs. Furthermore, the various charts’ the predictions of the AI model alter. We used LOCO while
varying bar lengths and colors indicate that the importance using RF AI model for MEMS dataset and it shows that
of each feature to the model’s prediction varies based on the the prediction difference for feature ‘x’ is 1 which means
model and situation under study. This can provide important that when we set the feature ‘x’ to its mean value while
insights into the predictive dynamics of the AI model and keeping other features unchanged, the RF model’s prediction
possibly direct feature selection or model modification. It also increased by 1, while the prediction difference for features y
shows different behaviors of features within each setting (e.g., and z are 0. This means that when we set the feature y (or z)
feature ‘x’ affects the decision more in high and low rotational to its mean value while keeping other features unchanged, the
speeds of the motor compared to medium rotational speeds). RF model’s prediction remained the same (no change).
We can also show the feature importance bar chart using
LOCO for each feature. We show an example for MEMS
F. LOCAL FEATURE IMPORTANCE USING LIME dataset using MLP model in Figure 7. In this figure, three
We next show local feature importance for MEMS dataset. features (x, y, and z) are displayed in a bar chart that
In particular, we provide three examples (one example for illustrates their relative importance according to the LOCO
each class of the three anomaly classes in MEMS) to show approach. The change in accuracy of predictions or some
how LIME, which is a well-known local XAI model, that other performance parameter when each feature is removed
uses features to explain decisions of the AI model on local is used to determine how important a feature is. With the
(single) instance. In these examples (shown in Figure 6), highest relevance, feature z signifies that the MLP model’s
the prediction probabilities for the three classes (designated predictions are significantly affected by its value. On the other
as 0 (normal), 1 (near-failure), and 2 (failure)) using a hand, the least important characteristic is feature y, while
RF classifier are displayed with the feature contributions feature x is of intermediate value.
that went into the forecast for each class, with each row
representing a distinct case (or example). H. FEATURE IMPORTANCE USING CONTRASTIVE
The anticipated probabilities for each case are shown for EXPLANATIONS METHOD (CEM)
all of the three classes which are displayed in the bar chart on We next explore the feature importance for MEMS datset’s
the left. Sets of rules related to features ‘x’, ‘y’, and ‘z’ that features using CEM method [32]. CEM is a method for
influence the model’s classification of the instance as ‘‘NOT deciphering machine learning models by examining how the
0’’, ‘‘NOT 1’’, or ‘‘NOT 2’’ are located next to the bar charts. model’s predictions alter in response to perturbations in the
As shown by the colored boxes and matching values, these input features. This technique looks at how changing a feature
rules represent the criteria based on which feature’s values might affect the forecast in order to assist determine which
that either raise or decrease the likelihood that the instance features have the biggest influence on the model’s output.
belongs to a particular anomaly class. The unique property of CEM is that it sheds light on why
Actual Feature Values and LIME Rules: The actual feature an AI model made a specific decision by contrasting it with
values for each instance are displayed in the rightmost alternative decisions.
column. For instance, with a chance of 0.92, the instance in We show two examples for using CEM for the MEMS
the top row is largely anticipated to be normal (class 0). The dataset in Figures 8-9. The figures are bar charts that show
properties ‘y > 0.10’ and ‘z ≤ 0.12’ are crucial requirements the feature importance for two different models as evaluated
that support this classification, according to the standards. by the Counterfactual Explanation Method (CEM). In these
In a similar vein, the bottom and middle rows present the figures, the three features (x, y, z) of the MEMS dataset
influential conditions and forecasts for additional cases that are displayed together with their corresponding importance
are primarily categorized as classes 1 (near failure) and 2 scores. For the RF model, CEM shows that feature z is the
(failure), respectively in this example. most significant feature in contrast to LOCO that shows that

VOLUME 12, 2024 71035


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 6. Local feature importance using LIME XAI method for three data instances from the MEMS dataset.

FIGURE 7. Feature importance using LOCO for the main features (x,y, and FIGURE 9. Feature importance using CEM for the main features (x,y, and
z) of MEMS dataset using MLP AI model. z) of MEMS dataset using MLP AI model.

Main Insights: Since feature z is constantly the most


significant in both AI models using CEM, it is likely a
major factor in the decisions made by the anomaly detection
model for the MEMS dataset. Although still significant,
characteristic x is not as significant as feature z. Finally,
feature y has almost no influence for MLP model in contrast
to RF model using CEM XAI method.

I. FEATURE IMPORTANCE USING ACCUMULATED LOCAL


EFFECT (ALE)
We next show another local feature importance illustration
using accumulated local effect (ALE) method. In this
experiment, our three distinct features (x, y, and z) are
represented by Accumulated Local Effects (ALE) plots in
Figures 10-12. The machine learning model (RF model in
FIGURE 8. Feature importance using CEM for the main features (x,y, and
z) of MEMS dataset using RF AI model. this experiment) predicts three classes: ‘‘normal,’’ ‘‘near-
failure,’’ and ‘‘failure.’’ ALE plots are used to illustrate how
characteristics, taking into consideration their interactions
with other features, impact a machine learning model’s
x is the top one. For CEM, x is the second top feature for RF average prediction. They provide insight into the connection
model. For the MLP model, both LOCO and CEM lead to between the input features and the anticipated prediction
same finding that feature z is the top feature. (result). In contrast to LIME, it shows the relationship

71036 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 12. ALE plot for MEMS dataset (Feature z) for all anomaly classes
FIGURE 10. ALE plot for MEMS dataset (Feature x) for all anomaly classes (1: normal, 2: near-failure, and 3: failure).
(1: normal, 2: near-failure, and 3: failure).

can have a big influence on the prediction. This implies a


complex and non-linear relationship between the variable
and the expected prediction. With regard to feature z, the
plot illustrates notable peaks and troughs for all classes,
suggesting intricate relationships where specific z ranges
have an enormous effect on each class’s prediction.

J. FEATURE IMPORTANCE USING PERMUTATION FEATURE


IMPORTANCE (PFI) WITH DIFFERENT AI MODELS
The Permutation Feature Importance (PFI) approach was
used in our evaluation experiments to determine the feature
importance for several AI models. By randomly shuffling the
feature values and so disrupting the correlation between the
FIGURE 11. ALE plot for MEMS dataset (Feature y) for all anomaly classes feature and the target, this method evaluates the significance
(1: normal, 2: near-failure, and 3: failure).
of each feature by monitoring the shift in the AI model’s
performance, typically accuracy. We show the feature impor-
tance for all of our three features (x, y, and z) for all of the
between each feature individually (in each single ALE plot) six AI models considered in our work (i.e., AdaBoost, Deep
and different anomaly classes. Neural Network, Multi Layer Perceptron, Random Forest,
With the ALE value on the y-axis plotted against the feature Support Vector Machine, and Decision Tree). The feature
value on the x-axis, each figure represents a single feature importance of these different models are compared in the bar
(Figure 10 for x, Figure 11 for y, and Figure 12 for z). charts in Figure 13. The x-axis represents the features (x, y,
The influence of the feature on each of the three classes z), and the y-axis quantifies the feature importance score,
is represented by the lines in various colors. A steep slope inferred as the drop in model accuracy when each feature’s
or peak in the line, for example, suggests that the feature values are permuted.
value has a significant impact on the model’s prediction for Main Insights: Figure 13 shows that feature z is likely
that class. On the other hand, flat sections imply that the a crucial predictor, as indicated by the consistent pattern
prediction remains constant when the feature value varies, in which feature z is the most important feature under PFI
which means that the feature is less important. The ALE throughout the six AI models for MEMS dataset. Thus, the
plots (Figures 10-12) show that feature x is most important model building and evaluation process should give close
for predicting ‘Failure’ class, feature y is most important for consideration to such a feature. For the other two features,
predicting ‘Normal’ class, and feature z is most important for feature x was the second top feature in four of the six AI
predicting ‘Near-failure’. models (which are ADA, DT, RF, and SVM), while feature
Main Insights: For feature x, after a certain value of x, y was the second top feature in only two AI models (DNN,
the effect on the prediction for class 1 (‘normal’) increases and MLP). Given that feature y is the least significant,
noticeably, whereas for class 3 (‘failure’), there is a noticeable the main explanation for such behaviour for feature y is
drop. Class 2 (‘near-failure’) has an impact that varies but that the Y-axis is representing the direction of the shaft
is less noticeable. Regarding feature y, all classes exhibit which exhibits smaller vibrations compared to Z-axis and
dramatic swings, suggesting that even minor y changes X-axis.

VOLUME 12, 2024 71037


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 13. The feature importance for three main features (x,y, and z) in MEMS dataset using Permutation Feature Importance (PFI) for all AI models
(AdaBoost, Decision Tree, Deep Neural Network, Random Forest, Support Vector Machine, and MLP).

FIGURE 14. The feature importance for three main features (x,y, and z) in MEMS dataset using Profweight for different AI models for anomaly detection
(AdaBoost, Decision Tree, Deep Neural Network, Random Forest, and Support Vector Machine).

71038 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

TABLE 6. The feature importance (given by top three features) for each M. SUMMARY OF FEATURE ANALYSIS FOR MEMS
anomaly type for the MEMS dataset. The feature z is the top feature
across all different attack types. We have provided overall feature importance analysis
using six XAI methods (SHAP, LOCO, CEM, PFI, and
ProfWeight). We have also provided local feature importance
analysis using the popular LIME method and ALE method.
Furthermore, we provided feature importance for different
RPMs considered in collecting MEMS data for more in-
depth understanding. All things considered, feature z (motor
acceleration in Z direction) seems to have the most impact
on the predictions across all models from SHAP feature
importance results. Despite differing by model in terms of
K. FEATURE IMPORTANCE USING PROFWEIGHT WITH AI
its significance in relation to z, feature x also exhibits a
MODELS
significant impact.
We next explore the ProfWeight-inspired method to evaluate
Generally speaking, feature y has the least effect. As illus-
and display feature relevance for MEMS dataset. The
trated in the experimental setup (Figure 2a), the motor’s rota-
ProfWeight approach assesses the significance of a feature
tion, coupled with the imbalanced disk due to the mounted
by considering many criteria, such as the feature’s weight
mass (eccentric weight), leads to unbalanced centripetal
within the model and its relationship with other features.
forces, causing repeated vibrations in multiple directions.
This approach appears to offer an indicator of the relative
Given the circular movement around the motor’s center, the
contribution of each attribute to the model’s predictions.
primary vibrations occur along the X and Z axes, while
We evaluated Profweight for a variety of machine learning
the Y-axis (aligned with the shaft) shows relatively minor
models (considered in our work), including AdaBoost,
vibrations. These variations in data patterns might not be
Decision Tree, Random Forest, Deep Neural Network, and
distinctly discernible with changes in machine health. Such
Support Vector Machine. Figure 14 shows the Profweight
feature selection necessitates domain knowledge, specifically
feature importance for MEMS dataset. Again, three features
an understanding of how the motor vibrates and the relative
are displayed on x-axis of each graph, with the y-axis
sensor placement. The rationale is that the selection of
representing each feature’s relevance ranking.
discriminative features, informed by this domain knowl-
Main Insights: Similar to PFI, Figure 14 shows that the
edge, can significantly influence the AI model’s learning
feature ‘z’ clearly outweighs the others in each graph in
process.
terms of relevance (importance) score across all models for
The link between features and predictions appears to
Profweight XAI method. The relevance scores of the other
be extremely model-dependent, based on the diversity in
two variables are noticeably lower, indicating that they have
feature impact across different models. Thus, we have used
less of an effect on the model’s functionality. Regardless of
several XAI methods for confirming which features appear
the model type, this constant ranking across models indicates
frequently as top important features for different AI models.
that z is the most significant feature and that it has a strong
By highlighting the features that are most useful for making
and consistent influence on the model’s predictions. These
predictions, our analysis can help direct feature engineering
insights can be critical for activities like feature selection,
and model selection.
model interpretation, and enhancing model performance
Having provided different results for MEMS dataset,
by concentrating on the most informative properties. They
we next provide the main evaluation results for N-BaIoT
can also be useful for determining which aspects influence
dataset.
anomaly detection model decisions which can also help in
preventing adversarial attacks on such AI methods.
VI. N-BaIoT RESULTS EVALUATION
L. FEATURE IMPORTANCE FOR EACH ANOMALY TYPE A. PERFORMANCE METRICS WITH ALL 115 FEATURES
We next show the anomaly-specific feature importance (given We first show the main performance metrics (accuracy,
by the top three features based on each anomaly type). This precision, recall, and F-1 score) for the different 10 AI
is calculated using averaging SHAP feature importance for models (both single AI models and ensemble methods).
different AI models. Table 6 shows such a list for MEMS Table 7 shows such results. Using aforementioned variety of
dataset. We observe that for different anomaly types, feature machine learning models, the table gives a thorough summary
‘z’ is the top feature. For ‘near-failure’ anomaly class, feature of the anomaly detection outcomes for the 9 IoT devices
‘y’ has the second rank. For ‘near-failure’ anomaly class, that were discussed in Section IV-A. The performance
feature ‘y’ has the second rank. For ‘failure’ anomaly class, measures (accuracy, precision, recall, and F-1 score) are
feature ‘x’ has the second rank. This anomaly-specific feature displayed in the columns, with each row representing a
importance can help in tuning AI models for detecting distinct AI model. Notably, the different ensemble methods
different conditions of the IoT device (here, the condition of (Random Forest, Bagging, Blending, Stacking, and Voting)
the motor) given such feature importance knowledge. show consistently high scores for all measures, with recall,

VOLUME 12, 2024 71039


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

TABLE 7. N-BaIoT - anomaly detection results for the nine devices (the higher the better) for each model.

71040 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

F-1 Score, and accuracy and precision all approaching C. FEATURE IMPORTANCE USING SHAP WITH DIFFERENT
0.99. These algorithms demonstrate remarkable efficacy in MODELS
precisely detecting data anomalies. From single AI models, We next provide the overall feature importance by SHAP for
Decision Tree, provides the best results with performance the different AI models for N-BaIoT dataset. The average
similar to ensemble methods. The MLP and DNN models effect of top-20 features on the model’s output magnitude
show also promising performances for seven of the nine determined by the mean absolute SHAP value is displayed
devices. On the other hand, the ADA model has the in Figure 15.
worst performance scores, suggesting that it would be less Main Insights: By giving each feature a value that
robust for anomaly detection for this dataset. Finally, the corresponds to its importance for a certain prediction, SHAP
performance of the SVM model varies for different devicesm values provide a broad overview of feature relevance for dif-
but generally has much lower performance compared to DT, ferent AI models. Figure 15a shows that the most significant
and MLP. feature in the AdaBoost model is ‘‘HH_L1_weight’’, which
Main Insights: The main insight from such performance is followed by ‘‘HH_L1_magnitude’’ and ‘‘HH_L5_weight’’.
evaluation using all features for all nine devices of N-BaIoT Figure 15b shows that feature ‘‘HH_L1_magnitude’ is the
is that ensemble methods work best for predicting different most significant feature for Random Forest model, which is
anomaly classes since combining different AI methods in followed by ‘‘HH_L3_weight’’ and ‘‘HH_L3_magnitude’’.
these ensemble methods harness the strengths of diverse The feature importance pattern of Decision Tree has similar
models and mitigate the weaknesses of lower performed ones. pattern to that of Random Forest. According to the MLP
We next provide feature importance analysis for one of model summary in Figure 15c, feature ‘‘HH_L1_magnitude’’
the nine devices (Samsung webcam IoT device which is has also the biggest average influence on model pre-
‘‘device 7’’). We chose this device since it has the best average dictions, closely followed by ‘‘HH_L5_magnitude’’, and
performance across different devices, thus the generated ‘‘HH_L5_mean’’ has the third rank. We have noticed similar
feature analysis would be more trustworthy compared to insights for SVM and DNN models (omitted in interest of
other devices. We also emphasize that we also generated space).
the feature importance for other 8 IoT devices (not shown
here in the interest of space). Again, we perform different D. FEATURE IMPORTANCE WITH CONTRASTIVE
feature importance analysis using our different XAI methods EXPLANATIONS METHOD (CEM)
(LOCO, CEM, SHAP, PFI, Profweight, ALE, and LIME), Recall that contrastive explanations method (CEM) shows
as detailed next. what may be altered in the input features to get a different
prediction output, which sheds light on how a machine
B. OVERALL FEATURE IMPORTANCE (SHAP GLOBAL learning model makes decisions. They give complicated
SUMMARY PLOT) models like MLPs some interpretability. Figure 16 shows
We next show the overall feature importance under SHAP the feature importance using CEM for the top features
(SHapley Additive exPlanations) values, which measure of N-BaIoT dataset using different AI models. Each bar
each feature’s contribution to a machine learning model’s represents a different feature as explained in Section IV-A.
prediction. Figure 15b shows such overall feature importance The y-axis, which has values between 0 and 0.4, measures
using Random Forest model. Every horizontal bar is a unique the relevance. When it comes to the model’s performance
feature, and its length signifies the mean absolute SHAP or predictions, the features with the tallest bars are the most
value, which reflects the average effect of the feature on the significant. For CEM, the top important features are ‘‘HH-
output magnitude of the model. Each bar’s color—red for L1-weight’’ with Adaboost and Random Forest and ‘‘HH-L1-
‘‘benign,’’ green for ‘‘gafgyt.udp,’’ blue for ‘‘gagfyt.scan,’’ magnitude’’ with Decision Tree and MLP models.
purple for ‘‘gagfyt.junk,’’ orange for ‘‘gagfyt.combo,’’ and
yellow for gagfyt.tcp —represents the percentage of each E. FEATURE IMPORTANCE WITH LOCO
feature’s influence on the various prediction classifications Recall that Leave-One-Covariate-Out (LOCO) is a strategy
for different botnet attack classes. For example, feature used in machine learning models to estimate the importance
‘HH_L1_magnitude’ significantly affects the ‘‘benign’’, of features. When a feature is removed from the model, the
‘‘gagfyt.tcp’’, and ‘‘gagfyt.combo’’ classes, but features LOCO technique measures the change in prediction error to
‘HH_L1_weight’ have more effect for ‘‘gagfyt.junk’’ and assess how important the feature is. To be more precise, the
‘‘gagfyt.udp’ attack classes. One other class of features model is retrained several times, omitting a distinct feature
have moderate effects for specific attack classes (e.g., each time, and the effect on model’s performance is tracked.
‘HH_L5_mean’ for ‘‘gagfyt.scan’’ and ‘HH_L5_weight’ for Figure 17 shows the main outcomes of using the LOCO
‘‘gagfyt.udp’’). On the other hand, some features have almost approach on different AI models such as AdaBoost, Decision
no effect for most classes under ranodm forest model (e.g., Tree, Random Forest, and MLP. The x-axis lists the features
‘HH_L5_pcc’, and ‘HH_L5_covariance’). by name, and the y-axis, which is scaled on logarithmic

VOLUME 12, 2024 71041


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 15. Overall feature importance for all top-20 features using SHAP for different AI models for N-BaIoT dataset.

FIGURE 16. Feature importance using CEM XAI method for the top features of N-BaIoT dataset using different AI models.

71042 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 17. Feature importance using LOCO XAI method for the top features of N-BaIoT dataset using different AI models.

scale (notice the ‘‘1e-5’’ that shows the scale factor), shows the values of a given feature are randomly permuted or
the feature importance scores. Every bar illustrates how shuffled between instances once the model is trained on the
the prediction performance of the model is affected by original dataset. Each feature goes through this procedure
eliminating a single feature. Higher bars denote features that several times, and the average performance change indicates
are crucial for producing accurate predictions of different how important the feature is. Permuting the values of a
anomaly classes because their absence dramatically reduces feature that is critical to the model’s predictions would
the model’s accuracy. From the figures we can infer that probably result in a notable decline in performance. Thus, the
feature ‘‘HH-L1-magnitude’’ is most important with the PFI technique provides information about how each feature
Adaboost and Decision Tree Ai models, followed by ‘‘HH- affects the overall performance of the model. This motivates
L1-std’’, while feature ‘‘HH-L1-std’’ ranks most important our usage of PFI for feature importance for N-BaIoT
with the RF model followed by ‘‘HH-L3-pcc’’ which in-turn dataset.
ranks the highest with the MLP model. Figure 18 shows the outcomes of different models under
Permutation Feature Importance (PFI) for N-BaIoT dataset
F. FEATURE IMPORTANCE WITH PERMUTATION FEATURE where one feature at a time is changed and all other features
IMPORTANCE (PFI) remain unaltered, and this is done for all features. The
Recall that the Permutation Feature Importance (PFI) method significance scores are displayed on the y-axis; higher values
is another technique used in machine learning models signal that the feature is more important because permuting it
to assess the importance of features. In PFI, a feature’s has a bigger detrimental effect on the model’s performance.
importance is determined by permuting its values and The features are listed on the x-axis and reflect various
tracking how the model’s performance indicators (accuracy statistical attributes of different AI models. Comparing the
in our experiments) change as a result. More specifically, figures of all the models, we can observe that features

VOLUME 12, 2024 71043


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

TABLE 8. N-BaIoT - anomaly detection results for the nine devices only using the top-20 features.

71044 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

‘‘HH-L1-weight’’ and ‘‘HH-L1-magnitude’’ visibly rank as on the prediction, while negative values show a negative
the top two important features in all models except with impact. The graph provides a clear image of how each
Adaboost, where ‘‘HH-L1-std’’ is the second highest ranked element affects a specific prediction by. This is especially
feature, making ‘‘HH-L1-magnitude’’ the third one. helpful for comprehending the behavior of the model on
a local basis. Overall, the features ‘‘H1_L1_magnitude’’
G. FEATURE IMPORTANCE USING PROFWEIGHT and ‘‘H1_L3_std’’ are the top two features for LIME for
ProfWeight, which stands for Profiled Weighting, is a most AI models except MLP that has ‘‘H1_L1_mean’’
technique designed to explain how various features affect as the top feature that affects its anomaly detection
an AI model’s predictions. It entails systematically changing decisions.
the weights given to various features and tracking how
this affects the predictions made by the model. Through I. PERFORMANCE METRICS WITH TOP-20 FEATURES
profiling the model with different weight configurations, We next show the performance metrics under top-20 features
IoT developers can learn which features have a major for different AI methods to test the effect of generating
impact on the model’s output. Evaluating the model’s top features for the N-BaIoT dataset. Table 8 shows such
sensitivity to various features and their individual con- performance metrics under top-20 features for all nine IoT
tributions is the fundamental concept. ProfWeight offers devices in the N-BaIoT dataset. Overall, the top-20 features
a sophisticated understanding of feature importance, have worse performance for all ensemble methods compared
which is very helpful when working with complex AI to using the full list of features. On the other hand, some
models. of the models with worst performance under all features
Figure 19 shows the outcomes of different models under has improvements in some performance metrics in several
Profweight for N-BaIoT dataset. The significance scores devices under top-20 features. The main insight from this
of different top features in N-BaIoT are displayed on the experiment is that feature selection can be more helpful
y-axis; higher values signal that the feature is more important for single AI models with lower performance, compared to
because permuting it has a bigger detrimental effect on the ensemble methods. Moreover, the anomaly detection models
model’s performance. The features are listed on the x-axis and seem to benefit from the majority of features when detecting
reflect various statistical attributes of the dataset. Comparing anomalies for N-BaIoT dataset. Furthermore, such feature
all the AI models, the feature ‘HH-L1-Magnitude’ has the selection would depend on the nature of the IoT dataset.
highest importance except with the MLP model, where ‘HH-
L3-mean’ has the first rank (i.e., with the highest importance). J. FEATURE IMPORTANCE FOR EACH ATTACK TYPE
Another interesting finding here is that RF model only We next show the attack-specific feature importance (given
depends on three main features (‘HH-L1-magnitude’, ‘HH- by the top five features based on each attack type). Table 9
L1-weight’, and ‘HH-L1-std’) for making its classification show such a list for N-BaIoT dataset. This is extracted from
decisions. SHAP average scores for each attack using different AI
models. We observe that the feature ‘HH_L1_magnitude’
H. FEATURE IMPORTANCE USING LOCAL INTERPRETABLE is the common top feature across the different attack types.
MODEL-AGNOSTIC EXPLANATIONS (LIME) Furthermore, the feature ‘HH_L1_weight’ is the top feature
We next provide local feature importance using LIME. for ‘gagfyt.junk’ and ‘gagfyt.udp’ botnet attacks. There
Designed to enable local interpretability for any machine are also other features that are common across different
learning model (i.e., LIME is model agnostic XAI method). attack types (e.g., ‘HH_L3_weight’, ‘HH_L3_magnitude’,
By fitting a straightforward, interpretable model (such ‘HH_L5_magnitude’, ‘HH_L1_mean’, and ‘HH_L3_mean’).
as a linear model) to roughly represent the behavior of Overall, these results show the main features that we should
the complex black-box model around a particular data look for when investigating specific network botnet attack
point, LIME produces locally faithful explanations. This for IoT devices in the N-BaIoT dataset. This attack-specific
localized approximation improves the interpretability of the feature importance can help in tuning AI models for detecting
model at the local level by providing insights into how different conditions of the IoT network (here, the network
it functions for a specific instance. When working with traffic between host and destination IoT devices) given such
complicated models for predicting anomalies for IoT net- feature importance knowledge.
works or needing transparency in certain predictions without
having to describe the entire model, LIME is especially K. SUMMARY OF RESULTS FOR N-BaIoT
helpful. We have provided overall feature importance analysis
The feature importance scores as calculated by LIME using five XAI methods (SHAP, LOCO, CEM, PFI, and
for different AI models for N-BaIoT dataset are displayed ProfWeight). We have also provided local feature impor-
in Figure 20. For any AI model, each bar on the graph tance analysis using the popular LIME method. Further-
represents a feature, and its length and direction show how more, we provided performance metrics under different
much the feature contributed to the model’s prediction for that AI methods and different combinations of features (top-20
particular instance. Positive values show a positive influence features, and all features). All things considered, features

VOLUME 12, 2024 71045


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 18. Feature importance using PFI XAI method for the top features of N-BaIoT dataset using different AI models.

TABLE 9. The feature importance (given by top five features) for each across different models. Furthermore, the feature importance
attack type for N-BaIoT dataset. The feature HH_L1_magnitude is the
common top feature across the different attack types. problem is more challenging in this dataset since it has
115 features compared to 3 features in MEMS dataset. Thus,
different AI models seem to have different features that affect
their decision-making.
Having provided our extensive evaluation for both MEMS
and N-BaIoT datasets, we next provide a discussion of the
main findings and main limitations of our work.

VII. DISCUSSION
A. OVERALL DISCUSSION OF RESULTS
We first observe that each dataset has a different best
model (e.g., DNN gave the best performance for MEMS
dataset while ensemble methods were the best for N-BaIoT
dataset). We also observe that each XAI method can help in
‘‘H1_L1_magnitude’’, and ‘‘H1_L1_mean’’ have the most understanding one aspect of our anomaly detection problem.
impact on the predictions across all models from feature In particular, SHAP can give top features for each anomaly
importance results for different XAI methods considered class, LIME can explain anomaly detection decisions on
in our evaluation for N-BaIoT dataset. Similar to MEMS, local samples of the IoT dataset, LOCO and Profweight
the link between features and predictions appears to be can explain more the importance of each single feature, and
model-dependent, based on the diversity in feature impact ALE can provide correlation between each single feature and

71046 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 19. Feature importance using Profweight XAI method for top features of N-BaIoT dataset using different AI models.

predictions. Our analysis provides main insights about the is better than that of MEMS. This is due to the fact that
top features for each of the two datasets (MEMS and N- the MEMS sensor dataset has only thousands samples, due
BaIoT), along with the performance of anomaly detection to very low sampling rate (10 Hz) and thus its AI models’
(represented by multi-class classification) for our two IoT performances are much lower compared to N-BaIoT dataset
datasets. Recall that we focused on sensors’ readings for (which has more than million and half samples).
MEMS dataset and network traffic between IoT devices for
N-BaIoT dataset, giving different angles of IoT systems. 2) CLASS OVERLAP AND DATA DISTRIBUTION
Compared to the N-BaIoT dataset, which have more distinct
B. DISCREPANCY IN MODELS’ PERFORMANCES ON THE class separations, the MEMS dataset may have classes that
TWO DATASETS are intrinsically closer together in feature space (such as
There are several factors that explain the discrepancy in ‘‘Near Failure’’ and ‘‘Failure’’), making it more difficult
classifiers’ performance between the N-BaIoT and MEMS for models to distinguish between them. Thus, better
datasets, which are detailed below. performance may also result from the fact that N-BaIoT
dataset is having a more balanced distribution of classes than
1) DATASET SIZE AND BALANCE the MEMS dataset.
One important note is that bigger datasets typically result in
greater model performance since they have more examples 3) FEATURE IMPORTANCE
for the AI model to train from. This intuition is consistent The ‘z’ feature was found to be the most important, and it had
with our analysis in which performances on N-BaIoT dataset a major impact on model projections in MEMS dataset. Thus,

VOLUME 12, 2024 71047


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 20. Feature importance using LIME XAI method for the top features of N-BaIoT dataset using different AI models.

the MEMS dataset might be more vulnerable to changes in MEMS and N-BaIoT datasets considered in our work. These
this crucial characteristic, resulting in notable performance metrics and their related results are detailed below.
variations, because it only has three features, one of which
is ‘z’. On the other hand, the N-BaIoT dataset might be less 1) DEMOGRAPHIC PARITY SCORE
impacted by changes in a single feature because it has a larger Demographic Parity, also known as statistical parity, eval-
feature set, leading to more consistent performance. uates whether outcomes are independent of protected
attributes. It compares the proportion of positive outcomes
across different labels from the dataset.
4) DATA QUALITY
The N-BaIoT dataset may have characteristics that are less
2) PREDICTIVE PARITY SCORE
noisy or more discriminative for the anomaly detection task,
Predictive parity examines whether the predictions made
which could explain why the classifiers performed better on
by the model are equally accurate across different labels.
it than on MEMS dataset.
It assesses whether the model’s performance is consistent
regardless of the label.
C. FAIRNESS AND BIAS OF AI MODELS
We discuss here several metrics to measure the fairness and 3) CALIBRATION DISPARITY SCORE
bias on the data collection technique and the accountability Calibration metric assesses whether the predicted prob-
issues of AI models. These metrics include Demographic Par- abilities align with the true probabilities of outcomes.
ity Score, Predictive Parity Score, and Calibration Disparity miscalibration across different groups can indicate bias in the
Score [66]. We explain and measure such scores for both AI model.

71048 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

4) METRICS PERFORMANCES ON MEMS D. CONFUSION OF AI MODELS BETWEEN ANOMALY


We start by explaining the results of these metrics for our first CLASSES (CONFUSION MATRICES)
dataset, MEMS dataset. 1) CONFUSION MATRICES COMPARISON
(a) Confusion Matrix: There are 4316 instances in the test Here, we show the Random Forest AI model’s performance
set where the model’s predictions match the true labels. confusion matrices for both IoT datasets in Figure 21. For the
(b) Demographic Parity Score: 1.0 MEMS dataset, the confusion matrix (shown in Figure 21a)
The Demographic Parity Score is 1.0, which indicates shows its three classes. Although it still tends to make
perfect demographic parity. This means that the proportion accurate predictions, there is clear confusion between Class 1
of positive predictions is the same across different groups. (normal) and Class 2 (near failure), as seen by the lighter
(c) Predictive Parity Score: 0.0 shades off the diagonal. Figure 21b shows RF model with six
The Predictive Parity Score is 0.0, which means that the classes for the N-BaIoT datset, which shows high accuracy
positive predictive value (PPV) is equal between the groups of in some classes with most predictions focused along the
interest (‘Near-failure’ class) and the other group (‘Normal’ diagonal, indicating accurate classifications. Interestingly,
and ‘Failure’ classes). In other words, the model’s predictions a higher percentage of true positives is suggested by the
are equally accurate for both groups. darker shades for classes like 1, 2, 3, and 5. Under such
(d) Calibration Disparity Score: 1.0 comparison of confusion matrices with the Random Forest
The Calibration Disparity Score is 1.0, indicating per- algorithm as the anomaly classification algorithm, we show
fect calibration disparity. This means that the calibration the differences in performances on the two datasets where
curves, which show the relationship between predicted N-BaIoT dataset have better prediction power compared to
probabilities and the true probabilities of positive out- MEMS dataset (with lower number of samples).
comes, are the same for the group of interest (‘Near-
failure’ class) and the other group (‘Normal’ and ‘Failure’ 2) MAIN EXPLANATIONS OF CONFUSION MATRIX OF MEMS
classes). DATASET
In the confusion matrix for the MEMS dataset (Figure 21a),
there are three classes. Lighter hues in the off-diagonal
5) METRICS PERFORMANCES ON N-BaIoT
cells show some misclassification across the classes, but
We finally explain the results of these metrics for our second
substantially fewer than in the N-BaIoT dataset. The darker
dataset, N-BaIoT dataset.
blue square in the upper left indicates a high number of true
(a) Confusion Matrix: There are 24,000 instances in the
positives for class 1 (Normal). The lighter blue squares in
test set where the model’s predictions match the true labels.
these cells indicate that there is some uncertainty between
(b) Demographic Parity Score: 1.0
class 2 (Near Failure) and class 3 (Failure), but this matrix still
A Demographic Parity Score of 1.0 indicates perfect
demonstrates a high degree of accuracy for class 1 (Normal)
demographic parity. This means that the proportion of
predictions.
positive predictions is the same across different demographic
groups. In this case, it suggests that the model’s predictions
are balanced across different groups. 3) MAIN EXPLANATIONS OF CONFUSION MATRIX OF
(c) Predictive Parity Score: 0.0 N-BaIoT DATASET
The Predictive Parity Score of 0.0 suggests that there Six class are displayed in the confusion matrix for the
is no difference in the positive predictive value (PPV) N-BaIoT dataset as seen in Figure 21b. The true positives for
between the group of interest (the ‘gafgyt.scan’ class) and each class are shown by the darker hues along the diagonal,
the other group (remaining anomaly classes). It means which show that the model correctly predicted each case.
that the model’s predictions are equally accurate for both Class 1 (benign), for example, has 4077 true positives. Lighter
groups. hues in non-diagonal cells indicate misclassifications—
(d) Calibration Disparity Score: 1.0 erroneous predictions made by the model when it mistaken
A Calibration Disparity Score of 1.0 indicates perfect one class for another. The concentration of darker hues along
calibration disparity. This means that the calibration curves, the diagonal and fewer light spots elsewhere indicate that the
which show the relationship between predicted probabilities RF model has a relatively low misclassification rate and high
and the true probabilities of positive outcomes, are the same accuracy for both the benign class and the other attack classes
for the group of interest (‘gafgyt.scan class’) and the other for N-BaIoT dataset.
group (remaining classes). Essentially, it suggests that the
model’s predicted probabilities are well-calibrated for both 4) COMPARISON BETWEEN MEMS AND N-BaIoT MATRICES
groups. The different shades of blue show the frequency of predic-
Main Insight: Overall, these results suggest that the tions where lighter cells indicate fewer occurrences, whereas
anomaly detection models are performing well in terms of darker cells indicate a high frequency of predictions for
fairness and bias according to the specified metrics for both that combination of true and projected classes. Off-diagonal
datasets considered in our work. cells should be brighter, indicating fewer misclassifications,

VOLUME 12, 2024 71049


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

FIGURE 21. Confusion matrices for the top classes of N-BaIoT and MEMS datasets using RF AI model.

whereas diagonal cells should ideally be the darkest, hand, decision trees with higher depths and more leaves
indicating accurate classifications. While the MEMS matrix usually enhance performance metrics but do not have the
(Figure 21a) demonstrates high performance primarily on explainability power of simpler decision trees. This sheds the
one class with some misunderstanding between the other two light on the importance of having XAI methods to extract
classes, the other matrix of N-BaIoT (Figure 21b) reveals a the main features for anomaly detection in IoT applications,
model that is doing well across many classes. which we tackle in our current work. For developers and users
of these IoT devices, knowing the ‘‘why’’ behind a decision
E. DISCUSSION ABOUT PERFORMANCE AFTER FEATURE of an AI model is just as important as the decision itself.
SELECTION There is also compelling economic rationale for the
In our evaluation, we notice that most of the AI models work deployment and analysis of such XAI-based systems. In IoT-
better under all features. However, we emphasize that the based smart manufacturing, a variety of sensors (e.g.,
main goal of our feature importance analysis is understanding vibration, ultrasonic, pressure sensors) are utilized for tasks
the top features that affect the decisions of AI models, even like process control, automation, production planning, and
under the usage of most features in building these AI models. equipment maintenance. For instance, in equipment mainte-
Eventually, this can lead to better explainable frameworks nance, continuous monitoring of the operating equipment’s
explaining decision-making of AI models. condition using proxy measures (e.g., vibration and sound)
is implemented to avert unplanned downtime and reduce
maintenance costs [72]. Real-time analysis of data from these
F. NEED FOR XAI AND FEATURE IMPORTANCE FOR IoT
sensors and building XAI models for this data plays a pivotal
We emphasize that random forests and KNN, which are
role in predictive maintenance tasks, employing the anomaly
frequently thought to be interpretable in machine learning
detection process [73], [74].
applications, also have interpretability issues. First, Random
In other IoT domains with IoT networks (such as N-
forests’ ensemble nature leads to complicated decision
BaIoT application), the manual intervention in data collection
processes that go beyond single tree clarity [67]. In low-
(e.g., replacing a failing sensor) or mitigation actions (e.g.,
dimensional spaces, KNN’s simplicity contrasts with the
adjusting the position of many IoT security cameras) is a
interpretability concerns in high-dimensional spaces, where
costly endeavor. Consequently, precise anomaly detection
defining the ‘nearest’ neighbour becomes not clear [68]. can significantly alleviate these labor costs. Therefore, our
Moreover, simple decision trees that have lower depth and proposed XAI-based anomaly detection technique finds
containing a few leaves can be used as an explainability applicability in different IoT applications.
tool as shown in prior works for intrusion detection [69],
[70], [71]. Since each level of the tree has a clear rule
for each node until it reaches a leaf, it shows exactly the G. COMPARATIVE ANALYSIS WITH PRIOR RELATED WORK
model decision process behind every decision which has We now provide a comparative analysis between our current
improved explainability when compared to black-box AI work and developed techniques with similar solutions by
models. However, this comes with much lower performance other scientists in anomaly detection and feature importance
in terms of accuracy, recall, and precision. On the other for IoT domain. Table 10 shows such a comparison where

71050 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

TABLE 10. A comparative analysis of the available features between the prior related works in IoT systems and our framework. Our work provides an
anomaly detection framework that incorporates detecting different attack types. Our framework also considers XAI-based feature importance analysis for
both global and local scopes for different AI models.

it shows the main differences between the main features of that the XAI framework is useful in IoT domain, especially
our work and those of prior related works. Our work provides when large anomaly detection datasets can be costly to collect
an anomaly detection framework that incorporates detecting and are normally thought to be very specific to a single
different attack types. Our framework also considers XAI- application. Future avenues of research include leveraging the
based feature importance analysis for both global and local data from multiple IoT sensors, exploring ensemble learning
scopes for different AI models. on XAI methods to enhance feature analysis, and detecting
the device health by merging information from multiple,
H. REPRODUCIBILITY potentially different, IoT sensors.
We have made our source codes and benchmark data
publicly available, facilitating the replication of our research. APPENDIX A
We are releasing our IoT database corpus consisting MEMS DATASET COLLECTION
of two datasets, aiming to encourage standardization in Raw data was collected from the real-world sensors from
benchmarking anomaly detection and feature importance August 2021–May 2022. Hardware (sensors mounted on
within this crucial domain. We invite the community motor testbeds) and software (code to collect the data from
to contribute to and expand this resource by sharing the sensors and save them to a desktop machine).
their new datasets and models. The website contain-
ing our database and source codes can be accessed A. MOTIVATION FOR DATA RELEASE OF MEMS DATASET
at: https://github.com/agummadi1/XAI_for_IoT_Systems. The main motivation for releasing our datasets is for perform-
Recall that detailed information about our framework and ing ML-based anomaly detection and feature importance for
various model categories has been provided in Section IV-A. IoT-based smart manufacturing systems. The manufacturing
Additionally, the descritions of the two datasets is available of discrete products typically involves the use of equipment
in Appendix A, and details about hyper-parameter selections termed machine tools. The health of a machine is often
and libraries used are presented in Appendix B. directly related to the health of the motors being used to
drive the process. Given this dependence, health studies of
VIII. CONCLUSION manufacturing equipment may work directly with equipment
This paper explored several interesting challenges to an in a production environment or in a more controlled
important application area, internet of things (IoT). We pro- environment on a ‘‘motor testbed’’.
posed an explainable AI framework for studying anomaly
detection and failure classification for securing IoT systems. APPENDIX B
We proposed a multi-class anomaly detection technique BENCHMARKS: MODELS, AND HYPER-PARAMETER
and an efficient defect-type classification technique for IoT SELECTION
applications. We then performed feature importance analysis A. MODELS AND HYPER-PARAMETER SELECTION
using seven XAI methods (SHAP, LIME, CEM, LOCO, PFI, We now provide details on the models used to study the
and Profweight, ALE). We tested our framework on two anomaly detection problem in our work. We explain the
real-world data-sets (MEMS, and N-BaIoT). We compared anomaly detection (prediction) algorithms and the hyper-
single AI and ensemble-based models for anomaly detection parameters used for each classification model. This can
using different performance metrics. Our evaluation showed help reproducing our results for the future related works.
that the single AI models lead to better anomaly detection Following standard tuning of the model, we created different
prediction for MEMS dataset while ensemble-based models variants of the models to choose the best parameters (by
were better for N-BaIoT dataset. We also identified the top comparing the performance of the multi-class classification
features that affect the decision of different AI models for problem). The details for each model can are explained below.
both datasets using our different XAI methods. (1) Deep Neural Network (DNN): The initial classifier
We release our database corpus and codes for the commu- is a Deep Neural Network (DNN) with an architecture
nity to build on it with new datasets and models. We believe comprising an input layer, where the count of neurons

VOLUME 12, 2024 71051


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

corresponds to the number of features used, employing the to five, uniform weights, and the search algorithm is set to
Rectified Linear Unit (ReLU) activation function. This is ‘auto’.
succeeded by a dropout layer with a dropout rate of 0.01,
a hidden layer featuring 16 neurons and ReLU activation, ACKNOWLEDGMENT
and concludes with a ‘‘softmax’’ layer. The loss function All the opinions, findings, and recommendations expressed
is set to ‘‘categorical_crossentropy,’’ utilizing the adaptive in this publication are those of the authors of the article and
momentum (ADAM) optimization algorithm. Training the they do not reflect the opinions or views of sponsors.
model requires eleven epochs with a batch size of 1024, and
default values are retained for the remaining parameters. REFERENCES
(2) Random Forest (RF): The subsequent classifier [1] L. Atzori, A. Iera, and G. Morabito, ‘‘The Internet of Things: A survey,’’
for detecting malicious samples in IoT datasets (network Comput. Netw., vol. 54, no. 15, pp. 2787–2805, 2010.
traffic in N-BaIoT and sensor samples in MEMS) is the [2] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, ‘‘Internet of Things
(IoT): A vision, architectural elements, and future directions,’’ Future
RandomForest (RF). Hyperparameters include setting the Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, Sep. 2013.
number of estimators (trees) to 100, maximum tree depth to [3] M. Abdallah, B.-G. Joung, W. J. Lee, C. Mousoulis, N. Raghunathan,
10, and the minimum number of samples required to split an A. Shakouri, J. W. Sutherland, and S. Bagchi, ‘‘Anomaly detection and
internal node to 2. Default values are maintained for the rest inter-sensor transfer learning on smart manufacturing datasets,’’ Sensors,
vol. 23, no. 1, p. 486, Jan. 2023.
of the parameters. [4] L. Barreto and A. Amaral, ‘‘Smart farming: Cyber security challenges,’’ in
(3) AdaBoost (ADA): AdaBoost is employed as the Proc. Int. Conf. Intell. Syst. (IS), Sep. 2018, pp. 870–876.
next classifier with the maximum number of estimators [5] J. Wang, Y. Ma, L. Zhang, R. X. Gao, and D. Wu, ‘‘Deep learning for
smart manufacturing: Methods and applications,’’ J. Manuf. Syst., vol. 48,
set at 50, the weight applied to each classifier during pp. 144–156, Jul. 2018.
boosting iterations set to 1, and the base estimator being the [6] J. S. Sunny, C. P. K. Patro, K. Karnani, S. C. Pingle, F. Lin, M. Anekoji,
Decision_Tree_Classifier. L. D. Jones, S. Kesari, and S. Ashili, ‘‘Anomaly detection framework
(4) Decision Tree (DT): The subsequent classifier for for wearables data: A perspective review on data concepts, data analysis
algorithms and prospects,’’ Sensors, vol. 22, no. 3, p. 756, Jan. 2022.
detecting malicious samples in IoT datasets is the Decision [7] T. E. Thomas, J. Koo, S. Chaterji, and S. Bagchi, ‘‘Minerva: A rein-
Tree (DT). Hyperparameters include setting the maximum forcement learning-based technique for optimal scheduling and bottleneck
tree depth to 10, and the minimum number of samples detection in distributed factory operations,’’ in Proc. 10th Int. Conf.
Commun. Syst. Netw. (COMSNETS), Jan. 2018, pp. 129–136.
required to split an internal node to 2. Default values are [8] L. Scime and J. Beuth, ‘‘Anomaly detection and classification in a laser
maintained for the rest of the parameters. powder bed additive manufacturing process using a trained computer
(5) Support Vector Machine (SVM): SVM is utilized vision algorithm,’’ Additive Manuf., vol. 19, pp. 114–126, Jan. 2018.
with a kernel set to ‘linear’, gamma to 0.5, probability set [9] A. Ukil, S. Bandyoapdhyay, C. Puri, and A. Pal, ‘‘IoT healthcare analytics:
The importance of anomaly detection,’’ in Proc. IEEE 30th Int. Conf. Adv.
to ‘True’, and regularization set to 0.5. Inf. Netw. Appl. (AINA), Mar. 2016, pp. 994–997.
(6) Multi-layer Perceptron (MLP): The MLP classifier [10] G. Shahzad, H. Yang, A. W. Ahmad, and C. Lee, ‘‘Energy-Efficient
adopts the same setup as DNN. intelligent street lighting system using traffic-adaptive control,’’ IEEE
Sensors J., vol. 16, no. 13, pp. 5397–5405, Jul. 2016.
(7) Bagging: Bagging ensemble method was used with [11] R. Mitchell and I.-R. Chen, ‘‘A survey of intrusion detection techniques for
base_estimator as decision tree classifier, and n_estimators cyber-physical systems,’’ ACM Comput. Surveys, vol. 46, no. 4, pp. 1–29,
with 100, and random_state = 42. The remaining parameters Apr. 2014.
were set to default. [12] B. Chatterjee, D.-H. Seo, S. Chakraborty, S. Avlani, X. Jiang, H. Zhang,
M. Abdallah, N. Raghunathan, C. Mousoulis, A. Shakouri, S. Bagchi,
(8) Voting: Voting ensemble method was used with D. Peroulis, and S. Sen, ‘‘Context-aware collaborative intelligence with
three estimators which are bagging classifier, Adaboost spatio-temporal in-sensor-analytics for efficient communication in a large-
classifier, and random forest classifier. The voting method area IoT testbed,’’ IEEE Internet Things J., vol. 8, no. 8, pp. 6800–6814,
2020.
was set to ‘‘hard’’, and the remaining parameters were set to [13] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’
default. ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, 2009.
(9) Blending: Blending ensemble method was used with [14] F. Sabahi and A. Movaghar, ‘‘Intrusion detection: A survey,’’ in Proc. 3rd
Int. Conf. Syst. Netw. Commun., Oct. 2008, pp. 23–26.
three base learners which are bagging classifier, Adaboost
[15] A. L. Bowler, S. Bakalis, and N. J. Watson, ‘‘Monitoring mixing processes
classifier, and random forest classifier. The blending method using ultrasonic sensors and machine learning,’’ Sensors, vol. 20, no. 7,
was choosing the base learner with the maximum prediction p. 1813, Mar. 2020. [Online]. Available: https://www.mdpi.com/1424-
probability for each sample. The remaining parameters were 8220/20/7/1813
[16] F. Lopez, M. Saez, Y. Shao, E. C. Balta, J. Moyne, Z. M. Mao, K. Barton,
set to default. and D. Tilbury, ‘‘Categorization of anomalies in smart manufacturing
(10) Stacking: Stacking ensemble method was used with systems to support the selection of detection mechanisms,’’ IEEE Robot.
three base learners which are bagging classifier, Adaboost Autom. Lett., vol. 2, no. 4, pp. 1885–1892, Oct. 2017.
[17] A. Das and P. Rad, ‘‘Opportunities and challenges in explainable artificial
classifier, and random forest classifier. The meta_classifier intelligence (XAI): A survey,’’ 2020, arXiv:2006.11371.
method was set to LogisticRegression, and use_probas was [18] M. Marjani, F. Nasaruddin, A. Gani, A. Karim, I. A. T. Hashem, A. Siddiqa,
set to ‘True’. The remaining parameters were set to default. and I. Yaqoob, ‘‘Big IoT data analytics: Architecture, opportunities, and
open research challenges,’’ IEEE Access, vol. 5, pp. 5247–5261, 2017.
(11) K-nearest Neighbour (KNN): The KNN classifier
[19] J. He, J. Wei, K. Chen, Z. Tang, Y. Zhou, and Y. Zhang, ‘‘Multitier fog
is used in one experiment (SHAP explanation for MEMS) computing with large-scale IoT data analytics for smart cities,’’ IEEE
with default hyperparameters: the number of neighbors is set Internet Things J., vol. 5, no. 2, pp. 677–686, Apr. 2018.

71052 VOLUME 12, 2024


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

[20] S. R. Safavian and D. Landgrebe, ‘‘A survey of decision tree classi- [45] S.-H. G. Teng and S.-Y. M. Ho, ‘‘Failure mode and effects analysis,’’ Int.
fier methodology,’’ IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3, J. Quality Rel. Manage., vol. 13, no. 5, pp. 8–26, 1996.
pp. 660–674, Jun. 1991. [46] W. J. Lee, H. Wu, A. Huang, and J. W. Sutherland, ‘‘Learning via
[21] C. Tang, N. Luktarhan, and Y. Zhao, ‘‘SAAE-DNN: Deep learning method acceleration spectrograms of a DC motor system with application to
on intrusion detection,’’ Symmetry, vol. 12, no. 10, p. 1695, Oct. 2020. condition monitoring,’’ Int. J. Adv. Manuf. Technol., vol. 106, nos. 3–4,
[22] A. Yulianto, P. Sukarno, and N. A. Suwastika, ‘‘Improving AdaBoost- pp. 803–816, Jan. 2020, doi: 10.1007/s00170-019-04563-8.
based intrusion detection system (IDS) performance on CIC IDS 2017 [47] O. Arreche, T. Guntur, and M. Abdallah, ‘‘XAI-IDS: Towards proposing an
dataset,’’ J. Phys., Conf. Ser., vol. 1192, Mar. 2019, Art. no. 012018. explainable artificial intelligence framework for enhancing network intru-
[23] P. Tao, Z. Sun, and Z. Sun, ‘‘An improved intrusion detection algorithm sion detection systems,’’ Appl. Sci., vol. 14, no. 10, 2024, Art. no. 4170.
based on GA and SVM,’’ IEEE Access, vol. 6, pp. 13624–13631, 2018. [Online]. Available: https://www.mdpi.com/2076-3417/14/10/4170, doi:
[24] J. O. Mebawondu, O. D. Alowolodu, J. O. Mebawondu, and 10.3390/app14104170.
A. O. Adetunmbi, ‘‘Network intrusion detection system using supervised [48] H. K. Bharadwaj, A. Agarwal, V. Chamola, N. R. Lakkaniga, V. Hassija,
learning paradigm,’’ Sci. Afr., vol. 9, Sep. 2020, Art. no. e00497. M. Guizani, and B. Sikdar, ‘‘A review on the role of machine learning
[25] S. Waskle, L. Parashar, and U. Singh, ‘‘Intrusion detection system using in enabling IoT based healthcare applications,’’ IEEE Access, vol. 9,
PCA with random forest approach,’’ in Proc. Int. Conf. Electron. Sustain. pp. 38859–38890, 2021.
Commun. Syst. (ICESC), Jul. 2020, pp. 803–808. [49] P. Nimbalkar and D. Kshirsagar, ‘‘Feature selection for intrusion detection
[26] P. Bühlmann, ‘‘Bagging, boosting and ensemble methods,’’ in Handbook system in Internet-of-Things (IoT),’’ ICT Exp., vol. 7, no. 2, pp. 177–181,
of Computational Statistics: Concepts and Methods. Berlin, Germany: Jun. 2021. [Online]. Available: https://www.sciencedirect.com/science/
Springer, 2012, pp. 985–1022. article/pii/S2405959521000588
[27] Y. Wang, M. Bellus, J.-F. Geleyn, X. Ma, W. Tian, and F. Weidle, ‘‘A [50] A. Nazir, Z. Memon, T. Sadiq, H. Rahman, and I. U. Khan, ‘‘A novel
new method for generating initial condition perturbations in a regional feature-selection algorithm in IoT networks for intrusion detection,’’
ensemble prediction system: Blending,’’ Monthly Weather Rev., vol. 142, Sensors, vol. 23, no. 19, p. 8153, Sep. 2023. [Online]. Available:
no. 5, pp. 2043–2059, May 2014. https://www.mdpi.com/1424-8220/23/19/8153
[28] R. Lazzarini, H. Tianfield, and V. Charissis, ‘‘A stacking ensemble of deep [51] S. S. Udmale, S. K. Singh, R. Singh, and A. K. Sangaiah, ‘‘Multi-fault
learning models for IoT intrusion detection,’’ Knowl.-Based Syst., vol. 279, bearing classification using sensors and ConvNet-based transfer learning
Nov. 2023, Art. no. 110941. approach,’’ IEEE Sensors J., vol. 20, no. 3, pp. 1433–1444, Feb. 2020.
[29] H. G. Ayad and M. S. Kamel, ‘‘On voting-based consensus of cluster [52] L. Torrey and J. Shavlik, ‘‘Transfer learning,’’ in Handbook of Research
ensembles,’’ Pattern Recognit., vol. 43, no. 5, pp. 1943–1953, May 2010. on Machine Learning Applications and Trends: Algorithms, Methods, and
[30] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model Techniques. Hershey, PA, USA: IGI Global, 2010, pp. 242–264.
predictions,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, [53] M. Abdallah, R. Rossi, K. Mahadik, S. Kim, H. Zhao, and S. Bagchi, ‘‘Aut-
pp. 1–10. oForecast: Automatic time-series forecasting model selection,’’ in Proc.
[31] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman, 31st ACM Int. Conf. Inf. Knowl. Manage. NY, USA: Association for Com-
‘‘Distribution-free predictive inference for regression,’’ J. Amer. Stat. puting Machinery, Oct. 2022, pp. 5–14, doi: 10.1145/3511808.3557241.
Assoc., vol. 113, no. 523, pp. 1094–1111, Jul. 2018. [54] Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe,
[32] A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, H. Purohit, K. Suefusa, T. Endo, M. Yasuda, and N. Harada, ‘‘Descrip-
and P. Das, ‘‘Explanations based on the missing: Towards contrastive tion and discussion on DCASE2020 challenge task2: Unsupervised
explanations with pertinent negatives,’’ in Proc. Adv. Neural Inf. Process. anomalous sound detection for machine condition monitoring,’’ 2020,
Syst., vol. 31, 2018, pp. 1–12. arXiv:2006.05822.
[33] S. R. Islam, W. Eberle, S. K. Ghafoor, and M. Ahmed, ‘‘Explainable [55] R.-J. Hsieh, J. Chou, and C.-H. Ho, ‘‘Unsupervised online anomaly
artificial intelligence approaches: A survey,’’ 2021, arXiv:2101.09429. detection on multivariate sensing time series data for smart manufactur-
[34] A. Altmann, L. Toloşi, O. Sander, and T. Lengauer, ‘‘Permutation ing,’’ in Proc. IEEE 12th Conf. Service-Oriented Comput. Appl. (SOCA),
importance: A corrected feature importance measure,’’ Bioinformatics, Nov. 2019, pp. 90–97.
vol. 26, no. 10, pp. 1340–1347, May 2010. [56] Y. Fathy, M. Jaber, and A. Brintrup, ‘‘Learning with imbalanced data
[35] A. Dhurandhar, K. Shanmugam, R. Luss, and P. A. Olsen, ‘‘Improving in smart manufacturing: A comparative analysis,’’ IEEE Access, vol. 9,
simple models with confidence profiles,’’ in Proc. Adv. Neural Inf. Process. pp. 2734–2757, 2021.
Syst., vol. 31, 2018, pp. 1–11. [57] L. Calderoni, A. Magnani, and D. Maio, ‘‘IoT manager: An open-source
[36] J. Dieber and S. Kirrane, ‘‘Why model why? Assessing the strengths and IoT framework for smart cities,’’ J. Syst. Archit., vol. 98, pp. 413–423,
limitations of LIME,’’ 2020, arXiv:2012.00093. Sep. 2019.
[37] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, [58] D. Han, Z. Wang, W. Chen, Y. Zhong, S. Wang, H. Zhang, J. Yang, X. Shi,
D. Breitenbacher, and Y. Elovici, ‘‘N-BaIoT—Network-based detection and X. Yin, ‘‘DeepAID: Interpreting and improving deep learning-based
of IoT botnet attacks using deep autoencoders,’’ IEEE Pervasive Comput., anomaly detection in security applications,’’ 2021, arXiv:2109.11495.
vol. 17, no. 3, pp. 12–22, Jul. 2018. [59] K. A. Jackson, D. H. DuBois, and C. A. Stallings, ‘‘An expert system
[38] M. Antonakakis et al., ‘‘Understanding the Mirai botnet,’’ in Proc. 26th application for network intrusion detection,’’ Los Alamos Nat. Lab.
USENIX Secur. Symp. (USENIX Security), 2017, pp. 1093–1110. (LANL), Los Alamos, NM, USA, Tech. Rep. LA-UR-91-558; CONF-
[39] E. Cozzi, P.-A. Vervier, M. Dell’Amico, Y. Shen, L. Bilge, and 911059-1; ON: DE91008590, 1991.
D. Balzarotti, ‘‘The tangled genealogy of IoT malware,’’ in Proc. Annu. [60] C. Wu, A. Qian, X. Dong, and Y. Zhang, ‘‘Feature-oriented design of visual
Comput. Secur. Appl. Conf., Dec. 2020, pp. 1–16. analytics system for interpretable deep learning based intrusion detection,’’
[40] R. Jeff. Considerations for Accelerometer Selection When Monitoring in Proc. Int. Symp. Theor. Aspects Softw. Eng. (TASE), Dec. 2020,
Complex Machinery Vibration. Accessed: Sep. 30, 2019. [Online]. pp. 73–80.
Available: http://www.vibration.org/Presentation/IMI%20Sensors [61] Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai, ‘‘Kitsune: An
%20Accel%20Presentation%200116.pdf ensemble of autoencoders for online network intrusion detection,’’ 2018,
[41] A. Albarbar, S. Mekid, A. Starr, and R. Pietruszkiewicz, ‘‘Suitability of arXiv:1802.09089.
MEMS accelerometers for condition monitoring: An experimental study,’’ [62] D. Chicco and G. Jurman, ‘‘The advantages of the Matthews correlation
Sensors, vol. 8, no. 2, pp. 784–799, Feb. 2008. coefficient (MCC) over F1 score and accuracy in binary classification
[42] Y. Himeur, K. Ghanem, A. Alsalemi, F. Bensaali, and A. Amira, ‘‘Artificial evaluation,’’ BMC Genomics, vol. 21, no. 1, pp. 1–13, Dec. 2020.
intelligence based anomaly detection of energy consumption in buildings: [63] A. Gulli and S. Pal, Deep Learning With Keras. Berlin, Germany: Packt,
A review, current trends and new perspectives,’’ Appl. Energy, vol. 287, 2017.
Apr. 2021, Art. no. 116601. [64] C. A. Stewart, V. Welch, B. Plale, G. C. Fox, M. Pierce, and T. Sterling,
[43] M. Abdallah, W. Jae Lee, N. Raghunathan, C. Mousoulis, J. W. Sutherland, ‘‘Indiana University Pervasive Technology Institute,’’ Indiana Univ.,
and S. Bagchi, ‘‘Anomaly detection through transfer learning in agriculture Blomingtoon, IN, USA, Tech. Rep., 2017.
and manufacturing IoT systems,’’ 2021, arXiv:2102.05814. [65] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello,
[44] W. J. Lee, G. P. Mendis, M. J. Triebe, and J. W. Sutherland, ‘‘Monitoring B. Micenková, E. Schubert, I. Assent, and M. E. Houle, ‘‘On the
of a machining process using kernel principal component analysis and evaluation of unsupervised outlier detection: Measures, datasets, and
kernel density estimation,’’ J. Intell. Manuf., vol. 31, no. 5, pp. 1175–1189, an empirical study,’’ Data Mining Knowl. Discovery, vol. 30, no. 4,
Jun. 2020, doi: 10.1007/s10845-019-01504-w. pp. 891–927, Jul. 2016.

VOLUME 12, 2024 71053


A. N. Gummadi et al.: XAI-IoT: An Explainable AI Framework for Enhancing Anomaly Detection

[66] C. Mougan, L. State, A. Ferrara, S. Ruggieri, and S. Staab, ‘‘Beyond demo- JERRY C. NAPIER is currently pursuing the bach-
graphic parity: Redefining equal treatment,’’ 2023, arXiv:2303.08040. elor’s degree in computer information technology
[67] A. Liaw and M. Wiener, ‘‘Classification and regression by randomForest,’’ (CIT) with the Purdue School of Engineering and
R News, vol. 2, no. 3, pp. 18–22, 2002. Technology, Indiana University-Purdue University
[68] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, ‘‘On the surprising Indianapolis (IUPUI). He is a current participant in
behavior of distance metrics in high dimensional space,’’ in Proc. 8th
IUPUI’s 1st Year Research Immersion Program,
Int. Conf. Database Theory, London, U.K. Cham, Switzerland: Springer,
where he has the opportunity to gain important
Jan. 2001, pp. 420–434.
[69] B. Ingre, A. Yadav, and A. K. Soni, ‘‘Decision tree based intrusion research, problem-solving, and critical thinking
detection system for NSL-KDD dataset,’’ in Proc. Inf. Commun. Technol. skills. He plans on continuing his studies in CIT
Intell. Syst. (ICTIS), vol. 2. Berlin, Germany: Springer, 2017, pp. 207–218. and hopes to gain valuable experience during his
[70] M. A. Ferrag, L. Maglaras, A. Ahmim, M. Derdour, and H. Janicke, time at IUPUI. His research interests include network design, information
‘‘RDTIDS: Rules and decision tree-based intrusion detection system security, and cybersecurity.
for Internet-of-Things networks,’’ Future Internet, vol. 12, no. 3, p. 44,
Mar. 2020.
[71] M. Al-Omari, M. Rawashdeh, F. Qutaishat, M. Alshira’H, and N. Ababneh,
‘‘An intelligent tree-based intrusion detection model for cyber security,’’
J. Netw. Syst. Manage., vol. 29, no. 2, pp. 1–18, Apr. 2021.
[72] W. J. Lee, G. P. Mendis, and J. W. Sutherland, ‘‘Development of an
intelligent tool condition monitoring system to identify manufacturing
tradeoffs and optimal machining conditions,’’ in Proc. 16th Global Conf.
Sustain. Manuf., vol. 33, 2019, pp. 256–263.
[73] M. C. Garcia, M. A. Sanz-Bobi, and J. Del Pico, ‘‘SIMAP: Intelligent
system for predictive maintenance: Application to the health condition
monitoring of a windturbine gearbox,’’ Comput. Ind., vol. 57, no. 6,
pp. 552–568, 2006.
[74] M. De Benedetti, F. Leonardi, F. Messina, C. Santoro, and A. Vasilakos,
‘‘Anomaly detection and predictive maintenance for photovoltaic sys-
tems,’’ Neurocomputing, vol. 310, pp. 59–68, Oct. 2018.
[75] A. L. Alfeo, M. G. C. A. Cimino, G. Manco, E. Ritacco, and G. Vaglini,
‘‘Using an autoencoder in the design of an anomaly detector for
smart manufacturing,’’ Pattern Recognit. Lett., vol. 136, pp. 272–278,
Aug. 2020. [Online]. Available: https://www.sciencedirect.com/science/
article/pii/S0167865520302269
[76] S. Wang and N. Xi, ‘‘Calibration of haptic sensors using transfer learning,’’
IEEE Sensors J., vol. 21, no. 2, pp. 2003–2012, Jan. 2021.
[77] K. I. Wang, X. Zhou, W. Liang, Z. Yan, and J. She, ‘‘Federated transfer MUSTAFA ABDALLAH (Member, IEEE) received
learning based cross-domain prediction for smart manufacturing,’’ IEEE
the M.Sc. degree in engineering mathematics
Trans. Ind. Informat., vol. 18, no. 6, pp. 4088–4096, Jun. 2022.
from the Faculty of Engineering, Cairo University,
and the Ph.D. degree from the Elmore Family
School of Electrical and Computer Engineering,
Purdue University. He is currently an Assistant
ANNA NAMRITA GUMMADI received the Professor with the Purdue School of Engineer-
bachelor’s degree in computer science engineering ing and Technology, Indiana University-Purdue
from Jawaharlal Nehru Technological University University Indianapolis (IUPUI). He has also
(JNTU) Hyderabad, India. She is currently several industrial research experiences, including
pursuing the master’s degree in cybersecu- internships with the Adobe Research, and a Principal, and a five-year
rity and trusted systems with the Purdue machine learning research experience with RDI (a leading machine learning
School of Engineering and Technology, Indi- company in Egypt). His research interests include game theory, human
ana University-Purdue University Indianapolis decision-making, explainable AI, and machine learning with applications,
(IUPUI). She is a Research Assistant focusing on including network security and autonomous driving systems. His research
cybersecurity and explainable AI. Having spent contribution is recognized by receiving the Purdue Bilsland Dissertation
almost three years as an Associate Software Engineer, she has hands-on Fellowship and having many publications in top IEEE/ACM journals and
experience with technology’s practical applications. She aspires to leverage conferences. He was a recipient of the M.Sc. Fellowship from the Faculty of
her expertise in creating transparent and ethically-driven technological Engineering, Cairo University, in 2013.
advancements.

71054 VOLUME 12, 2024

You might also like