Ianomaly: A Toolkit For Generating Performance Anomaly Datasets in Edge-Cloud Integrated Computing Environments

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

iAnomaly: A Toolkit for Generating Performance

Anomaly Datasets in Edge-Cloud Integrated


Computing Environments
Duneesha Fernando, Maria A. Rodriguez and Rajkumar Buyya
Cloud Computing and Distributed Systems (CLOUDS) Laboratory
School of Computing and Information Systems
The University of Melbourne, Australia
Email: [email protected], {marodriguez, rbuyya}@unimelb.edu.au
arXiv:2411.02868v1 [cs.DC] 5 Nov 2024

Abstract—Microservice architectures are increasingly used to performance anomaly datasets, which are crucial for training
modularize IoT applications and deploy them in distributed and evaluating algorithms proposed in such research. The
and heterogeneous edge computing environments. Over time, few existing studies rely on cloud datasets [6], [7] or data
these microservice-based IoT applications are susceptible to
performance anomalies caused by resource hogging (e.g., CPU collected from private edge setups [3], [4], [8] to evaluate their
or memory), resource contention, etc., which can negatively proposed approaches. The cloud datasets have been collected
impact their Quality of Service and violate their Service Level from applications (mostly web applications) deployed on cloud
Agreements. Existing research on performance anomaly detection servers. However, these cloud servers lack the heterogeneity
in edge computing environments is limited primarily due to found in edge devices in terms of computing, storage, and
the absence of publicly available edge performance anomaly
datasets or due to the lack of accessibility of real edge setups networking capabilities. Additionally, the microservices in
to generate necessary data. To address this gap, we propose cloud applications do not demonstrate the same diversity
iAnomaly: a full-system emulator equipped with open-source in terms of QoS and resource requirements as those in an
tools and fully automated dataset generation capabilities to IoT application. As a result, cloud datasets fail to capture
generate labeled normal and anomaly data based on user-defined characteristics inherent to real edge environments. On the other
configurations. We also release a performance anomaly dataset
generated using iAnomaly, which captures performance data for hand, private edge setups have not been publicly released and
several microservice-based IoT applications with heterogeneous lack detailed information, which makes it difficult to replicate
QoS and resource requirements while introducing a variety of their environments, generate the necessary data, and reproduce
anomalies. This dataset effectively represents the characteristics the results of the anomaly detection experiments. It also
found in real edge environments, and the anomalous data in hinders research in the field because not everyone has access
the dataset adheres to the required standards of a high-quality
performance anomaly dataset. to a real edge-cloud deployment for data collection purposes.
Index Terms—Edge computing, Microservices, IoT, Perfor- Hence, relying on cloud datasets and private edge setups
mance anomaly detection, Datasets, Emulators does not facilitate performance anomaly detection research in
edge computing environments, thus posing a challenge to the
I. I NTRODUCTION progression of the field. Therefore, there is an opportunity
Edge-cloud integrated environments consist of devices with to create a performance anomaly dataset that reflects the
heterogeneous computing, storage, and networking capabili- characteristics of edge computing environments and release
ties. Microservice architectures are increasingly used to mod- the setup used for dataset generation.
ularize IoT applications and deploy them in these distributed Edge computing emulators are a suitable platform to gener-
environments to meet the Quality of Service (QoS) require- ate performance anomaly datasets. They are more representa-
ments of each module while optimizing resource usage [1], tive of real edge environments when compared to simulators,
[2]. Over time, these microservice-based IoT applications are and are more easily accessible and cost-effective when com-
susceptible to performance anomalies caused by resource hog- pared to real edge deployments. The main aim of existing
ging (e.g., CPU or memory) and resource contention, which edge computing emulators is to create a staging environment
can negatively impact their QoS and violate their Service that achieves compute and network realism similar to a real
Level Agreements [3]–[5]. Therefore, it is crucial to conduct edge environment and facilitate testing of IoT applications be-
performance anomaly detection on microservice-based IoT fore deploying them into production [9]–[14]. However, these
applications in edge computing environments and eventually general-purpose emulators do not incorporate in their design,
mitigate such anomalies. tools and mechanisms required to autonomously and trans-
Currently, there is limited research on performance anomaly parently generate large-scale performance anomaly datasets
detection in edge computing environments. One of the main useful for model training and evaluation. For example, they
reasons for this is the absence of publicly available edge lack adequate monitoring tools to collect performance and

1
system-level metrics, workload generation tools to generate setup. However, similar to the rest of the literature, they also
and capture normal performance data, and chaos engineering do not provide sufficient details required for reproducing the
mechanisms to inject performance anomalies into applications. data generation environments. Therefore, the reliance on cloud
This work addresses this gap by presenting the iAnomaly datasets and private edge setups presents a challenge to the
framework, a performance anomaly-enabled full-system emu- progression of the field of performance anomaly detection
lator that accurately models an edge computing environment research in edge computing environments. Additionally, there
hosting microservice-based IoT applications. iAnomaly is de- is a lack of publicly available normal traces from real edge
signed with open-source tools and provides fully automated environments into which anomalies can be injected to generate
dataset generation capabilities to generate labeled normal (data synthetic datasets. We further explore this gap in current
collected under normal conditions without anomalies) and research by creating and releasing a performance anomaly
anomaly (data collected under anomalous conditions) data dataset that reflects the characteristics of edge computing
based on user-defined configurations. In addition, we present environments along with the setup used for dataset generation.
a performance anomaly dataset generated using the proposed There are three options of platforms for generating a per-
framework. The dataset captures performance data for several formance anomaly dataset. They are simulators, emulators,
microservice-based IoT applications with heterogeneous QoS and real edge environments. Out of these, emulators employ
and resource requirements across a wide range of domains, real applications deployed on testbed hardware to emulate
software architectures, service composition patterns, and com- real-world infrastructure configurations [14], while simulators
munication protocols by introducing a variety of client/sensor- do not support real-world IoT protocols and services [10].
side as well as server-side anomalies. To the best of our Simulations make a number of simplifications that may not
knowledge, this multivariate dataset is the first open-source always hold true, especially with an infrastructure as dynamic
edge performance anomaly dataset. as edge computing [9]. Most simulators lack detailed network
The analysis of the dataset showed that the microservices simulation capabilities and focus on specific aspects of edge
within it vary in terms of their QoS and resource usage during modeling, such as service scheduling [14]. Therefore, we
regular operation, thus successfully capturing the characteris- identify emulators as the most suitable platform for generating
tics of a real edge dataset. Further analysis confirmed that the data as they provide a higher Degree of Realism (DoR) than
anomalous data in the dataset meets the necessary standards simulators and because they enable the generation of large-
for a high-quality performance anomaly dataset. This includes scale data in a cost-effective manner as opposed to real edge
having an anomaly ratio comparable to other standard anomaly environments.
datasets and the dataset’s non-triviality. Existing edge computing emulators can be organized under
The rest of the paper is organized as follows: Section two main categories based on the level of virtualization
II reviews the existing related works. Section III presents and abstraction used to model the edge devices. They are
the architecture of the iAnomaly toolkit, while section IV 1) full-system emulators and 2) container-based emulators.
discusses the implementation aspects of iAnomaly. Section V In container-based emulators, edge devices are represented
provides details of the generated performance anomaly dataset as docker containers, while full-system emulators provide a
followed by an analysis of the dataset. Section VI concludes higher granularity of emulation by allowing the deployment
the paper and draws future research directions. of multiple containerized microservices within edge devices
modeled as virtual machines (VMs). Early emulators such as
II. R ELATED W ORK EmuFog [9], FogBed [10], MockFog [11], and EmuEdge [12]
Out of the existing research studies conducted around per- do not support the microservice-level granularity of IoT appli-
formance anomaly detection in edge computing environments, cations. Fogify [13] is the first edge emulator to support the
Becker et al. [3], Soualhia et al. [4], and Skaperas et al. [8] microservice-level granularity of IoT applications. However,
evaluated their proposed approaches using data collected from it is limited to deploying only a single microservice per edge
private edge setups. However, the lack of detailed informa- device due to being a container-based emulator. Extending
tion about these private edge setups makes it challenging to such emulators with dataset generation capabilities restricts
reproduce their environments and generate the necessary data data collection to device-level anomalies only. In contrast,
to replicate the results of their anomaly detection experiments. iContinuum [14] is a full-system emulator with support for
Tuli et al. [6] evaluated their proposed approaches using two microservices deployment. Unlike container-based emulators,
publicly available cloud datasets: the Server Machine Dataset full-system emulators provide a higher level of realism and
(SMD) collected from a large Internet company [15], and also allow the injection of both device-level and microservice-
the Multi-source Distributed System (MSDS) dataset gener- level anomalies. Consequently, our research aims to bridge the
ated from microservices deployed on a cluster of bare metal identified gap by developing a full-system emulator with per-
nodes with homogeneous computing, storage, and network formance anomaly dataset generation capabilities. A compar-
capabilities [16]. As a result, these cloud datasets are unable ison of performance anomaly dataset generation capabilities
to accurately represent the properties inherent to real edge in existing edge emulators [9]–[14] along with our proposed
environments. Tuli et al. also conducted a further evaluation iAnomaly toolkit is shown in Table I.
on three self-created datasets collected from a private edge The main intention of existing emulators is to test IoT

2
TABLE I
C OMPARISON OF PERFORMANCE ANOMALY DATASET GENERATION CAPABILITIES IN EDGE EMULATORS
Edge emulator/ Main objective of work Fault injec- Performance Emulation capabilities/ ar- Microservice Applications
toolkit tion capabili- anomaly chitecture support
ties dataset
generation
capabilities
EmuFog [9] Testing IoT apps in a staging environment before × × Focused on network emula- × Not reported
deploying into the real edge, Automatic place- tion
ment of fog nodes.
FogBed [10] Creating an environment to conduct resource × × Focused on network emula- × Healthcare prevention and monitoring
management/service orchestration experiments. tion, Container-based emu- system
lator
MockFog [11] Testing IoT apps in a staging environment before × × Full-system emulator × Ambulance cars communicating vital
deploying into the real edge. measures to hospitals
EmuEdge [12] Achieving compute and network realism of a real × × Full-system emulator × Not reported
edge environment.
Fogify [13] Testing IoT apps in a staging environment before ✓ × Container-based emulator ✓ Smart transport applications
deploying into the real edge.
iContinuum [14] Achieving compute and network realism of a real ✓ × Full-system emulator ✓ Image processing application
edge environment, Intent-based emulation.
iAnomaly Creating a toolkit to generate performance ✓ ✓ Full-system emulator ✓ Face detection/recognition application,
(proposed) anomaly datasets. Industrial machinery predictive main-
tenance application, Location retrieval
application

applications in a staging environment before deploying them tion tool, while iContinuum uses Locust for workload gen-
into production. They are designed to achieve the compute and eration. However, Locust primarily focuses on generating
network realism of an edge environment, and the evaluation HTTP/HTTPS workloads, while it is important to incorporate a
of these studies is also focused on those aspects. Modern workload generation tool supporting a wide range of protocols,
emulators such as Fogify [13] and iContinuum [14] also have not just HTTP.
the capability to perform fault injections. However, the main Consequently, it is evident that the current emulators lack
intention of such fault injection capabilities is not to collect the necessary tools for generating performance anomaly data.
data for performance anomaly detection model training but These tools include a monitoring tool for collecting metrics,
to test the fault tolerance and availability aspects of IoT a workload generation tool for creating normal performance
applications in the face of faults. data, and a chaos engineering tool for injecting performance
Although both Fogify and iContinuum have implemented anomalies. Identifying this research gap, our paper aims to
and evaluated fault injection capabilities, neither of them address it by developing a toolkit with an emulator that incor-
has specified the tools used to inject anomalies. Since these porates a set of open-source tools for generating performance
emulators only conducted injections of a limited number of anomaly datasets.
anomaly types, it can be inferred that they likely utilized In addition to finding the best open-source tools for gen-
basic tools, such as stress-ng, for this purpose. However, such erating performance anomaly datasets and creating a full-
mechanisms do not allow for the introduction of failures or system emulator with performance anomaly dataset generation
disruptions in a controlled manner. capabilities, we also integrate automated dataset generation
Both Fogify and iContinuum are capable of collecting both features into our proposed iAnomaly toolkit. Moreover, as
system and application metrics during monitoring. However, shown in Table I, most emulators have released only one IoT
Fogify utilizes an in-house developed monitoring tool for this application which they used in their experiments. However, we
purpose, whereas open-source tools are preferred in emulators generate (and release) an open-source performance anomaly
to support interoperability and transparency of code. iCon- dataset using iAnomaly by deploying three IoT applications
tinuum employs sFlow-RT, an open-source tool, to capture consisting of microservices with varying QoS and resource
network and host-level metrics such as CPU and memory requirements. These applications span a wide range of do-
usage. It also integrates sFlow agents with Prometheus, another mains, software architectures, service composition patterns,
open-source tool, to collect application metrics. While Fogify and communication protocols.
requires explicit instrumentation, i.e. manually embedding
monitoring code within the source code, in order to capture III. I A NOMALY A RCHITECTURE
performance metrics, sFlowRT can only capture application Figure 1 depicts the architecture of iAnomaly. At the core
metrics without explicit instrumentation from IoT applications of the framework is a full system emulator that comprises
that communicate via HTTP protocol. As most IoT appli- multiple distinct layers with a set of components to build
cations deployed in edge computing environments are not all the layers from infrastructure to applications. The in-
limited to HTTP protocol and use a variety of protocols such frastructure layer hosts a diverse array of computing and
as MQTT, RTSP, etc., it is important to be able to collect networking resources. While the heterogeneity of compute
metrics from applications communicating via such non-HTTP nodes can be emulated using Virtual Machines (VM) with
protocols as well. different resource capacities in the cloud, a network emulator
Fogify has not provided details of its workload genera- is used to construct the network topology (by creating virtual

3
side anomalies such as user surges and spikes. Details of
request workloads, including concurrency and duration, are
specified using test plans. It is important that the workload
generation tool also supports a wide range of protocols, not
just HTTP. When generating data from a specific microservice,
we need at least two instances of workload generators, one
for generating normal workloads, and another for introducing
client/sensor-side anomalies.
The anomaly injection tool is responsible for injecting
server-side anomalies, such as resource hogging and service
failures. MockFog [11], which is one of the early edge
emulators, suggested using chaos engineering tools for in-
jecting anomalies and conducting performance testing. Chaos
engineering tools are designed to test the fault tolerance of
systems by deliberately introducing failures or disruptions in
a controlled manner, making them suitable for inclusion in the
proposed toolkit to inject performance anomalies. Moreover,
by incorporating chaos engineering tools, we can inject a
diverse range of anomalies, unlike with basic tools such as
stress-ng.
Generating a significant amount of normal and anomaly data
by using these data collection tools is a time-consuming and
repetitive task necessitating human intervention. Additionally,
there is a learning curve associated with using the tools,
notably in terms of creating necessary test plans (incorporat-
Fig. 1. System architecture of iAnomaly
ing varying parameters representing normal and anomalous
switches and network elements) and simulate the network workloads) using the workload generator, designing chaos
within the infrastructure layer. Users can define the appli- engineering configurations, and scripting data collection for
cation structure of the microservice-based IoT applications retrieving information from the monitoring tool.
through the application layer. The middleware layer of a full- To overcome these challenges, we further extend the
system emulator manages the deployment and operation of iAnomaly framework with automated dataset generation ca-
applications across the emulated infrastructure. Its control pabilities, where users can provide the configurations of the
plane consists of a cluster manager/orchestrator that utilizes expected dataset, and the resulting labeled normal and anomaly
containerization and orchestration technologies to manage the data will be stored in a predefined location. As depicted in
computing cluster and its resources, and a network controller Figure 1, users can define the dataset generation configurations
that uses Software-Defined Networking (SDN) technologies to through the application layer. We also introduce a dataset
manage the network (such as effectively regulating the network generation orchestrator to the middleware layer, which inter-
flow while considering resource usage conditions). prets the content from the dataset generation configuration
In addition to a full system emulator, iAnomaly includes and coordinates with the data generation tools in the toolkit
components to facilitate the data generation and collection to generate the normal and anomaly data required for the
process within its middleware layer. These components consist performance anomaly dataset. This process will be explained
of a monitoring module, a workload generation tool, and a tool in detail in the next section.
for injecting anomalies.
The monitoring module is responsible for gathering system IV. I A NOMALY I MPLEMENTATION
and application metrics from the deployed microservices. The This section describes the implementation details of the
collected data will be stored in a database and retrieved architecture presented in section III. The deployment diagram
back through queries when creating the dataset. An important of the iAnomaly toolkit is shown in Figure 2. iAnomaly relies
requirement of the monitoring tool is to be able to collect on iContinuum [14] as a full-system emulator to accurately
data from IoT applications communicating not only via HTTP- model edge computing environments hosting microservice-
based protocols but also via non-HTTP protocols such as based IoT applications. iContinuum utilizes Virtual Machines
Kafka, MQTT, and RTSP. In addition, it is preferable for (VMs) with varying resource specifications to demonstrate
the tool to be capable of collecting application metrics from the heterogeneity of edge devices in terms of computing and
programs without the need for explicit instrumentation. storage. Additionally, a VM with higher resource capacities
The workload generator is in charge of sending re- acts as the master node, hosting only the tools required for
quest/sensor data to the microservices. This tool is used during compute orchestration. Acknowledging the critical role of the
normal data generation as well as for introducing client/sensor- master node as the system’s control plane, we ensure that

4
Fig. 3. Interactions between the dataset generation orchestrator and other
Fig. 2. Deployment diagram of the iAnomaly toolkit
components
microservices are deployed only on VMs other than the master ployed via the toolkit, a Pixie Edge Module (PEM) is deployed
node. Following the implementation of iContinuum, iAnomaly for each worker node. These modules capture monitoring data
uses Mininet1 as the network emulator to construct the network from microservices deployments on the worker nodes and
topology and simulate the bandwidth between edge devices. send those to the Pixie Vizier deployed on the Kubernetes
The OVS switches that form the Mininet topology are de- master node, from where they are transferred to the Pixie
ployed in a separate VM, which is also nondeployable for cloud. Pixie Vizier acts as Pixie’s central collector and is also
microservices. responsible for managing PEMs. When retrieving back the
In line with iContinuum’s control plane implementation, collected data, data retrieval queries written in Pixie language
iAnomaly also uses Kubernetes, specifically K3s2 , which is (PxL) are executed against the Pixie cloud via a Pixie API
a lightweight Kubernetes distribution designed for resource- client.
constrained environments such as edge computing or IoT The Workload Generator is implemented by leveraging
devices, as the cluster manager/orchestrator. Additionally, Jmeter5 as it supports a wide range of protocols, not just HTTP.
iAnomaly utilizes Open Network Operating System (ONOS)3 We are using two separate instances of JMeter: one to generate
as the network controller, and it manages the OVS switches in normal workloads and the other to create client/sensor-side
the Mininet topology. The ONOS controller is also deployed anomalies. Both instances are deployed outside the Kubernetes
in the VM where the OVS switches are deployed. Figure 2 cluster to run the JMeter loads as needed and to avoid
shows how the Kubernetes orchestrator is deployed in the interfering with the master node’s operations.
master node and forms a multi-node Kubernetes cluster with Chaos Mesh6 , an open-source chaos engineering platform
the worker nodes. Each worker node is configured with an for Kubernetes, is used as the anomaly injection tool and
OVS bridge featuring two virtual interfaces, tap0, and tap1, deployed in the Kubernetes master node. From there, CRD
out of which tap1, which is configured as a GRE interface, is (Custom Resource Definition) YAMLs are applied to introduce
linked to the tap1 port of the corresponding OVS switch. This server-side anomalies into the target deployments. Chaos Mesh
ensures that the worker nodes have a bi-directional connection is capable of injecting a multitude of server-side anomalies
with the Mininet-created network topology through Generic such as CPU stress, memory stress, network delay, etc.
Routing Encapsulation (GRE) tunnelling. The dataset generation orchestrator is deployed in the Ku-
The Monitoring Module is realized by using Pixie4 , a bernetes master node. As illustrated in Figure 3, it acts as
lightweight and open-source eBPF(extended Berkeley Packet the central component that interprets the content from the
Filter)-based monitoring tool specifically designed for Kuber- dataset generation configuration/s and coordinates with the
netes applications. eBPF-based monitoring tools, which gained data generation tools in the toolkit to produce the labeled
popularity recently, allow sandboxed programs to execute normal and anomaly data needed for the performance anomaly
directly inside the Linux kernel and automatically capture dataset. To start the data generation process, the dataset
telemetry data in a non-intrusive manner, i.e., without re- generation orchestrator reads the configuration file/s to retrieve
quiring modifications to user-space applications [17]. It also the deployment details, normal data collection parameters,
supports monitoring a wide variety of protocols, including and anomaly injection settings. It then remotely executes
HTTP, Kafka, AMQP, and MySQL, making it well-suited for the test plans on Jmeter’s normal data generation instance
monitoring edge computing environments. When Pixie is de- using Paramiko’s7 SSH client to initiate normal workload
generation. Once the normal data has been generated for
1 https://mininet.org/
2 https://k3s.io/ 5 https://jmeter.apache.org/
3 https://opennetworking.org/onos/ 6 https://chaos-mesh.org/
4 https://docs.px.dev/ 7 https://docs.paramiko.org/

5
the required duration, the orchestrator executes PxL queries faces in the frame. Additionally, it conducts face alignment
to collect the generated normal data. Simultaneously, while using affine transformation and supports multi-face alignment
the normal workload generation process is ongoing, the or- within a single frame. Aligned faces are then input to the face
chestrator proceeds to inject anomalies. Anomalies are in- recognizer microservice. The face recognizer microservice
jected either through server-side disruptions using Chaos Mesh employs a ResNet-50 model to extract features from the
or via client/sensor-side anomalies through Jmeter’s anoma- detected faces and calculates the cosine distance between
lous data generation instance. For server-side disruptions, these features and the features of the authorized personnel’s
the dataset generation orchestrator applies the corresponding faces to calculate the similarity between detected faces and
chaos YAMLs through the installation of helm charts8 . After faces of authorized personnel. Upon successful recognition,
the entire anomaly injection period, the orchestrator executes the timestamp is logged in the database - which is implemented
PxL queries to collect the corresponding anomaly data. Finally, as another microservice - marking the employee’s entry or exit
both normal and anomaly data are funnelled back to the from the building.
orchestrator, which integrates and processes the data to create
a comprehensive dataset suitable for training and evaluating
performance anomaly detection models.
Therefore, the realization of the architecture proposed in
section III using the open-source tools discussed earlier, has
made the process of dataset generation easily accessible for
researchers. We have released the source code of the iAnomaly
Fig. 5. Industrial machinery predictive maintenance application
toolkit, which includes iContinuum as its full-system emulator
together with the chosen open-source data collection tools
b) Predictive maintenance for industrial machinery:
and the code for automated dataset generation, in a public
In this application, IoT sensors that measure temperature,
repository9 .
vibration, and pressure continuously generate multivariate time
V. C ASE S TUDY: DATASET G ENERATION series data, which is then written to a Kafka topic. As shown
in Figure 5, an orchestrator microservice subscribes to this
This section showcases how iAnomaly was used to cre-
Kafka topic, reads the raw sensor data, and forms time-series
ate an open-source labeled dataset consisting of normal and
windows, which are sent to the emergency event detector
anomalous data collected for three different IoT applications,
microservice. Before the windowed data is sent for emergency
followed by an analysis of the generated performance anomaly
event detection, the orchestrator imputes any missing values
dataset.
by calling the missing data imputer microservice and also
A. IoT Applications standardizes the data using standard score normalization. The
The generated dataset records data for three IoT applications emergency event detector uses an isolation forest model to
typically deployed in edge environments: detect whether a window of standardized time series data is
anomalous by detecting patterns that deviate from the norm,
such as overheating, excessive vibration, or mechanical stress.
When an anomaly is detected, alerts are triggered to the
maintenance teams.
Fig. 4. Face detection/recognition application

a) Face detection/recognition: This application show-


cases the scenario of using computer vision for secure ac-
cess control at a corporate office building. As illustrated in
Fig. 6. Location retrieval application
Figure 4, it comprises four microservices: 1) preprocessor,
2) face detector, 3) face recognizer, and 4) database. Cam-
c) Location retrieval: This application depicts the sce-
eras at entry points (e.g., doors, gates) produce an RTSP
nario of fleet management for logistics or delivery services. It
stream. The preprocessor microservice reads from this video
facilitates tracking the real-time location of delivery vehicles
stream, performing resizing and grayscale conversion on the
in a fleet. As shown in Figure 6, when a dispatcher or
images and carrying out motion detection by thresholding
a customer requests the location of a specific vehicle, the
the difference between consecutive frames. Upon detection
request is directed to the location retriever microservice. This
of motion (i.e., when an employee approaches), the frame
microservice checks its Least Recently Used (LRU) cache, and
is sent to the face detector microservice. This microservice
if the vehicle’s location was recently queried, it provides the
utilizes a Multi-Task Cascaded Convolutional Neural Network
cached location for quick access. If the location is not in the
(MTCNN) to detect bounding boxes and landmark points of
cache, the microservice queries the vehicle’s current location,
8 https://helm.sh/ updates the cache, and returns the most up-to-date information.
9 https://github.com/Cloudslab/iAnomaly In our implementation, the location simulator microservice is

6
TABLE II
P ROPERTIES OF I OT APPLICATIONS USED TO GENERATE PERFORMANCE ANOMALY DATASET

Service
Application Microservice Type of task performed Software architecture composition QoS properties
pattern
Preprocessor Computer vision-based Stream processing LC, HTp,HCI, BI
Face detection/ Face Detector Computer vision-based Request-response LC, MTp, HCI
recognition application Chained
Face Recognizer Computer vision-based Request-response LC, LTp, HCI
Database General purpose Request-response LT, LTp
Industrial machinery Orchestrator Time-series processing Publish-subscribe LC, LTp
predictive maintenance Emergency event detection Time-series processing Request-response Aggregator LC, LTp, MCI
application Missing data imputation General purpose Request-response LC, LTp, MCI
Location retrieval Location service with
application timed cache Simple non-resource-intensive Request-response Passthrough LC, HTp
LC: Latency Critical, LT: Latency Tolerant, HTp: High Throughput, MTp: Moderate Throughput, LTp: Low Throughput, HCI: High Compute Intensive,
MCI: Moderate Compute Intensive, BI: Bandwidth Intensive

used to mock the location of the delivery vehicle by generating


GPS locations in a trajectory.
As shown in Table II, the aforementioned applications
span the properties of a wide range of IoT applications. For
instance, the face detection/recognition application falls under
Computer Vision (CV), the industrial machinery predictive
maintenance application focuses on time-series processing,
and the location retrieval application is a simple, non-resource-
intensive application. Each application relies on different
software architectures, including stream processing, request-
response, and publish-subscribe, which are implemented using
different communication protocols, including HTTP, Kafka,
RTSP, and MySQL. The applications also employ different
service composition patterns, such as chained, aggregator,
and passthrough. Most importantly, the microservices in these
applications have unique QoS and resource requirements.
The iAnomaly repository also contains the Python imple-
mentation for all three IoT applications. In addition, we have
made the Docker images of the microservices accessible to
the public10 .

B. Dataset Generation
The applications presented in Section V-A were deployed
in a Kubernetes cluster, where iAnomaly was responsible for
the orchestration and automation of the data generation and Fig. 7. Dataset generation configuration YAML file of the location retrieval
application
collection process. Specifically, the physical environment con-
sisted of ten VMs with heterogeneous computing and storage
following the QoS-aware scheduling algorithm proposed by
specifications created in the Melbourne Research Cloud11 to
Pallewatta et al. in their research study [19].
emulate the worker nodes. In particular, two 2vCPU/8G VMs
represented the IoT layer, four 2vCPU/8G VMs represented Thereafter, normal and anomaly data were generated from
Fog level 1, three 4vCPU/16GB VMs represented Fog level each application by providing dataset generation configura-
2, and one 8vCPU/32GB VM represented Fog level 3. The tions specified in the form of YAML files. For example,
configuration of the emulated network was determined based the configuration shown in Figure 7 was used to generate
on existing related research [18], with the following specifi- data from the location retrieval application. This configuration
cations: IoT layer → Fog level 1: 5ms/100Gbps, Fog level instructs iAnomaly to generate and collect normal performance
1 → Fog level 2: 20ms/10Gbps, Fog level 2 → Fog level 3: data from the location retriever microservice over a duration of
50ms/0.15Gbps, and 2ms bandwidth among nodes at the same three hours. Additionally, it specifies the injection of five types
level. The applications were assigned to the worker nodes by of anomalies—CPU hog, memory stress, user surge spike, user
surge step, and network delay—over a total duration of two
10 https://hub.docker.com/repository/docker/dtfernando/ianomaly hours. Similar configurations were used to collect performance
11 https://dashboard.cloud.unimelb.edu.au/ data from all applications. While five types of anomalies

7
Fig. 8. Distribution of records by anomaly type

(two client-side and three server-side) were injected into the


location retrieval application, only a subset of these anomalies
was introduced into the other applications. This selection was
Fig. 9. Collinearity among metrics in the generated dataset
based on the likelihood of each anomaly type occurring in
real-world conditions for the respective applications. Figure 8
depicts the distribution of different types of anomalies across
the dataset. It is also important to note that data for each
application was collected independently to avoid anomalies
caused by colocation.
Pixie’s default granularity of 10 seconds was used when
collecting data. For each application, data was collected
across 12 metrics, covering key aspects of system and
application performance: disk read and write throughput
(total disk read throughput, total disk write throughput),
memory usage (rss, vsize), CPU utilization (cpu usage),
network activity (rx bytes per ns, tx bytes per ns),
latency percentiles (latency p50, latency p90, latency p99),
request throughput (request throughput), and error rate
(errors per ns), collectively providing a comprehensive view
of resource consumption, network efficiency, and service
reliability.
The final dataset comprises a total of 30240 records. Of
these, 19260 records attribute to 54 hours of normal data,
and 10980 records account for 31 hours of anomalous data.
Within this dataset, there are 1512 records labelled as anoma-
lous data points, resulting in an anomaly ratio of 5%. This
ratio is consistent with the anomaly ratio of other standard
anomaly datasets, such as SMD (5.84%) and ASD (4.61%)
[20], indicating that our dataset maintains a realistic anomaly
density—an important characteristic of a high-quality anomaly
dataset [21]. The collected dataset is also made available at the
iAnomaly repository. Fig. 10. Distribution of normal data across shortlisted metrics

C. Analysis of the Generated Dataset purpose within the dataset. To simplify our analysis, we
Figure 9 illustrates the colinearity among the metrics in the select a single metric from each identified group of correlated
generated dataset after excluding errors per ns metric, which metrics to serve as a proxy for the others in the group.
contains all-zero values. The plot shows that inherently related Based on this correlation analysis, we identify cpu usage, rss,
groups of metrics, such as disk read/write throughputs, as well rx bytes per ns, vsize, request throughput, and latency p50
as latency percentiles, are highly correlated. In addition, met- as the subset of metrics with the lowest collinearity. Conse-
rics such as request throughput and tx bytes per ns, as well quently, we will focus on these metrics for further analysis.
as disk read/write throughputs and rx bytes per ns, exhibit a Figure 10 illustrates the distribution of normal data for each
strong positive correlation with each other. Outside of these shortlisted metric, focusing on three instances of the location
groups, most other metrics do not show strong correlations retriever microservice and one instance of each of the other
with each other, indicating that each metric serves a unique microservices. The three instances of the location retriever

8
correspond to different deployments in regions with varying order relationships and behavior of several metrics are required
user populations. In the latency p50 subplot, it is evident that to make an accurate detection.
the preprocessor, face detector, and face recognizer microser- Successful collection of the dataset was possible due to
vices experience the highest latency. This increased latency iAnomaly’s use of an optimal set of open-source tools. In
is attributed to their highly compute-intensive (HCI) nature, particular, leveraging Pixie as the monitoring tool allowed
which requires more time to process and respond to individual it to gather metric data from all three IoT applications,
requests. In contrast, the other microservices demonstrate each using different communication protocols. In contrast,
lower latency due to their relatively lower compute intensity. using a regular full-system emulator like iContinuum would
The second subplot represents the request throughput met- only allow data collection from the location retrieval appli-
ric. Here, we can observe that the preprocessor and loca- cation, which uses HTTP for communication. Furthermore,
tion retriever microservices exhibit high request throughput iAnomaly’s automated dataset generation capabilities led to
(HTp), while the face detector shows moderate throughput an 87% reduction in code lines compared to using a regular
(MTp). The other microservices fall into the low throughput full-system emulator such as iContinuum during our dataset
(LTp) category. These observations confirm the expected QoS generation. Notably, iAnomaly completely eliminated the need
properties of the microservices, as listed in Table II. The for human intervention during the data collection process,
third subplot corresponds to the cpu usage metric. Despite requiring only 31 lines of configurations per microservice,
being HTp, the location retriever microservices result in low while iContinuum needs 307 lines of code and significant
CPU usage since they are not computationally intensive. The human involvement to generate the same dataset.
preprocessor microservice shows the highest CPU usage due to
its HCI nature and high request throughput. The face detector VI. C ONCLUSIONS AND F UTURE W ORK
microservice, while also HCI, has moderate throughput and,
therefore, has the second-highest CPU usage. The rest of the Since existing research on performance anomaly detection
microservices have a low CPU usage due to their LTp nature. in edge computing environments is limited due to the absence
The rx bytes per ns subplot confirms the bandwidth- of publicly available edge performance anomaly datasets and
intensive (BI) nature of the preprocessor microservice. The due to the lack of accessibility of real edge setups to generate
final subplot, which corresponds to the rss metric, indicates necessary data, we propose iAnomaly: a full-system emulator
that three computer vision microservices, together with the with performance anomaly dataset generation capabilities.
anomaly detector and the missing data imputer (which also Towards that, it is equipped with open-source tools such as
utilize machine learning models for processing), have high Pixie for monitoring, Jmeter for normal workload generation,
rss values, demonstrating significant memory usage. By com- and client/sensor-side anomaly injection, as well as Chaos
paring these subplots, we can see that the diversity of the Mesh to introduce server-side anomalies. It also incorporates
selected applications allows our collected dataset to effectively a dataset generation orchestrator to facilitate automatic data
capture the variations in QoS and resource requirements of the generation and collection based on user-defined configurations.
microservices, as expected from an edge dataset. As a case study, we generated a performance anomaly
Figure 11 contains the Probability Density Functions (PDFs) dataset using iAnomaly. It contains performance data for
for the normal and anomalous data distributions of two ran- various microservice-based IoT applications with different
domly selected metrics from the datasets of the preprocessor QoS and resource requirements, injected with anomalies on
and face detector microservices. Subfigure 11(a) corresponds both the client/sensor side and the server side. Analysis of this
to latency p50 of the preprocessor microservice while figure dataset showed that it represents the characteristics of real edge
11(b) corresponds to cpu usage of the face detector microser- environments, and the anomalous data in the dataset meets
vice. Both subplots illustrate that the anomalous data overlaps the required standards for high-quality performance anomaly
with the distribution of normal data. This overlap proves that datasets. We have made this dataset available to the public.
the anomalies present in our dataset are non-trivial and not Additionally, we have released the iAnomaly toolkit for other
merely outliers. Renjie et al. [21] have identified the presence researchers who may need to collect a more extensive dataset
of such non-trivial anomalies as a property of a good anomaly or conduct further anomaly detection research.
dataset. iAnomaly toolkit can easily be extended to collect trace data
Furthermore, Figure 12 visualizes a few selected anomalies alongside metrics, which is particularly useful for research on
from the dataset. While certain anomalies are easily noticeable root cause localization (RCL). Extending to such multi-source
using their respective metrics - for example, user surge anoma- datasets is possible due to Pixie’s support. Researchers can
lies are evident from the increase in the latency p50 metric further enhance this framework to conduct real-time experi-
(Figure 12(a)), and memory stress is apparent from the rss ments on anomaly-aware resource management. The toolkit
metric (Figure 12(b)) - some anomalies, such as CPU stress, and the released dataset are not limited to anomaly detection
cannot be detected simply by looking at the cpu usage metric research. Normal data from the dataset, as well as normal
(Figure 12(c)), especially when it occurs in compute-intensive data generated from the iAnomaly toolkit, can be used as
microservices. During such scenarios, which are non-trivial foundational traces for experiments in other related research
to detect, algorithms that are capable of analyzing the higher- areas, such as resource scheduling and resource management.

9
(a) latency p50 of the preprocessor (b) cpu usage of the face detector

Fig. 11. PDFs for normal and anomalous data distributions of selected metrics

(a) User surges using latency p50 (b) Memory stress using rss (c) CPU stress using cpu usage

Fig. 12. Visualization of selected anomalies from the dataset

R EFERENCES International Conference on Fog Computing (ICFC), Prague, Czech


Republic, 2019.
[1] F. Al-Doghman, N. Moustafa, I. Khalil, N. Sohrabi, Z. Tari, and A. Y. [13] M. Symeonides, Z. Georgiou, D. Trihinas, G. Pallis, and M. D.
Zomaya, “Ai-enabled secure microservices in edge computing: Oppor- Dikaiakos, “Fogify: A fog computing emulation framework,” in 2020
tunities and challenges,” IEEE Transactions on Services Computing, IEEE/ACM Symposium on Edge Computing, San Jose, CA, USA, 2020.
vol. 16, pp. 1485–1504, 2023. [14] N. Akbari, A. N. Toosi, J. Grundy, H. Khalajzadeh, M. S. Aslanpour, and
[2] C. Wu, Q. Peng, Y. Xia, Y. Jin, and Z. Hu, “Towards cost-effective and S. Ilager, “icontinuum: an emulation toolkit for intent-based computing
robust ai microservice deployment in edge computing environments,” across the edge-to-cloud continuum,” in Proceedings of the 2024 IEEE
Future Generation Computer Systems, vol. 141, pp. 129–142, 2023. 17th International Conference on Cloud Computing, ser. CLOUD ’24,
[3] S. Becker, F. Schmidt, A. Gulenko, A. Acker, and O. Kao, “Towards Shenzhen, China, 2024.
aiops in edge computing environments,” in Proceedings of the 2020 [15] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly
IEEE International Conference on Big Data, Atlanta, GA, USA, 2020. detection for multivariate time series through stochastic recurrent neural
[4] M. Soualhia, C. Fu, and F. Khomh, “Infrastructure fault detection and network,” in Proceedings of the 25th ACM SIGKDD International
prediction in edge cloud environments,” in Proceedings of the 4th Conference on Knowledge Discovery and Data Mining, ser. KDD’19,
ACM/IEEE Symposium on Edge Computing, Arlington, Virginia, 2019. Anchorage, AK, USA, 2019.
[5] J. Hunter, “Deep learning-based anomaly detection for edge-layer [16] S. Nedelkoski, J. Bogatinovski, A. K. Mandapati, S. Becker, J. Cardoso,
devices,” Master’s thesis, University of Tennessee at Chattanooga, and O. Kao, “Multi-source distributed system data for ai-powered
2022. [Online]. Available: https://scholar.utc.edu/theses/740/ analytics,” in Proceedings of the 8th European Conference on Service-
[6] S. Tuli, S. Tuli, G. Casale, and N. R. Jennings, “Generative Oriented and Cloud Computing, ser. ESOCC’20, Crete, Greece, 2020.
optimization networks for memory efficient data generation,” 2021. [17] J. Levin and T. A. Benson, “Viperprobe: Rethinking microservice
[Online]. Available: https://arxiv.org/abs/2110.02912 observability with ebpf,” in Proceedings of the IEEE 9th International
[7] S. Tuli, F. Mirhakimi, S. Pallewatta, S. Zawad, G. Casale, B. Javadi, Conference on Cloud Networking, ser. CloudNet’20, 2020.
F. Yan, R. Buyya, and N. R. Jennings, “Ai augmented edge and fog [18] S. Pallewatta, V. Kostakos, and R. Buyya, “Microservices-based iot
computing: Trends and challenges,” Journal of Network and Computer application placement within heterogeneous and resource constrained
Applications, vol. 216, 2023. fog computing environments,” in Proceedings of the 12th IEEE/ACM
[8] S. Skaperas, G. Koukis, I. A. Kapetanidou, V. Tsaoussidis, and L. Ma- International Conference on Utility and Cloud Computing, ser. UCC’19,
matas, “A pragmatical approach to anomaly detection evaluation in Auckland, New Zealand, 2019.
edge cloud systems,” in International Workshop on Intelligent Cloud [19] ——, “Qos-aware placement of microservices-based iot applications in
Computing and Networking, ser. ICCN ’24, Vancouver, Canada, 2024. fog computing environments,” Future Generation Computer Systems,
[9] R. Mayer, L. Graser, H. Gupta, E. Saurez, and U. Ramachandran, “Emu- vol. 131, p. 121–136, 2022.
fog: Extensible and scalable emulation of large-scale fog computing [20] Z. Li, Y. Zhao, J. Han, Y. Su, R. Jiao, X. Wen, and D. Pei, “Multivariate
infrastructures,” in 2017 IEEE Fog World Congress (FWC), Santa Clara, time series anomaly detection and interpretation using hierarchical inter-
CA, USA, 2017. metric and temporal embedding,” in Proceedings of the 27th ACM
[10] A. Coutinho, F. Greve, C. Prazeres, and J. Cardoso, “Fogbed: A rapid- SIGKDD Conference on Knowledge Discovery and Data Mining, ser.
prototyping emulation environment for fog computing,” in 2018 IEEE KDD ’21, Virtual Event, Singapore, 2021.
International Conference on Communications (ICC), Kansas City, MO, [21] R. Wu and E. J. Keogh, “Current time series anomaly detection
USA, 2018. benchmarks are flawed and are creating the illusion of progress (ex-
[11] J. Hasenburg, M. Grambow, E. Grünewald, S. Huk, and D. Bermbach, tended abstract),” in Proceedings of the 2022 IEEE 38th International
“Mockfog: Emulating fog computing infrastructure in the cloud,” in Conference on Data Engineering, ser. ICDE’22, 2022.
2019 IEEE International Conference on Fog Computing (ICFC), Prague,
Czech Republic, 2019.
[12] Y. Zeng, M. Chao, and R. Stoleru, “Emuedge: A hybrid emulator for
reproducible and realistic edge computing experiments,” in 2019 IEEE

10

You might also like