Edge MLOps An Automation Framework For AIoT Applications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/356452360

Edge MLOps: An Automation Framework for AIoT Applications

Conference Paper · October 2021


DOI: 10.1109/IC2E52221.2021.00034

CITATIONS READS
8 545

4 authors, including:

Kimmo Ahola
VTT Technical Research Centre of Finland
22 PUBLICATIONS   95 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

5G ENSURE View project

CloudTRANSIT View project

All content following this page was uploaded by Magnus Westerlund on 30 November 2021.

The user has requested enhancement of the downloaded file.


Edge MLOps: An Automation Framework for AIoT
Applications
Emmanuel Raj David Buffoni Magnus Westerlund Kimmo Ahola
TietoEVRY TietoEVRY Arcada University of Applied Sciences VTT
Helsinki, Finland Stockholm, Sweden Helsinki, Finland Espoo, Finland
[email protected] [email protected] [email protected] [email protected]

Abstract—Artificial Intelligence of Things (AIoT) is the combi- computer exchanges data, but rather model parameters are
nation of artificial intelligence (AI) technologies with the Internet shared (weights and biases) [6]. Traditional ML at the edge,
of Things (IoT) infrastructure to achieve more efficient IoT has also some limitations. For instance, machine learning
operations and decision making. Edge computing is emerging
to enable AIoT applications. In this paper, we develop an models are often trained centrally and deployed manually and
Edge MLOps framework for automating Machine Learning individually to the edge nodes. This process of developing,
at the edge, enabling continuous model training, deployment, deploying and monitoring Machine Learning models needs to
delivery and monitoring. To achieve this, we synergize cloud and be automated to be scalable. Security is an essential challenge
edge environments. We experimentally validate our framework in IoT and edge computing, as edge nodes are prone to attacks
on a forecasting air quality situation. During validation, the
framework showed stability and automatically retrained, inte- [7], [8]. In most cases, privacy and compliance are highly
grated, and deployed models for specific environments when their essential and the edge-cloud setup has to comply with country
performance deteriorated under a certain threshold. specific data privacy laws. In order to fully leverage the
Keywords-Edge Computing, Machine Learning, IoT, 5G Net- capabilities of AIoT, to facilitate automated decision making
works, AI, Automation, Digital Transformation, MLOps. at the data source, a framework for seamlessly integrating
the two paradigms is necessary. The objective of the paper
I. I NTRODUCTION is to investigate and develop methods for enabling automated
In recent years, Internet of Things (IoT) has grown to machine learning at the edge that can be used by enterprises
become ubiquitous and this has led to a need for the devel- or industry.
opment of new process support methods to operate big data Hence the focus of the research is to answer the question:
environments at the edge and a demand for low latency com- How can a framework that integrates continuous delivery and
munication, enabled by technologies such as 5G Networks. continuous deployment of machine learning models at the
This shift in infrastructure helps to digitize physical spaces and edge be implemented using state-of-the-art tools and methods?
enables real-time decision making using Artificial Intelligence Drawing on our experience of building large-scale and real-
[1]. Artificial Intelligence of Things (AIoT) is the combination world machine learning pipelines, utilizing Machine Learning
of artificial intelligence (AI) technologies with the Internet Operations (MLOps) [9], [10], we present a new automated
of Things (IoT) infrastructure to achieve more efficient IoT framework for Machine Learning Operations at the Edge
applications and improved local decision making [2]. Edge (Edge MLOps). The main contributions of the paper are the
computing enables generating insights and making decisions proposed Edge MLOps framework, a use-case verification for
at the data source, reducing the amount of data sent to the a fully automated and scalable process for training, deploying
cloud and central repository [3]. and monitoring IoT data and edge node operations in real-
Performing Machine Learning (ML) inference at the edge time. Finally, we discuss an extended architecture for privacy
has several benefits compared to traditional cloud-based ma- considerations utilizing federated learning.
chine learning [3]. In addition to reducing the amount of data The paper is structured so that we first review the literature
exchanged, Machine Learning models tailored to specific IoT in Section II, then we describe the Edge MLOps framework
devices can be used to perform tasks customised to their in Section III. Section IV presents the setup for experimenting
respective environments [4], which is not the case with the our framework. Section V and Section VI disclose the empir-
cloud-based solutions. In some use cases, consented collection ical choices made for carrying out the experiments. Finally,
and processing of sensitive personal data is required or useful in Sections VII and VIII we discuss a future extension for
for the purpose of improving an individual’s daily routine improving privacy and summarize the results.
[5]. For this purpose, edge computing can facilitate data and
privacy-preserving processing in accordance with privacy laws. II. R ELATED WORK
Additionally, as we note in Section VIII, the dimensions A large number of AI applications such as Computer
of privacy can further be enhanced by methods such as Vision, Autonomous Driving, and AIoT can be realised in
Federated Learning, in which no edge node or centralised edge computing environments. Placing AI services close to
the data source can improve user experience by reducing low processing power, small storage capacity, and varied
the latency, the dependence on network connectivity and the specifications. Training machine learning models on edge
communication cost [11]. Sato [9] introduced MLOps as a devices is a difficult challenge as ML can be intensive in terms
part of DevOps best practices for ML. In [12], MLOps is of computation, memory, and storage. However, edge devices
presented as a solution for handling and improving AutoML connected to a mains power source are often sufficient for ML
and feature-specific ML engineering, such as explainability inference.
and interpretability. Further studies have described how to To provision computation resources for machine learning
perform inference of AI and more specifically Machine Learn- model training, an alternative is to use a cloud platform. Cloud
ing (ML) on the edge [13], [14], [15]. For improving the platforms have several advantages in that they provide pow-
efficiency of ML models’ inference on edge devices, efforts erful processing and computation resources, massive storage
have been made to optimize Mobile Edge Computing (MEC) and efficient management. However, they may be inadequate
workloads by optimizing operations on the cloud and edge because of insufficient data privacy, high bandwidth costs and
nodes [16], [17]. For optimizing the training of ML models reliance on network connection. For the interested reader, a
on edge devices, there are attempts to build a framework to more complete description of the strengths and weaknesses
train machine learning models on resource-constrained IoT of both platforms can be found in [26]. To combine the
edge devices [18]. However, the training and deployment of best of both worlds, we present an ML lifecycle management
the ML models is often done manually and some efforts have framework, Edge MLOps, that uses a hybrid edge and cloud
been made to scale and automate this process. solution.
The discipline of integrating DevOps principles and ML is The overall requirements are addressed by the framework
at its beginnings and the proposed solutions are still often in the following manner using the cloud platform and edge
specific to particular industries or use-cases [19], [20], [21]. devices:
There are also other complementary Ops-based methods being Cloud platform
developed such as DataOps, that consider streamlining the • ML pipeline for model training and retraining
data flow between services, for instance within a microservice • CI and CD
architecture [22], [23]. There exist some attempts of building • Source code management
a framework for the lifecycle management of AI applications • Data Storage
[24]. Most of them, are based on cloud solutions without any • Fleet Analytics (dashboard)
focus to edge environments (e.g. [25]). Our solution proposes Edge devices
a framework for automating the AI workflows by combining • Continuous Integration and data streaming with IoT de-
cloud and edge environments. A similar work to ours is the vices
framework proposed by [26], [27] where cloud and edge • Continuous Integration and Delivery from cloud to edge
platforms are combined together. The main difference with • Hyper-personalization
[26] is the authors’ focus on training ML models on the • ML inference
cloud and deploying them on edge devices while our work The framework facilitates the above requirements and func-
focuses on automating the deployment process. In [27], the tionalities in a modular way. On a high level, there are two
authors proposed to trigger the process by model versioning main layers of the framework. 1) The Cloud Orchestration
while in our work, models’ performance drops or data drift are layer which runs on the cloud platform is dedicated for
considered. Moreover, our framework can be easily extended ML model training, retraining, CI-CD jobs, centralized data
to a Federated Learning setting where data privacy and security storage, source code management, analytics dashboard, and
is mandatory. data and model governance. 2) The Edge Inference layer is
III. E DGE MLO PS F RAMEWORK dedicated for orchestrating ML inference and operations for
IoT devices, such as continuous integration and streaming data
To enable a use case of individually trained AIoT appli- from the IoT devices. To enable robust and low-latency net-
cations on edge nodes, as further discussed in Section IV, work communication between these two layers, 5G Networks
we present an Edge MLOps framework to automate and or WLAN can be used for communicating continuous delivery
operationalize the Continuous Delivery (CD) and Continuous triggers, and to send and receive data as needed. An overview
Integration (CI) of ML models to the nodes. Usually, deploying of our proposal for Edge MLOps is shown in Figure 1.
ML models happens in two distinct steps. The training step The advantages of such a framework are a high level of
fits the parameters of an algorithm to a given dataset. Once decoupling through modularity between the two layers, and
the training is complete, the trained model can be used for utilization of the strengths of both platforms. In the following
inferring new data points during inference [28]. sub-sections we describe the two layers, Cloud Orchestration
Distributed edge devices may have several advantages over and Edge Inference.
a conceptually centralized cloud solutions. For instance, it
allows for processing data in real-time, specialized modelling A. Cloud Orchestration
per node, minimized communication needs, and improving In the Cloud Orchestration layer, multiple services are
data privacy protection. However, edge devices tend to have running on the cloud to perform parallel jobs that adds
Fig. 1. Proposed Framework Architecture for Edge MLOps for AIoT Applications.

additional complexity to the whole process. ML modeling for fitting the model to the dataset (training set).
is used to enable fast and reasonably accurate predictions This step can be done manually, but efficient and
for newly generated data, which is often the case in an automatic solutions such as GridSearch or Random-
AIoT setting. We train our models on the cloud due to the Search [29], exist.
availability of high performance compute resources and we • Model Evaluation: the inference of the trained
use an ML pipeline to facilitate the ML models’ training on model is evaluated according to different metrics
the cloud. Consequently, machine learning steps for handling on a separate set of data points named test set.
specialized ML models per device include training, testing, The output of this step is a report on the model
selecting best model candidates for deployment, monitoring performance.
performance in production, and maintaining prediction quality. • Model Packaging: once the trained model has been
In a scalable setting these steps become tedious tasks that tested, the model is serialised and packaged into a
cannot be manually addressed. Instead, we propose to handle container for standardized deployments.
them with a Cloud Orchestration layer that is composed of • Model Registering: the model candidate, which has
four functionalities: been serialised, is registered and stored in the model
registry from where it is ready for quick deployment
1) Machine Learning Pipelines: An ML pipeline which
into an edge device.
builds and deploys models is usually proposed by cloud
vendors as a basis for cloud-based MLOps. Such a 2) Storage: For storing structured and unstructured data,
service provides computational resources and data stor- services are commonly available in a cloud vendors
age on-demand to enable training of machine learning catalogue. Cloud storage solutions are usually auto-
models. Typically, this service also comes with an ML scaled depending on the demand which makes them
model repository and container storage to enable faster convenient and efficient to use. To be able to keep track
deployment. We use the ML pipeline to train and retrain of the model versions and trace back if needed, data
ML models. To train and validate an ML pipeline for a versioning and governance is necessary.
specific dataset, we consider these five steps: 3) Fleet Analytics: Fleet analytics is a data analysis and
• Data Ingestion: The ingestion script procures data monitoring tool for data generated from the edge and IoT
(based on parameters) and versions the data which devices and also the predictions made by the deployed
will be used for ML model training. As a result models. Cloud vendors usually propose some ready-
of this step, any experiment (i.e. model training or made tools for checking the health and performance of
re-training) can be audited and traced back. IoT and edge devices. A custom dashboard can also be
• Model Training and retraining: a script that per- created to monitor these insights in real-time [30].
forms all the traditional steps in ML such as data 4) CI/CD: Continuous Integration and Continuous Deploy-
pre-processing, feature engineering, feature scaling ment pipeline enables continuous delivery to the edge
before training or retraining any model. Usually, layer. The goal of the CI/CD pipeline is to deploy,
ML models have a set of hyperparameters to tune retrain and maintain models while maintaining end to
end traceability (e.g. maintaining logs, tracking versions IV. E XPERIMENTAL S ETUP
of models deployed as well as datasets and source code To validate our Edge MLOps Framework discussed in
used for model training). The CI/CD pipeline also en- Section III, we carried out experiments on a VTT Technical
ables triggers to perform necessary jobs in parallel, build Research Center campus in Finland where IoT devices and
artefacts and release for deployment to edge devices. edge devices were setup as illustrated in Figure 2. The aim
of the experiment is to continuously monitor air quality and
B. Edge Inference conditions in 3 rooms, each of them equipped with differing
IoT devices.
The Edge Inference layer focuses on orchestrating oper-
ations for the IoT and the edge devices. It enables real-
time machine learning inference and works in sync with the
Cloud Orchestration layer for deploying the ML models to
the edge and transferring data from devices for training and
testing purposes. More details on each capability of the Edge
Inference layer are given below:
1) Continuous Integration and Delivery from cloud to edge:
Continuous delivery to the edge is driven by a high speed
and low latency network for stable and robust communication
Fig. 2. Overview of the experimental setup for 3 rooms. Each room is
and operations. We propose using a private network (e.g. equipped of an IoT and an edge devices.
WLAN) to support the communication between edge devices,
IoT devices, and Cloud layer. Inside this network, edge devices
Below we present the different Hardware, Software and
communicate and infer data from the sensors using the MQTT
Network solutions we used in our experiments. The choice
protocol. For stability and standardization reasons, all these
we made was to use industry standard solutions to make our
operations are run inside containers, (e.g. Docker containers).
framework easily replicable by the industry.
Continuous deployment pipeline facilitates model deployments
1) Hardware Tools: For edge devices we used Raspberry
in containers to the edge, data transfer, and monitoring (ML
Pi 4, Nvidia Jetson Nano 2 and Google TPU edge.
models and edge devices).
2) Software Tools: For software development, we have
2) Hyper-personalization at the edge: Edge and IoT de- chosen common data scientist’s tools for the tech stack to do
vices in a network can be used in different environments, data analysis, model training and deployment, and monitoring
consist of different hardware, and be located in separate - Python, Linux and Docker. Docker containers are used to
geographical locations. In many cases, the devices need to deploy applications in runtime and are a standard in industry.
perform tasks customized to their respective environments. In Microsoft Azure is used as the cloud service because it
such cases, edge nodes can enable customization for each incorporates a plethora of reliable services for Ops and Edge
device using a custom ML model. This way, ML models computing.
deployed at the edge can optimize and retrain themselves when 3) IoT devices and Networking: In this work, we use Treon
needed, while constantly learning to serve better. We train Node IoT sensors to establish an office air quality monitoring
ML models on the cloud (see sub-section III-A), using data system. In addition to temperature and humidity, the sensor
from respective rooms or environments to train customized also measures volatile organic compounds (VOCs), producing
ML models. estimates of air quality. The sensors form a Wirepas sensor
However, devices often come with different hardware ar- mesh network, which features adaptive routing and diagnostics
chitectures (e.g. arm64X, x86 etc) or sensor setups, so to per- monitoring for both the nodes and the network. A node
sonalize a model for deployment demands customization for sends measurements every five minutes, and the sensor mesh
specialized data and the variety in the edge hardware. By using network relays the data to a gateway. The gateway then uses
appropriate docker containers tailored to each architecture we the Message Queue Telemetry Transport (MQTT) protocol,
improve hardware stability and standardization among nodes a robust real-time messaging protocol for IoT applications
in the network. [32], to send the information to a Mosquitto MQTT broker
3) Automated Machine Learning at the edge: Machine running in a VTT Technical Research Center test network. The
Learning inference and monitoring are automated as part of MQTT protocol was chosen because it is widely used in IoT
Continuous Deployment operations. A periodic trigger from environments and it was the basis of the edge environment
the CI-CD process in the Cloud Orchestration layer is imple- already installed as part of a different research project. The
mented to invoke the monitoring feature in the edge devices, edge nodes run behind a NAT router, also in the test network,
to evaluate model drift and to replace or retrain the existing and they have access to the IoT sensor data via the MQTT
ML model with an alternative [31]. With this approach, the broker.
whole process of ML inference at the edge is automated in 4) Central Storage: For storing all of our data such as
real-time. sensor outputs, ML models, training dataset, artifacts and
logs, we provisioned an auto-scalable storage on the cloud (for MLOps), Azure DevOps (for Source Code Management),
(Azure blob storage) to serve our storage needs. To handle data Azure IoT Central (for Fleet Analytics), Power BI (for Fleet
privacy, we provisioned our central storage in the region of our Analytics) and Azure blob storage (for data and models
closest proximity with a restricted access to the authorized storage).
services and parties only. To achieve that in practice, we used 1) Continuous Integration from IoT to Edge: We commu-
VLAN and subnet configuration. Additionally, the storage is nicate with the IoT devices from edge devices and establish
provisioned with full end-to-end data lineage allowing data continuous integration between the respective IoT devices
traceability. and the edge devices. Communication between IoT and Edge
In the following Sections V and VI we detail our work as Devices is established using the MQTT protocol which has
two separate cycles, design and empirical, in accordance with become a standard pub/sub platform for IoT applications.
the design science method [33]. 2) Continuous Integration from IoT to Cloud: Two pro-
cesses were implemented for Continuous Integration, delivery
V. D ESIGN C YCLE and deployment as shown in Figure 4. These processes were
To design a robust and scalable framework for Edge MLOps setup to facilitate the end to end Continous Integration from
for AIoT applications we did multiple iterations and assess- IoT devices to the Cloud through the edge devices. The two
ments as part of the design cycle. Firstly, we mapped out the scripts are running inside a Docker container and orchestrate
data flow as shown in Figure 3 for the use case (based on the incoming data and the ML pipelines. For system resilience
the requirements discussed in Section III). The data flow is we configured an automated restore to default firefighting
observed in four phases - Data Processing and Model Training, measure. Whenever a node failure is detected the edge device
Model Selection, Inference and Fleet Analytics and Actions. is redeployed with the respective docker container to restore
In the Data Processing and Model Training phase, ML models the default state.
are trained using the training data (collected from IoT Sensors
for a period of 3 months). The trained models are packaged
and registered into the Model Registry. For Model Selection,
we select the best models using the quantitative model recom-
mendations (as described in section VI). The selected models
are then deployed using containers in the respective edge
devices. The ML model detects any anomalies (bad air quality
predictions) via the container port. The predicted anomalies are
reported to the Edge Fleet Analytics dashboard which raises
an alarm notifying the guardian.
We assessed services from Microsoft Azure with the goal
to facilitate various steps in the framework such as continuous
delivery, deployment and monitoring on edge devices, cloud
Fig. 4. Docker container deployed in each edge device
to edge communication orchestration, etc., to name a few.
Then, suitable services were chosen based on the maturity
of the service for the experiment and efficiency for enabling • Ingestion, Inference and Serialization Process:
orchestration and automated pipelines for synergy between – This process enables and maintains continuous in-
edge and cloud. The selected services were Azure ML Service tegration of sensor to edge by fetching data in real

Fig. 3. The dataflow of the use case.


time. Received data is pre-processed, cleaned and VI. E MPIRICAL C YCLE
formatted for ML inference. We now describe how we empirically evaluate and validate
– A trained ML model (from the Cloud) runs inside each step of the Edge MLOps framework.
the docker container consumes the new data and
forecasts the air quality in the next 15 minutes. A. Machine Learning Operations with Cloud Orchestration
– Processed data from the IoT device and forecasts We first experiment with all the steps of the Machine
are concatenated together and appended to a Comma Learning pipeline on the Cloud Orchestration layer, where
Separated Value (CSV) file temporarily stored in the Machine Learning Operations are done (see Figure 5).
docker container.
• Time Trigger Process: A scheduled time trigger is
implemented for evaluating potential ML model drift in
performance comparing actual and future predictions. If
the evaluation performance difference is greater or equal Fig. 5. Machine Learning pipeline on the Cloud Orchestration layer.
to a predefined threshold, then, the process calls to look
for and deploy an alternative model from the Cloud. We train our models on the cloud due to the availability
of high performance compute resources and we use an ML
These two processes make the bridge between IoT sensors
pipeline to facilitate the ML models training on the cloud.
and the Cloud and enable a Continuous Integration between
For this purpose, we use Azure Machine Learning service as
the IoT devices and the Cloud. Thanks to that, Continuous
a platform where we can provision compute, storage and the
Delivery and Deployment is possible.
needed infrastructure on request for ML Operations. Using
3) Continuous Delivery and Deployment for Edge: In this Azure ML service, we can train, manage, deploy, and audit
section we will look into important aspects of Continuous models (incl. model traceability with data and source code ver-
Delivery and Deployment on the edge. We rely on the Azure sioning support from training). Models are trained separately
DevOps service for the CI-CD pipeline. After configuring (using respective datasets of rooms) to be deployed in rooms
a secure connection through SSH with Azure DevOps and in the respective edge devices. We train a dedicated model
each edge device, the CI-CD pipeline orchestrates the Cloud for each room using the data from the particular room where
services for training the models (Azure ML Service) and the edge device is set up (in which we deploy the custom
storing them (Azure Blob Storage). trained model). A dedicated model for each room or device
In the following, we describe each phase of CI-CD pipeline: will enable model hyper-personalization (as shown in section
I) in the respective environment [24].
1) Release: A release is triggered to monitor edge devices 1) Data Ingestion Step: Data ingestion and Data Explo-
to check for model performance in real-time. A new ration are done in this step. For the experiment, data has been
model is trained using both real-time data and previous collected for 3 months, starting from 15th October 2019 to
data snapshots. In our experiments, time-based and 15th January 2020 from 26 different IoT devices generating
manual triggers were implemented but other options are a total of 537873 event. Each event contains 13 parameters
possible such as at the source code commit or new which are presented in Table I
trained models.
Event variables list
2) Monitor edge and deploy: the aim is to monitor the timestamp name (string) room (string) room type
ML models deployed on edge devices by evaluating their (datetime) (string)
drift. Processing of the data stored on each edge is done floor (string) air quality al- air quality un- pressure
tered (float) altered (float) (float)
parallelly. In practice, we evaluate the air quality forecast
ambient light humidity IAQ accuracy IAQ acc. un-
(for the next 15 minutes) and the actual air quality (float) (float) altered (float) altered (float)
records with the Root Mean Square Error (RMSE) [28]. temperature
In case the RMSE is greater or equal to a fixed and (float)

empirically defined threshold, a call is made to replace TABLE I: Event description. IAQ stands for Indoor Air
the existing model by a new version from the Model Quality.
Repository. The approach is generic and can be used
with other evaluation metrics and other threshold values. We collaborated with the Finnish Technical Research Center
This ensures the Continuous Deployment of ML models who are interested in forecasting the air quality in real-time.
to the edge. The aim is to train ML models to be able to predict the air
3) Model Retrain: When a call for replacing a model is quality in the rooms.
made, the existing model is fine-tuned on the Cloud We performed data analysis on the collected data and we
with latest real-time data to keep it up-to-date. Then, observed that 3 rooms out of 26 had a high variance in their
the model is stored in the Model Repository for future air quality (2 of them were meeting rooms). This would force
deployments. our framework to train new forecasting models frequently to
maintain good prediction performance. We report in Table II ML algorithms. We experimented with these 4 supervised ML
the descriptive statistics of the selected 3 rooms. models for producing forecasts: Multiple Linear Regression
(MLR), Support Vector Regressor (SVR), Extreme Learning
Selected Rooms
Room name Room type Unhealthy air quality Avg. air quality
Machines (ELM) and Random forest Regressor (RFR) (see
frequency [28] [35] [36] [37]). The model hyperparameters are selected
Room A10 Office room 2033 61.92 through gridsearch on a Time Series 10 Fold Cross-Validation
Room A29 Meeting 2205 61.40
Room splits.
Room A30 Meeting 1085 55.45 3) Model Evaluation Step: The best model candidates are
Room
selected based on the lowest RMSE averaged over the folds.
TABLE II: Descriptive statistics for air quality in selected Detailed results are reported in Table III.
rooms. Unhealthy air quality frequency is the number of Other feature extractions or ML models could have been
times the value provided by the IoT devices is greater than used to provide more accurate Time Series forecasts but we
a threshold (e.g. 150+). are interested in the ML model lifecycle management in its
whole.
Our goal is to provide a satisfactory forecast of the air
quality of each room. To achieve this, we need to perform Model Training Results
Room name Algorithm Cross Test RMSE
a Time Series Analysis [34] to understand better the nature Validation
of the different time series and how to model them. We plot RMSE (train)
Room A10 MLR 5.020 5.875
the room time series in Figure 6 which show a non-stationary Room A10 ELM 6.325 6.208
pattern since mean, variance and covariance are observed to Room A10 RFR 10.710 9.987
be changing over time. It may be due to several factors such Room A10 SVR 6.046 5.977
Room A29 MLR 5.362 4.158
as trends, cycles, random walks or combinations of the three. Room A29 ELM 11.202 4.223
Room A29 RFR 11.676 9.208
Room A29 SVR 8.073 4.176
Room A30 MLR 3.648 3.551
Room A30 ELM 7.920 3.895
Room A30 RFR 9.686 7.720
Room A30 SVR 5.177 3.55

TABLE III: Model training results.

4) Model Packaging Step: Once the models have been


trained, they are serialised and packaged to be ready for
inferring new incoming data on the respective edge devices.
For feature engineering we store the scaling variables into
a serialized file to be reused when models are deployed
on the edge. The trained models are serialised in the Open
Neural Network Exchange (ONNX) format. ONNX is an
Fig. 6. Air quality Time Series for the 3 selected rooms. open ecosystem for interoperable AI models, it enables model
interoperability and serialisation of ML and deep learning
2) Machine Learning Step: Based on the collected Time models in a standard format. We package the scaling variable,
Series, the goal is to train ML models for each room to serialized ONNX model and monitoring script (which mon-
forecast the air quality for the next 15 minutes. To achieve itors the model drift) into a container and register/store the
this, we have to perform Feature Engineering [28] to prepare container in the model registry (where we store and version
data for the ML models. We follow the Supervised Machine our model container) which is part of the Azure ML service.
Learning paradigm for predicting the air quality. Supervised By packaging scaling variables and models into containers we
Machine Learning models learn an objective function which can pull, deploy and run these containers on edge devices such
for any input data produces a prediction. The goal of ML as Raspberry Pi 4, Jetson Nano and Google TPU edge.
algorithms is to find a solution minimising the number of 5) Model Registering Step: Packaged models are registered
prediction errors. Prediction errors are evaluated with a loss and stored in the ML model store for quick deployments.
function comparing the predictions and the targets. In our case,
the targets are the Time Series which have been shifted 3 B. Edge Inference
steps ahead (15 minutes). On the parameters generated by On each edge device, for each new event computed by the
IoT devices we analysed the ones which have a correlation IoT devices, the ML model predicts the future air quality
with the air quality 15 minutes ahead. We found that the for the current room. As a reminder each model is replaced
air quality static, ambient light, humidity, iaq accuracy static, by an updated version when a performance drop is noticed.
pressure and temperature were correlated to the targets. After We measured for 45 days the activity of the Edge MLOps
using a Z-score scaler, these parameters are provided to the framework by counting the number of deployed models and
Edge vs Cloud inference based on the experiments
the number of failures. A total of 23 models have been changed Edge devices (10) Cloud node (1)
or continuously deployed on respective edge devices without Device Raspberry pi 4 DS2 v2 (Azure)
Computation 40 vCPUs (4x10) 2 vCPUs
reporting any error. Table IV lists the deployed models by RAM 40 GB (4x10) 7 GB
the Edge MLOps framework. We notice that for all new de- Temporary storage 640 GB (64 GB/device) 14 GB
ployed models, the RMSE decreases meaning that the updated Data pruned 22 % 0%
ML inference/minute 1/device 10
model performs better than the previous one. In addition, our Avg. inference time 0.2 seconds 2.2 Seconds
implementation successfully performed for the different edge Total cost/month $ 10/month $ 93/month
devices without showing any compatibility issues with our TABLE V: Quantitative analysis - Edge vs cloud based on the
framework. experiments.
Realtime machine learning inference at the edge
S.no Date of model Edge Device Deployed Model Drift Model
change Model (RMSE) Retrain
(RMSE)
We evaluate the robustness of our framework at the different
1 15-03-2020 Jetson nano 2 ELM 16.39 4.1 stages. At a 5-minute interval, IoT devices sent data to
2 16-03-2020 Google TPU edge RFR 14.23 6.3
3 16-03-2020 Raspberry pi 4 MLR 11.91 4.3 edge devices without any failures. During the 45 days of
4
5
17-03-2020
22-03-2020
Raspberry pi 4
Jetson nano 2
ELM
SVR
13.27
22.32
8.1
6.2
experiment, a time trigger was executed every day at 12:00
6 24-03-2020 Google TPU edge RFR 17.11 4.4 to monitor the model drift of each deployed model. A total
7 27-03-2020 Raspberry pi 4 MLR 16.22 4.7
8 29-03-2020 Jetson nano 2 ELM 30.28 8.2 of 135 time triggers were executed successfully without any
9
10
30-03-2020
05-04-2020
Google TPU edge
Raspberry pi 4
SVR
MLR
18.12
12.92
5.4
3.2
failure. In addition to them, 27 manual triggers have been
11 10-04-2020 Jetson nano 2 SVR 17.21 5.2 executed in a random fashion without any failure. On the
12 11-04-2020 Google TPU edge MLR 13.42 4.7
13 13-04-2020 Jetson nano 2 ELM 27.29 5.3 Cloud layer, 23 ML models have been successfully retrained
14
15
17-04-2020
19-04-2020
Google TPU edge
Raspberry pi 4
RFR
SVR
17.46
16.32
6.9
5.1
when their performance was degraded. An aggregate of 38
16 19-04-2020 Google TPU edge MLR 11.91 3.4 MB of data was collected without any interruptions or data
17 21-04-2020 Jetson nano 2 ELM 23.26 7.3
18 22-04-2020 Google TPU edge RFR 16.92 7.2 leakage. Data collection and storage pipeline worked and no
19
20
24-04-2020
25-04-2020
Raspberry pi 4
Google TPU edge
SVR
MLR
17.87
13.92
5.2
5.2
interruptions were detected. These results indicate that the
21 25-04-2020 Jetson nano 2 SVR 19.21 7.9 proposed solution is feasible and stable. Edge MLOps is
22 26-04-2020 Raspberry pi 4 ELM 23.57 6.4
23 26-04-2020 Google TPU edge SVR 18.21 5.5 scalable to multiple edge devices (depending on the use case)
and is confined to the infrastructure (cloud and networking)
TABLE IV: Results of the Machine Learning inference at the
and tools used to implement the framework and perform
edge.
experiments. Our framework is generic enough to be applied
on any AIoT application such as healthcare, manufacturing,
telecommunication, energy, etc.
VII. R ESULTS
Our framework proposes a stable and automatic way to train
For the purpose of this study, validation of the design cycle models on the cloud and deploy them on edge devices. This
was done over several iterations. We assessed cost, energy results in a stable and robust pipeline. Automation in the ML
and operational efficiency of edge vs cloud machine learning model lifecycle management is worth considering as it saves
inference based on our experiments. For benchmarking, we times and manual labour. Manual deployment of models can
used a single edge device and compared it against a compute be prone to errors due to human intervention. This risk is
resource on the cloud. For edge devices we used Raspberry mitigated by our approach.
pi 4’s and for a cloud compute, we used a data science
VIII. F UTURE D IRECTIONS
virtual machine DS2 v2 (on Azure Cloud). Table V shows the
results of monitoring the models deployed in 10 edge devices A. Towards a Fault-tolerant Edge MLOps Framework
(Raspberry Pi 4) compared to 1 cloud node (a data science In our experiments, we have been able to deploy 23 models
virtual machine DS2 v2 on Azure Cloud) for a timeline 30 without any errors for 45 days but it is worth noting that
days or a month. Based on our experiments the framework we may encounter exceptions in the long run. For example,
setup provided an improvement of resource usage by almost model training may fail or model inference may not be as
9 times compared to the same experiment performed on cloud expected which may result in unexpected behavior of the
computing using micro services. Our calculations are based system. To tackle this and ensure that our framework is
on power consumption references from Raspberry Pi and robust, additional tests should be performed such as end-
Microsoft (for data science virtual machine DS2 v2) [38], to-end testing, integration testing, smoke testing and unit
[39]. There was an overall 9-times cost reduction and a 9- testing as described in [10]. These tests can be performed in
times increase in inference speed. In addition to that, 22% DEV and TEST environments before going live in PROD. To
of the incoming data from IoT devices were pruned (to send the best of our knowledge, there is not any prior work for
only essential data to cloud for storage) reducing storage costs testing the robustness, fault-tolerance or system recovery of
in the cloud up to an order of magnitude. Similar studies are Machine Learning Operations framework in mixed Edge and
described in [11] suggesting the energy and cost efficiency for Cloud computing settings. As future work, we would adapt
inference at the edge. works done in similar settings to our framework. For example,
[40] proposed to evaluate the response time, the availability,
throughput and reliability for the Cloud setting. In the Edge
setting, [41] measured different success rates of meeting loss-
tolerance requirements under increasing workload to evaluate
their system. In an Edge-Cloud Architecture as in [42], authors
reported measurements of latency values and throughput by
considering various Real-case scenarios for assessing fault
tolerance. An additional consideration should also be made
as Machine Learning systems are inherently different from
classical software engineering systems. In other terms, in
Machine Learning systems, there exist additional technical
debts to be paid. An exploration of several ML-specific risk
factors (e.g. data dependency, data entanglement, feedback
Fig. 7. Edge MLOps Framework for Federated Learning
loops...) are described in [43] and can serve as robustness and
fault-tolerance evaluation metrics.

B. Considerations on security and privacy are then trained locally. This process calibrates the global
models that are then audited and stored in the model registry.
In the current version of our framework, security and privacy These fine-tuned global models are then only stored data on the
challenges are not addressed and can be an interesting future cloud. We leave the experimental validation of this extension
direction of our work. Cloud and Edge computing paradigms as a future work.
are not perfect as they are vulnerable to attacks and raise
concerns on privacy. In our framework, deploying models on IX. C ONCLUSION
the edge gives the advantage of moving the computation closer Edge MLOps is a framework combining cloud and edge
to the device itself. As a result, it gives faster responsiveness environments for operationalising ML models. Cloud solu-
and processing compared to cloud-based solutions. In addition, tions propose high computation power and storage, while
Edge computing allows to filter out sensitive data which has edge environments reduce latency and network connectivity
to be sent to the cloud and adding a security level for the data. dependence.
However, Edge computing is also vulnerable to some security We proposed a complete lifecycle for automating AIoT
and privacy attacks such as Eavesdropping, Denial of Service, workloads by orchestrating two main pipelines (cloud orches-
Data Tempering, exploiting weak credentials for protection or tration and edge inference) to ensure modularity. Each pipeline
using insecure communication [44]. For each attack, there exist is in turn composed of other pipelines such as ML pipeline, CI-
countermeasures and additional work can be done on that to CD pipeline, Continuous Deployment pipeline at the edge etc.
propose a more secure MLOps framework. To ensure privacy Traditionally each pipeline is executed separately or connected
on users’ sensitive information a countermeasure proposed in together in an adhoc manner. In this paper, we presented an
[45] is to distribute the sensitive information through Edge approach to synchronise all these steps for realtime automation
computing nodes such that no node has complete knowledge of cloud and edge operations. Our framework was successfully
of the information. As an open research direction, the authors implemented in a real-life scenario where the goal was to
suggest to investigate federated learning strategies in case of deploy ML models to forecast the air quality of rooms directly
Machine Learning application on the Edge. on the edge. It integrates features such as different pipeline
triggers (e.g. data drift, model versioning, model drift), is
C. Framework extension for Federated Learning language and library agnostic (based on containers) and can
Data privacy may be necessary, for instance, in Healthcare be easily extended in a Federated Learning setting.
where in accordance with regulations like the GDPR [46], As future work, we would investigate the behaviour and
patient data cannot be used as it is and personally identifiable the performance of our framework on cases with a more
information should be anonymised. To tackle this challenge, a time-sensitive AI setup, such as a short training time-frame.
Federated Learning approach can be implemented by extend- Moreover, as data privacy and security are becoming increas-
ing the proposed framework as we show in Figure 7. ingly necessary, experimentally validating our framework in
Federated Learning [47] is a way of performing Machine the Federated Learning setting is a natural way forward.
Learning in a collaborative fashion. The training process is
ACKNOWLEDGMENTS
distributed across multiple edge devices storing only a local
sample of the data. Data is not exchanged nor transferred This work is supported by TietoEVRY and the 5G-FORCE
between edge devices or the cloud to maintain data privacy and project “6388/31/2018 5G-FORCE” (www.5g-force.org).
security. In our framework, data storage and model training are R EFERENCES
decentralised and fine-tuning of ML models is centralised on
[1] K. Bilal, O. Khalid, A. Erbad, and S. Khan, “Potentials, trends, and
the cloud. The ML pipeline step of the Cloud Orchestration prospects in edge technologies: Fog, cloudlet, mobile edge, and micro
layer, does not ingest data anymore but rather the ML models data centers,” Computer Networks, vol. 130, 10 2017.
[2] Y. Wu, Y. Wu, and S. Wu, An outlook of a future smart city in Taiwan [25] H. Tian, M. Yu, and W. Wang, “Continuum: A platform for cost-aware,
from post–Internet of things to artificial intelligence Internet of things. low-latency continual learning,” in Proceedings of the ACM Symposium
Elsevier, 01 2019, pp. 263–282. on Cloud Computing, ser. SoCC ’18. New York, NY, USA: Association
[3] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision for Computing Machinery, 2018, p. 26–40.
and challenges,” IEEE Internet of Things Journal, vol. 3, pp. 1–1, 10 [26] J. Moon, S. Kum, and S. Lee, “A heterogeneous iot data analysis
2016. framework with collaboration of edge-cloud computing: Focusing on
[4] G. Agarwal, “What is Edge AI and how it fills the cracks of IoT, 2019. indoor pm10 and pm2. 5 status prediction,” Sensors, vol. 19, no. 14, p.
[5] M. Westerlund, “A study of EU data protection regulation and appropri- 3038, 2019.
ate security for digital services and platforms,” Ph.D. dissertation, Åbo [27] A. Bhattacharjee, Y. Barve, S. Khare, S. Bao, Z. Kang, A. Gokhale, and
Akademi University, 2018. T. Damiano, “Stratum: A bigdata-as-a-service for lifecycle management
[6] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and of iot analytics applications,” in 2019 IEEE International Conference on
D. Bacon, “Federated learning: Strategies for improving communication Big Data (Big Data), 2019, pp. 1607–1612.
efficiency,” in NIPS Workshop on Private Multi-Party Machine Learning, [28] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
2016. Learning, ser. Springer Series in Statistics. New York, NY, USA:
[7] M. Beck, M. Werner, S. Feld, and T. Schimper, “Mobile edge computing: Springer New York Inc., 2001.
A taxonomy,” in The Sixth International Conference on Advances in [29] J. Bergstra and Y. Bengio, “Random search for hyper-parameter op-
Future Internet (AFIN 2014), 01 2014. timization,” J. Mach. Learn. Res., vol. 13, no. null, p. 281–305, Feb.
[8] J. Wickström, M. Westerlund, and G. Pulkkis, “Smart contract based dis- 2012.
tributed IoT security: A protocol for autonomous device management,” [30] E. Raj, M. Westerlund, and L. Espinosa-Leal, “Reliable fleet analytics
In proceedings of 21st ACM/IEEE International Symposium on Cluster, for edge iot solutions,” Cloud Computing 2020: The Eleventh Inter-
Cloud and Grid Computing (CCGrid 2021) (forthcoming), 2021. national Conference on Cloud Computing, GRIDs, and Virtualization,
[9] K. Sato, “What is ML Ops? Best practices for DevOps ML,” p. 55, 2020.
2018, Presentation at Google Cloud Next’18. [Online]. Available: [31] R. Akkiraju, V. Sinha, A. Xu, J. Mahmud, P. Gundecha, Z. Liu,
https://cloud.withgoogle.com/next18/sf/sessions/session/192579 X. Liu, and J. Schumacher, “Characterizing machine learning process:
A maturity framework,” arXiv preprint arXiv:1811.04871, 2018.
[10] E. Raj, Engineering MLOps. Birmingham, United Kingdom: Packt,
[32] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and
2021.
M. Ayyash, “Internet of things: A survey on enabling technologies,
[11] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Distributed deep
protocols, and applications,” IEEE Communications Surveys Tutorials,
neural networks over the cloud, the edge and end devices,” in 2017
vol. 17, no. 4, pp. 2347–2376, 2015.
IEEE 37th International Conference on Distributed Computing Systems
[33] R. Wieringa, Design science methodology for information systems and
(ICDCS). IEEE, 2017, pp. 328–339.
software engineering. Springer, 2014, 10.1007/978-3-662-43839-8.
[12] D. A. Tamburri, “Sustainable mlops: Trends and challenges,” in 2020 [34] R. J. Hyndman and G. Athanasopoulos, Forecasting : principles and
22nd International Symposium on Symbolic and Numeric Algorithms for practice, 2nd ed. OTexts.com, 2014.
Scientific Computing (SYNASC), 2020, pp. 17–23. [35] D. F. Andrews, “A robust method for multiple linear regression,”
[13] S. B. Calo, M. Touna, D. C. Verma, and A. Cullen, “Edge computing ar- Technometrics, vol. 16, no. 4, pp. 523–531, 1974.
chitecture for applying ai to iot,” in 2017 IEEE International Conference [36] A. Akusok, K.-M. Björk, Y. Miche, and A. Lendasse, “High-
on Big Data (Big Data), 2017, pp. 3012–3016. performance extreme learning machines: a complete toolbox for big data
[14] J. Chen and X. Ran, “Deep learning with edge computing: A review,” applications,” IEEE Access, vol. 3, pp. 1011–1025, 2015.
Proceedings of the IEEE, vol. 107, no. 8, pp. 1655–1674, 2019. [37] A. Liaw, M. Wiener et al., “Classification and regression by randomfor-
[15] C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, est,” R news, vol. 2, no. 3, pp. 18–22, 2002.
K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, [38] Microsoft, “Yearly running cost per rasbpi - raspberry pi forums,”
B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, https://www.raspberrypi.org/forums/viewtopic.php?t=18043, (Accessed
B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang, “Machine learning on 04/12/2021).
at facebook: Understanding inference at the edge,” in 2019 IEEE [39] “Pricing - windows virtual machines — microsoft azure,”
International Symposium on High Performance Computer Architecture https://azure.microsoft.com/en-us/pricing/details/virtual-
(HPCA), 2019, pp. 331–344. machines/windows/, (Accessed on 04/12/2021).
[16] S. Wang, C. Ding, N. Zhang, X. Liu, A. Zhou, J. Cao, and X. S. Shen, “A [40] P. Kumari and P. Kaur, “A survey of fault tolerance in cloud computing,”
cloud-guided feature extraction approach for image retrieval in mobile Journal of King Saud University - Computer and Information Sciences,
edge computing,” IEEE Transactions on Mobile Computing, 2019. 2018.
[17] Y. Lee, A. Scolari, B.-G. Chun, M. Weimer, and M. Interlandi, “From [41] C. Wang, C. Gill, and C. Lu, “Frame: Fault tolerant and real-time
the edge to the cloud: Model serving in ml. net.” IEEE Data Eng. Bull., messaging for edge computing,” in 2019 IEEE 39th International
vol. 41, no. 4, pp. 46–53, 2018. Conference on Distributed Computing Systems (ICDCS), 2019, pp. 976–
[18] B. Sudharsan, J. G. Breslin, and M. I. Ali, “Edge2train: a framework to 985.
train machine learning models (svms) on resource-constrained iot edge [42] A. Javed, J. Robert, K. Heljanko, and K. Främling, “Iotef: A federated
devices,” in Proceedings of the 10th International Conference on the edge-cloud architecture for fault-tolerant iot applications,” Journal of
Internet of Things, 2020, pp. 1–8. Grid Computing 18, vol. 18, 2020.
[19] Gartner, Integrate DevOps and Artificial Intelligence to [43] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,
Accelerate IT Solution Delivery and Business Value, 2017, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, “Hidden
https://www.gartner.com/doc/3787770/integrate-devops-artificial- technical debt in machine learning systems,” in Proceedings of the 28th
intelligence-accelerate. International Conference on Neural Information Processing Systems -
[20] J. Hermann and M. D. Balso, Meet michelangelo: Ubers machine Volume 2, 2015, p. 2503–2511.
learning platform, 2017, https://eng.uber.com/michelangelo. [44] S. Parikh, D. Dave, R. Patel, and N. Doshi, “Security and privacy issues
[21] M. Ali and Y. Lee, “Crm sales prediction using continuous time-evolving in cloud, fog and edge computing,” Procedia Computer Science, vol.
classification,” in AAAI, 2018. 160, pp. 734–739, 2019.
[22] J. Ereth, “Dataops-towards a definition.” LWDA, vol. 2191, pp. 104–112, [45] A. Alwarafy, K. A. Al-Thelaya, M. Abdallah, J. Schneider, and
2018. M. Hamdi, “A survey on security and privacy issues in edge-computing-
[23] A. R. Munappy, D. I. Mattos, J. Bosch, H. H. Olsson, and A. Dakkak, assisted internet of things,” IEEE Internet of Things Journal, vol. 8,
“From ad-hoc data analytics to dataops,” in Proceedings of the Inter- no. 6, pp. 4004–4022, 2021.
national Conference on Software and System Processes, 2020, pp. 165– [46] P. Voigt and A. Von dem Bussche, “The eu general data protection regu-
174. lation (gdpr),” A Practical Guide, 1st Ed., Cham: Springer International
[24] P. Agrawal and N. Rawat, “Devops, a new approach to cloud de- Publishing, 2017.
velopment testing,” in 2019 International Conference on Issues and [47] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:
Challenges in Intelligent Computing Techniques (ICICT), vol. 1, 2019, Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10,
pp. 1–4. no. 2, Jan. 2019. [Online]. Available: https://doi.org/10.1145/3298981

View publication stats

You might also like