An Open Source Framework Based On Kafka-ML - Ai
An Open Source Framework Based On Kafka-ML - Ai
Keywords: The current dependency of Artificial Intelligence (AI) systems on Cloud computing implies higher transmission
Distributed deep neural networks latency and bandwidth consumption. Moreover, it challenges the real-time monitoring of physical objects,
Cloud computing e.g., the Internet of Things (IoT). Edge systems bring computing closer to end devices and support time-
Fog/edge computing
sensitive applications. However, Edge systems struggle with state-of-the-art Deep Neural Networks (DNN)
Distributed processing
due to computational resource limitations. This paper proposes a technology framework that combines the
Low-latency fault-tolerant framework
Edge-Cloud architecture concept with BranchyNet advantages to support fault-tolerant and low-latency AI
predictions. The implementation and evaluation of this framework allow assessing the benefits of running
Distributed DNN (DDNN) in the Cloud-to-Things continuum. Compared to a Cloud-only deployment, the
results obtained show an improvement of 45.34% in the response time. Furthermore, this proposal presents an
extension for Kafka-ML that reduces rigidness over the Cloud-to-Things continuum managing and deploying
DDNN.
1. Introduction claim that Edge computing can be interchangeable with Fog comput-
ing [5]. However, the key difference between these two paradigms may
Artificial Intelligence (AI) and Deep Neural Networks (DNN) [1] are be seen in the location where the data processing is performed. In
key contributors to autonomous decision and prediction processes in Fog computing, the processing is performed as close as possible to the
multiple domains, ranging from manufacturing systems to self-driving IoT devices, while Edge computing pushes the limits even further by
cars. Currently, thousands of data sources originated during the Inter- allowing connected gateways and IoT devices to process data locally.
net era, such as the Internet of Things (IoT), provide data streams.
These two paradigms reduce bandwidth consumption and latency in a
Streaming data [2] feeds these application domains that, in the end,
sequence of processing known as the Cloud-to-Things continuum [6].
depend on Cloud platforms due to the large amount of data collected
The Cloud-to-Things continuum, shown in Fig. 1, is defined as a set of
and processed. Cloud platforms [3] provide infrastructures and plat-
forms as a service (IaaS and PaaS, respectively) to access computing, processing units, such as Edge devices and Fog servers. These process-
storage, and connectivity. Nonetheless, this dependency means a high ing units, located between the IoT and the Cloud, optimize response
transmission latency since data centers are located far from the end times and bandwidth consumption in time-sensitive applications. For
devices. Therefore, it challenges the real-time monitoring of physical instance, a deployment in this context could consist of IoT devices
objects featured in the IoT. generating information connected to gateways or Edge devices and
Architectures based on Edge and Fog computing represent promis- Fog servers processing before sending the information received to the
ing alternatives that complement Cloud-based systems and make them Cloud.
more capable of ensuring a rapid response to emergencies. The purpose Distributed DNN (DDNN) [7] combine DNN for complex pattern de-
of the Fog computing paradigm is to extend the Cloud capabilities tection with a distribution of the DNN layers over the Cloud-to-Things
(storage, network, and computation services) and bring them closer continuum to optimize latency. For instance, in Structural Health Mon-
to the edge of the network [4]. Fog systems can be considered a ge- itoring (SHM) [8], DDNN can assess the global state and detect struc-
ographically distributed computing architecture connected to multiple
tural problems of civil infrastructures. These mission-critical scenarios
heterogeneous devices (mini data centers, network devices, lightweight
require minimal response latency for real-time evaluation of civil in-
servers), that forms a bridge between the Cloud and the Edge so as to
frastructures and population safety. Although DDNN have evolved over
meet the time-sensitive requirements of IoT applications. Some authors
∗ Corresponding author.
E-mail addresses: [email protected] (D.R. Torres), [email protected] (C. Martín), [email protected] (B. Rubio), [email protected] (M. Díaz).
https://doi.org/10.1016/j.sysarc.2021.102214
Received 17 January 2021; Received in revised form 31 May 2021; Accepted 9 June 2021
Available online 16 June 2021
1383-7621/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
the last few years, we argue that DDNN are still at a low technology 3. the combination of the BranchyNet approach and the Edge-
readiness level since there is a lack of: Cloud architecture concept, which shows an improvement of
45.34% of the response time compared to a Cloud-only deploy-
1. frameworks for the fault-tolerant distribution of DNN and the ment.
management and monitoring of DDNN over heterogeneous hard-
ware, The rest of the paper is organized as follows. The motivation for this
2. effective communication layers to interconnect, discover, and proposal is presented in Section 2. Section 3 introduces a background
communicate neural networks, and on Kafka-ML and the other main technologies used in this work. The
3. solutions to manage AI pipelines from a DDNN model training low-latency and fault-tolerant framework is described in 4. In Section 5,
until it is ready for inference. the implementation and its evaluation are presented and discussed.
Section 6 provides an outline of the related literature. Finally, Section 7
To address the lacks mentioned in 1) and 2), in this article, a concludes the paper and explores future work lines.
low-latency and fault-tolerant framework for enabling the flexible dis-
tribution of DNN over the Cloud-to-Things continuum is introduced. 2. Motivation
The purpose of this framework is to manage and distribute DNN
with fault-tolerant guarantees and provide adequate communication As was demonstrated when the Genoa (Italy) bridge collapsed in
layers with low latency to interconnect them. This framework is de- August 2018 [12], civil infrastructure failures and malfunctions can
signed considering container technologies, such as Docker [9], to fa- have terrible consequences for human lives and essential civil work
cilitate rapid deployment and mobility with lightweight scaling and activities. Therefore, SHM requires proper real-time monitoring and
reallocation components, applications, and services. It also consid- management. Detecting and predicting any damage or vulnerability is
ers Apache Kafka, the present state-of-the-art solution for scalable essential to protect the population before a disaster can occur. Con-
dispatching of data flows. sequently, a number of recent studies [13,14] have shown successful
The framework uses models based on BranchyNet [10], which applications of Deep Learning (DL) techniques in this field. However,
provides a novel approach that promotes fast inference through early- to the best of our knowledge, these techniques are normally deployed in
exit DNN branches. These branches are complementary outputs located a monolithic way (i.e., through non-distributed deep neural networks).
throughout a neural network structure. BranchyNet-based branches Moreover, these techniques are not fit for use in time-sensitive systems
allow intermediary predictions that, when accurate enough, can make as required in this context. One of the reasons for the absence of DDNN
a DNN inference stop at that point, saving time by avoiding inference could reside in the lack of available architectures and frameworks for
over the upper layers of a DNN model. As a result, during the inference managing and deploying DDNN applications in the Cloud-to-Things
process, all the DNN-model layers are processed only if none of the continuum.
early exits in the model is accurate enough, i.e., the probability of being Furthermore, modern DNN require a considerable amount of com-
a class returned by the model for a prediction made by an early exit is putational resources [15–17]. This implies that Cloud and Edge systems
higher than a specified threshold. must have sufficient hardware resources to allocate multiple instances
In this work, an extension of Kafka-ML [11] is also proposed to of a DNN model and accept a large number of requests per minute from
address the lack 3), by providing an open-source framework to manage multiple devices. The (un)limited capabilities such as processing and
and deploy AI pipelines using data streams. storage of cloud systems to allocate these DNN models are well-known.
The main contributions of this work are summarized as follows: However, the response latency present in the communications to these
platforms is far from meeting the requirements of time-sensitive appli-
1. A framework that reduces location rigidness over the Cloud-to- cations. For these reasons, DNN are being partitioned and distributed
Things continuum by managing and deploying DDNN applica- in heterogeneous hardware and multi-layered infrastructures.
tions, DNN layers can extract complex patterns from high volume datasets
2. effective communication layers to interconnect DDNN applica- consisting of images, as notably demonstrated by Convolutional Neural
tions, and Networks (CNN) [18]. However, working with images can raise several
2
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
challenges. One is related to information privacy, mainly when involv- ML-powered applications. Kafka-ML offers also an accessible and user-
ing sensitive data in domains like eHealth. Although some systems that friendly Web interface (following a similar approach as AutoML initia-
deal with sensitive data only process private images without letting tives) to manage ML and AI pipelines for both experts and non-experts
people access these images, their communication over the network users. As its main characteristic, Kafka-ML exploits containerization
(especially with the Cloud) can lead to security breaches, even if these (Docker [9]) and container orchestration platforms (Kubernetes [22])
communications are safely and reliably protected. To support data to facilitate the distribution of its components and the system load, and
sensitivity, the DNN distribution allows the transmission of inferences to provide high availability and fault tolerance.
resulting from the intermediate layers of a DNN model, which contain Docker, container runtime of choice in the industry, automates
less sensitive information than raw images. Moreover, DNN can also the application deployment inside containers enabling the execution
notably reduce the image size [19] thanks to the image compression of applications on multiple architectures (x86, AMD64, ARM) and
performed in some layers. Therefore, DDNN could also significantly reducing development time (e.g., dependency problems) for developers
reduce bandwidth consumption by exploiting this technique. and deployment teams. A cluster of machines running Docker con-
Finally, real-time applications require adequate communication lay- tainers is usually managed through a container orchestration system,
ers to fully interconnect DDNN and architectures for unified manage- such as Kubernetes, which enables distributed management and coor-
ment and deployment of DDNN. Moreover, as BranchyNet proposes, dination while providing fault tolerance, vertical (in federation mode)
early exits return a response as soon as the accuracy is good enough and horizontal scaling, and high availability. Kubernetes is an open-
to interact with the lowest levels of the continuum, providing real-time source system for managing containerized applications in a cluster of
outcomes. Given the results of Kafka-ML [11] managing AI applications nodes, easing both the management and deployment of containers and
with data streams, we have envisaged an extension of this frame- providing automation and declarative configuration. Kubernetes also
work to overcome the management and distribution of DDNN in the enables continuous monitoring of containers, including Docker and its
Cloud-to-Things continuum. replicas, to ensure that they continuously match the defined status.
Kafka-ML users can write some code lines that define an ML model
on the Web interface of Kafka-ML to start training, evaluating, compar-
3. Kafka-ML: managing AI pipelines through data streams
ing, and making inferences. The pipeline of an ML model in Kafka-ML
representing its life cycle is shown in Fig. 2: (1) designing and defining
In this paper, an extension of Kafka-ML [11] is presented. Kafka-ML,
the ML model; (2) creating a configuration of ML models, i.e., choosing
which is available on GitHub,1 is an open-source framework that allows
a set of ML model(s) to be trained; (3) deploying the configuration
the management of Machine Learning (ML) and AI pipelines through
for training using containers; (4) ingesting the deployed configuration
data streams. Kafka-ML aims to reduce the gap between data streams
with training and optionally evaluation data streams through Apache
and current ML and AI frameworks providing an accessible framework
Kafka; (5) deploying the trained model for inference in the architecture
to harmonize their full integration.
presented in this work; and (6) feeding the deployed trained model for
Kafka-ML makes use of the distributed message system Apache
inference to make predictions with data streams. All the steps related to
Kafka [20]. Apache Kafka is a distributed and publish/subscribe mes-
feeding the ML model (e.g., inference and training) use data streams.
saging system that dispatches large amounts of data at low latency.
Each task executed is deployed in a Docker container in Kubernetes.
Apache Kafka enables multi-customer distribution, which allows the Kafka-ML is used as the primary tool for training, evaluating and
connection of multiple customers to topics (e.g., for distributing in- deploying ML models. The extension of Kafka-ML to enable DDNN and
formation to different layers like batch and stream platforms), and a how DDNN communicate through Apache Kafka are further discussed
high rate of message dispatching. One of its most notable features is next.
the consumer group, which enables the distribution and parallelism of
messages in a cluster of customers. 4. Distributed deep neural networks over the cloud-to-things con-
Contrary to many distributed queue frameworks, Apache Kafka tinuum with Kafka-ML
stores messages in disk with a configurable retention policy, enabling
its users to retrieve data later. This is popularly known as the distributed Apache Kafka as a distributed message system is responsible for pro-
log, which allows consumers to search the log as they require. In viding an effective and fault-tolerant communication layer for DDNN
some cases, such as ML training, this feature is especially useful as inference in the proposed framework. This architectural decision has
all data may need to be processed at once. If a failure occurs during the following features and benefits.
this process, the customer can start again without losing any data First, since Apache Kafka works with data streams, it allows this
stream or having to store it in a file system (as they are stored in framework to accept and work with data streams as its normal func-
the distributed log of Apache Kafka). In the case of adopting a queue tioning and opens the way for the integration of new ones, such as the
framework that does not support a retention policy like Apache Kafka ones present in the IoT and the Internet. Furthermore, architectures
does, data streams should be at least stored into another data storage that use this framework can easily scale to increase the computing
system before training is successfully performed to ensure there is no capacity when required (e.g., deploying replicas of DDNN applications)
data loss. Therefore, Apache Kafka allows Kafka-ML to use a novel thanks to the Apache Kafka capabilities for parallelism (topic parti-
approach to manage data streams, which can be reused as many times tions and consumers groups), which would automatically distribute the
as configured, so no file system or data storage are needed for datasets. data stream load among Apache Kafka customers (DDNN application
Load balancing and fault tolerance among Kafka-ML tasks that require replicas).
data streams, such as training and inference, is achieved through Kafka Second, the fault-tolerant mechanisms provided in Apache Kafka
partitions and replicas of the topics. Each topic can be allocated into (e.g., topic replicas) enable fine and reliable control of data streams
multiple partitions, and each partition can have multiple replicas for for DDNN deployments and reduce the risk of data loss. The Apache
fault tolerance. Kafka distributed log also ensures that streaming data are available
Regarding ML frameworks, Kafka-ML supports TensorFlow [21]. (for a predefined time or until it exceeds the available memory) to
TensorFlow is an open-source framework with a flexible ecosystem of DDNN applications even after they have been consumed. Thus, this also
tools, libraries, and community resources for building and deploying enables fault tolerance for DDNN applications.
Furthermore, IP addressing can be a challenge, especially when
having a cluster of non-high-availability nodes as those present in the
1
https://github.com/ertis-research/kafka-ml. continuum. In this regard, Apache Kafka facilitates the discovery of
3
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
DDNN applications. They only have to know the Kafka topics where the containers, which can be easily deployed without dealing with the
prediction and inference results go (assuming Apache Kafka is deployed installation steps required in non-containerization deployments. Once
with fault tolerance and its IP is also known). Consequently, in case an ML model has been defined and trained in Kafka-ML, a Docker
of a node failure (e.g., a DDNN application that waits for inference container will be instantiated in the continuum infrastructure for each
results), a new consumer subscription (along with the previous DDNN sub-model. These instances will download their corresponding trained
application) will be deployed in a transparent way for the DDNN model from Kafka-ML to start the inference process through data
application that sends the data stream (producer). Thus, Apache Kafka streams and Apache Kafka. Therefore, Docker containers (and DDNN
input and output topics have to be indicated when deploying DDNN applications) communicate each other through Apache Kafka and the
applications. This enables high flexibility for the deployment of DDNN topics configured. Those can also communicate each other in different
in this architecture, enabling the deployment of all partitioned models clusters available in the Cloud-to-Things continuum. During training,
in one layer (e.g, the Edge or the Cloud) or in many layers as infras- another Docker container is deployed to train all the sub-models. For
tructures are available in the continuum. Moreover, Kafka MirrorMaker further details about these algorithms, please refer to Kafka-ML [11].
functionality [23] enables a powerful and easy way to deal with topic Finally, the orchestration container platform Kubernetes harmonizes
synchronization among Kafka clusters as those present in the Cloud-to- the deployment of the containers that compose this architecture. More-
Things continuum. To this end, the only requirement to be considered over, Kubernetes is used to manage a cluster of nodes that can be
is that the inference of a DDDN hidden layer (non-early exits) must present in the continuum and offers other suitable features for mission-
match the input of the next connected hidden layer in the global model critical applications in production environments, such as fault tolerance
definition. and high availability. This allows unified management of DDNN appli-
As a result, the combination of DDNN and the Cloud-to-Things cations and continuous monitoring to ensure a desire execution state in
continuum can face the challenges presented in the Motivation section. the available nodes. Therefore, fault tolerance and high availability are
Whereas Kafka-ML was designed for a single cluster infrastructure warranted for the data and control plane in this framework, through
(e.g., a Cloud solution), this work provides a solution that allows Apache Kafka with its management of data streams (data plane) and
applying this combination in the continuum. The framework has been through Kubernetes with its management of the infrastructure and
developed to facilitate the inference of BranchyNet-based models over deployed Docker components (control plane). To avoid synchronization
different heterogeneous infrastructures, such as Edge, Fog, and Cloud. delays among internal clusters due to the strict requirements of con-
Consequently, Kafka-ML now allows designing BranchyNet-based mod- sensus protocols, each available layer (e.g., Cloud, Edge, Fog) deploys
els. Therefore, Kafka-ML users can place an early exit in layers of the an independent and isolated cluster of Kubernetes. All of these clusters
continuum (Fig. 3) to early stop the inference if the result is good can be centrally managed through the Kubernetes Cluster Federation2
enough (with a hit probability higher than a configurable threshold). (KubeFed) functionality. Fig. 4 shows an overview of the architecture
Otherwise, the prediction is not considered reliable. In this case, the and its components deployed in two layers of the continuum (Edge and
ML flow continues to the next layer in the continuum until getting a Cloud).
reliable prediction in the subsequent layers or reaching the last layer of
the continuum (e.g., the Cloud). This allows generating predictions as 4.1. Time synchronization for DDNN applications
close as possible to the lowest layer where a time-sensitive interaction
can be required as long as the prediction is reliable according to a set For time synchronization between DDNN layers, Apache Kafka is
threshold. This chosen threshold represents a trade-off between reliable used. Since every message between DDNN applications is sent through
and time-sensitive predictions. Furthermore, this framework makes the Kafka (each DDNN layer has been configured with a Kafka input
deployment of DDNN applications very flexible. The layered architec- and output topic), all the communications in DDNN applications are
ture depicted in Fig. 3 may be seen as a possible target deployment, available in Kafka for a predefined time or until it exceeds the available
one single layer (e.g., the Cloud), or as many layers as are available in memory. Therefore, in the event that a DDNN layer has not been
the continuum thanks to the abstraction of the system through Apache deployed before a message for it has arrived, messages for this layer
Kafka. will still be available in Apache Kafka until the layer can process them.
To facilitate the deployment and development of DDNN appli- As DDNN applications are vertically layered, if any layer fails for
cations and their portability and mobility over the continuum, this whatever reason, it could also stop the whole flow of processing for
architecture, its components, and dependencies like Apache Zookeeper a while. However, the adoption of Kubernetes and the component’s
(required by Apache Kafka for synchronization of brokers and topic isolation in Docker containers provides continuous monitoring of the
replicas) are containerized through Docker containers. This provides available infrastructure to restart any component in case of failure.
a portable and lightweight solution to be distributed in the contin-
uum. Containerization also reduces development and deployment ef-
2
forts since all dependencies and the source code itself are packed in https://github.com/kubernetes-sigs/kubefed.
4
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
Fig. 3. Early exits and distribution of the DDNN framework in the Cloud-to-Things continuum. The Edge and Cloud contain different parts of a full neural model based on
BranchyNet.
Fig. 4. Low-latency and fault-tolerant architecture for the management and deployment of DDNN applications over the Cloud-to-Things continuum based on Kafka-ML.
Moreover, the BranchyNet approach enables early exits for predictions starts at the Flatten layer in the Edge and goes until the end of this
and not always all layers have to process data streams. In this case, branch, the edge_output layer. When placing an early exit, we have
once a prediction is received from an early exit (i.e., hit probability is to consider the vertical traversal of the DNN stack before the start
higher than a set threshold), the result is sent to the Kafka output topic of this branch to add this early exit successfully. For instance, when
configured and the DDNN communication flow ends. designing a DNN model, we need to flatten the input of MaxPooling
or Convolutional layers in order to work with Dense layers, since they
5. Implementation and evaluation work with different dimensionalities.
There is also another fact to consider when placing an early exit
5.1. Implementation while following this framework, namely the additional memory re-
quired when adding a branch (i.e., including more layers) and thus the
We have designed a DNN model based on VGG16 [17] and the DNN model requires more resources. Although it is not a requirement,
early exit concept proposed by BranchyNet [10]. VGG16 gets a 92.7% we recommend placing one early exit per architecture layer (e.g., Edge,
top-5 test accuracy on ImageNet [24], and thus, we can successfully Fog). In this way, we have one output for each layer in the architecture
train a VGG16 model to classify images and expect more than 80% (e.g., Edge, Fog, Cloud) deciding whether or not to send to the upper
test accuracy on CIFAR10,3 which is the dataset we have used during layers of the architecture and saving in communications. For instance,
training and evaluation processes. Moreover, VGG16 is a well-known if a prediction made at the early exit placed in the Edge system has a
model and it can be easily adapted to the BranchyNet approach. probability of being a predicted class higher than a specified threshold,
The complete resulting model is shown in Fig. 5. This model is this will result in the Edge system sending back this prediction to the
large enough to place a sizeable workload on the Cloud, and thus, devices rather than asking the upper layers, the Cloud in this case, for
also in the Edge. In this model, we have included one early exit that a more reliable prediction. Moreover, Edge systems have normally less
corresponds to the edge_output layer. The branch for this early exit available resources to allocate to a DNN model (i.e., GPUs, memory).
Therefore, having an elevated number of early exits placed at the Edge
level does not guarantee a faster response time. Those early exits that
3
http://www.cs.toronto.edu/~kriz/cifar.html. are placed too early in the model will not have enough prior layers in
5
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
the model to provide a good prediction. Hence, we have to consider The code of implementation8 and Kafka-ML9 are both open-source
the DNN stack and the actual size of a model before introducing these projects. Furthermore, in the GitHub repository for this implementa-
branches for early exiting and partitioning the model. These aspects tion, we have also shared the trained model based on VGG16 and
result in a trade-off between the accuracy and the size of the DNN BranchyNet for the sake of reproducibility, as well as the Docker
model and sub-models generated once partitioned. and Kubernetes configuration files used during the evaluation of the
As a result, in our DDNN application we partitioned the model into proposed framework.
two pieces, as shown in Fig. 5. Note that one of them is much smaller
than the second one to be placed in computers that comprise an Edge 5.2. Evaluation
cluster, while the larger one is to be placed in a Cloud cluster. We can
then allocate these two parts of the model (i.e., two sub-models) in the To evaluate this framework, we have used the value of 0.8 as a
different computers that built the evaluation environment, which will threshold. Selecting a threshold is a trade-off between reliable and
be introduced in 5.2, reaching a suitable accuracy level. time-sensitive predictions. As a result, we have chosen this threshold
This model was trained using the CIFAR10 dataset, which consists empirically due to most of the early exit responses, which return a
of 60.000 32 × 32 color images in 10 well-balanced classes. First, the probability higher than this threshold value, match the class returned
full model without the Edge exit was trained with 80 percent of the in the Cloud. Hence, when the probability of being a class returned by
training set (40.000 32 × 32 color images; 10.000 images are used the Edge early exit is higher than the specified threshold, the DDNN
for the validation set). Then, all layers were blocked. The Edge exit, application in the Edge does not continue the inference process. There-
which consists of the Flatten, Dense and edge_output layers shown fore, it does not send any images or information to continue the process
in Fig. 5, was added to train this branch in the same way as the rest of in the Cloud. However, the predictions made by the sub-model placed
the model without affecting the weights already trained. in the Edge system may be less reliable than those made in the Cloud
Once the model is trained (e.g., through Kafka-ML or a Jupyter as the sub-model placed in the Cloud has more layers that comprise the
notebook) and cut to the level of the early exist, each DDNN part can DDNN. Note that this threshold can also be conveniently configured in
be deployed through Kafka-ML in different instances of this program the framework according to the needs of the target application.
running in different machines with Zookeeper, Kafka, and its topics Results are obtained using the CIFAR10 test set comprising 10.000
correctly configured. As we include Kubernetes as management con- 32 × 32 color images through an environment composed of 5 devices
tainer platform, we have used Docker images for Apache Zookeeper4 sending the same data at the same time, an Edge infrastructure, and a
and Kafka5 provided by third parties to easily configure them. Cloud system. The Cloud deploys a Kubernetes cluster with 3 Nodes, 6
We have developed a container-based Python application to facili- vCPUS, and 12 GB memory in Google Cloud. We have placed the Edge
tate the inference of the BranchyNet-based model over different archi- infrastructure at the University of Malaga, comprised of 3 computing
tectures levels by using the same technologies as Kafka-ML. These tech- nodes connected to the external streaming devices. These 3 computers
nologies are Kafka6 and Tensorflow7 [21]. This application is packed form a cluster in Kubernetes and have different hardware configura-
and executed in Kubernetes as a service, i.e., in a Docker container. tions. The first of them works with an i7-4790 3.60 GHz and an 8 GB
Kafka-ML components (Fig. 4) are also packed in Docker containers and memory configuration. The second one has an i5-7400 3.00 GHz and 16
deployed in Kubernetes. GB of memory, and the last of them works with an i7-10700 2.90 GHz
and 32 GB of memory. The devices generating CIFAR10 data streams
are deployed at the University of Malaga into 5 different computers
4
https://github.com/31z4/zookeeper-docker.
5
https://github.com/wurstmeister/kafka-docker.
6 8
kafka-python: https://pypi.org/project/kafka-python/ https://github.com/ertis-research/DDNN.
7 9
For both training and development, we have used Tensorflow 2.3.0. https://github.com/ertis-research/kafka-ml.
6
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
Fig. 6. Resulting accuracy for CIFAR10 test set using an Edge-Cloud architecture and an Cloud-only system.
with i7-4790 3.60 GHz and 8 GB of memory each one. Edge computers • 1× Apache Kafka broker (1 Computer)
are then placed in the same network where the information is produced • Kafka topics configured with 1 partition and without replication
(University of Malaga) to reduce the network latency.
To sum up, this is the infrastructure used to deploy the DDNN For this test, results were obtained after sending all CIFAR10 test
application in the continuum: images from a single computer and shown in Fig. 7. As with accuracy,
we show the average response time using two different sets. Using
• Devices deployed in 5 computers at the University of Malaga the simple deployment mentioned, the average Edge-Cloud response
(Spain). time is much higher than the Cloud-only architecture response time.
• Edge layer: 3 computers at the University of Malaga (Spain). The set of images that has been responded at the Edge corresponds
• Cloud layer: Google Cloud. Europe West 1 zone (Belgium). to 5416 images, which are more than half of the CIFAR10 test set.
Thus, the rest of the requests are much slower, since they have to go
We compare the results for the Edge-Cloud architecture with an- twice (for the request and response) through all the Cloud-to-Things
other architecture that depends only on the Cloud. In the Cloud-only continuum layers defined. This means that a Cloud-only architecture
case, the model used consists of the model shown in Fig. 5 as a single gives better response time results for the complete test set when using
piece, i.e., without the branch for the early exit. First, we show the this simple deployment (with 1 computer in the Edge and a single
average accuracy for two different collections of the CIFAR10 test set node in Google Cloud). However, the Edge-Cloud architecture shows
in Fig. 6. a significant improvement over the Cloud-only architecture regarding
On the one hand, global accuracy shows the average accuracy for the the response time (0.0638 s) when comparing the results for the set
whole CIFAR10 test set when using the complete model (i.e., without of images that has been responded at the Edge level. Thereby, more
early exits) in the Cloud-only architecture and when using the model than 54% of the responses were provided in the Edge early exit that
based on VGG16 and BranchyNet (Fig. 5), i.e., with early exits in the has reduced the Edge response time for these prediction requests.
Edge-Cloud architecture. Despite the fact that the model evaluated Finally, we have evaluated the framework in a higher performance
is the same in both cases, in the Cloud-only case, the inference is scenario. This scenario makes use of the described 3 Edge computers
processed using all the layers of the model, while early stopping (with and the 3 nodes of Google Cloud. We have deployed a cluster of 3 Kafka
an early-exit branch) in the Edge-Cloud case is considered. Therefore, brokers for each architecture level (i.e., Edge and Cloud) while varying
as expected, the accuracy obtained in the Cloud-only case is higher than topic replication and the number of replicas for high availability and
the accuracy obtained in the Edge-Cloud case. However, the results fault tolerance. In this case, a replica of the DDNN inference module is
obtained by the Edge-Cloud architecture are still suitable enough. deployed in Kubernetes with each partition to distribute the load of the
On the other hand, we show the accuracy for the set of images system among 5 devices sending CIFAR10 data streams. Fig. 8 shows
that have been responded at the Edge level (early exit) of the Edge- the response time of our framework versus a Cloud-only deployment.
Cloud architecture, 84.18%. This result is compared with the accuracy As a result, with 1 partition, the total average Edge response time
obtained in the Cloud-only architecture for the same set of images (including early exits and those that go to the Cloud) is higher due to
(92.49%). The values observed in this comparison show the main the overhead of clients. However, with two partitions, the load is dis-
disadvantage of the BranchyNet approach. Using early exits does not tributed among DDNN deployments, and results show lower response
increase the accuracy of the model, but decreases it while saving latency in our distributed framework than in the Cloud-only architec-
time during the inference process. This confirms the threshold trade- ture. The overload in the Cloud system caused by the increase in the
off discussed. Therefore, running a complete DNN model in the Cloud number of partitions and replicas may be due to the infrastructure
can return more reliable predictions, but considering early exits in an available in Google Cloud (6 vCPUs and 12 GB of memory). Fig. 9
intermediate architecture layer (e.g., Edge) will reduce response time shows the speed-up of our architecture (early exits and total Edge time)
maintaining an acceptable degree of reliability (in this case, > 80%). regarding the Cloud deployment. The best result is obtained with 4
Next, we have evaluated the framework comparing the Edge-Cloud partitions and replicas, reaching a speed-up of 23× for early exits and
environment with the Cloud-only environment in a Cloud-to-Things 8× for total Edge response time.
simple deployment by using the following configuration in both layers These results demonstrate that considering early exits in DDNN
of the architecture (Edge on 1 computer and 1 node in Google Cloud) combined with continuum architectures could drastically reduce re-
in order to measure the response time: sponse time, which is essential in AI and time-sensitive applications,
7
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
Fig. 7. Resulting response time of the distributed architecture versus an Cloud-only deployment ingesting CIFAR10 data streams. One Kafka Broker, 1 partition, and 1 client are
used.
Fig. 8. Response time of the distributed architecture versus the Cloud-only deployment. Three Kafka Brokers and 5 clients are used. Partitions and replicas change.
such as self-driving, civil infrastructure monitoring, and eHealth. De- architecture presented is statically deployed and DDNN applications
spite the fact that the accuracy is slightly decreased due to early do not adapt to the continuous changes. For instance, in case of a
stopping during the inference process across the Cloud-to-Things con- node failure in these architectures, a DDNN application would stop its
tinuum, it can be adjusted to satisfy the requirements of different service, whereas in our architecture, Kubernetes automatically would
scenarios by modifying the threshold value. Time-sensitive applications reallocate the DDNN application container into another available node
require working with response times in the order of milliseconds. Thus, to restart the service. An adaptive surgery [25] scheme dynamically
the faster a system gives a prediction, the better its fitness for purpose. splits DDNN between the Edge and the Cloud to optimize both the
latency and throughput under variable network conditions. In the
6. Related work continuum, Fog can also play a role besides Edge and Cloud, and
DINA [26] presents a fine-grained solution based on matching theory
6.1. DDNN partitioning for dynamic DDNN partitioning in Fog networks. These approaches do
not consider the instances when the inference stops at the middle layers
One of the first works on DNN branches for early exits was (early exits), which can also reduce the network traffic [27] and the
BranchyNet [10]. An extension of that work [7] proposed the adaption computing capacity [28].
of BranchyNet to the Cloud-to-Things continuum through horizontal Other studies show that response time is accelerated whilst net-
aggregations. This provides system fault tolerance with a 20× reduction work congestion is reduced by combining Cloud and Edge environ-
of communication cost compared to offloading the whole computation ments [29]. These approaches are also considered in video and image
to the Cloud. The main drawback of these approaches is that the analysis in smart city applications, which entail a great computational
8
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
Fig. 9. The speed-up of the distributed architecture versus the Cloud-only deployment. Three Kafka Brokers and 5 clients are used. Partitions and replicas change.
cost and a huge number of data sent through the network, resulting in which addresses the comprehensive integration of data streams and ML
considerable delay reductions [30]. Yet, early exits at the middle layers frameworks and is also part of this work, is also available on GitHub.11
are not contemplated, and they could result in a further improvement To accomplish the DNN distribution, especially in large DNN (e.g.,
in response time and a significant decrease in the number of messages ResNet-152), we partition the layers of the neural network. Apart
sent to the Cloud. from being a non-trivial task, as it constitutes a trade-off between
computation and transmission costs, the partitioning of DDNN can
6.2. Cloud-to-Things continuum architectures for DDNN have implications in the prediction accuracy, and the response latency
could also be affected. Moreover, infrastructure capabilities may vary
As stated, Kubernetes enables the management and monitoring of all largely in heterogeneous and dynamic environments, which can also
kinds of applications in a cluster of nodes. Kubeflow [31] is an open- affect the pre-established partitioning strategy. Network conditions can
source ML toolkit for Kubernetes. It does not provide support for data
also vary, e.g., the throughput can decrease by 10 times in LTE net-
streams as the framework proposed and Kafka-ML do, but allows for the
works during peak hours [25]. Therefore, dynamic, fault-tolerant, and
configuration of multiple steps of AI and ML pipelines, such as hyper-
auto-adaptive partitioning strategies for DDNN inference acceleration
parameters and pre-processing. These solutions are not designed for the
should be released to ensure and maintain flexibility. We envisage
flexible support of DDNN over the Cloud-to-Things continuum.
that new micro-service components could be defined in Kafka-ML to
EdgeLens [32] and HealthFog [33] provide frameworks to deploy
continuously monitor the available continuum infrastructure [35] and
deep learning-based applications in Edge-Fog-Cloud environments and
improve the Quality of Service (QoS) for such applications. To reduce dynamically decide where to cut and where to deploy DDNN applica-
latency, EdgeLens scales down in resolution in order to shorten the tions to optimize the response latency. This will be explored as future
delivery time. Both EdgeLens and HealthFog run non-distributed ML work in order to adapt DDNN to the current status of infrastructures
instead of adapting the DDNN themselves to the continuum. (hardware + networking) by allocating DDNN layers at the right place
IoTEF [34] provides a fault-tolerant architecture and unified man- at the right time in the Cloud-to-Things continuum.
agement and monitoring for Cloud and Edge clusters; however, it
does not allow the automation of DDNN in order to accomplish other
CRediT authorship contribution statement
QoS aspects, such as latency and inference optimization beyond fault
tolerance.
Daniel R. Torres: Implementation of the framework and evalua-
7. Conclusion tion, First manuscript draft. Cristian Martín: Supervised the research
and conceptualization, Kafka-ML development and conceptualization,
This paper addresses the distribution of DNN over the Cloud-to- First manuscript draft. Bartolomé Rubio: Supervised the research,
Things continuum for those mission-critical applications that require Conceptualization, Manuscript review, Funding. Manuel Díaz: Super-
low-latency responses. It has been demonstrated that the partitioning vised the research, Conceptualization, Manuscript review, Funding.
of deep neural networks can be better adapted to the needs of the
heterogeneous devices that host them and improve response times
(early exits), security, and communication requirements. To the best Declaration of competing interest
of our knowledge, this work is the first to fully place the BranchyNet
approach in the continuum providing a fault tolerance and low-latency The authors declare that they have no known competing finan-
framework; and an architecture and effective communication layers cial interests or personal relationships that could have appeared to
that offer better support to DDNN in this field. The proposed framework influence the work reported in this paper.
is open-source and available on GitHub.10 The Kafka-ML extension,
10 11
https://github.com/ertis-research/DDNN. https://github.com/ertis-research/kafka-ml.
9
D.R. Torres et al. Journal of Systems Architecture 118 (2021) 102214
10