Preprint
Preprint
The DeepHealth HPC Infrastructure: Leveraging Heterogenous HPC and Cloud Computing
Infrastructures for IA-based Medical Solutions
Original Citation:
Availability:
Publisher:
CRC Press
Published version:
DOI:10.1201/9781003176664-10
Terms of use:
Open Access
Anyone can freely access the full text of works made available as "Open Access". Works made available
under a Creative Commons license can be used according to the terms and conditions of said license. Use
of all other works requires consent of the right holder (author or publisher) if not exempted from copyright
protection by the applicable law.
13 March 2025
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
Chapter 10: The DeepHealth HPC Infrastructure: Leveraging heterogenous HPC and
(0000-0003-4274-8298)
UPV: Rafael Tornero, Jose Maria Martínez, David Rodriguez, Izan Catalán, Jorge García,
1. Introduction
Deep learning (DL) is increasingly being considered to effectively support medical diagnosis
of complex diseases. DL includes two main operations: (1) training, which refers to the
process of generating a predictive model, in the form of a Deep Neural Network (DNN),
based on large data-sets composed of biomedical information (e.g., medical images); and (2)
inference, which refers to the process of predicting a diagnosis based on a reduced data-set.
Deep learning training is the most computationally intensive operation, requiring very large
memory and computing power. The training operation is an iterative process, requiring to
iterate many times over all the data-set samples to properly adjust the model, i.e., the weights
of the DNN. In general, larger datasets allows to obtain predictive models with higher
accuracy, which in turn, increases the running time needed for the training procedure.
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
This makes unfeasible to run training operations on general purpose computing systems, such
as those available in hospitals, for large enough medical datasets, even when such computers
are equipped with powerful processors featuring many-cores or GPU acceleration devices.
or hundreds of computing nodes, allows split the training operation into smaller datasets upon
which parallel training operations can be applied. These methods are mandatory when the
samples do not fit into a single computing node. As an example, the data-set provided by the
141 GB which does not fit into the memory of a single computing node (e.g., the
images to mitigate the problem of overfitting, that are applied on-the-fly, because each unique
image from the dataset may require to be transformed in a different way at every iteration,
expensive. It is important to remark that medical imaging datasets are typically composed of
DeepHealth has developed an HPC toolkit capable of efficiently exploit the computing
way. To do so, the data/computer scientists only need to describe in a file (CSV or JSON or
XML or similar) the set of computing nodes to be used during the training process and launch
it.
HPC facilities are not well suited for every kind of application. Queue-based workload
managers that commonly orchestrate HPC centers cannot satisfy the strict time-to-solution
requirements of the inference phase, and air-gapped worker nodes are not able to expose
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
user-friendly web interfaces for data visualisation. To address these issues, the DeepHealth
HPC toolkit also supports cloud environments, with innovative hybrid solutions to support
private and public clouds and a new Workflow Management System (WMS) capable of
architectures.
Overall, the DeepHealth HPC toolkit offers to the computer vision and deep learning
functionalities included into the European Computer Vision Library (ECVL) and the
European Distributed Deep Learning Library (EDDL)1, the HPC capabilities needed to
efficiently exploit the computing capabilities of HPC and Cloud infrastructures. The two
libraries are developed within the context of the DeepHealth project as well.
EDDL is a general-purpose deep learning library initially developed to cover deep learning
needs in healthcare use cases within the DeepHealth project. EDDL provides
deep learning functionalities and the implementation of the necessary tensor operators,
activation functions, regularization functions, optimization methods, as well as all layer types
In order to be compatible with existing developments and other deep learning toolkits, the
EDDL uses ONNX 2 , the standard format for neural network interchange, to import and
export neural networks including both weights and topology. As part of it design to run on
1
https://github.com/deephealthproject
2
“Open Neural Network Exchange. The open standard for machine learning interoperability,”
distributed environments, the EDDL includes specific functions to simplify the distribution of
batches when training and inference processes are run on distributed computing
infrastructures. The EDDL serializes networks using ONNX to transfer weights and gradients
between the master node and worker nodes. The serialization includes the network topology,
the weights and the bias. To facilitate distributed learning, the serialization functions
Next sections present the parallel strategies implemented to execute the EDDL operations and
2.1. COMPSs
objective is to facilitate the parallelization of sequential source code written in Java or Python
COMPSs tasks) and the synchronization data dependencies existing among them by
annotating the sequential source code (using annotations in case of Java or standard
Figure 1 shows a snipped (simplified for readability purposes) of the parallelisation of the
EDDLL training operation with COMPSs. COMPSs tasks are identified with a standard
Python decorator @task (lines 1 and 5). The IN, OUT and INOUT arguments define the
data directionality of function parameters. By default, parameters are IN, and so there is no
nodes for initialization purposes; otherwise, it executes on the available computing resources.
The train iterates over num_epochs epochs (line 13). At every epoch, num_batches
batches are executed (line 14), each instantiating a new COMPSs task (line 15) with an
EDDLL train batch operation. All COMPSs tasks are synchronize at line 17 with
compss_wait_on, and the partial weights are collected. The gradients of the model are
The task-based programming model of COMPSs is then supported by its runtime system,
which manages several aspects of the application execution and keeps the underlying
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
master-worker structure:
● The master, executed in the computing resource where the application is launched, is
responsible for steering the distribution of the application and data management.
● The worker(s), co-located with the Master or in remote computing resources, are in
One key aspect is that the master maintains the internal representation of a COMPSs
application as a Direct Acyclic Graph (DAG) to express the parallelism. Each node
corresponds to a COMPSs task and edges represent data dependencies (and so potential data
transfers). As an example, Figure 2 presents the DAG representation of the EDDLL training
Based on this DAG, the runtime can automatically detect data dependencies between
COMPSs tasks: as soon as a task becomes ready (i.e., when all its data dependencies are
honoured), the master is in charge of distributing it among the available workers, transferring
the input parameters before starting the execution. When the COMPSs task is completed, the
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
result is either transferred to the worker in which the destination COMPSs tasks executes (as
One of the main features of the COMPSs framework is that it abstracts the parallel execution
model from the underlying distributed infrastructure. Hence, COMPSs programs do not
include any detail that would tie them to a particular platform, boosting portability among
diverse infrastructures and so enabling its execution in both a classical HPC environment and
tasks execute. Internally, the COMPSs runtime implements different adapters to support
the execution of COMPSs tasks in a given resource. Through a set of configuration files,
the user specifies the available computing resources, which may reside in a computing cluster
or in the cloud.
Figure 3 shows an example of this deployment. The execution starts in the Computing
Resource 1, where the COMPSs Master executes. Then four workers are deployed in four
different resources to distribute the workload, where the EDDLL training operations can be
distributed.
module, in which the COMPSs workers are executed in the different Marenostrum
computing nodes, each equipped with 2 Intel Xeon Platinum 8160 CPU with 24 cores each at
2.10GHz, 96 GB of main memory and 200 GB local SSD available as temporary storage
during jobs. The COMPSs runtime is then responsible for distributing the parallelversion of
2.2. StreamFlow
ranging from practitioners’ desktop machines to entire HPC centres. In particular, each step
of a complex pipeline can be scheduled on the most efficient infrastructure, with the
underlying run-time layer automatically taking care of worker nodes’ lifecycle, data transfers,
The basic idea behind the StreamFlow paradigm is to easily express correspondences
between the description of a coarse-grain application workflow, i.e., a graph containing the
application steps with the related data dependencies, and the description of an execution
environment, i.e. a manifest defining the capabilities of a target infrastructure. Starting from
such description, the StreamFlow run-time layer is then able to orchestrate both the worker
3
https://www.bsc.es/marenostrum/marenostrum
4
I. Colonnelli, B. Cantalupo, I. Merelli, and M. Aldinucci, “Streamflow: cross-breeding
doi:10.1109/TETC.2020.3019202
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
With respect to the majority of WMSs on the market, StreamFlow gets rid of two common
design constraints:
● There is no need for a single shared data space accessible from all the workers
constitutes the unit of deployment, i.e., all its components are always co-allocated when
executing a step. Each agent in a model, called service, constitutes the unit of binding, i.e.,
each step of a workflow can be bound to a single service for execution. Finally, a resource is
a single instance of a potentially replicated service and constitutes the unit of scheduling, i.e.,
example, a Helm chart describing a COMPSs master pod and four COMPSs worker pods
constitutes a model with two services, the former with one resource and the latter with four
resources.
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
Figure 4 shows the StreamFlow’s logical stack. The Deployment Manager is the component
in charge of creating and destroying models when needed. To do that, it mainly relies on
the Scheduler component is in charge of selecting the best resource on which each workflow
step should be executed while guaranteeing that all requirements are satisfied. Finally, the
Data Manager, which knows where each step’s input and output data reside, must ensure that
each service has access to all the data dependencies required to complete the
Instead of coming with yet another way to describe workflow models, StreamFlow relies on
open standard for describing analysis workflows, following a declarative JSON or YAML
syntax. Being CWL a fully declarative language, it is far simpler to understand for domain
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
experts than its Make-like or dataflow-oriented alternatives. Moreover, the fact that many
products offer support for CWL, either alongside a proprietary coordination language or at
It is also worth noting that StreamFlow does not need any specific package or library to be
installed on the target execution architecture other than the software dependencies required
by the host application (i.e., the involved workflow step). This agentless nature allows
virtually any target architecture reachable by a practitioner a potential target model for
3. Cloud Infrastructures
The hybrid cloud is the combination of Private Cloud and Public Cloud, allowing the
exploitation of the best of both types. There are several reasons why it may be interesting to
have a hybrid cloud. The three most typical cases are the following:
information, is kept into the private part of the cloud to minimize potential risks.
● Cost: The presence of sporadic peaks on the computing workload can be served with
infrastructure to compensate the peaks. By doing so, the hybrid cloud part is used for
the usual load, and the public part is used to compensate the peaks while keeping the
service alive.
● Availability: In the same way as with a Content Delivery Network (CDN), a hybrid
The DeepHealth HPC toolkit includes an on-premise private cloud based on a Kubernetes
cluster (and hosted by the DeepHealth partner TREE Technology), and a public cloud in
By doing so, the DeepHealth hybrid cloud allows to provision computing resources in a
flexible way to accommodate project requests. GPU computing capabilities can be added –
both through provisioning on the on-premise Kubernetes cluster and via nodes provisioned
using Amazon Elastic Kubernetes Service (EKS) technology on Amazon Web Services
(AWS). This hybrid infrastructure has two main pieces: Rancher and API Services.
management of the clusters. Rancher allows managing the operational and security
challenges on multiple Kubernetes clusters across any infrastructure. Moreover, its web
interface provides control over deployments, jobs, pipelines, including an app catalog for
5
https://rancher.com/docs/rancher/v2.x/en/
6
https://helm.sh/
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
2. An API has been developed to facilitate the use and/or integration of the libraries with the
Kubernetes platform. Moreover, a high-level REST API has been developed on top of the
Kubernetes API to help abstract the user from the infrastructure itself, simplifying the
complexity, from simple ones, like list of Pods, to more complex ones such as expose
Pods – which implements functionality abstracting the user from the potentially complex
configuration of the clusters (e.g., multi-cloud, hybrid cloud, etc.). The API itself can
support the addition of new Kubernetes clusters both on-premise and in the cloud from
of the COMPSs and StreamFlow parallel frameworks presented in Section 2. Next sections
described them.
Unlike the Linux-based infrastructure, there is no need for setting up the execution
environment in all the computing resources, but only a docker image must be available, e.g.,
Docker Hub7. Figure 6 shows an example of this deployment. The execution starts in the
computing resource 1, where the COMPSs Master executes. Then three workers are deployed
in three different containers in the cloud infrastructure (in our case the cloud is provided by
TREE), where the COMPSs application is distributed (in our case, EDDLL training
operations).
7
https://www.docker.com/products/docker-hub
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
Moreover, the COMPSs runtime is being adapted to support the cloud infrastructure provided
by TREE. The cloud is based on Kubernetes (K8S) 8, and allows to manage applications in a
container technology environment and to automate the manual processes to deploy and scale
the user from the infrastructure itself, speeding up the processes of deployment and
management of the workflows. COMPSs runtime interacts with this API to deploy workers
The DeepHealth hybrid cloud has been integrated with the COMPSs framework (see Section
2.1), which allows to accelerate the Deep Learning training operations by dividing the
training data sets across a large set of computing nodes available on cloud infrastructures, and
upon which partial training operations are then be performed. This combination is done
through the REST API, allowing COMPSs to abstract from the infrastructure and perform
Figure 6 shows a possible distribution of the execution of the parallel version of the EDDLL
training operation presented in Section 2.1, in the DeepHealth hybrid considering three
COMPSs workers, setting the number of replicas to 3. The COMPSs runtime use the
DeepHealth cloud API to automatically deploy the master and the three replicas in which the
COMPSs workers will be executed. Once the deployment is completed, the parallel execution
of the training operation is initiated, and so the COMPSs runtime starts the distribution the
different COMPSs tasks (i.e., the build and train_batch tasks shown in Figure 2,
guaranteeing the data dependencies among tasks. In this case, the update_gradients
function is executed in the COMPSs master to aggregate the partial computed weights at the
8
https://kubernetes.io/
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
Hybrid Cloud.
models. The WMS introduced in Section 2.2, is the key technology enabling a transparent
offloading of AI tasks to heterogeneous sets of worker nodes, hiding all deployment and
● The HPC component is the C3S OCCAM (Open Computing Cluster for Advanced
Tesla K40 and V100 GPUs. Workloads are orchestrated through an elastic virtual
farm of hardened Docker containers, running directly on top of the bare metal layer.
cluster running on top of an OpenStack cloud. The underlying physical layer consists
9
M. Aldinucci and others, “OCCAM: a flexible, multi-purpose and extendable HPC cluster,”
of high-end computing nodes equipped with Intel Xeon Gold 80-cores and 4 NVIDIA
ODH implements a novel form of multi-tenancy called “HPC Secure multi-Tenancy” (HST),
specifically designed to support AI application on critical data. HST allows resource sharing
on a hybrid infrastructure while guaranteeing data segregation among different tenants, both
inside the cloud and between the HPC and cloud components. Moreover, access to the tenant
environment.
ODH platform fully integrates the DeepHealth toolkit via Docker containers, both on bare
metal and in the multitenant Kubernetes cluster. The DeepHealth toolkit provides
functionalities to be used for both training and inference, addressing the complexity of the
different available computational resources and target architectures at both the training and
environments using specialized architectures like HPC centres with FPGAs and GPUs
because it is important to optimize the number of samples processed per second without
hindering the overall accuracy. In the inference step, based on pre-trained models and
deployed in production environments (and even small devices in the edge), response time for
that is more appropriate according to the specific characteristic of the computation. For
instance, the computational-heavy training step can be initially tested on a multiGPUs node
and then executed on the OCCAM HPC cluster, while the much more lightweight inference
when needed, an interactive inspection of the final results. An AI expert can simply launch
the pipeline directly from his computer using StreamFlow, which orchestrates the execution
of the first step on OCCAM and the second one on HPC4AI and manages all the required
data transfers in a fully transparent way. We demonstrated this approach with the Lung
Nodule Segmentation AI pipeline, a DeepHealth project use case (see Figure 8).
Moreover, integration between StreamFlow and COMPSs provides the ability to perform the
distributed training in the ODH cloud environment, by using the clouds capabilities of
COMPSs
The idea behind this overall approach is that the ability to deal with hybrid workflows can be
a crucial aspect for performance optimization when working with massive amounts of input
data and different needs in computational steps. Accelerators like GPUs, and in turn different
infrastructure like HPC and clouds, can be more efficiently used selecting for each
application the execution plan that better fits the specific needs of the computational step of
FPGAs are devices with reconfigurable logic that can be combined and interconnected in
order to process a specific algorithm. Contrary to CPUs and GPUs, in an FPGA the
optimization. On the other hand, the resources of an FPGA device are limited and usually the
device works at a lower frequency. Therefore, a careful and custom design is needed in order
to take benefits out of it. Although FPGAs are well suited for inference in Deep Learning,
FPGAs are supported in the DeepHealth project in two orthogonal although complementary
and needed directions. First, large FPGA infrastructures are being adapted to the project by
suitable interfaces and protocols specific of FPGAs. Second, specific and optimized FPGA
algorithms are developed for specific use cases within the project. In the next sections we
describe the adaptations and developments performed for the infrastructure and then the
The DeepHealth toolkit includes the MANGO FPGA platform, developed within the
European MANGO project. As part of DeepHealth activities we are evolving this cluster of
FPGAs from a hardware prototyping platform to a high performance and low energy compute
platform. The platform consists of two clearly differentiated subsystems: the General-purpose
Nodes (GNs) and the Heterogeneous Nodes (HNs). The former executes the host
applications, as well as the low-level communication libraries, run; the latter represent the
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
equipped with an Intel Xeon processor E5-2600 v3, 64 GB of RAM memory and 1 TB of
SSD storage. Each GNs is connected via PCIe to two HNs, so it can use both HN
subsystems. GNs are also connected to the HNs via Ethernet and USB for cluster
● An HN (see Figure 9 and Figure 10) consists of 12 FPGA modules mounted on top of 4
proFPGA motherboards and placed in an FPGA cluster. This setup is extended with a
total amount of 22 GB of RAM memory split in several DDR3 and DDR4 modules
V2000T, Xilinx Zynq 7000 SoC Z100 and Intel Stratix 10 SG280H FPGA modules. The
communications between its FPGAs. The cables are disposed in such a way that
maximize communication bandwidth among the overall system but keeping the
throughput balance between the different FPGA modules. The HN also includes one PCIe
extension board that enables PCIe communications with the GN. Figure 10 shows the
positioning and interconnections of the different FPGA / memories and cables between
Overall, the complete DeepHealth FPGA-based infrastructure consists of a total of 4 GNs and
HNs.
With the objective of facilitating programmability, the DeepHealth FPGA platform the FPGA
in two partitions: a static partition, known as shell, and a reconfigurable partition (see Figure
11). The former remains unchanged while the host is up and running. When the host is
booting this partition is loaded into the FPGA from an external memory drive. Once
uploaded, it provides the required interfaces to communicate with attached peripherals and
the host where the FPGA is connected to in an efficient way. In the MANGO cluster the
connection to the host is accomplished through a PCIe Gen3 x8 bus, offering a bidirectional
raw bandwidth of 16GB/s, approximately. In addition, the static partition provides the clocks
and reset networks to the rest of the elements in the FPGA, such as kernels. On contrary, the
reconfigurable partition consists of a placeholder, inside of the FPGA, that can be changed at
runtime using the features that the shell provides. This partition contains the resources in
The MANGO cluster is compatible with the OpenCL application programming interface
(API)11 for FPGA initialization, data transfer and kernel offloading, supporting both Xilinx
and Intel FPGAs. Kernels are the part of the application that will run on the FPGA, so as to
11
Khronos.org/opencl
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
Figure 11. FPGA logical design of two FPGAs in the same cluster connected with
Chip2Chip.
Furthermore, the FPGA design has been instrumented to support communication between the
within a cluster use the I/O pins available on the devices. On the other hand, connections
between FPGAs at different clusters use the so-called Multi-Gigabit Transceivers (MGT).
With these communication channels the GNs has DMA access to all the devices in an HN
cluster, as well as to offload and control application kernels running on different FPGAs. At
the same time, HN clusters are able to communicate with each other. To accomplish these
goals the design incorporates an IP core provided by Xilinx named Chip2Chip(C2C). This IP
core works like a medium access bridge connecting two devices over a memory-mapped
C2C can be configured to work in master or slave mode, as depicted in Figure 11. To connect
two FPGAs with C2C one device has to be configured as master and the other device as
slave. The rest of the FPGAs in a cluster can be connected similarly, following a daisy chain.
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
In this architecture, the host has access to the memory on the slave FPGA through the C2C
instances. Equally, provided the required logic is available, a kernel executing on the master
FPGA can access the memory of the slave FPGA. As a result, this communication interface
enables the host to offload and control kernels not only in the master FPGA, but also in the
slave one.
Multi-FPGA support
Another feature being researcher within the DeepHealth FPGA infrastructure is the analysis
on the use of multiple FPGAs for the implementation for a single inference network. Uniting
the compute and memory resources of all used FPGAs makes possible to implement neural
networks with higher requirements of memory (weights) and/or with a higher level of
The N2D2 (Neural Network Design & Deployment) deep learning framework made by
DeepHealth partner CEA12 (see Section 4.1.3 for further details), features a technology called
language, suitable for synthesis and implementation on a FPGA. dNeuro is optimized to use
the available compute resources (mainly, DSPs) as efficiently as possible, and to use only the
embedded memory resources (block RAMs), also in the scope of efficiency. dNeuro
generates the network according to specified constraints, allowing more or less parallelism
depending on the number of available DSP units. Specifying a number of DSPs higher than in
12
https://github.com/CEA-LIST/N2D2
13
https://www.cea.fr/cea-tech/leti/Documents/démonstrateurs/Flyer_DNEURO.pdf
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
similarly as if this platform were a single, larger FPGA. For this, the netlist has to be split
into multiple netlists, one for each FPGA. The partitioning must be as efficient as possible, in
order to maintain efficient communication between the FPGAs, both in terms of critical path
preservation and control of the number of inter-FPGA signals. To achieve this goal,
DeepHealth is also investigating on the development of a FPGA board (see Figure 12)
optimized for inference. The key component of the board is an INTEL Stratix-10 MX1650 or
MX2100 FPGA. Both types are package and pin compatible and have FPGA internal High
Bandwidth Memory (HBM) embedded. The MX2100 provides 16GByte of HBM memory
and the MX1650 provides 8GByte of HMB memory. The MX2100 is the preferred choice
Figure 12. The new DeepHealth FPGA board optimized for inference operations.
Furthermore, SODIMM connectors are used for memories and peripherals which are
connected to regular FPGA I/Os. In DeepHealth these SODIMM extension board sites can be
used to attach additional memories to the FPGA depends on the applications requirements.
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
Examples of such memories are DDR4 memory or high-speed SRAM memories to support
communicate with external devices via high-speed interfaces four QSFP28 interfaces are
capabilities for external communication general purpose connectors are available at the
Finally, a board support package (BSP) is provided to use OpenCL/HLS tools for
Pruned and quantized models enable the use of FPGA devices for energy-efficient inference
processes when compared to GPUs or CPUs. This section presents the use of these two
N2D2 is a comprehensive solution for fast and accurate DNN simulation and full and
particularly useful for DNN design and exploration, allowing simple and fast prototyping of
Once the training DNN performances are satisfying, an optimized version of the network can
performances benchmarking can also be performed among different hardware targets. Various
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
targets are currently supported by the tool-flow: from plain C code to C code tailored for
High-Level Synthesis (HLS) with Xilinx Vivado HLS and code optimized for GPU. Various
optimizations are possible in the exports: (1) DNN weights and signal data precision
[−1.0,1.0] for signed outputs and [0.0,1.0] for unsigned outputs. The optimal
quantization threshold value of the activation output of each layer is determined using
the dataset, and implies the use of additional shifting and clipping layers.
3. Quantization: inputs, weights, biases and activations are quantized to the desired
precision.
Interworking between the N2D2 framework and the EDDL library is possible thanks to the
ONNX file exchange format. This allows any Neural Network designer to get advantage of
both environments. This integration is illustrated with the two flows outline hereafter.
In order to obtain an inference code making an intensive usage of integer operators rather
This flow is schematized on Figure 13. It can be adjusted to take advantage of the various
1. N2D2 users to access to the FPGA based HPC inference code generation provided by
2. EDDL users to benefit for N2D2 training capabilities not available in EDDL yet;
The performance of the ISIC classification use-case generated through this flow are detailed
below.
With 63% classification accuracy, it shows up the same accuracy as the floating-point version
with a memory footprint divided by four. The storage memory size for the weights and biases
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
of the standard VGG16 network is estimated to 512.28 Mega bytes. When performing
quantization, the weights are stored in signed 8 bits integers, and biases in signed 16 bits
integers. Each layer requires an additional scaling factor, stored in an unsigned 8-bit integer.
Coming from 32 bits values, the required storage size will be divided by almost 4. The
storage memory size for the weights and biases of the quantized VGG16 network is estimated
Similar results were obtained with different versions of the MobileNet network14. The
original version of MobileNet was trained on the ISIC dataset: recognition rates ranged from
73% to 79% depending on the alpha factor (from 0.25 to 1) which drastically impacts the
required memory size. Eight bits quantization and export times were in the range of 150 to
500 seconds, and the recognition rates of the quantized networks remained in the same range
as before. The estimated memory footprint was 440 kB (alpha = 0.25), 1250 kB (0.5) and
4050 kB (1) with estimated frame rate on FPGA ranging from 250 to 1780 frames per
second.
It is well known that many DNNs, trained on some tasks, are typically over-parametrized.15
DeepHealth also includes pruning techniques. The goal of these techniques is to achieve the
highest sparsity (i.e. the maximum percentage of removed parameters) with minimal (or no)
performance loss. One possible approach relies in a regularization strategy, used at training
14
https://arxiv.org/abs/1704.04861
15
H. N. Mhaskar, T. Poggio, Deep vs. shallow networks: An approximation theory
time, employing the use of the sensitivity term16 in order to penalize the parameters which are
𝐶
1 | ∂𝑦 |
𝑆𝑤 = ∑ | ∂𝑤𝑘 |
𝐶
𝑘=1
| |
where 𝐶 is the number of the output classes, 𝑦𝑘 is the 𝑘-th output of the model and 𝑤 is the
evaluated parameter. The lowest 𝑆𝑤, the least its change will perturbate the change of the
output. Through this metric, it is possible to remove parameters which impact the least the
model’s performance. It has been recently shown that iterative pruning strategies enable the
one-shot approaches.17
Recently, a lot of attention has been devoted to the problem of the so-called structured
pruning: unless focusing on single parameters, which enable a reduced gain in terms of
FLOPs and memory footprint at training time, pruning algorithm should be focused on
16
E. Tartaglione, S. Lepsøy, A. Fiandrotti, G. Francini, Learning sparse neural networks via
pruning algorithm which removes a very large quantity of parameters, to SeReNe,19 which is
despite removing less parameters than the unstructured ones, bring significant advantages in
terms of memory footprint and FLOPs reduction, achieving up to 2× footprint saving and 2×
architectures like ResNet and VGG-16, trained on state-of-the-art tasks, with no performance
loss.
DL kernel are specialized on the most frequent neural network layers used in the specific
domain of the project (image-related classification and segmentation processes for health),
therefore convolutions, resizing, and pooling operations are the focus of the kernel. However,
the design of the kernel is performed using high-level synthesis (HLS). With HLS an
The kernel design follows the dataflow model by Xilinx combined with the use of streams.
With this model a pipelined design between modules can be created and data can be
processed concurrently on all the connected modules. The use of streams enables
concurrency.
19
Tartaglione, Enzo, et al. "SeReNe: Sensitivity based Regularization of Neurons for
Figure 14. Baseline kernel design for convolutional operations on the EDDL.
Figure 14 shows the baseline design for the kernel targeting convolutional operations. Each
box represents a different module and arrows represent streams. The kernel reads data
(activations, bias, and weights) from a DDR memory attached to the FPGA and produces
features being written back to the DDR memory. The images (activations) are pipelined
through all the modules. The padding module provides padding support to the input images
read in a streamflow fashion and then forwards the padded image to the next module. The cvt
module converts the input stream into frames of pixels that will be convolved in the mul
module and reduced in the add module. Finally, the produced features are written back to
memory.
The design supports parallel access to memory in order to processes a defined number of
input (CPI) and output (CPO) channels. Therefore, the design can be customized to different
performed in each cycle is CPI x CPO as each CPI channel is used for each CPO channel
featuring many-core fabrics or a set of GPU cards accelerators. In that regard, the EDDL
includes an API to build the neural network and the associated data structures (according to
the network topology) on the following acceleration technologies (named Computing Service
in EDDL nomenclature): CPU, GPU and FPGA. Tensor operations are then performed using
the hardware devices specified by means of the Computing Service provided as a parameter
to the build function. Moreover, the number of CPU cores, GPU cards or FPGA cards to be
used are indicated by the Computing Service. This section presents the many-core and GPU
acceleration included in the EDDL. See Section 4.1 for FPGA acceleration.
On many-core CPUs, tensor operations are performed by using the Eigen20 library that rely
the parallelization on OpenMP21. When using GPU cards, the forward and backward
algorithms are designed to minimize the number of memory transfers between the CPU and
the GPU cards in use, according to the configuration of the Computing Service. EDDL
incorporates three modes of memory management to address the lack of memory when a
given batch size does not fit in the memory of a GPU. The most efficient one tries to allocate
the whole batch in the GPU memory to reduce memory transfers at the minimum, the
intermediate and least efficient modes allow to work with larger batch sizes at the cost of
increasing the number of memory transfers to perform the forward and backward steps for a
In the case of using more than one GPU in a single computer, the EDDL internally creates
one replica of the network per GPU in use, and automatically splits every batch of samples
20
G. Guennebaud, B. Jacob and others, “Eigen v3.,” 2010. [Online]. A vailable:
http://eigen.tuxfamily.org
21
https://openmp.org
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
into sub-batches, one sub-batch per GPU, so that each sub-batch is processed by one
GPU. Every time a batch is processed, the weights stored in each GPU are different and
weight synchronization every certain number of batches is required to avoid divergence. The
weight synchronisation is done by transferring all the weights from GPU memory to CPU
memory, computing the average, and transfer back the updated weights to the memory of all
the GPU cards in use, i.e., to all the replicas of the network. The computing service used for
defining the use of GPU cards has an attribute to indicate the number of batches between
weight synchronizations. As memory transfers between CPU and GPU must be reduced as
much as possible, a trade-off between performance and divergence must be reached by means
of this attribute, whose optimal value will vary depending on the data set used for training
EDDL support for GPUs has been implemented twice, by means of CUDA kernels developed
as part of the EDDL code, and by integrating the NVIDIA cuDNN library. The use of
use the EDDL; they only need to create the corresponding Computing Service to use all or a
5. Conclusions
DeepHealth HPC toolkit allows the European Distributed Deep Learning Library (EDDL) to
be efficiently executed on HPC and Cloud infrastructures. On one side, it includes HPC and
cloud workflow managers to parallelise the execution of EDDL operations, including the
parallelisation of the costly training operations; on the other side, it supports the most
common hardware acceleration technologies, i.e., many-cores, GPUs and FPGAs, to further
accelerate the training and inference operations in single computing nodes. Moreover, the
Chapter 10 - Leveraging heterogenous HPC and cloud computing infrastructures
for IA-based Medical solutions
DeepHealth HPC toolkit provides to the data/computer scientists the level of abstraction