Power Machine Learning at Scale: Mapping Parallelized Modeling-to-HPC Infrastructure On AWS
Power Machine Learning at Scale: Mapping Parallelized Modeling-to-HPC Infrastructure On AWS
Power Machine Learning at Scale: Mapping Parallelized Modeling-to-HPC Infrastructure On AWS
at Scale
Mapping Parallelized Modeling-to-HPC Infrastructure on AWS
May 2019
Notices
Customers are responsible for making their own independent assessment of the
information in this document. This document: (a) is for informational purposes only, (b)
represents current AWS product offerings and practices, which are subject to change
without notice, and (c) does not create any commitments or assurances from AWS and
its affiliates, suppliers or licensors. AWS products or services are provided “as is”
without warranties, representations, or conditions of any kind, whether express or
implied. The responsibilities and liabilities of AWS to its customers are controlled by
AWS agreements, and this document is not part of, nor does it modify, any agreement
between AWS and its customers.
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contents
Introduction ..........................................................................................................................1
Data Management ...............................................................................................................2
Distributed Computation Frameworks ................................................................................6
Build Compute Clusters to Fit the Workload.......................................................................9
Modeling Using Hybrid Infrastructures ..............................................................................12
Conclusion .........................................................................................................................13
Contributors .......................................................................................................................14
Further Reading .................................................................................................................14
Document Revisions..........................................................................................................14
Appendix ............................................................................................................................15
Abstract
This white paper presents best practices for executing machine learning (ML) workflows
at scale on AWS. It provides an overview of end-to-end considerations, challenges, and
recommended solutions for architecting an infrastructure appropriate for ML use cases.
The intended audience for this white paper includes IT groups, enterprise architects,
data scientists, and others interested in understanding the technical recommendations
for executing parallelized modeling at scale using High Performance Computing (HPC)
infrastructure on AWS.
Amazon Web Services Power Machine Learning at Scale
Introduction
Businesses are generating, storing, and analyzing more data than ever before. With the
latest advances in machine learning (ML), there is a drive to use these vast datasets to
build business outcomes. Although ML algorithms have been used for more than 20
years, recent increased momentum in ML adoption is due to advancements in
algorithmic frameworks and the compute infrastructure used to run the algorithms,
including computational accelerators. The resulting ability to iteratively and
automatically apply complex mathematical calculations to datasets at scale within
relatively short timeframes, has added a new dimension of possibilities for ML analytics.
Enterprises increasingly rely on machine learning to automate tasks, provide
personalized services to their end users, and increase efficiency of their operations by
gathering and analyzing data from a variety of deployed devices and sensors. To
leverage economies of scale and increased agility, companies are moving their ML
workloads to the AWS Cloud. For example, Marinus Analytics is using face search and
recognition technologies to combat human trafficking. Petabytes of patient data stored
and analyzed in the Philips HealthSuite digital platform is being analyzed using ML
methodologies. TuSimple is using ML in the AWS Cloud for the perceptual distance
capabilities of their L4 autonomous driving systems.
The present capabilities to scale and accelerate ML workloads are based on High
Performance Computing (HPC) methodologies and applications. Modern HPC can use
Graphics Processing Units (GPUs) for general purpose computing (GPGPUs),
massively parallel data storage, and low-latency, high-bandwidth network
communication to solve compute and memory intensive problems. These problems are
common to many scientific research domains, including climate research, fluid
dynamics, and life sciences. These disciplines share computational needs with areas of
ML, such as deep learning (DL). The computational demands of DL workloads make
them ideal candidates to benefit from HPC methods.
1
Amazon Web Services Power Machine Learning at Scale
Data Management
Data underpins every ML project and determines not only what ML approach is best,
but also what infrastructure is most suitable to support it. Data sources, frequency of
revisions or updates to the data, and where the data will be stored, are all fundamental
questions that should be addressed before a project begins. It is not uncommon for data
to require standardization, normalization, and enrichment before it can be analyzed
using ML approaches. Although it is peripheral to traditional HPC or core ML offerings,
AWS offers ETL (extract, transform, load) processing services that can help during the
preprocessing stages of your project. There are a variety of ways that different machine
learning analytical processes might ingest, create, and transfer data. To support each
phase of these workflows, you must have storage infrastructure and data management
tools. At each phase, you must consider the durability and availability of the data. To
determine your storage requirements, you must consider the size of the dataset and the
performance requirements for reading, writing, and transferring the data.
The amount of data required for any machine learning and deep learning predictive
modeling application is directly proportional to the complexity of the problem and
algorithm. To determine whether the dataset size is sufficient, many data scientists
implement learning curves, k-fold cross validation resampling methods on smaller
datasets, and assessing confidence intervals of final results. In addition, there are
statistical heuristic methods for calculating an estimate for a suitable sample size based
on the number of classes, input features, and model parameters that should be
represented adequately by the training dataset. Therefore, datasets required for
machine learning applications are often pulled from database warehouses, streaming
IoT input, or centralized data lakes. Many customers use Amazon Simple Storage
Service (Amazon S3) as a target endpoint for their training datasets. ETL processing
services (Amazon Athena, AWS Glue, Amazon Redshift Spectrum) are functionally
complementary and can be architected to preprocess datasets stored in or targeted to
Amazon S3. In addition to transforming data with services like Amazon Athena and
Amazon Redshift Spectrum, you can use services such as AWS Glue to provide
metadata discovery and management features. The choice of ETL processing tool is
also largely dictated by the type of data you have. For example, tabular data processing
with Amazon Athena gives you the ability to manipulate your data files in Amazon S3
using Sequel Query Language (SQL). If your datasets or computations are not optimally
compatible with SQL, you can use Amazon Glue to seamlessly run Spark Jobs (Scala
and Python support) on data stored in your Amazon S3 buckets.
2
Amazon Web Services Power Machine Learning at Scale
When you analyze the data, it is helpful to make a number of plots to visualize trends
that might inform your data preprocessing and modeling choices. Visualizations can
help identify accidental correlations, sampling biases, or non-response biases that are
indicative of whether your dataset is representative enough, and will generalize well to
new, unseen data. Data quality can be quantified by assessing frequency of errors or
missing values, noise-to-signal ratio, and heterogeneity of data distribution. AWS offers
two serverless options for visualizing data that have varying levels of appeal, depending
on your preferences. If you prefer a Graphical User Interface (GUI), you can use
Amazon QuickSight to build plots in an interactive dashboard using data stored in
Amazon S3. If you are more comfortable writing code, Amazon SageMaker offers
managed Jupyter Notebooks that you can connect to data in your Amazon S3 bucket,
for visualization tasks, and to run Amazon Athena and AWS Glue jobs.
Improvements to the compute layer, which include faster processors and more cores,
enables the processing of larger data workloads using increasingly complex algorithms.
Unfortunately, these improvements result in storage I/O processing bottlenecks when
you scale machine learning workloads. High performance parallelized file systems,
where metadata and I/O are managed by multiple or all compute nodes, can alleviate
bottlenecks that are introduced by file systems that rely on a single admin node for
routing management. The latency and response times required to sync a parallel file
system are determining factors in optimizing performance of parallelized file systems.
Deep learning training applications need storage systems that can constantly feed data
at low latencies and at high throughput. The I/O profiles for different machine learning
and deep learning applications are varied and can be random, which results in
significant performance challenges that you should carefully consider when you
architect for your use case. For example, GPU-accelerated analytics often use large
thread counts, each of which require low-latency access to segments of the data input.
In the case of CNNs for object detection, random access enabled by streaming
bandwidth and fast memory mapping improves performance. High performance,
random segment I/O access optimizes performance of RNNs and NLP type studies. A
common bottleneck at higher levels of compute is insufficient data flowing to GPU
workers. To avoid stalls, AWS provides flexibility by allowing for multiple storage options
to keep GPUs saturated. For applications running previous-generation NVIDIA GPUs,
such as K80 P2 instances, Amazon S3 is sufficient to maintain desirable concurrency
on I/O processing. For P3 instances that need high throughput, you can select a
massively parallel file system, such as Amazon Elastic File System (Amazon EFS) or
Amazon FSx for Lustre (Amazon FSx), to elastically scale to demand.
Moving a large dataset from one location to another can be an expensive process, so
before you decide to move your data, make sure you have considered all the
3
Amazon Web Services Power Machine Learning at Scale
requirements and implications. In situations where Amazon S3 reads are not expected
to surpass the read performance abilities of Amazon S3, you don’t need to move the
data to a higher performance storage location. If your business needs don’t require
faster training speeds, you can train your models at slower speeds and use a storage
location that matches those performance requirements.
In other situations, training speed can be essential to meeting your business needs. In
these scenarios, migrating your prepared dataset to a parallel file system can unlock
greater training performance. Table 1 shows an example of some different file systems
and the relative rate that they can transfer images to a compute cluster. This
performance is only a single measurement and is only a suggestion for which file
systems you could use for a given workload. The specifics of a given workload might
change these results.
Table 1 – Comparison of the relative (to Amazon EFS) images per second
that each filesystem can load
Amazon S3 <1.00
Amazon EFS 1
NVMe 1.4
RAMDisk 1.44
BeeGFS 1.6
After a model is trained, the process of evaluating the model is often not as data
dependent as the training step, which means you can store the evaluation data in
Amazon S3. This enables you to leverage tools such as Amazon SageMaker Batch
Transform to evaluate a model in a serverless environment.
For deployment, the data needs are highly dependent on the problem. If you will use the
model for a batch transform, you can prepare the data in advance and give it to the
model using either Amazon S3 or one of the other previously mentioned file systems.
Currently, Amazon S3 single object uploads are capped at 5GB and file sizes are set at
4
Amazon Web Services Power Machine Learning at Scale
up to 5TB. To determine which file system to use, consider the size of the data you
need to process and your business requirements for processing speeds.
Real-time prediction systems are more complex because they are only considered real-
time for a short duration. In these cases, you might have to analyze and adjust each
step of the inference pipeline to ensure that inference times meet the requirements. For
real-time inference on AWS, you can use an Amazon SageMaker Deployment with a
server placed behind a load balancer that is auto-scaled to meet the API requests,
which are sent to the model endpoint. Another option is to use Amazon Elastic
Inference set to the correct amount of GPU-acceleration at reduced cost to provide
scaled inference, without needing an oversized instance. Similarly, you can deploy
popular deep learning framework models in SageMaker Neo to optimize the trained
model for the targeted hardware platform.
In extreme cases, a SageMaker deployment might have too much latency. In these
situations, to find the bottlenecks, you must perform an extensive audit of where the
data for inference is coming from, it’s path to the model, the hardware running the
model, and the response pathway. Only then can you make an assessment about
where you should deploy the model and how to perform the inference. In some cases,
getting the data you need from the disparate data sources can be limiting and you must
rearchitect your data access patterns to enable real-time inference. In other cases, you
need to bring the model closer to the data sources. In these cases, you deploy the
model at the edge to interact directly with data generators. Though AWS provides AWS
IoT GreenGrass to deploy Amazon SageMaker models to edge devices, to properly
implement your inference pipeline, you still must determine the data needs and
customizations for your pipeline. If you are in this situation, to make sure that your
model deployment is completed on time and to your performance needs, we
recommend that you seek help from AWS Professional Services.
5
Amazon Web Services Power Machine Learning at Scale
When you perform deep learning (DL) at scale, for example, datasets are commonly too
large to fit into memory and therefore require pre-processing steps to partition the
datasets. In general, a best practice is to pack the data in parallel, distributed across
multiple machines. You should do this in a single run, and split the data into a small
number of files with a uniform number of partitions. When the data is partitioned, it is
readily accessible and easily fed in as batches across multiple machines. When the
data is split into a small number of files, the preparation job can be parallelized and thus
run faster. In addition to Amazon Spark EMR clusters, you can also use other tools,
such as Amazon SageMaker pipes and AWS Glue (depending on the dataset size).
There are also many third-party solutions, such as Big DL in Spark 3 that run on Spark or
Python, which you can integrate into your existing data preparation pipelines.
During the data loading process, new data is fed to GPU workers to train the DL
algorithm. As a general rule, DL data loaders should be able to read data continuously
from contiguous locations, store data in a compact way, load and train in different
threads, and actively converse RAM. Ideally, all data processing should be run on the
CPU side, with the CPU processing the next batch of data so that it's ready to be fed to
the GPU. We recommend that you do not use GPU for data processing. Instead, only
use your GPUs for training the DL models to make sure that no consumer or producer
overlap occurs between the CPU (producer) and the GPU (consumer). In some
situations, such as image manipulation, data preparation can be compute intensive. In
these situations, you can use an AWS Batch job, which provides access to GPU-
enabled instances, to manipulate large datasets. Pairing the easy scaling and access to
the spot instance pricing of AWS Batch with OpenCV acceleration on NVIDIA GPUs,
can help you to efficiently prepare image datasets.
6
Amazon Web Services Power Machine Learning at Scale
As a part of the loading process, data is shuffled (randomly sorted) into batches. For
data fed in at scale, we recommend that you shuffle the data before the data loading
process begins, and set the buffer size to a lower number for shuffling. To make sure
that shuffling happens without negative impacts on performance, run this on small,
partitioned files. We recommend that you run the loading process in parallel, and in
some instances, it can be helpful to prefetch more than one batch, such as when the
duration of processing time varies.
Most frameworks, such as Tensorflow, PyTorch, and MxNet, offer a data loading class
that you can use to load files from disk. These functions call a data loader function each
time the next chunk of data is needed, which allows you to process Big Data in batches.
Many frameworks also include a data pipelines feature that you can use to write a
custom pipeline in python code and library helper functions. Because there is more than
one option for a single framework and there are no universal formats (such as Open
Neural Network Exchange Format ONNX4) available, data pipelines can be difficult to
move to another DL platform. There are also several open source libraries that enable
parallel computing with task scheduling for analytics, such as Dask, Ray, PyToolz, and
ipyparallel.
The process of training ML models can be the most compute intensive step in the ML
process. Because model training is compute intensive, we suggest you complete
vertical scaling before horizontal scaling. Moving from CPUs to GPUs for model training
provides such significant performance increases that we do not recommend training
complex ML models on traditional CPUs. If the training times on a GPU card are
insufficient for your business needs, we recommend that you try a more powerful GPU
before moving to multiple GPUs. Similarly, we recommend that you leverage multiple
GPUs on a single node before you move to a multi-node solution. This approach
generally provides the best performance and limits the complexity of the system until
that complexity is necessary.
Before you make the choice to change from a single GPU instance to more complex
training infrastructures, we recommend that you complete a step-by-step analysis
process to make sure you consider all the implications. If a single GPU instance is not
training sufficiently fast, the first thing to evaluate is GPU utilization. Currently, the most
common approach to training DL models is to send batches of data to the GPU. The
GPU runs the data through the model and then the model weights are updated. With
this approach, the GPU goes through cycles: the GPU waits for data, computes that
data, then waits for more data. If the data cannot be prepared for the GPU faster than
the GPU is consuming the data, you have a data preparation bottleneck, and a faster
GPU will not significantly increase the model training speed.
7
Amazon Web Services Power Machine Learning at Scale
To help make sure that data preparation is not a bottleneck, most DL frameworks
implement a data preparation framework designed to prepare datasets in parallel,
generally using the CPUs, and storing the prepared datasets in memory for GPU
consumption. When they operate ideally, these CPU tasks prepare data faster than the
GPU can consume it, but the system cannot always keep up with modern GPUs. If GPU
utilization during training is not maximized, first verify whether the compute resources
can handle more processes (in python, but potentially threads in another language)
dedicated to preparing data, because it's generally easy to increase the number.
When the CPUs cannot handle more processes, it becomes important to determine if
the data preparation processes are IO limited or compute limited. If IO limits the data
preparation, you can consider moving the data to a different data store. If the compute
process limits the data preparation, first make sure that only necessary processes are
running. For example, if the model takes in images at a resolution of 256 x 256 pixels,
but the images are stored at a resolution of 1920 x 1080 pixels, it might be possible—
depending on the algorithm—to first reduce all images to 256 x 256 pixels before the
training process starts. This change reduces the time needed to load an image and
removes some processing steps. In extreme cases, when you cannot reduce
preprocessing, you might have to use a data preparation cluster to generate batches of
data to feed the model. A data preparation cluster allows an arbitrary number of
processes to be used to prepare the data.
If your single GPU is fully saturated and the training time is still not sufficient to meet
your business needs, you can try a more powerful GPU, if one is available. Currently,
the P3 instance family has the most powerful GPU (the Volta V100) on AWS. After
switching to a more powerful GPU, you should repeat the process above to make sure
that the new GPU has not moved the bottleneck from the GPU to the data preparation.
If the GPU is still limiting and the training times are insufficient, then the next step is
moving to a node with more GPUs.
A single node with multiple GPUs has some advantages over the same number of
GPUs spread over many nodes. First, the communication between GPUs on a single
node is often over NVLink and is substantially faster than traveling across the network
to another node (regardless of whether nodes are in a placement group). Secondly, the
movement of information between the GPU memory and system memory is faster than
communication across the network. This allows many GPUs on the same node to more
quickly aggregate their parameter updates than multiple nodes communicating across
the network.
8
Amazon Web Services Power Machine Learning at Scale
If the largest single node GPU system is not sufficiently fast and data preparation is not
a bottleneck, a cluster of GPU compute nodes is a possible solution. Deep learning
frameworks have different methods of enabling multi-node model training, so the ideal
cluster setup depends on the framework. An overview of all frameworks is beyond the
scope of this paper, but there are some commonalities to examine. To begin with, you
must have network communication for both data loading and parameter updates, which
makes network bandwidth and latency precious resources. Important first steps to
optimizing distributed training in AWS include choosing Amazon Elastic Compute Cloud
(Amazon EC2) instances with the highest bandwidth and locating all instances in a
placement group. Algorithms that reduce network communication, such as Horovod5,
and algorithms that lower precision representations of the parameters, are essential to
create performance improvements because network bandwidth is limited.
9
Amazon Web Services Power Machine Learning at Scale
A CloudFormation template for a deep learning cluster includes a VPC, a master instance with
SSH connection access, Auto Scaling, SQS to configure master and workers, and security
groups. In addition, it automatically creates an EFS mount for data access and launches activity
monitoring with AWS Lambda and Amazon Simple Notification Service (Amazon SNS) 6.
For additional machine learning cluster architectural diagrams, see the Appendix.
10
Amazon Web Services Power Machine Learning at Scale
NVMe-based SSD storage, the P3dn.24xlarge instance types are optimized for
distributed machine learning and HPC applications. The P3 family of instance types
deliver up to one petaflop of mixed-precision performance per instance to significantly
accelerate performance. P3 family instances can reduce machine learning training
times from days to minutes, and can increase the number of simulations completed for
high performance computing by 3–4x.
Although they are less commonly used because of a higher learning curve for utilization,
FPGA (Field Programmable Gate Array) instance types offer specialized hardware
acceleration that have shown advantages in time-to-results for machine learning
applications. For less computationally intensive machine learning applications that do
not require GPUs or high-memory inference studies, Amazon also offers a variety of
CPU instances with memory, storage and networking options to fit your needs. For
example, the Intel® Xeon®Scalable processors incorporated in Amazon C5 instances,
with the optimized deep learning functions in the Intel MKL-DNN library, can provide
sufficient compute capacity for some deep learning training workloads.
Advances in networking, such as the Elastic Fabric Adaptor (EFA) network interface for
Amazon EC2 instances, also provide support for HPC workloads that require intense
inter-node communication, such as computational fluid dynamics, weather modeling,
and large-scale simulations.
11
Amazon Web Services Power Machine Learning at Scale
AWS provides several services that support direct connectivity between on-premises
networks and the AWS Cloud. Hybrid network architectures often use VPN services and
AWS Direct Connect (DX) for connectivity to an AWS Region. In a hybrid network
architecture, data transfer to Amazon S3 can take place over a private connection
instead of using the default secure protocol (HTTPS) over the public internet. You can
also use the Amazon S3 Transfer Acceleration service to complete transfers using the
Amazon network and edge locations. You can implement AWS Identity and Access
Management resources to manage the integrated networks of your hybrid architectures.
For both machine learning and deep learning applications, hybrid architectures can
provide modularization of the analytical workflow to maximize the run-time efficiency
and lower latency of your application. By training jobs in the cloud, you can leverage the
infinitely scalable resources of the cloud. After a model is trained, it can be transferred
to an on-premises server or edge device to run inference workloads. Hybrid
architectures are also useful in cases where data privacy is a concern. In many cases,
you might not want to have all of your raw data in the cloud. Instead, you can
modularize the pre-processing steps for your models to implement an off-cloud feature
extraction application to extract only those features necessary for model training. This
reduces the level of information that is available for unsecure secondary inferences7.
12
Amazon Web Services Power Machine Learning at Scale
enables edge devices to act locally on real-time data while leveraging the cloud for
model management, which allows data to be returned to the cloud for durable storage
and continuous model optimization. Data returned to the cloud and targeted to an
Amazon S3 bucket can be used as a trigger in an automated CI/CD workflow by
integrating additional AWS services such as AWS CodeBuild, AWS CodePipeline,
Amazon CloudWatch, and AWS Lambda. You can also architect Lambda functions as
AWS Step Functions dependencies for managing an Amazon SageMaker instance
backed by HPC resources.
For DL or ML use cases at the edge that require memory-intensive processing, such as
image collation and IoT sensor stream capture, AWS Snowball Edge provides a storage
edge device that is enabled to run cloud services with specific Amazon EC2 instance
types, as well as AWS Lambda functions.
Conclusion
Areas of machine learning, such as deep learning, are specifically amenable to the
methods of high performance computing because they pose computationally bound
problems that benefit from model and data parallelism. The I/O profiles for different
machine learning and deep learning applications are varied, and the associated
randomness results in significant performance challenges. Improvements to the
compute layer—including faster processors and more cores—have enabled us to
process larger data workloads using increasingly complex algorithms. Advancements in
high performance parallelized file systems where metadata and I/O are managed by
multiple or all compute nodes has enabled elastic scalability to meet processing
demands.
13
Amazon Web Services Power Machine Learning at Scale
Contributors
Contributors to this document include:
Further Reading
For additional information, see:
• Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht.
2017. Occupy the cloud: distributed computing for the 99%. In Proceedings of
the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY,
USA, 445-451. DOI: https://doi.org/10.1145/3127479.3128601
• Ben-Nun, Tal, and Torsten Hoefler. “Demystifying Parallel and Distributed Deep
Learning: An In-Depth Concurrency Analysis.” ArXiv:1802.09941 [Cs], February
26, 2018. http://arxiv.org/abs/1802.09941.
Document Revisions
Date Description
May 2019 First publication
14
Amazon Web Services Power Machine Learning at Scale
Appendix
The following are two additional architectural diagrams for machine learning clusters.
15
Amazon Web Services Power Machine Learning at Scale
Notes
1 M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster
computing with working sets. Proceedings of the 2nd USENIX conference on Hot
topics in cloud computing, 10:10, 2010.
2 T. Hughes and S. Stefani. Build Amazon SageMaker notebooks backed by Spark in
Amazon EMR. https://aws.amazon.com/blogs/machine-learning/build-amazon-
sagemaker-notebooks-backed-by-spark-in-amazon-emr/
3 Wang, Y., Qiu, X., Ding, D. et al. “BigDL: A distributed deep learning framework for big
data” 2018. https://arxiv.org/pdf/1804.05839.pdf
4 Whitepaper: HPC on AWS Redefines What is Possible
5 Alexander Sergeev, Mike Del Blaso ”Horovod: fast and easy distributed deep learning
in TensorFlow” arXiv:1802.05799[cs.LG], Feb 15, 2018
https://arxiv.org/abs/1802.05799
6 https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/
https://github.com/awslabs/deeplearning-cfn
7 Seyed Ali Osia, Ali Shahin Shamsabadi, Ali Taheri, Kleomenis Katevas, Sina
Sajadmanesh, Hamid R. Rabiee, Nicholas D. Lane, Hamed Haddadi. “A Hybrid Deep
Learning Architecture for Privacy-Preserving Mobile Analytics”. arXiv:1703.02952
[cs.LG], Apr 18, 2018. https://arxiv.org/pdf/1703.02952.pdf
16