Training AI Models on CPU. Revisiting CPU for ML…
Training AI Models on CPU. Revisiting CPU for ML…
Listen Share
The recent successes in AI are often attributed to the emergence and evolutions
of the GPU. The GPU’s architecture, which typically includes thousands of multi-
processors, high-speed memory, dedicated tensor cores, and more, is particularly
well-suited to meet the intensive demands of AI/ML workloads. Unfortunately,
the rapid growth in AI development has led to a surge in the demand for GPUs,
making them difficult to obtain. As a result, ML developers are increasingly
exploring alternative hardware options for training and running their models. In
previous posts, we discussed the possibility of training on dedicated AI ASICs
such as Google Cloud TPU, Haban Gaudi, and AWS Trainium. While these options
offer significant cost-saving opportunities, they do not suit all ML models and
can, like the GPU, also suffer from limited availability. In this post we return to
the good old-fashioned CPU and revisit its relevance to ML applications. Although
CPUs are generally less suited to ML workloads compared to GPUs, they are much
easier to acquire. The ability to run (at least some of) our workloads on CPU could
have significant implications on development productivity.
Our goal will be to demonstrate that although ML development on CPU may not
be our first choice, there are ways to “soften the blow” and — in some cases —
perhaps even make it a viable alternative.
Disclaimers
Our intention in this post is to demonstrate just a few of the ML optimization
opportunities available on CPU. Contrary to most of the online tutorials on the
topic of ML optimization on CPU, we will focus on a training workload rather than
an inference workload. There are a number of optimization tools focused
specifically on inference that we will not cover (e.g., see here and here).
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import time
train_set = FakeDataset()
batch_size=128
num_workers=0
train_loader = DataLoader(
dataset=train_set,
batch_size=batch_size,
num_workers=num_workers
)
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
t0 = time.perf_counter()
summ = 0
count = 0
Running this script on a c7i.2xlarge (with 8 vCPUs) and the CPU version of
PyTorch 2.4, results in a throughput of 9.12 samples per second. For the sake of
comparison, we note that the throughput of the same (unoptimized script) on an
Amazon EC2 g5.2xlarge instance (with 1 GPU and 8 vCPUs) is 340 samples per
second. Taking into account the comparative costs of these two instance types
($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of the time of this
writing), we find that training on the GPU instance to give roughly eleven(!!) times
better price performance. Based on these results, the preference for using GPUs
to train ML models is very well founded. Let’s assess some of the possibilities for
reducing this gap.
Batch Size
Increasing the training batch size can potentially increase performance by
reducing the frequency of the model parameter updates. (On GPUs it has the
added benefit of reducing the overhead of CPU-GPU transactions such as kernel
loading). However, while on GPU we aimed for a batch size that would maximize
the utilization of the GPU memory, the same strategy might hurt performance on
CPU. For reasons beyond the scope of this post, CPU memory is more
complicated and the best approach for discovering the most optimal batch size
may be through trial and error. Keep in mind that changing the batch size could
affect training convergence.
The table below summarizes the throughput of our training workload for a few
(arbitrary) choices of batch size:
Training Throughput as Function of Batch Size (by Author)
Contrary to our findings on GPU, on the c7i.2xlarge instance type our model
appears to prefer lower batch sizes.
Training Throughput as Function of the Number of Data Loading Workers (by Author)
Mixed Precision
Another popular technique is to use lower precision floating point datatypes such
as torch.float16 or torch.bfloat16 with the dynamic range of torch.bfloat16
Torch Compilation
In a previous post we covered the virtues of PyTorch’s support for graph
compilation and its potential impact on runtime performance. Contrary to the
default eager execution mode in which each operation is run independently
(a.k.a., “eagerly”), the compile API converts the model into an intermediate
computation graph which is then JIT-compiled into low-level machine code in a
manner that is optimal for the underlying training engine. The API supports
compilation via different backend libraries and with multiple configuration
options. Here we will limit our evaluation to the default (TorchInductor) backend
and the ipex backend from the Intel® Extension for PyTorch, a library with
dedicated optimizations for Intel hardware. Please see the documentation for
appropriate installation and usage instructions. The updated model definition
appears below:
model = torchvision.models.resnet50()
backend='inductor' # optionally change to 'ipex'
model = torch.compile(model, backend=backend)
In the case of our toy model, the impact of torch compilation is only apparent
when the “channels last” optimization is disabled (and increase of ~27% for each
of the backends). When “channels last” is applied, the performance actually
drops. As a result, we drop this optimization from our subsequent experiments.
The use of the run_cpu script further boosts our runtime performance to 39.05
samples per second. Note that the run_cpu script includes many controls for
further tuning performance. Be sure to check out the documentation in order to
maximize its use.
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
model, optimizer = ipex.optimize(
model,
optimizer=optimizer,
dtype=torch.bfloat16
)
Combined with the memory and thread optimizations discussed above, the
resultant throughput is 40.73 samples per second. (Note that a similar result is
reached when disabling the “channels last” configuration.)
We can run data distributed training across NUMA nodes easily using the ipexrun
utility. In the following code block (loosely based on this example) we adapt our
script to run data distributed training (according to usage detailed here):
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
os.environ["RANK"] = os.environ.get("PMI_RANK", "0")
os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", "1")
dist.init_process_group(backend="ccl", init_method="env://")
rank = os.environ["RANK"]
world_size = os.environ["WORLD_SIZE"]
batch_size = 128
num_workers = 0
train_dataset = FakeDataset()
dist_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
num_workers=num_workers,
sampler=dist_sampler
)
# configure DDP
model = torch.nn.parallel.DistributedDataParallel(model)
Unfortunately, as of the time of this writing, the Amazon EC2 c7i instance family
does not include a multi-NUMA instance type. To test our distributed training
script, we revert back to a Amazon EC2 c6i.32xlarge instance with 64 Sign
vCPUsup
and 2
Open in app Sign in
NUMA nodes. We verify the installation of Intel® oneCCL Bindings for PyTorch
and run the following command (as documented here):
# This example command would utilize all the numa sockets of the processor, taking
ipexrun cpu --nnodes 1 --omp_runtime intel train.py
In our experiment, data distribution did not boost the runtime performance.
Please see ipexrun documentation for additional performance tuning options.
import torch
import torchvision
import timeimport torch_xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()
model = torchvision.models.resnet50().to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
Unfortunately, (as of the time of this writing) the XLA results on our toy model
seem far inferior to the (unoptimized) results we saw above (— by as much as 7X).
We expect this to improve as PyTorch/XLA’s CPU support matures.
Results
We summarize the results of a subset of our experiments in the table below. For
the sake of comparison, we add the throughput of training our model on Amazon
EC2 g5.2xlarge GPU instance following the optimization steps discussed in this
post. The samples per dollar was calculated based on the Amazon EC2 On-demand
pricing page ($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of
the time of this writing).
Performance Optimization Results (by Author)
While the runtime performance results of the optimized CPU training of our toy
model (and our chosen environment) were lower than the GPU results, it is likely
that the same optimization steps applied to other model architectures (e.g., ones
that include components that are not supported by GPU) may result in the CPU
performance matching or beating that of the GPU. And even in cases where the
performance gap is not bridged, there may very well be cases where the shortage
of GPU compute capacity would justify running some of our ML workloads on
CPU.
Summary
Given the ubiquity of the CPU, the ability to use them effectively for training
and/or running ML workloads could have huge implications on development
productivity and on end-product deployment strategy. While the nature of the
CPU architecture is less amiable to many ML applications when compared to the
GPU, there are many tools and techniques available for boosting its performance
— a select few of which we have discussed and demonstrated in this post.
In this post we focused optimizing training on CPU. Please be sure to check out
our many other posts on medium covering a wide variety of topics pertaining to
performance analysis and optimization of machine learning workloads.