Tech Overview Magnum Io 1790750 r5 Web
Tech Overview Magnum Io 1790750 r5 Web
ACCELERATING
GPU-STORAGE
COMMUNICATION
WITH NVIDIA
MAGNUM IO
GPUDIRECT STORAGE
ADDRESSING THE CHALLENGES OF GPU-
NVIDIA Magnum IO GPUDIrect
ACCELERATED WORKFLOWS Storage accelerates the data
path to the GPU by eliminating IO
The datasets being used in high-performance computing (HPC), artificial bottlenecks.
intelligence, and data analytics are placing increasingly high demands on
> Supports RDMA over InfiniBand
the scale-out compute and storage infrastructures of today’s enterprises.
and Ethernet RoCE
This, together with the computation shift from CPUs to faster GPUs, has
> Supports distributed file systems:
rendered input and output (IO) operations between storage and the GPU to NFS, DDN EXAScaler, WekaIO,
take on even greater significance. In some cases, application performance IBM Spectrum Scale
suffers as GPU compute nodes need to wait for IO to complete. > Supports storage protocol via
With IO bottlenecks in multi-GPU systems and supercomputers, the NVMe and NVMe-oF
compute-bound problem translates into an IO-bound problem. Traditional > Provides a compatibility mode for
non-GDS ready platforms
reads and writes to GPU memory use POSIX APIS to read/write data from
system memory as an intermediate bounce buffer. In addition some file > Enabled on NVIDIA DGX™ Base OS
systems need additional memory in kernel page cache. An extra copy of > Supports Ubuntu and RHEL
operating systems
data through the system memory is the leading cause of the IO bandwidth
bottleneck to the GPU, as well as overall higher IO latency and CPU > Can be used with multiple
libraries, APIs, and frameworks:
utilization. This is primarily due to CPU cycles being used to transfer the
DALI , RAPIDS cuDF, PyTorch, and
buffer contents to the GPU. MXNet
BENEFITS
NVIDIA MAGNUM IO GPUDIRECT STORAGE > Higher bandwidth: Achieves up
to 2X more bandwidth available to
NVIDIA Magnum IO™ GPUDirect® Storage (GDS) was specifically GPU than a standard CPU-to-CPU
designed to accelerate data transfers between GPU memory and path.
remote or local storage in a way that avoids CPU bottlenecks. GDS > Lower latency: Avoids extra
creates a direct data path between local NVMe or remote storage and copies in the host system memory
GPU memory. This is enabled via a direct-memory access (DMA) engine and provides dynamic routing
that optimizes path, buffers, and
near the network adapter or storage that transfers data into or out of
mechanisms.
GPU memory—avoiding the bounce buffer in the CPU.
> CPU utilization: Use of DMA
With GDS, third-party file systems or modified kernel driver modules engines near storage is less
available in the NVIDIA Open Fabric Enterprise Distribution (MLNX_ invasive to CPU load and doesn’t
OFED) allow such transfers. GDS enables new capabilities that provide interfere with GPU load. At larger
sizes, the ratio of bandwidth to
up to 2X peak bandwidth through the GPU while improving latency and
fractional CPU utilization is much
overall system utilization of both the CPU and GPU.
higher with GPUDirect Storage.
By exposing GPUDirect Storage within CUDA® via the cuFile API, DMA
engines near the network interface card (NIC) or storage device can
create a direct path between GPU memory and storage devices. The
cuFile API is integrated in the CUDA Toolkit (version 11.4 and later)
or delivered via a separate package containing a user-level library
(libcufile) and kernel module (nvidia-fs) to orchestrate IO directly from
DMA and remote DMA (RDMA) capable storage. The user-level library is
readily integrated into the CUDA Toolkit runtime and the kernel module
is installed with the NVIDIA driver. MLNX_OFED is also required and
must be installed prior to GDS installation.
ACCELERATING GPU-STORAGE COMMUNICATION WITH NVIDIA MAGNUM IO GPUDIRECT STORAGE | TECHNICAL OVERVIEW | Jun21 | 2
GPUDIRECT STORAGE DATA PATH
GPUDirect Storage enables a direct DMA data path between GPU memory
and local or remote storage as shown in figure 1, thus avoiding a copy
to system memory through the CPU. This direct path increases system
bandwidth while decreasing latency and utilization
MIO Figure 1 load on the CPU and GPU.
System System System
Memory Memory Memory
40000
1.5
Bandwidth (MiB/s)
30000
1.0
20000
0.5
10000
0 0.0
4 8 16 32 64 256 512 1024 2048 4096 8192 16384
IoSize(KiB)
Figure 2. The benefits of using GDS with the gdsio benchmarking tool
ACCELERATING GPU-STORAGE COMMUNICATION WITH NVIDIA MAGNUM IO GPUDIRECT STORAGE | TECHNICAL OVERVIEW | Jun21 | 3
> DeepCAM Benchmark
Figure 3 demonstrates another benefit of GDS. When optimized with
GDS and the NVIDIA Data Loading Library (DALI®), DeepCAM, a deep
learning model running segmentation on high-resolution climate
simulations to identify extreme weather patterns, can achieve up to a
6.6X speedup compared to out-of-the-box NumPy, a Python library used
for working with arrays.
30
4.7x
20
3.0x
10
Performance benchmarking done for DeepCAM Inference using standard GDS configuration in DGX A100, UB 20.04
MLNX_OFED 5.3 GDS 1.0 DALI 1.3. Application for batch size >= 32 limited by GPU compute throughput
© 2021 NVIDIA Corporation and affiliates. All rights reserved. NVIDIA, the NVIDIA logo, CUDA, DALI, DGX, GPUDirect, and Magnum IO
are trademarks and/or registered trademarks of NVIDIA Corporation and its affiliates in the U.S. and other countries. Other company
and product names may be trademarks of the respective owners with which they are associated. All other trademarks are property of
their respective owners. . Jun21