0% found this document useful (0 votes)
41 views4 pages

Tech Overview Magnum Io 1790750 r5 Web

Uploaded by

sapon 5034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views4 pages

Tech Overview Magnum Io 1790750 r5 Web

Uploaded by

sapon 5034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

TECHNICAL OVERVIEW

ACCELERATING
GPU-STORAGE
COMMUNICATION
WITH NVIDIA
MAGNUM IO
GPUDIRECT STORAGE
ADDRESSING THE CHALLENGES OF GPU-
NVIDIA Magnum IO GPUDIrect
ACCELERATED WORKFLOWS Storage accelerates the data
path to the GPU by eliminating IO
The datasets being used in high-performance computing (HPC), artificial bottlenecks.
intelligence, and data analytics are placing increasingly high demands on
> Supports RDMA over InfiniBand
the scale-out compute and storage infrastructures of today’s enterprises.
and Ethernet RoCE
This, together with the computation shift from CPUs to faster GPUs, has
> Supports distributed file systems:
rendered input and output (IO) operations between storage and the GPU to NFS, DDN EXAScaler, WekaIO,
take on even greater significance. In some cases, application performance IBM Spectrum Scale
suffers as GPU compute nodes need to wait for IO to complete. > Supports storage protocol via
With IO bottlenecks in multi-GPU systems and supercomputers, the NVMe and NVMe-oF
compute-bound problem translates into an IO-bound problem. Traditional > Provides a compatibility mode for
non-GDS ready platforms
reads and writes to GPU memory use POSIX APIS to read/write data from
system memory as an intermediate bounce buffer. In addition some file > Enabled on NVIDIA DGX™ Base OS
systems need additional memory in kernel page cache. An extra copy of > Supports Ubuntu and RHEL
operating systems
data through the system memory is the leading cause of the IO bandwidth
bottleneck to the GPU, as well as overall higher IO latency and CPU > Can be used with multiple
libraries, APIs, and frameworks:
utilization. This is primarily due to CPU cycles being used to transfer the
DALI , RAPIDS cuDF, PyTorch, and
buffer contents to the GPU. MXNet

BENEFITS
NVIDIA MAGNUM IO GPUDIRECT STORAGE > Higher bandwidth: Achieves up
to 2X more bandwidth available to
NVIDIA Magnum IO™ GPUDirect® Storage (GDS) was specifically GPU than a standard CPU-to-CPU
designed to accelerate data transfers between GPU memory and path.
remote or local storage in a way that avoids CPU bottlenecks. GDS > Lower latency: Avoids extra
creates a direct data path between local NVMe or remote storage and copies in the host system memory
GPU memory. This is enabled via a direct-memory access (DMA) engine and provides dynamic routing
that optimizes path, buffers, and
near the network adapter or storage that transfers data into or out of
mechanisms.
GPU memory—avoiding the bounce buffer in the CPU.
> CPU utilization: Use of DMA
With GDS, third-party file systems or modified kernel driver modules engines near storage is less
available in the NVIDIA Open Fabric Enterprise Distribution (MLNX_ invasive to CPU load and doesn’t
OFED) allow such transfers. GDS enables new capabilities that provide interfere with GPU load. At larger
sizes, the ratio of bandwidth to
up to 2X peak bandwidth through the GPU while improving latency and
fractional CPU utilization is much
overall system utilization of both the CPU and GPU.
higher with GPUDirect Storage.
By exposing GPUDirect Storage within CUDA® via the cuFile API, DMA
engines near the network interface card (NIC) or storage device can
create a direct path between GPU memory and storage devices. The
cuFile API is integrated in the CUDA Toolkit (version 11.4 and later)
or delivered via a separate package containing a user-level library
(libcufile) and kernel module (nvidia-fs) to orchestrate IO directly from
DMA and remote DMA (RDMA) capable storage. The user-level library is
readily integrated into the CUDA Toolkit runtime and the kernel module
is installed with the NVIDIA driver. MLNX_OFED is also required and
must be installed prior to GDS installation.

ACCELERATING GPU-STORAGE COMMUNICATION WITH NVIDIA MAGNUM IO GPUDIRECT STORAGE | TECHNICAL OVERVIEW | Jun21 | 2
GPUDIRECT STORAGE DATA PATH
GPUDirect Storage enables a direct DMA data path between GPU memory
and local or remote storage as shown in figure 1, thus avoiding a copy
to system memory through the CPU. This direct path increases system
bandwidth while decreasing latency and utilization
MIO Figure 1 load on the CPU and GPU.
System System System
Memory Memory Memory

CPU CPU CPU

PCIe NICs PCIe PCIe NICs


Storage Storage Storage
Switch Switch Switch

GPU GPU GPU

Without GPUDirect Storage With GPUDirect Storage With GPUDirect Storage


(Local) (Remote)

Figure 1. GPUDirect Storage data path

EFFECTIVENESS OF GPUDIRECT STORAGE ON


MICROBENCHMARKS

> GDSIO Benchmark


Figure 2 shows the benefits of using GDS with the gdsio benchmarking tool
that’s available as part of the installation. The figure demonstrates up to a
1.5X improvement in the bandwidth available to the GPU and up to a 2.8X
improvement in CPU utilization compared to the traditional data paths via the
CPU bounce buffer.
GDS vs CPU-GPU READ - 2 North-South NICs
CPU-GPU_READ GDS_READ Throughput Advantage CPU Util Advantage
Breakeven
50000 2.0

40000
1.5
Bandwidth (MiB/s)

30000
1.0
20000

0.5
10000

0 0.0
4 8 16 32 64 256 512 1024 2048 4096 8192 16384

IoSize(KiB)

Figure 2. The benefits of using GDS with the gdsio benchmarking tool

ACCELERATING GPU-STORAGE COMMUNICATION WITH NVIDIA MAGNUM IO GPUDIRECT STORAGE | TECHNICAL OVERVIEW | Jun21 | 3
> DeepCAM Benchmark
Figure 3 demonstrates another benefit of GDS. When optimized with
GDS and the NVIDIA Data Loading Library (DALI®), DeepCAM, a deep
learning model running segmentation on high-resolution climate
simulations to identify extreme weather patterns, can achieve up to a
6.6X speedup compared to out-of-the-box NumPy, a Python library used
for working with arrays.

Accelerating DeepCam Inference


40

NumPy (baseline) DALI + GDS (compat) DALI + GDS


6.6x
Effective Bandwidth [GB/s]

30

4.7x
20

3.0x

10

1.0x 1.0x 1.0x


0
8 16 32
Global Batch Size

Performance benchmarking done for DeepCAM Inference using standard GDS configuration in DGX A100, UB 20.04
MLNX_OFED 5.3 GDS 1.0 DALI 1.3. Application for batch size >= 32 limited by GPU compute throughput

Figure 3. Performance of DeepCAM with DALI 1.3.0 and GDS

EVOLUTIONARY TECHNOLOGY, REVOLUTIONARY


PERFORMANCE BENEFITS
As workflows shift away from the CPU in GPU-centric systems, the data
path from storage to GPUs increasingly becomes a bottleneck. NVIDIA
Magnum IO GPUDirect Storage enables DMA directly to and from GPU
memory. With this evolutionary technology, the performance benefits can
easily be seen with a variety of benchmarks and real-world applications
resulting in reduced runtimes and faster time to data-insight.

Learn more about GPUDirect Storage Acceleration at: developer.nvidia.com/gpudirect

© 2021 NVIDIA Corporation and affiliates. All rights reserved. NVIDIA, the NVIDIA logo, CUDA, DALI, DGX, GPUDirect, and Magnum IO
are trademarks and/or registered trademarks of NVIDIA Corporation and its affiliates in the U.S. and other countries. Other company
and product names may be trademarks of the respective owners with which they are associated. All other trademarks are property of
their respective owners. . Jun21

You might also like