0% found this document useful (0 votes)

41 views4 pages

Tech Overview Magnum Io 1790750 r5 Web

Uploaded by

sapon 5034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views4 pages

Tech Overview Magnum Io 1790750 r5 Web

Uploaded by

sapon 5034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

TECHNICAL OVERVIEW

ACCELERATING
GPU-STORAGE
COMMUNICATION
WITH NVIDIA
MAGNUM IO
GPUDIRECT STORAGE
ADDRESSING THE CHALLENGES OF GPU-
NVIDIA Magnum IO GPUDIrect
ACCELERATED WORKFLOWS Storage accelerates the data
path to the GPU by eliminating IO
The datasets being used in high-performance computing (HPC), artificial bottlenecks.
intelligence, and data analytics are placing increasingly high demands on
> Supports RDMA over InfiniBand
the scale-out compute and storage infrastructures of today’s enterprises.
and Ethernet RoCE
This, together with the computation shift from CPUs to faster GPUs, has
> Supports distributed file systems:
rendered input and output (IO) operations between storage and the GPU to NFS, DDN EXAScaler, WekaIO,
take on even greater significance. In some cases, application performance IBM Spectrum Scale
suffers as GPU compute nodes need to wait for IO to complete. > Supports storage protocol via
With IO bottlenecks in multi-GPU systems and supercomputers, the NVMe and NVMe-oF
compute-bound problem translates into an IO-bound problem. Traditional > Provides a compatibility mode for
non-GDS ready platforms
reads and writes to GPU memory use POSIX APIS to read/write data from
system memory as an intermediate bounce buffer. In addition some file > Enabled on NVIDIA DGX™ Base OS
systems need additional memory in kernel page cache. An extra copy of > Supports Ubuntu and RHEL
operating systems
data through the system memory is the leading cause of the IO bandwidth
bottleneck to the GPU, as well as overall higher IO latency and CPU > Can be used with multiple
libraries, APIs, and frameworks:
utilization. This is primarily due to CPU cycles being used to transfer the
DALI , RAPIDS cuDF, PyTorch, and
buffer contents to the GPU. MXNet

BENEFITS
NVIDIA MAGNUM IO GPUDIRECT STORAGE > Higher bandwidth: Achieves up
to 2X more bandwidth available to
NVIDIA Magnum IO™ GPUDirect® Storage (GDS) was specifically GPU than a standard CPU-to-CPU
designed to accelerate data transfers between GPU memory and path.
remote or local storage in a way that avoids CPU bottlenecks. GDS > Lower latency: Avoids extra
creates a direct data path between local NVMe or remote storage and copies in the host system memory
GPU memory. This is enabled via a direct-memory access (DMA) engine and provides dynamic routing
that optimizes path, buffers, and
near the network adapter or storage that transfers data into or out of
mechanisms.
GPU memory—avoiding the bounce buffer in the CPU.
> CPU utilization: Use of DMA
With GDS, third-party file systems or modified kernel driver modules engines near storage is less
available in the NVIDIA Open Fabric Enterprise Distribution (MLNX_ invasive to CPU load and doesn’t
OFED) allow such transfers. GDS enables new capabilities that provide interfere with GPU load. At larger
sizes, the ratio of bandwidth to
up to 2X peak bandwidth through the GPU while improving latency and
fractional CPU utilization is much
overall system utilization of both the CPU and GPU.
higher with GPUDirect Storage.
By exposing GPUDirect Storage within CUDA® via the cuFile API, DMA
engines near the network interface card (NIC) or storage device can
create a direct path between GPU memory and storage devices. The
cuFile API is integrated in the CUDA Toolkit (version 11.4 and later)
or delivered via a separate package containing a user-level library
(libcufile) and kernel module (nvidia-fs) to orchestrate IO directly from
DMA and remote DMA (RDMA) capable storage. The user-level library is
readily integrated into the CUDA Toolkit runtime and the kernel module
is installed with the NVIDIA driver. MLNX_OFED is also required and
must be installed prior to GDS installation.

ACCELERATING GPU-STORAGE COMMUNICATION WITH NVIDIA MAGNUM IO GPUDIRECT STORAGE | TECHNICAL OVERVIEW | Jun21 | 2
GPUDIRECT STORAGE DATA PATH
GPUDirect Storage enables a direct DMA data path between GPU memory
and local or remote storage as shown in figure 1, thus avoiding a copy
to system memory through the CPU. This direct path increases system
bandwidth while decreasing latency and utilization
MIO Figure 1 load on the CPU and GPU.
System System System
Memory Memory Memory

CPU CPU CPU

PCIe NICs PCIe PCIe NICs

Storage Storage Storage
Switch Switch Switch

GPU GPU GPU

Without GPUDirect Storage With GPUDirect Storage With GPUDirect Storage

(Local) (Remote)

Figure 1. GPUDirect Storage data path

EFFECTIVENESS OF GPUDIRECT STORAGE ON

MICROBENCHMARKS

> GDSIO Benchmark

Figure 2 shows the benefits of using GDS with the gdsio benchmarking tool
that’s available as part of the installation. The figure demonstrates up to a
1.5X improvement in the bandwidth available to the GPU and up to a 2.8X
improvement in CPU utilization compared to the traditional data paths via the
CPU bounce buffer.
GDS vs CPU-GPU READ - 2 North-South NICs
CPU-GPU_READ GDS_READ Throughput Advantage CPU Util Advantage
Breakeven
50000 2.0

40000
1.5
Bandwidth (MiB/s)

30000
1.0
20000

0.5
10000

0 0.0
4 8 16 32 64 256 512 1024 2048 4096 8192 16384

IoSize(KiB)

Figure 2. The benefits of using GDS with the gdsio benchmarking tool

ACCELERATING GPU-STORAGE COMMUNICATION WITH NVIDIA MAGNUM IO GPUDIRECT STORAGE | TECHNICAL OVERVIEW | Jun21 | 3
> DeepCAM Benchmark
Figure 3 demonstrates another benefit of GDS. When optimized with
GDS and the NVIDIA Data Loading Library (DALI®), DeepCAM, a deep
learning model running segmentation on high-resolution climate
simulations to identify extreme weather patterns, can achieve up to a
6.6X speedup compared to out-of-the-box NumPy, a Python library used
for working with arrays.

Accelerating DeepCam Inference

NumPy (baseline) DALI + GDS (compat) DALI + GDS

6.6x
Effective Bandwidth [GB/s]

4.7x
20

3.0x

1.0x 1.0x 1.0x

0
8 16 32
Global Batch Size

Performance benchmarking done for DeepCAM Inference using standard GDS configuration in DGX A100, UB 20.04
MLNX_OFED 5.3 GDS 1.0 DALI 1.3. Application for batch size >= 32 limited by GPU compute throughput

Figure 3. Performance of DeepCAM with DALI 1.3.0 and GDS

EVOLUTIONARY TECHNOLOGY, REVOLUTIONARY

PERFORMANCE BENEFITS
As workflows shift away from the CPU in GPU-centric systems, the data
path from storage to GPUs increasingly becomes a bottleneck. NVIDIA
Magnum IO GPUDirect Storage enables DMA directly to and from GPU
memory. With this evolutionary technology, the performance benefits can
easily be seen with a variety of benchmarks and real-world applications
resulting in reduced runtimes and faster time to data-insight.

Learn more about GPUDirect Storage Acceleration at: developer.nvidia.com/gpudirect

© 2021 NVIDIA Corporation and affiliates. All rights reserved. NVIDIA, the NVIDIA logo, CUDA, DALI, DGX, GPUDirect, and Magnum IO
are trademarks and/or registered trademarks of NVIDIA Corporation and its affiliates in the U.S. and other countries. Other company
and product names may be trademarks of the respective owners with which they are associated. All other trademarks are property of
their respective owners. . Jun21

Orin-TRM DP10508002v1.1p
No ratings yet
Orin-TRM DP10508002v1.1p
8,752 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
No ratings yet
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
31 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Gpu-Initiated On-Demand High-Throughput Storage Access in The Bam System Architecture
No ratings yet
Gpu-Initiated On-Demand High-Throughput Storage Access in The Bam System Architecture
17 pages
Paper 3
No ratings yet
Paper 3
13 pages
GIDS
No ratings yet
GIDS
14 pages
Demystifying GPU Microarchitecture Through Microbenchmarking
No ratings yet
Demystifying GPU Microarchitecture Through Microbenchmarking
12 pages
Unit 4 Data Transfer: Direct Memory Access (DMA)
100% (3)
Unit 4 Data Transfer: Direct Memory Access (DMA)
8 pages
Optimus Developer Guide
No ratings yet
Optimus Developer Guide
11 pages
A21209 Accelerating Storage With Magnum IO and GPUDirect Storage - 1602092945212001mbEW
No ratings yet
A21209 Accelerating Storage With Magnum IO and GPUDirect Storage - 1602092945212001mbEW
40 pages
Topic 8
No ratings yet
Topic 8
71 pages
GPS Micro21
No ratings yet
GPS Micro21
13 pages
Gpu IEEE Paper
No ratings yet
Gpu IEEE Paper
14 pages
Energy Efficiency in Gpu
No ratings yet
Energy Efficiency in Gpu
26 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
Gpu Series I Cpu Vs Gpu 1720694318
No ratings yet
Gpu Series I Cpu Vs Gpu 1720694318
4 pages
Core-Level DVFS For Spatial Multitasking GPUs
No ratings yet
Core-Level DVFS For Spatial Multitasking GPUs
4 pages
Gpus As Storage System Accelerators: Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu
No ratings yet
Gpus As Storage System Accelerators: Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu
15 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Optimizing GPU Energy Efficiency With 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface
No ratings yet
Optimizing GPU Energy Efficiency With 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface
25 pages
Graphics Processing Unit (Gpu) Memory Hierarchy: Presented by Vu Dinh and Donald Macintyre
No ratings yet
Graphics Processing Unit (Gpu) Memory Hierarchy: Presented by Vu Dinh and Donald Macintyre
24 pages
A Complete Gpu Guide - Cherry Servers
No ratings yet
A Complete Gpu Guide - Cherry Servers
29 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Nvswitch Technical Overview
No ratings yet
Nvswitch Technical Overview
8 pages
Tme303 Dspa100 Ra1138901 Vast
No ratings yet
Tme303 Dspa100 Ra1138901 Vast
13 pages
Unit 4
No ratings yet
Unit 4
48 pages
GPU-Co Processing
No ratings yet
GPU-Co Processing
8 pages
NVIDIA DGX SuperPOD - NetApp EF600 and BeeGFS
No ratings yet
NVIDIA DGX SuperPOD - NetApp EF600 and BeeGFS
14 pages
Graphics Processing Unit GPU Programming Strategie
No ratings yet
Graphics Processing Unit GPU Programming Strategie
14 pages
Owens
No ratings yet
Owens
67 pages
Sum Product Paper
No ratings yet
Sum Product Paper
10 pages
Hispeed Dual Theory of Operation
100% (3)
Hispeed Dual Theory of Operation
139 pages
Part1 22
No ratings yet
Part1 22
77 pages
Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
No ratings yet
Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
10 pages
Nvidia Magnum Io Gpudirect Storage Overview Guide: 1.1. Related Documents
No ratings yet
Nvidia Magnum Io Gpudirect Storage Overview Guide: 1.1. Related Documents
22 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators
No ratings yet
G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators
4 pages
Data Sheet
100% (1)
Data Sheet
1,130 pages
Embedded System 16 Marks University Questions
No ratings yet
Embedded System 16 Marks University Questions
2 pages
Phoenix SecureCore Setup Utility
No ratings yet
Phoenix SecureCore Setup Utility
23 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
PCE385
No ratings yet
PCE385
100 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Nvidia DGX Station Datasheet PDF
No ratings yet
Nvidia DGX Station Datasheet PDF
2 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
Components of Computer System
50% (2)
Components of Computer System
12 pages
8 Things You Should Know About GPGPU Technology: Q&A With TACC Research Scientists
No ratings yet
8 Things You Should Know About GPGPU Technology: Q&A With TACC Research Scientists
2 pages
10.0 Peripheral Devices
No ratings yet
10.0 Peripheral Devices
67 pages
8086 Microprocessor
No ratings yet
8086 Microprocessor
48 pages
Samsung Color Laser Printer CLP 510 510N Parts and Service Manual
No ratings yet
Samsung Color Laser Printer CLP 510 510N Parts and Service Manual
227 pages
Acer Aspire 1300 Series: Service Guide
No ratings yet
Acer Aspire 1300 Series: Service Guide
92 pages
STM32F0xxx Reference Manual
No ratings yet
STM32F0xxx Reference Manual
1,004 pages
Basic Concepts in Serial I/O Interfacing I/O Devices
No ratings yet
Basic Concepts in Serial I/O Interfacing I/O Devices
33 pages
PC Technical Reference Aug81
No ratings yet
PC Technical Reference Aug81
396 pages
PY32F030 Reference Manual v1.1 - EN - Final
No ratings yet
PY32F030 Reference Manual v1.1 - EN - Final
436 pages
CC2540 User Guide
No ratings yet
CC2540 User Guide
353 pages
All Quiz
No ratings yet
All Quiz
31 pages
U-Boot TI 32 USB
No ratings yet
U-Boot TI 32 USB
12 pages
Computer Function and Interconnection
No ratings yet
Computer Function and Interconnection
36 pages
BMS Institute of Technology PDF
No ratings yet
BMS Institute of Technology PDF
53 pages
LPC55S6x: 1. General Description
No ratings yet
LPC55S6x: 1. General Description
140 pages
Os Unit-5
No ratings yet
Os Unit-5
32 pages
GLA UNIVERSITY Mathura, Uttar Pradesh, India
No ratings yet
GLA UNIVERSITY Mathura, Uttar Pradesh, India
31 pages
An4777 How To Optimize Power Consumption On stm32 Mcus Stmicroelectronics
No ratings yet
An4777 How To Optimize Power Consumption On stm32 Mcus Stmicroelectronics
33 pages
Hercules Introduction
No ratings yet
Hercules Introduction
29 pages
CSC159 Okt 2021 Question Set 1
No ratings yet
CSC159 Okt 2021 Question Set 1
9 pages
What Is and How To Configure The eDMA Scatter-Gather Feature
No ratings yet
What Is and How To Configure The eDMA Scatter-Gather Feature
11 pages
Sym Drive
No ratings yet
Sym Drive
14 pages
xiSWITCH Infographic
No ratings yet
xiSWITCH Infographic
1 page
Edma Controller: Level 1 Programme Memory
No ratings yet
Edma Controller: Level 1 Programme Memory
5 pages
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
From Everand
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
Maris Fenlor
No ratings yet
Practical GPU Programming
From Everand
Practical GPU Programming
Maris Fenlor
No ratings yet
CUDA Programming with Python: From Basics to Expert Proficiency
From Everand
CUDA Programming with Python: From Basics to Expert Proficiency
William Smith
1/5 (1)
Microsoft Hyper-V Cluster Design
From Everand
Microsoft Hyper-V Cluster Design
Eric Siron
No ratings yet
Mastering CUDA Python Programming
From Everand
Mastering CUDA Python Programming
Ed A Norex
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
The complete guide to Hardware Technician Terminology: A simplified guide
From Everand
The complete guide to Hardware Technician Terminology: A simplified guide
Sumitra Kumari
No ratings yet
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
From Everand
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Pop!_OS System Administration Guide: Definitive Reference for Developers and Engineers
From Everand
Pop!_OS System Administration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
From Everand
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PC Hardware Explained
From Everand
PC Hardware Explained
V. Subhash
No ratings yet

Tech Overview Magnum Io 1790750 r5 Web

Uploaded by

Tech Overview Magnum Io 1790750 r5 Web

Uploaded by

TECHNICAL OVERVIEW

CPU CPU CPU

PCIe NICs PCIe PCIe NICs

GPU GPU GPU

Without GPUDirect Storage With GPUDirect Storage With GPUDirect Storage

Figure 1. GPUDirect Storage data path

EFFECTIVENESS OF GPUDIRECT STORAGE ON

> GDSIO Benchmark

Accelerating DeepCam Inference

NumPy (baseline) DALI + GDS (compat) DALI + GDS

1.0x 1.0x 1.0x

Figure 3. Performance of DeepCAM with DALI 1.3.0 and GDS

EVOLUTIONARY TECHNOLOGY, REVOLUTIONARY

Learn more about GPUDirect Storage Acceleration at: developer.nvidia.com/gpudirect

You might also like