Tech Brief Virtual GPU Positioning
Tech Brief Virtual GPU Positioning
Technical Brief
Table of Contents
The flexibility of the NVIDIA vGPU solution sometimes leads to the question, “How do I select
the right software license and GPU combination to best meet the needs of my workloads?”
In this technical brief, you will find guidance on how to select the best virtual GPU software
license and graphics processing unit (GPU) combination, based on your workload. This
guidance is based on variables such as performance and performance per dollar1. Other
factors that should be considered include things like which NVIDIA vGPU certified OEM server
you’ve selected, which NVIDIA GPUs are supported in that platform, as well as any power and
cooling constraints.
Note:
1Performance per dollar assumes estimated GPU street price plus NVIDIA virtual GPU software
license cost with 3-year subscription divided by the number of users.
It is recommended that you test your unique workloads to determine the best NVIDIA virtual
GPU solution to meet your needs. However, this technical brief provides general guidance
based on performance and price performance, for virtualized workloads using NVIDIA virtual
GPU software.
Table 1 summarizes the recommended GPU for running a specific virtualized workload, based
only on performance. For this testing, we selected a representative benchmark for each
workload, described in Table 5. For the specific benchmarks run with NVIDIA virtual GPU
software, NVIDIA® Quadro RTX™ 6000 and Quadro RTX 8000 GPUs provided the best
performance for professional graphics and rendering workloads, while the V100S provided the
best performance for artificial intelligence (AI) and high-performance computing (HPC).
In many cases, raw performance is not the only factor considered when selecting the right
virtual GPU solution for your workload. Cost is often also considered. Table 2 summarizes the
recommended GPU if only performance per dollar is considered. If the infrastructure will
support only a knowledge worker VDI workload, the M10 GPU provides the best performance
per dollar, while also providing great user density. The T4 GPU is flexible enough to run
knowledge worker VDI and professional graphics workloads, and it also provides the best
performance per dollar for professional graphics applications. Because the NVIDIA RTX™
platform was designed to accelerate photorealistic rendering, it is no surprise that it provides
the best performance per dollar for rendering workloads. For high performance computing,
the NVIDIA Volta™ architecture of the V100S has hardware to accelerate double precision
(FP64) workloads, giving it the best performance and performance per dollar. It is important to
note, for AI training workloads, time-to-solution is extremely important, and for that reason,
costs outside of just infrastructure should be considered. As such, V100S would be
recommended for this workload when considering these other cost factors.
The NVIDIA virtual GPU (vGPU) solution provides a flexible way to accelerate virtualized
workloads – from AI to VDI. The solution includes NVIDIA virtual GPU software and NVIDIA
data center GPUs. There are three unique NVIDIA virtual GPU software licenses, each priced
and designed to address a specific use case:
NVIDIA GRID Virtual PC/Virtual Applications (NVIDIA GRID) – accelerates office productivity
applications, streaming video, Windows 10, RDSH, multiple and high-resolution monitors
and 2D electric design automation (EDA).
NVIDIA Quadro Virtual Data Center Workstation (Quadro vDWS) – accelerates professional
design and visualization applications including Autodesk Revit, Maya, Dassault Systèmes
CATIA, Solidworks, Esri ArcGIS Pro, Petrel, and more.
NVIDIA Virtual Compute Server (vCS) – accelerates artificial intelligence (AI), deep learning
(DL), data science and high-performance computing (HPC) workloads run in a virtualized
environment.
Decoupling the GPU hardware and virtual GPU software options enables customers to benefit
from innovative features delivered in the software at a regular cadence, without a dependency
on purchasing new GPU hardware. It also provides the flexibility for IT to architect the optimal
solution to meet the specific needs of users in their environment.
Configuration and Deployment Quadro vDWS NVIDIA GRID vPC NVIDIA vCS
Windows OS Support
Linux OS Support
Multi-vGPU/NVLink
Page Retirement
Maximum Hardware Rendered Display Four 5K, Two 8K Four QHD, Two 4K, One 4K
One 5K
Advanced Professional Features Quadro vDWS NVIDIA GRID vPC NVIDIA vCS
ISV Certifications
NVIDIA CUDA/OpenCL
Graphics Features and APIs Quadro vDWS NVIDIA GRID vPC NVIDIA vCS
NVENC
Quadro Optimizations
DirectX
Vulkan Support
Available Profiles 0Q, 1Q, 2Q, 3Q, 4Q, 0B, 1B, 2B 4C, 6C, 8C, 12C,
6Q, 8Q, 12Q, 16Q, 16C, 24C, 32C,
24Q, 32Q, 48Q 48C
Table 4 shows the NVIDIA GPUs recommended for virtualization workloads. The GPUs in this
table are tested and supported with NVIDIA virtual GPU software. Refer to the NVIDIA virtual
GPU product documentation for the full support matrix details.
GPUs/Board 1 1 1 1 4 1
(Architectur (Volta) (Turing) (Turing) (Turing) (Maxwell) (Pascal)
e)
RT Cores -- 72 72 40 -- --
Memory 32GB/16GB 48GB GDDR6 24GB GDDR6 16GB GDDR6 32GB 16GB
Size HBM2 GDDR5 GDDR5
(8GB per
GPU)
vGPU 1GB, 2GB, 4GB, 1GB, 2GB, 3GB, 1GB, 2GB, 3GB, 1GB, 2GB, 0.5GB, 1GB, 1GB, 2GB,
Profiles 8GB, 16GB, 32GB 4GB, 6GB, 8GB, 4GB, 6GB, 8GB, 4GB, 8GB, 2GB, 4GB, 4GB, 8GB,
12GB, 16GB, 12GB, 24GB 16GB 8GB 16GB
24GB, 48GB
Form PCIe 3.0 Dual PCIe 3.0 Dual PCIe 3.0 Dual PCIe 3.0 PCIe 3.0 MXM
Factor Slot and SXM2 Slot Slot Single Slot Dual Slot (blade
servers)
The NVIDIA GPUs recommended for virtualization are divided into three categories:
Performance Optimized GPUs are typically recommended for high-end virtual workstations
running professional visualization applications, artificial intelligence, deep learning, data
science or HPC workloads.
Density Optimized GPUs are typically recommended for knowledge worker virtual desktop
infrastructure (VDI) to run office productivity applications, streaming video and Windows 10.
They are designed to maximize the number of VDI users supported in a server.
Blade Optimized GPUs are designed to fit in the compact, blade server form factor and
leverage a Mobile PCI Express Module (MXM) interconnect instead of the standard PCIe
interconnect used for rack servers. Currently, NVIDIA offers just one MXM form factor GPU
for blade servers, the P6. The P6 GPU should be selected to run any workload where a
blade server form factor is preferred.
The NVIDIA T4 GPU is a compact, single slot card that consumes just 70W of power. By
comparison, the NVIDIA V100S and V100, Quadro RTX 6000, Quadro RTX 8000, and M10 GPUs
are dual slot PCIe cards, which consume twice as much space (two PCIe slots) inside the
server and more than three times the power. This means that you can fit two NVIDIA T4 GPUs
in the same space that you had fit a single V100S or V100, Quadro RTX 6000, Quadro RTX 8000,
or M10 GPU.
Built on the innovative NVIDIA RTX platform, the Quadro RTX 6000 and Quadro RTX 8000 GPUs
are uniquely positioned to power the most demanding professional visualization workloads.
They are an integral part of the NVIDIA RTX Server solution, which can run various workloads
including powerful virtual workstations. You will find that the performance of the Quadro RTX
6000 and Quadro RTX 8000 GPUs is very comparable, and the key differences between these
two cards are the memory size and price. The Quadro RTX 8000 GPU should be selected over
the Quadro RTX 6000 GPU if there is a requirement for larger memory to power virtual
workstations that support very large animations, files, or models.
The NVIDIA V100S is the most advanced data center GPU ever built to accelerate AI, high
performance computing, and data science. Customers who train or use neural networks, use
computationally intensive applications, or run simulations requiring double precision accuracy
(FP64 performance) should be using the V100S, which provides the best time-to-solution. V100
is available in two form factors, PCIe and SXM module. The SXM module is available with
servers that support NVIDIA® NVLink®, provide the best performance and strong-scaling for
hyperscale and HPC data centers running applications that scale to multiple GPUs, such as
deep learning.
Note:
1
Assumption is that enough frame buffer is available on all vGPUs across all GPUs.
AI Deep ResNet-50 V1.5, TensorRT 6.0.1, Batch Size = 128, NVIDIA vCS
Learning 19.12-py3, Precision: Mixed
Inference ResNet-50 NVIDIA® TensorRT™ is a model for high-
performance deep learning inference.
Professional Graphics
The Quadro RTX 6000 and Quadro RTX 8000 GPUs are based on the NVIDIA Turing ™
architecture, which enables major advances in efficiency and performance and is well suited
for professional graphics workloads. The significantly higher power budget of the Quadro RTX
6000 and Quadro RTX 8000 cards enable them to provide higher performance than the T4.
However, for those that do not require the highest performance, the T4 provides the best
performance per dollar for professional graphics workloads.
Figure 1 represents SPECviewperf13 results tested on a server with Intel Xeon Gold 6154 (18C,
3.0GHz), Quadro vDWS software, VMware ESXi 6.7.0 U3, host/guest driver 440.44/441.66, VM
config, Windows 10, 8 vCPU, 16GB memory.
RTX 6000 and RTX 8000 for the Best Professional Graphics Performance
(Higher is Better)
1.8
1.6 1.6
1.6
1.4
1.4
Geomean (Normalized)
1.2
1
1
0.8
0.6
0.4
0.2
0
T4 V100S RTX 6000 RTX 8000
SPECviewperf13
Figure 2 assumes estimated GPU street price plus NVIDIA Quadro vDWS software cost with 3-
year subscription.
1.5
1
1
0.5
0
V100S RTX 8000 RTX 6000 T4
SPECviewperf13
Rendering
Quadro RTX 6000 and Quadro RTX 8000 GPUs have RT Cores, accelerator units that are
dedicated to performing ray tracing operations with extraordinary efficiency, making them the
optimal choice for providing the highest rendering performance. The Quadro RTX 6000 and
Quadro RTX 8000 GPUs also have a significantly higher power budget versus the T4, resulting
in higher performance. The Quadro RTX 8000 would be selected over Quadro RTX 6000 if there
is a requirement to support larger models or scenes. Because the scenes used in our tests did
not require the additional frame buffer of the Quadro RTX 8000, you will see that the
performance results between Quadro RTX 6000 and Quadro RTX 8000 were comparable for
this test. However, the attractive price point of the Quadro RTX 6000 makes it ideal for those
who wish to achieve the best performance per dollar.
Figure 3 represents testing on a server with Intel Xeon Gold 6154 (18C, 3.0GHz), Quadro vDWS,
VMware ESXi 6.7.0 U3, host/guest driver 440.44/441.66, VM config, Windows 10, 8 vCPU, 16GB.
RTX 6000 and RTX 8000 for the Best Rendering Performance
(Lower is Better)
2 1.9
1.8
1.6 1.5
Render Time (Normalized)
1.4
1.2
1 1
1
0.8
0.6
0.4
0.2
0
T4 V100S RTX 6000 RTX 8000
Autodesk Arnold 6.0.1.0
Figure 4 assumes estimated GPU street price plus NVIDIA Quadro vDWS software cost with 3-
year subscription.
2.1
2
1.5
1
1
0.5
0
V100S RTX 8000 T4 RTX 6000
Autodesk Arnold 6.0.1.0
2.5
2.2 2.2
1.5
1
1
0.5
0
T4 RTX 6000 RTX 8000 V100S
TensorFlow Resnet-50
1.5
1
1
0.5
0
T4 RTX 8000 RTX 6000 V100S
TensorRT Resnet-50
Figure 7 assumes estimated GPU street price plus NVIDIA vCS software cost.
1.6
1.4
1.4
Performance/$ (Normalized)
1.2
1.2
1
1
0.8
0.6
0.4
0.2
0
V100S RTX 8000 T4 RTX 6000
TensorRT Resnet-50
6
Performance (Normalized)
3
2.2 2.2
2
1
1
0
T4 RTX 6000 RTX 8000 V100S
LAMMPS
Figure 9 assumes estimated GPU street price plus NVIDIA vCS software cost.
1.5 1.4
1.2
1
1
0.5
0
RTX 8000 T4 RTX 6000 V100S
LAMMPS
Knowledge Workers
As more knowledge worker users are added on a server, the server runs out of CPU
resources. Adding an NVIDIA GPU for this workload offloads constraints on the CPU resulting
in improved user experience and performance for end users. The NVIDIA nVector knowledge
worker VDI workload was used to test user experience and performance with NVIDIA GPUs.
NVIDIA M10, T4, Quadro RTX 6000, Quadro RTX 8000 and V100S achieve similar performance
for this workload.
Customers are realizing the benefits of increased resource utilization by leveraging common
virtualized GPU accelerated server resources to run virtual desktops and workstations but
leveraging these same resources to run compute when users are logged off. Customers who
want to be able to run compute workloads on the same infrastructure that they run VDI, might
leverage a V100S to do so. Learn more about Using NVIDIA Virtual GPUs to Power Mixed
Workloads in our whitepaper.
Despite having 48GB of frame buffer, the Quadro RTX 8000 supports a maximum of only 32
users due to reaching the context switching limit per GPU. Refer to Table 6 to see how many
VDI users can be supported for each GPU (with 1GB profile size).
Max. Users 32 16 24 32 32
Figure 10 assumes estimated GPU street price plus NVIDIA GRID software cost with 3-year
subscription divided by number of users.
M10 for Best Cost per User. T4 for Best Flexibility and Low Cost per User.
(Lower is Better)
2
1.8
1.8
1.6
1.4 1.4
Cost per User (Normalized)
1.4
1.2
1.2
1
1
0.8
0.6
0.4
0.2
0
V100S RTX 8000 RTX 6000 T4 M10
Organizations chose to virtualize servers and applications for various reasons (manageability,
flexibility, and security to name a few) and are often willing to sacrifice performance. When
allocating a full GPU to a workload in a virtualized environment, there is a performance
difference. However, the performance difference of using NVIDIA vGPU is negligible and will
depend on the workload, as well as various other configuration variables. The following
example illustrates 4% lower performance with NVIDIA vGPU in comparison to a bare metal
server running an AI Inference benchmark in a 1:1 configuration.
Figure 11 represents Resnet-50 V1.5 | TensorRT 6.0.1 | Batch Size = 128 | 19.12-py3 |
Precision: Mixed.
(Higher is Better)
T4 1
Improving overall utilization through sharing a GPU across multiple virtual machines with
NVIDIA vGPU software is implemented by scheduling the time which each virtual machine can
use the GPU. NVIDIA vGPU software provides multiple GPU scheduling options to
accommodate a variety of Quality of Service (QoS) levels for sharing the GPU. View the NVIDIA
vGPU product documentation for more information about GPU scheduling options.
In general, the performance per virtual machine when sharing a GPU with n virtual machines
will be 1/n of the total performance of the GPU. Therefore, two virtual machines sharing a GPU
will result in approximately 50 percent of the overall performance per virtual machine and four
virtual machines will result in approximately 25 percent of the overall performance per virtual
machine.
Figure 12 is an illustration of multiple virtual machines with an overall throughput increase of
16%.
Figure 12 represents SPECviewperf13 results tested on a server with Intel Xeon Gold (18C,
3.0GHz), Quadro vDWS with RTX 8000 with Equal Share scheduler, VMware ESXi 6.7.0 U3,
host/guest driver 440.44/441.66, VM config, Windows 10, 8 vCPU, 16GB memory.
1.4
1.2
1 0.29
Geomean (Normalized)
0.8 0.55
0.29
0.6
1
0.4 0.29
0.55
0.2
0.29
0
RTX 6000-24Q RTX 6000-12Q RTX 6000-6Q
Single VM Tw o VMs Four VMs
However, when workloads across virtual machines are not executed at the same time, or
aren’t always GPU bound, the performance can exceed the expected performance. The default
GPU scheduling policy, “Best Effort,” will be selected for this to happen as it leverages unused
GPU time of other virtual machines. See Figure 13 for a simplified view of how the “Best
Effort” GPU scheduler works.
The scaling factor of virtual machines with vGPU aggregation is like the scaling factor using
non-virtualized configurations. NVIDIA virtual GPU technology supports aggregating vGPUs for
highest performance within a virtual machine via NVLink and traditional PCIe-based solutions.
NVLink enables a high-speed, direct GPU-to-GPU interconnect that provides higher bandwidth
for multi GPU system configurations than traditional PCIe-based solutions.
Figure 14 represents Server Config: 2x Intel Xeon Gold (6140, 3.2GHz), VMware ESXi 6.7 U3,
NVIDIA vCS 9.1 RC, NVIDIA V100 (32C profile), Driver 430.18, TensorFlow Resnet-50 V1, NGC
19.01, FP16 BS: 256.
3.37
4x V100
3.45
1.74
2x V100
1.8
0.94
1x V100
1
While this technical brief provides general guidance on how to select the right NVIDIA GPU for
your workload, actual results may vary depending on the specific application being virtualized.
The most successful deployments are those that balance virtual machine density (scalability)
with required performance. This is achieved when a proof of concept (POC) with production
workloads is conducted while analyzing the utilization of all resources of a system and
gathering subjective feedback from all stakeholders. Consistently analyzing resource
utilization and gathering subjective feedback allows for optimizing the configuration to meet
the performance requirements while optimizing the configuration for best scale.
Other Resources:
Try NVIDIA vGPU for free
Using NVIDIA Virtual GPUs to Power Mixed Workloads
NVIDIA Virtual GPU Software Documentation
NVIDIA vGPU Certified Servers
OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
Trademarks
NVIDIA, the NVIDIA logo, CUDA, NVIDIA GRID, NVIDIA RTX, NVIDIA Turing, NVIDIA Volta, NVLink, Quadro, Quadro RTX, and TensorRT are trademarks
and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the
respective companies with which they are associated.
Copyright
© 2020 NVIDIA Corporation. All rights reserved.