0% found this document useful (0 votes)
18 views11 pages

VCUDA

The document presents vCUDA, a framework that enables GPU-accelerated high-performance computing (HPC) in virtual machines (VMs) by intercepting and redirecting CUDA API calls to leverage hardware acceleration. It addresses the performance challenges associated with virtualization by allowing applications in VMs to access GPU resources transparently, demonstrating that such applications can perform competitively with those running natively. The paper also discusses the architecture of vCUDA, its components, and the results of performance evaluations, highlighting its practical applicability and potential for future enhancements.

Uploaded by

Arul N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

VCUDA

The document presents vCUDA, a framework that enables GPU-accelerated high-performance computing (HPC) in virtual machines (VMs) by intercepting and redirecting CUDA API calls to leverage hardware acceleration. It addresses the performance challenges associated with virtualization by allowing applications in VMs to access GPU resources transparently, demonstrating that such applications can perform competitively with those running natively. The paper also discusses the architecture of vCUDA, its components, and the results of performance evaluations, highlighting its practical applicability and potential for future enhancements.

Uploaded by

Arul N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

vCUDA: GPU Accelerated High Performance Computing in Virtual Machines

Lin Shi, Hao Chen and Jianhua Sun


Advanced Internet and Media Lab
School of Computer and Communications
Hunan University, Chang Sha, 410082, China
{linshi,haochen,jhsun}@aimlab.org

Abstract problems to GPU. The introduction of some vendor specific


technologies (such as NVIDIA’s CUDA [7]) are further ac-
This paper describes vCUDA, a GPGPU (General Pur- celerating the adoption of high performance parallel com-
pose Graphics Processing Unit) computing solution for puting to commodity computers.
virtual machines. vCUDA allows applications executing Although virtualization technologies provide a wide
within virtual machines (VMs) to leverage hardware accel- range of benefits such as system security, ease of manage-
eration, which can be beneficial to the performance of a ment, isolation and live migration, VM technologies have
class of high performance computing (HPC) applications. not been widely adopted in high performance computing
The key idea in our design is: API call interception and area. This is mainly due to the overhead incurred by indi-
redirection. With API interception and redirection, appli- rect access to physical resources such as CPU, IO devices
cations in VMs can access graphics hardware device and and physical memory, which is one of the fundamental char-
achieve high performance computing in a transparent way. acteristics of virtual machines.
We carry out detailed analysis on the performance and In this paper, we propose a framework vCUDA for HPC
overhead of our framework. Our evaluation shows that which uses hardware acceleration provided by GPUs to ad-
GPU acceleration for HPC applications in VMs is feasi- dress the performance issues associated with VMs. Due to
ble and competitive with those running in a native, non- the closure and diversity, the powerful graphic processing
virtualized environment. Furthermore, our evaluation also ability can not be directly used by application running in
identifies the main cause of overhead in our current frame- virtualization platforms. To achieve hardware acceleration
work, and we give some suggestions for future improvement. non-invasively for general purpose computing applications
in VMs, we propose a solution by intercepting CUDA API
calls, which intercepts and redirects CUDA commands and
1. Introduction data in VMs to a CUDA enabled graphics device, and does
the real computations by the vendor-supplied GPU driver
Recently, system level virtualization technology revivals and CUDA library in VMM. With detailed performance
again, which stems from the continued growth in hardware evaluations, we demonstrate that hardware accelerated high
performance and increased demand for service consolida- performance computing jobs can run as efficiently in vir-
tion from business markets. Virtual machine (VM) tech- tualized environment as in a native host. Although we fo-
nologies allow different guest VMs coexisting in a physical cus on CUDA and Xen, we believe that our framework can
machine under the management of a virtual machine mon- be readily extended for other vendor specific GPGPU solu-
itor (VMM). VMM technology has been applied to many tions and other VMMs. To the best of our knowledge, this
areas including intrusion detection [8], high performance is the first study to adopt GPU accelerated HPC in virtual
computing [12] and device driver reuse [21] et al. machines.
Over the past few years, there has been a marked in- In summary, the main contributions of our work are:
crease in the performance and capabilities of graphics pro-
cessing unit (GPU). The modern GPU is not only a power- • We propose a framework which allows high perfor-
ful graphics engine but also a highly parallel programmable mance computing applications to benefit from hard-
processor featuring peak arithmetic and memory bandwidth ware acceleration in virtual machines. To demonstrate
[10, 28, 29]. Research community has successfully mapped the framework, we have developed a prototype system
a broad range of computationally demanding and complex using Xen virtual machine and CUDA.
• we present a set of extensions built with this frame- the GPU in VMM, one main reason is lacking standard
work such as multiplexing, suspend and resume with- interface in hardware level. One possibility is to redirect
out any modifications to applications. graphics processing requests from guestOS to hostOS us-
ing software emulation. However, this is not practical for
• We carry out detailed performance evaluation on the modem graphics hardware because trapping requests at this
overhead of our framework. This evaluation shows level is generally too inefficient. Another choice is to in-
that the vCUDA framework is practical and can de- tercept the graphics protocol stream at a higher (device-
liver high performance for HPC applications as those independent) point in the stream and redirect that to hostOS.
in native environments. The approach might be to replace system dynamic link li-
The rest of the paper is organized as follows: In Sec- braries containing APIs with protocol stubs to redirect to
tion 2, we provide some necessary background about this the guestOS, though there are still significant issues with
work. Then, we present our framework for hardware ac- this approach when the methods for managing state in the
celerated high performance computing in VMs in Section 3 original APIs have not been designed with this approach in
and carry out detailed performance analysis in Section 4. mind.
In Section 5, we discuss several issues in our current im- In the rest of this paper, the term hostOS refers to the ad-
plementation and how they can be addressed in future. we ministrative OS (or domain0 according to Xen’s semantics).
discuss the related work in Section 6 and conclude the paper The term guestOS refers to a VM (or domainU in Xen).
in Section 7.
2.2. CUDA
2. Background
CUDA (Compute Unified Device Architecture) [7] is a
2.1. VMM and GPU Virtualization complete GPGPU solution that provides direct access to
the hardware interface, rather than the traditional approach
System level virtualization technologies simulate details that must rely on the graphical interface API. The CUDA
of the underlying hardware in software, provide different framework uses common C language as its programming
hardware abstraction for each operating system instance, language and provides a large number of high performance
and run multiple heterogeneous operating systems concur- computing instructions and development capabilities, so
rently. It decouples the software from the hardware by that developers can establish more efficient data-intensive
forming a level of indirection which traditionally known as computing solutions on the basis of powerful GPU acceler-
the VMM. There are some different forms of VMM, but ation ability.
they all provide a complete and consistent view of underly- The CUDA software stack is composed of three layers:
ing hardware to the VM running on it. a hardware driver, an application programming interface
The VMM provides total mediation of all interactions (API) and its runtime, and two higher-level mathematical
between the virtual machine and underlying hardware, thus libraries of common usage. The runtime library is split into
allowing strong isolation between virtual machines and sup- three parts: A host component that runs on the host and pro-
porting the multiplexing of many virtual machines on a sin- vides functions to control and access one or more compute
gle hardware platform. VMM layer can also map and remap devices from the host; A device component that runs on
virtual machines to available hardware resources at will and the device and provides device-specific functions; A com-
even migrate virtual machines across machines. Encapsu- mon component that provides built-in vector types and a
lation also means that administrators can suspend virtual subset of the C standard library that are supported in both
machines and resume them at arbitrary times or checkpoint host and device code. For host runtime component, it is
them and roll them back to a previous execution state. With composed of two APIs: A low-level API called the CUDA
this general-purpose undo capability, systems can easily re- driver API, and A higher-level API called the CUDA run-
cover from crashes or configuration errors. time API, which is implemented on top of the CUDA driver
In the field of commercial software, Vmware [31] has API. A important fact is they are mutually exclusive which
become the de facto industry standard, and in the field of means one application can only use one of them. NVCC is
open-source software, Xen [2, 3] leads the emergence of a compiler for CUDA, it simplifies the process of compil-
paravirtualize technology. Both Xen and Vmware devel- ing CUDA code. Its basic work-flow consists in separating
oped the hosted architecture. In this architecture, the vir- device code from host code and compiling the device code
tualization layer uses the device drivers of a host operating into a binary form or cubin object.
system such as Windows or Linux to access devices. For the programmer the CUDA execution model as
While virtualization technology has been successfully shown in Figure 1 is a collection of threads running in par-
applied to a variety of devices, it is difficult to virtualize allel. The programmer decides the number of threads to

2
pled from software layer.
• Lazy RPC Transmission. vCUDA used XML-RPC
[32] as the means of high-level communication be-
tween guestOS and hostOS, taking into account its
compatibility and portability. XML-RPC is well-
supported by a number of third-party libraries that pro-
vide language-agnostic ways of invoking RPC calls. In
addition, we adopted a lazy RPC mode to improve ef-
ficiency of the original XML-RPC.

In addition, CUDA currently is not a fully open API,


some internal details have not been documented in official
software development kit SDK). we do not have full knowl-
edge about the states maintained only by the underlying
hardware driver or shared between applications and hard-
ware. We achieve the virtualization functionality from the
following three aspects:

• Function parameters. The intercepting library has no


access to all the internals of an application linked with
Figure 1. The CUDA execution model: A num- it. But it can get all the parameters of corresponding
ber of threads are batched together in blocks, API calls, which can be used as inputs to the faked API
which are again batched together in a grid. calls defined in the intercepting library. These faked
API calls with proper parameters can then be sent to
remote server for execution as normal calls.

be executed. A collection of threads (called a block) runs • Ordering semantics. Ordering semantics are the set
on a multiprocessor at a given time. Multiple blocks can of rules that constrain the order in which API calls may
be assigned to a single multiprocessor and their execution be executed. CUDA is basically a strictly ordered in-
is time-shared. A single execution on a device generates terface, which means some APIs must be launched in
a number of blocks. A collection of all blocks in a sin- the order in which they are specified. This behavior is
gle execution is called a grid. All threads of all blocks essential for maintaining internal persistency. But in
executing on a single multi-processor divide its resources some cases, if possible, vCUDA would use less con-
equally amongst themselves. Each thread and block is given strained ordering semantics when it meant increased
a unique ID that can be accessed within the thread during performance.
its execution. Each thread executes a single instruction set • Device state. CUDA maintains a large amount of
called the kernel. The kernel is the core code to be exe- states in hardware. The states contain attributes such
cuted on each thread. Using the thread and block IDs each as device point, symbol, pitch, texture and so on. On
thread can perform the kernel task on different set of data. a workstation with hardware acceleration, the graph-
Since the device memory is available to all the threads, it ics hardware keeps track of most or all of the current
can access any memory location. state. However, in order to properly implement a re-
mote execution for these APIs in virtual machine, it is
3. vCUDA necessary for the client to keep track of some of the
state in software.
vCUDA framework is organized around two main archi-
tectural features: 3.1. System Architecture

• Virtualization of CUDA API. 31 APIs in total 56 run- vCUDA uses a robust client-server model, which con-
time APIs were encapsulated into RPC calls. Their pa- sists of three user space modules: the vCUDA library, vir-
rameters were properly queued and redirected to hos- tual GPU in client and the vCUDA stub in server. Figure 2
tOS, and variables was cached and kept persistence in shows the vCUDA architecture. In the rest of this paper, the
both server and client side. Through this kind of virtu- term server memory refers to the hostOS memory space,
alization, the graphics hardware interface was decou- the term client memory refers to the VM memory space,

3
tion 3.2). In the current stage, we do not support the virtual-
ization of 3D graphics APIs, which requires a large amount
of engineering effort. Thus the discussion about these APIs
is beyond the scope of this paper (Section 5).

3.1.2. vGPU

vGPU is created, identified and used by vCUDA library,


and in fact it is represented as a large data structure in mem-
ory maintained by vCUDA library. vGPU provides three
main functionalities. First, vGPU abstracts some features
of real GPU to give each application a complete view of
the underlying hardware. vCUDA library creates a vir-
tual GPU context for each application, which contains de-
vice attributes such as GPU memory usage, texture memory
properties. Another important role of vGPU is local device
memory management. When a CUDA application allocates
Figure 2. The vCUDA architecture.
a device memory, vGPU will return a local virtual address
to the application and notify remote stub to allocate the real
and device memory refers to the memory space in graphics device memory. vGPU is also responsible for maintaining
hardware reside in hostOS. the mappings of local and remote addresses to avoid un-
necessary memory copies and leaks. The third function of
3.1.1. vCUDA Library vGPU is to store the CUDA API flow, most of the APIs’ op-
codes and parameters are stored in a global queue in mem-
ory or a file in file system to support suspend/resume as
vCUDA library resides in guestOS as a substitute of stan-
described in Section 3.4.
dard CUDA runtime library, and it is responsible for inter-
cepting and redirecting API calls from client to stub. There
are two programming interface provided by CUDA, runtime 3.1.3. vCUDA Stub
API and driver API. We chose the runtime API as the target
of virtualization because it is the most widely used library vCUDA stub receives and interprets remote requests and
in practice and also the officially recommended interface creates a corresponding execution context for the API calls
for programmers. However we don’t anticipate any main from guestOS, then returns the results to guestOS. The
obstacle to virtualize the driver level API. vCUDA stub manages the actual physical resources such as
There are total 56 runtime APIs described in the offi- the allocation of hardware resources and threads, the match-
cial programming guide. But we also found other 6 inter- ing of parameters of API calls, and also keeps a consistent
nal APIs in the dynamic linking library of CUDA runtime, view of states in both the stub and client sides by periodical
and they are not visible to programmers. It seems that they synchronization with vGPU.
are used to manage device execution code and memory al- The vCUDA stub spawns one thread for each client. The
located to variables. These six internal APIs are compiled main purpose of these threads is to receive CUDA com-
by NVCC into the final executable file, and never interact mands via the RPC channel, and to execute those com-
with other APIs. They were called before the choice of de- mands on behalf of the client application. Each stub thread
vice therefore have nothing to do with specific GPU. For the receives the vCUDA API stream, decodes it, and translates
six APIs, we just wrapped the corresponding parameter and it into a server-side representation. Then, for each client
sent it to the stub. We do not assume any internal logic to that vCUDA library interprets, the stub thread sets up an
these functions because they might be changed in the future. appropriate execution environment, and finally calls the na-
We use NVCC generated intermediate code, combined tive APIs. The process of this translation is crucial to the
with control flow analysis to customize the virtual logic for coherent and success of our system.
each API. For each API we intercept in the faked library, all
API calls are packed into a global API queue. This queue 3.2. Lazy RPC
contains a copy of the arguments to the corresponding func-
tion as well as an opcode, which is encoded into a single Thousands of CUDA APIs could be involved in a CUDA
byte. The contents in this queue were pushed to the stub application. If vCUDA intercepts and redirects every API
periodically according to some pre-defined strategies (Sec- call in client applications, then the same number of RPCs

4
will be invoked and the overhead of world switch (exe- 3.4. Suspend and Resume
cution context switch between different guest OSes) will
be inevitably introduced into the system. In virtual ma- vCUDA provides support for suspend and resume, en-
chines, world switch is an extremely expensive operation abling client sessions to be interrupted or moved between
and should be avoided whenever possible [23]. computers [18]. Upon resume, vCUDA presents the same
We classify the CUDA APIs into two categories. One is device state that the application observed before suspending
called instant APIs, whose executions have immediate ef- while retaining hardware acceleration capabilities.
fect on the state of CUDA runtime system. The other cate- The basics to implement application suspend and resume
gory is lazy APIs, which do not have any side effects on the is to store the CUDA API calls which affect device states
runtime state until the invocation of a following instant API. and the coresponding states bound to these calls when nec-
This kind of classification allows vCUDA to reduce the fre- essary. While the guest is running, vCUDA stub and vGPU
quency of world switch by updating states in the stub lazily, both snoop on the CUDA commands they forward to keep
and decrease unnecessary RPCs by redirecting lazy APIs to track of the state of device. Upon resume, vCUDA spawns a
stub side in a batched manner, thus boosting the system’s new thread in the stub, which is initialized by synchronizing
performance. A potential problem with lazy mode needs it with the application vCUDA state stored by vCUDA. The
mentioning is the delay of error report and time counting, time spent on resume depends on RPC efficiency, the GPU
thus it is not suitable for debugging and measurement pur- computing time of specific application and world switch
pose. overhead.

3.3. Multiplexing 4. Experiments

While the previous sections have presented detailed tech-


One fundamental feature of VM technologies is de- nical descriptions of the vCUDA system, this section evalu-
vice multiplexing. For example, in Xen, the physical net- ates the efficiency of vCUDA using programs selected from
work card can be multiplexed among multiple concurrently official SDK examples: a set of general-purpose algorithms
executing guest OSes. To enable this multiplexing, the from various domains. The benchmark range from simple
privileged driver domain (domain0) and the unprivileged data management to more complex WalshTransform com-
guest domains (domainU) communicate by means of a split putation and MonteCarlo simulation. Table 1 shows the
network-driver architecture. The driver domain hosts the statistical characteristics of these benchmarks, such as the
backend of the split network driver, and the domainU hosts quantity of API calls, the device memory size they consume
the frontend. and the data volume transferred from or to GPU device.
In vCUDA, we implement the GPU multiplexing in ap- These applications are evaluated concerning the follow-
plication level through the cooperation of the vCUDA li- ing criteria:
brary in domainU and the stub in domain0, thus allowing
multiple CUDA applications to execute concurrently in the • Performance How close does vCUDA come to pro-
same VM or different VMs. As described in Section 3.1.3, viding the performance observed in unvirtualized en-
the vCUDA stub spawns one thread for each client, and vironment with GPU acceleration?
there is an indicator (hash value of IP address, domain
ID and process ID) for each thread to distinguish differ- • Lazy RPC and Concurrency How greatly can
ent clients from different VMs. Under the coordination of vCUDA reduce the frequency of network transmission
vCUDA stub, these threads allocate and manage hardware by Lazy RPC mechanism? How well does vCUDA
resources cooperatively to guarantee the correct execution scale to support multiple CUDA applications running
semantics of client applications. concurrently?
There are three situations when we handle the mappings • Suspend and Resume What is the latency for resum-
of threads to GPU hardware. The first is one thread mapped ing a suspended CUDA application? What is the size
to a single hardware device. The second is multiple threads of an application’s recorded CUDA state?
running on a single device, and the last is that one thread
controls multiple GPU devices. According to the Nvidia’s • Compatibility How compatible is vCUDA with a
official guide, the third case is not supported by the cur- wide range of applications besides the examples dis-
rent GPU hardware. Although being not a officially recom- tributed with CUDA SDK?
mended operation, the second case was also implemented
in our framework and some performance evaluation will be The following testbed has been used for all bench-
given in Section 4.2. marks: A personal computer equipped with one Intel Core

5
Table 1. Statistics of Benchmark Applications.
Number of APIs GPU RAM Data Volume
AlignedTypes (AT) 1990 94.00MB 611.00MB
BinomialOptions (BO) 31 0.01MB 0.01MB
BlackScholes (BS) 5143 61.03MB 76.29MB
ConvolutionSeparable (CS) 48 108.00MB 72.00MB
FastWalshTransform (FWT) 144 128.00MB 128.00MB
MersenneTwister (MT) 24 91.56MB 91.56MB
MonteCarlo (MC) 53 187.13MB 0.00MB
ScanLargeArray (SLA) 6890 7.64MB 11.44MB

2 Duo E6550 processors running at 2.33 GHz with two


single-threaded cores and provided with 2 GBytes of mem-
ory. Furthermore, the graphics hardware was NVIDIA’s
GeForce8600 GT. As for software, the test machine ran the
RHEL 5.0 Linux distribution with the 2.6.16.29 kernel, with
the official NVIDIA driver for linux version 169.09. We
choose the XEN3.0.3 as our virtual platform, all paravirtu-
alized virtual machines were setup with 512 MB RAM, 5G
disk, and bridge mode network configuration.

4.1. Performance

Performance evaluation refers to the execution time of


benchmarks in virtual machine compared to the native ver-
sion. Our first test measures basic performance of vCUDA,
all benchmarks were evaluated in two different configura-
tions: Figure 3. vCUDA performance.

• Native: Every application has direct and exclusive ac-


cess to hardware and native CUDA drivers. vCUDA
was not used. This represents the upper bound on
achievable performance for our experimental setup.

• vCUDA: a virtualized guest using vCUDA to provide


hardware acceleration. All CUDA instructions were
intercepted and redirected to hostOS.

Figure 3 shows the results from running the benchmarks


under two configurations described above. The first two
bars shows the execution time with and without vCUDA,
the third bar represents the overhead caused by XML-RPC
encode/decode procedures.
The experiments result shows that the time consump-
tion with vCUDA is one to five times bigger than the
native version. But further observation will discover the
main cause that affects the efficiency is the encode/decode
time of XML-RPC implementation. Benchmark programm Figure 4. vCUDA performance - normalized
AlignedType involves data transfer about 600MBytes, view.
whose encode/decode time took more than 60% of the to-
tal execution time. This indicates that the XML-RPC is not

6
Figure 5. Effect of lazy RPC mode. Figure 6. Evaluation of scalability by running
two CUDA applications concurrently.

a efficient enough protocol for high-capacity data transmis-


sion. We are investigating a customized inter-domain com- Figure 6 presents the results for the concurrent execution
munication mechanism in virtual machine to improve the of two applications, compared to the results for a single in-
efficiency. Other benchmarks that involve large data trans- stance (taken from Figure 3) in two circumstances. The
fer exhibit similar characteristics. On the other hand, the test results in unvirtualized configuration all present good
less the data volume is transferred, the closer performance scalability, and the overheads for all applications are all be-
is to the native version, such as benchmarks BO and MC. low 16% (3.8% for AT, 6.3% for BO, 7.5% for BS, 13.3%
Figure 4 normalizes the native and encode/decode re- for CS, 9.5% for FWT, 15.6% for MT and 1.4% for SLA).
sults against the results obtained in VM (i.e. native and en- In the contrary, the counterparts in vCUDA show obvious
code/decode time divided by vCUDA time). The purpose of performance degradation, the overhead ranging from 4% to
normalization is to compare the results more intuitively. For 170% (4.9% for AT, 94.8% for BO, 49% for BS, 67.1% for
example, although the execution time in vCUDA is not ex- CS, 42.4% for FWT, 99.9% for MT and 167.7% for SLA).
actly the sum of native and encode/decode time, we can in- We attribute this to the current unoptimized implementa-
fer a coarse performance penalty of vCUDA. In the case of tion of our system, such as the management of concurrent
benchmark AT, vCUDA itself incurred 8.92% (1-26.81%- accesses to GPU device of different stub threads and ineffi-
64.27%) overhead except for encode/decode time. cient inter-domain data transfer that also incurs significant
overhead of world switch. Despite the performance issue,
4.2. Lazy RPC and Concurrency the concurrency evaluation validates the GPU multiplexing
functionality described in Section 3.3.
Figure 5 compares the RPC frequencies in two cases
leaving lazy mode open or close. As shown in Figure 5, the 4.3. Suspend and Resume
lazy mode transmission significantly reduce the frequency
of RPC between two domains at the level of 40% to 70%. To measure the performance of vCUDA’s suspend and
To examine vCUDA’s ability to support concurrent ap- resume, we suspended the benchmarks at the end of each
plications, we compared the performance of two applica- application’s API call flow in time. We then resumed the
tions executing concurrently in an unvirtualized configura- guest and verified successful resumption of the CUDA ap-
tion, to the performance of the same two applications exe- plication. We measured the size of the CUDA state neces-
cuting concurrently in vCUDA. We launched each bench- sary to synchronize the vCUDA stub to the current applica-
mark concurrently with a reference application to test the tion state, and the time it took to perform the entire resume
performance of concurrency in vCUDA. The benchmark operation. The results of these experiments are shown in
BO was chosen as a reference application, because it con- Figure 8 and Figure 9.
sumed smaller device memory and can run concurrently Note that since unregistfatbinary is always the last API
with most of other benchmarks. The only exception was called in CUDA applications, we put the suspending point
MC, which consumed too much memory in device RAM to before the first occurrence of unregistfatbinary. That is
execute concurrently with BO. the worst situation because the maximum states need to be

7
Figure 7. Size of suspended CUDA state. Figure 9. Comparison of time consumption
among resume operation, native and virtual-
ized executions.

4.4. Compatibility

A well designed API interface virtualization scheme


should be not only transparent but also compatible to a
wide range of applications except for the examples in of-
ficial SDK. In order to verify the compatibility of vCUDA,
we chose five applications from CUDA zone [7] that can
run correctly in our testbed. These applications were mp3
lame encoder in CUDA contest [25], Molecular Dynamics
Simulation with GPU [22], matrix-vector multiplication al-
gorithm in CUDA [9], storeGPU [1] and MRRR implemen-
tation in CUDA [20].
Figure 8. Resume time. All the five third party applications passed the test and
returned the same results as in native executions. The de-
tails of these tests are depicted in Table 2, which shows that
when running in vCUDA framework, these applications ex-
synchronized, thus the experimental results represent an up-
hibit similar performance characteristics as those discussed
per bound of performance.
in Section 4.1. For example, the performance degradation
Figure 7 illustrates the data volume need to be restored of application MV is mainly due to the higher data volume
for resumption. The resume time (Figure 8) is strongly transfer compared with other applications.
dependent on the size of the suspended CUDA state (Fig-
ure 7), which can be as large as 65MB for the FWT bench-
mark. The benchmarks AT, BS, CS and FWT took more 5. Discussion
time to perform the resume operation than others due to
their larger data volumes. Furthermore, Figure 9 compares In this section we discuss several issues with our current
the time consumption in three different circumstances (na- prototype and how they can be addressed in future.
tive, vCUDA and resume). The resume operation took vCUDA has not yet achieve virtualization of all APIs of
much less time than the execution in vCUDA in all bench- CUDA version 1.1. As visual computing is becoming very
marks, but took a little more time than the native execution popular and widespread and the virtualization of 3D graph-
in benchmarks AT, BS, CS and FWT, which all involved ics interface such as OpenGL could be beneficial in prac-
more data transfer. However, the resume time is almost neg- tice, we are planing to integrate existing 3D virtualization
ligible in benchmarks BO (0.65s) and MT (0.36s), where technology such as VMGL [19] into our framework.
the state size is much smaller (13KB for BO and 7KB for Another aspect needs to improve is the efficiency of net-
MT). work transmission. So far our work has focused on porta-

8
Table 2. Statistics of the Third Party Applications.
Number of APIs GPU RAM Data Volume Native Time(s) vCUDA Time(s)
GPUmg 89 294KB 2448KB 0.242s 0.387s
storeGPU 32 983KB 860KB 0.301s 0.413s
MRRR 370 1893KB 2053KB 0.591s 0.686s
MV 31 15761KB 61472KB 0.776s 3.814s
MP3encode 94 1224KB 391KB 0.252s 0.515s

bility across VMMs and OSs, therefore avoided all per- computing, such as HPVM [6] and Java. In [12], the authors
formance optimizations that might compromise portability. proposed a framework for HPC applications in VMs, which
The underlying data channels have not been fully utilized, addresses the performance and management overhead asso-
leading to relatively low efficiency. One of our goals in ciated with VM-based computing. They explained how to
future is to develop a specific communication strategy be- achieve high communication performance for VMs by ex-
tween different domains in Xen by adopting technologies ploiting the VMM-bypass feature of modern high speed in-
such as XWAY [17]. terconnects such as InfiniBand, and reduce the overhead of
CUDA currently is only Nvidia’s private GPGPU inter- distributing and managing VMs in large scale clusters with
face standard, which means vCUDA only supports Nvidia’s scalable VM image management schemes.
graphics cards. Recently the industry has announced other
competitive frameworks such as [26] for the same purpose, 7. Conclusions
and we expect that methodologies discussed in our frame-
work can also be applied to other interfaces. In this paper we proposed a framework vCUDA, a
GPGPU high performance computing solution for virtual
6. Related Work machines. vCUDA allows applications executing within
virtual machines (VMs) to leverage hardware acceleration,
Research community have adopted various methods to which can be beneficial to the performance of a class of
expand and reuse API, a typical method is to replace the high performance computing (HPC) applications. We ex-
graphic API library with an ”intercept” library that looks plained how to access graphics hardware in VMs transpar-
exactly like the original graphic library. ently by API call interception and redirection. Our evalu-
According to the specific features and practical require- ation showed that GPU acceleration for HPC applications
ments, many existing systems intercept calls to the graph- in VMs is feasible and competitive with those running in a
ics library for various purposes. VirtualGL [30] virtru- native, non-virtualized environment. In future, we will add
alizes GLX to grant remote rendering ability. WireGL 3D graphics virtualization to our framework and port it to
[14] and its successor Chromium [13] intercept OpenGL newer versions of CUDA. We plan to investigate high per-
[27] to generate different output like distributed displays. formance inter-domain communication schemes to improve
Chromium provides a mechanism for implementing plug- the efficiency of data transfer in our system.
in modules that alter the stream of GL commands, allow-
ing the distribute parallel rendering. HijackGL [24] uses
8. Acknowledgments
the Chromium library to exploring new rendering styles.
In VMM platform like Xen this methodology is used to
achieve 3D hardware acceleration in a virtual machine. [19] The authors would like to thank the anonymous review-
deploys a fake stub in guestOS and redirect the OpenGL ers for their useful suggestions and comments on this paper.
flow to hostOS. Blink project [11] intercepts OpenGL to This research was supported in part by the National Basic
multiplex 3D display in several ClientOS. Another main Research Program of China under grant 2007CB310900,
category is the tools to help performance analysis and de- and the National Science Foundation of China under grants
bugging. IBM’s ZAPdb OpenGL debugger [15] uses this 60803130 and 60703096.
interception technique to aid in debugging OpenGL pro-
grams. Intel’s Graphics Performance Toolkit [16] uses a References
similar method to instrument graphics application perfor-
mance. [1] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan,
High level middleware- and language-based virtual ma- M. Ripeanu. StoreGPU: Exploiting Graphics Process-
chines have been studied and used for high performance ing Units to Accelerate Distributed Storage Systems.

9
In Proc. ACM/IEEE International Symposium on High In Proc. 20th Annual International Conference on Su-
Performance Distributed Computing (HPDC), Boston, percomputing, June 28-July 01, 2006, Cairns, Queens-
MA, Jun. 2008. land, Australia.

[2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, [13] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ah-
A. Ho,R. Neugebauer, I. Pratt, and A. Warfield. Xen ern,P. D. Kirchner, and J. T. Klosowski. Chromium: a
and the Art of Virtualization. In Proc. 19th ACM Sym- Streamprocessing Framework for Interactive Rendering
posium on OperatingSystems Principles (SOSP), pages on Clusters. In Proc. 29th Annual Conference on Com-
164-177, Bolton Landing, NY, Oct. 2003. puter Graphics and Interactive Techniques, pages 693-
702, New York, NY, USA,2002.
[3] M. Ben-Yehuda, J. Mason, O. Krieger, and J. Xeni-
dis. Xen/IOMMU, Breaking IO in New and Interest- [14] G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Ev-
ing Ways. http://www.xensource.com/files/xs0106 xe- erett, and P. Hanrahan. WireGL: A Scalable Graphics
niommu.pdf. System for Clusters. In Proc. SIGGRAPH, pages 129-
140, August 2001.
[4] I. Buck, G. Humphreys, and P. Hanrahan. Tracking
Graphics State for Networked Rendering. In Proc. [15] IBM’s ZAPdb OpenGL debugger, 1998. Zapdb. Com-
ACM SIGGRAPH/EUROGRAPHICS Workshop on puter Software.
Graphics Hardware, pages 87-95, Interlaken, Switzer-
[16] Intel Graphics Performance Toolkit. Computer Soft-
land, Aug. 2000.
ware.
[5] I. Buck, T. Foley, D. Horn, J. Sugerman, F. Kayvon, M. [17] K. Kim, C. Kim, S. I. Jung, H. S. Shin, J. S. Kim.
Houston, and P. Hanrahan. Brook for GPUs: Stream Inter-domain Socket Communications Supporting High
Computing on Graphics Hardware. In Proc. ACM SIG- Performance and Full Binary Compatibility on Xen. In
GRAPH 2004, pages 777-786, New York, NY, 2004. Proc. VEE 2008. ACM Press, Mar. 2008.
[6] A. Chien et al. Design and Evaluation of an HPVM- [18] M. Kozuch and M. Satyanarayanan. Internet sus-
Based Windows NT Supercomputer. The International pend/resume. In Proc. Fourth IEEE Workshop on Mo-
Journal of High Performance Computing Applications, bile Computing Systems and Applications, Callicoon,
13(3):201-219, Fall 1999. New York, June 2002.
[7] CUDA: Compute Unified Device Architecture. [19] H. A. Lagar-Cavilla, N. Tolia, M. Satyanarayanan, and
http://www.nvidia.com/object/cuda home.html. (ac- E. de La-ra. VMM-independent Graphics Acceleration.
cessed September, 2008). In Proc. VEE 2007. ACM Press, June 2007.
[8] G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. [20] C. Lessig, An Implementation of the MRRR Algo-
M.Chen. Revirt: Enabling Intrusion Analysis Through rithm on a Data-Parallel Coprocessor. Technical Re-
Virtual Machine Logging and Replay. In Proc. 5th Sym- port, University of Toronto, 2008.
posium on Operating Systems Design and Implementa-
tion (OSDI), Boston,MA, Dec. 2002. [21] J. LeVasseur, V. Uhlig, J. Stoess, and S. Gotz. Un-
modified Device Driver Reuse and Improved System
[9] N. Fujimoto. Faster Matrix-Vector Multiplication on Dependability via Virtual Machines. In Proc.6th Sym-
GeForce 8800GTX. In Proc. 22nd IEEE Interna- posium on Operating Systems Design and Implementa-
tional Parallel and Distributed Processing Symposium tion, San Francisco,CA, Dec. 2004.
(IPDPS), LSPP-402, Apr. 2008.
[22] MDGPU. http://www.amolf.nl/˜vanmeel/mdgpu/about.html.
[10] GPGPU: General Purpose Programming on GPUs.
http://www.gpgpu.org/w/index.php/FAQ#What pro- [23] A. Menon, J. R. Santos and Y. Turner et al. Diagnos-
gramming APIs exist for GPGPU.3F. ing Performance Overheads in the Xen Virtual Machine
Environment. In Proc. 1st ACM/USENIX International
[11] J. G. Hansen. Blink: 3d Display Multiplex- Conference on Virtual Execution Environments, pages
ing for Virtualized Applications. Technical Report, 13-23, Chicago, IL, June 2005.
DIKU - University of Copenhagen, Jan. 2006.
http://www.diku.dk/ jacobg/pubs/blink-techreport.pdf. [24] A. Mohr and M. Gleicher. HijackGL: Reconstructing
from Streams for Stylized Rendering. In Proc. the Sec-
[12] W. Huang, J. Liu, B. Abali, D. K. Panda. A Case for ond International Symposium on Non-photorealistic
High Performance Computing with Virtual Machines. Animation and Rendering, 2002.

10
[25] MP3 LAME Encoder (Nvidia’s CUDA Contest).
http://cudacontest.nvidia.com/index.cfm?action=contestdownload&contestid=2.
[26] OpenCL: Parallel Computing on the GPU and CPU. In
Beyond Programmable Shading Course of SIGGRAPH
2008, August 14, 2008.

[27] OpenGL - The Industry Standard for High Perfor-


mance Graphics. http://www.opengl.org.
[28] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris,
J. Kruger, A. E. Lefohn, and T. J. Purcell. A Survey of
General-Purpose Computation on Graphics Hardware.
Journal of Computer Graphics Forum,26:21-51,2007.
[29] D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using
Data Parallelism to Program GPUs for General-Purpose
Uses. In Proc. 12th International Conference on Ar-
chitectural Support for Programming Languages and
Operating Systems (ASPLOS), New York, NY, USA,
2006.
[30] VirtualGL. http://virtualgl.sourceforge.net/.
[31] VMware Workstation.
http://www.vmware.com/products/ws/.
[32] XML-RPC. http://www.xmlrpc.com/.

11

You might also like