VCUDA
VCUDA
2
pled from software layer.
• Lazy RPC Transmission. vCUDA used XML-RPC
[32] as the means of high-level communication be-
tween guestOS and hostOS, taking into account its
compatibility and portability. XML-RPC is well-
supported by a number of third-party libraries that pro-
vide language-agnostic ways of invoking RPC calls. In
addition, we adopted a lazy RPC mode to improve ef-
ficiency of the original XML-RPC.
be executed. A collection of threads (called a block) runs • Ordering semantics. Ordering semantics are the set
on a multiprocessor at a given time. Multiple blocks can of rules that constrain the order in which API calls may
be assigned to a single multiprocessor and their execution be executed. CUDA is basically a strictly ordered in-
is time-shared. A single execution on a device generates terface, which means some APIs must be launched in
a number of blocks. A collection of all blocks in a sin- the order in which they are specified. This behavior is
gle execution is called a grid. All threads of all blocks essential for maintaining internal persistency. But in
executing on a single multi-processor divide its resources some cases, if possible, vCUDA would use less con-
equally amongst themselves. Each thread and block is given strained ordering semantics when it meant increased
a unique ID that can be accessed within the thread during performance.
its execution. Each thread executes a single instruction set • Device state. CUDA maintains a large amount of
called the kernel. The kernel is the core code to be exe- states in hardware. The states contain attributes such
cuted on each thread. Using the thread and block IDs each as device point, symbol, pitch, texture and so on. On
thread can perform the kernel task on different set of data. a workstation with hardware acceleration, the graph-
Since the device memory is available to all the threads, it ics hardware keeps track of most or all of the current
can access any memory location. state. However, in order to properly implement a re-
mote execution for these APIs in virtual machine, it is
3. vCUDA necessary for the client to keep track of some of the
state in software.
vCUDA framework is organized around two main archi-
tectural features: 3.1. System Architecture
• Virtualization of CUDA API. 31 APIs in total 56 run- vCUDA uses a robust client-server model, which con-
time APIs were encapsulated into RPC calls. Their pa- sists of three user space modules: the vCUDA library, vir-
rameters were properly queued and redirected to hos- tual GPU in client and the vCUDA stub in server. Figure 2
tOS, and variables was cached and kept persistence in shows the vCUDA architecture. In the rest of this paper, the
both server and client side. Through this kind of virtu- term server memory refers to the hostOS memory space,
alization, the graphics hardware interface was decou- the term client memory refers to the VM memory space,
3
tion 3.2). In the current stage, we do not support the virtual-
ization of 3D graphics APIs, which requires a large amount
of engineering effort. Thus the discussion about these APIs
is beyond the scope of this paper (Section 5).
3.1.2. vGPU
4
will be invoked and the overhead of world switch (exe- 3.4. Suspend and Resume
cution context switch between different guest OSes) will
be inevitably introduced into the system. In virtual ma- vCUDA provides support for suspend and resume, en-
chines, world switch is an extremely expensive operation abling client sessions to be interrupted or moved between
and should be avoided whenever possible [23]. computers [18]. Upon resume, vCUDA presents the same
We classify the CUDA APIs into two categories. One is device state that the application observed before suspending
called instant APIs, whose executions have immediate ef- while retaining hardware acceleration capabilities.
fect on the state of CUDA runtime system. The other cate- The basics to implement application suspend and resume
gory is lazy APIs, which do not have any side effects on the is to store the CUDA API calls which affect device states
runtime state until the invocation of a following instant API. and the coresponding states bound to these calls when nec-
This kind of classification allows vCUDA to reduce the fre- essary. While the guest is running, vCUDA stub and vGPU
quency of world switch by updating states in the stub lazily, both snoop on the CUDA commands they forward to keep
and decrease unnecessary RPCs by redirecting lazy APIs to track of the state of device. Upon resume, vCUDA spawns a
stub side in a batched manner, thus boosting the system’s new thread in the stub, which is initialized by synchronizing
performance. A potential problem with lazy mode needs it with the application vCUDA state stored by vCUDA. The
mentioning is the delay of error report and time counting, time spent on resume depends on RPC efficiency, the GPU
thus it is not suitable for debugging and measurement pur- computing time of specific application and world switch
pose. overhead.
5
Table 1. Statistics of Benchmark Applications.
Number of APIs GPU RAM Data Volume
AlignedTypes (AT) 1990 94.00MB 611.00MB
BinomialOptions (BO) 31 0.01MB 0.01MB
BlackScholes (BS) 5143 61.03MB 76.29MB
ConvolutionSeparable (CS) 48 108.00MB 72.00MB
FastWalshTransform (FWT) 144 128.00MB 128.00MB
MersenneTwister (MT) 24 91.56MB 91.56MB
MonteCarlo (MC) 53 187.13MB 0.00MB
ScanLargeArray (SLA) 6890 7.64MB 11.44MB
4.1. Performance
6
Figure 5. Effect of lazy RPC mode. Figure 6. Evaluation of scalability by running
two CUDA applications concurrently.
7
Figure 7. Size of suspended CUDA state. Figure 9. Comparison of time consumption
among resume operation, native and virtual-
ized executions.
4.4. Compatibility
8
Table 2. Statistics of the Third Party Applications.
Number of APIs GPU RAM Data Volume Native Time(s) vCUDA Time(s)
GPUmg 89 294KB 2448KB 0.242s 0.387s
storeGPU 32 983KB 860KB 0.301s 0.413s
MRRR 370 1893KB 2053KB 0.591s 0.686s
MV 31 15761KB 61472KB 0.776s 3.814s
MP3encode 94 1224KB 391KB 0.252s 0.515s
bility across VMMs and OSs, therefore avoided all per- computing, such as HPVM [6] and Java. In [12], the authors
formance optimizations that might compromise portability. proposed a framework for HPC applications in VMs, which
The underlying data channels have not been fully utilized, addresses the performance and management overhead asso-
leading to relatively low efficiency. One of our goals in ciated with VM-based computing. They explained how to
future is to develop a specific communication strategy be- achieve high communication performance for VMs by ex-
tween different domains in Xen by adopting technologies ploiting the VMM-bypass feature of modern high speed in-
such as XWAY [17]. terconnects such as InfiniBand, and reduce the overhead of
CUDA currently is only Nvidia’s private GPGPU inter- distributing and managing VMs in large scale clusters with
face standard, which means vCUDA only supports Nvidia’s scalable VM image management schemes.
graphics cards. Recently the industry has announced other
competitive frameworks such as [26] for the same purpose, 7. Conclusions
and we expect that methodologies discussed in our frame-
work can also be applied to other interfaces. In this paper we proposed a framework vCUDA, a
GPGPU high performance computing solution for virtual
6. Related Work machines. vCUDA allows applications executing within
virtual machines (VMs) to leverage hardware acceleration,
Research community have adopted various methods to which can be beneficial to the performance of a class of
expand and reuse API, a typical method is to replace the high performance computing (HPC) applications. We ex-
graphic API library with an ”intercept” library that looks plained how to access graphics hardware in VMs transpar-
exactly like the original graphic library. ently by API call interception and redirection. Our evalu-
According to the specific features and practical require- ation showed that GPU acceleration for HPC applications
ments, many existing systems intercept calls to the graph- in VMs is feasible and competitive with those running in a
ics library for various purposes. VirtualGL [30] virtru- native, non-virtualized environment. In future, we will add
alizes GLX to grant remote rendering ability. WireGL 3D graphics virtualization to our framework and port it to
[14] and its successor Chromium [13] intercept OpenGL newer versions of CUDA. We plan to investigate high per-
[27] to generate different output like distributed displays. formance inter-domain communication schemes to improve
Chromium provides a mechanism for implementing plug- the efficiency of data transfer in our system.
in modules that alter the stream of GL commands, allow-
ing the distribute parallel rendering. HijackGL [24] uses
8. Acknowledgments
the Chromium library to exploring new rendering styles.
In VMM platform like Xen this methodology is used to
achieve 3D hardware acceleration in a virtual machine. [19] The authors would like to thank the anonymous review-
deploys a fake stub in guestOS and redirect the OpenGL ers for their useful suggestions and comments on this paper.
flow to hostOS. Blink project [11] intercepts OpenGL to This research was supported in part by the National Basic
multiplex 3D display in several ClientOS. Another main Research Program of China under grant 2007CB310900,
category is the tools to help performance analysis and de- and the National Science Foundation of China under grants
bugging. IBM’s ZAPdb OpenGL debugger [15] uses this 60803130 and 60703096.
interception technique to aid in debugging OpenGL pro-
grams. Intel’s Graphics Performance Toolkit [16] uses a References
similar method to instrument graphics application perfor-
mance. [1] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan,
High level middleware- and language-based virtual ma- M. Ripeanu. StoreGPU: Exploiting Graphics Process-
chines have been studied and used for high performance ing Units to Accelerate Distributed Storage Systems.
9
In Proc. ACM/IEEE International Symposium on High In Proc. 20th Annual International Conference on Su-
Performance Distributed Computing (HPDC), Boston, percomputing, June 28-July 01, 2006, Cairns, Queens-
MA, Jun. 2008. land, Australia.
[2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, [13] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ah-
A. Ho,R. Neugebauer, I. Pratt, and A. Warfield. Xen ern,P. D. Kirchner, and J. T. Klosowski. Chromium: a
and the Art of Virtualization. In Proc. 19th ACM Sym- Streamprocessing Framework for Interactive Rendering
posium on OperatingSystems Principles (SOSP), pages on Clusters. In Proc. 29th Annual Conference on Com-
164-177, Bolton Landing, NY, Oct. 2003. puter Graphics and Interactive Techniques, pages 693-
702, New York, NY, USA,2002.
[3] M. Ben-Yehuda, J. Mason, O. Krieger, and J. Xeni-
dis. Xen/IOMMU, Breaking IO in New and Interest- [14] G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Ev-
ing Ways. http://www.xensource.com/files/xs0106 xe- erett, and P. Hanrahan. WireGL: A Scalable Graphics
niommu.pdf. System for Clusters. In Proc. SIGGRAPH, pages 129-
140, August 2001.
[4] I. Buck, G. Humphreys, and P. Hanrahan. Tracking
Graphics State for Networked Rendering. In Proc. [15] IBM’s ZAPdb OpenGL debugger, 1998. Zapdb. Com-
ACM SIGGRAPH/EUROGRAPHICS Workshop on puter Software.
Graphics Hardware, pages 87-95, Interlaken, Switzer-
[16] Intel Graphics Performance Toolkit. Computer Soft-
land, Aug. 2000.
ware.
[5] I. Buck, T. Foley, D. Horn, J. Sugerman, F. Kayvon, M. [17] K. Kim, C. Kim, S. I. Jung, H. S. Shin, J. S. Kim.
Houston, and P. Hanrahan. Brook for GPUs: Stream Inter-domain Socket Communications Supporting High
Computing on Graphics Hardware. In Proc. ACM SIG- Performance and Full Binary Compatibility on Xen. In
GRAPH 2004, pages 777-786, New York, NY, 2004. Proc. VEE 2008. ACM Press, Mar. 2008.
[6] A. Chien et al. Design and Evaluation of an HPVM- [18] M. Kozuch and M. Satyanarayanan. Internet sus-
Based Windows NT Supercomputer. The International pend/resume. In Proc. Fourth IEEE Workshop on Mo-
Journal of High Performance Computing Applications, bile Computing Systems and Applications, Callicoon,
13(3):201-219, Fall 1999. New York, June 2002.
[7] CUDA: Compute Unified Device Architecture. [19] H. A. Lagar-Cavilla, N. Tolia, M. Satyanarayanan, and
http://www.nvidia.com/object/cuda home.html. (ac- E. de La-ra. VMM-independent Graphics Acceleration.
cessed September, 2008). In Proc. VEE 2007. ACM Press, June 2007.
[8] G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. [20] C. Lessig, An Implementation of the MRRR Algo-
M.Chen. Revirt: Enabling Intrusion Analysis Through rithm on a Data-Parallel Coprocessor. Technical Re-
Virtual Machine Logging and Replay. In Proc. 5th Sym- port, University of Toronto, 2008.
posium on Operating Systems Design and Implementa-
tion (OSDI), Boston,MA, Dec. 2002. [21] J. LeVasseur, V. Uhlig, J. Stoess, and S. Gotz. Un-
modified Device Driver Reuse and Improved System
[9] N. Fujimoto. Faster Matrix-Vector Multiplication on Dependability via Virtual Machines. In Proc.6th Sym-
GeForce 8800GTX. In Proc. 22nd IEEE Interna- posium on Operating Systems Design and Implementa-
tional Parallel and Distributed Processing Symposium tion, San Francisco,CA, Dec. 2004.
(IPDPS), LSPP-402, Apr. 2008.
[22] MDGPU. http://www.amolf.nl/˜vanmeel/mdgpu/about.html.
[10] GPGPU: General Purpose Programming on GPUs.
http://www.gpgpu.org/w/index.php/FAQ#What pro- [23] A. Menon, J. R. Santos and Y. Turner et al. Diagnos-
gramming APIs exist for GPGPU.3F. ing Performance Overheads in the Xen Virtual Machine
Environment. In Proc. 1st ACM/USENIX International
[11] J. G. Hansen. Blink: 3d Display Multiplex- Conference on Virtual Execution Environments, pages
ing for Virtualized Applications. Technical Report, 13-23, Chicago, IL, June 2005.
DIKU - University of Copenhagen, Jan. 2006.
http://www.diku.dk/ jacobg/pubs/blink-techreport.pdf. [24] A. Mohr and M. Gleicher. HijackGL: Reconstructing
from Streams for Stylized Rendering. In Proc. the Sec-
[12] W. Huang, J. Liu, B. Abali, D. K. Panda. A Case for ond International Symposium on Non-photorealistic
High Performance Computing with Virtual Machines. Animation and Rendering, 2002.
10
[25] MP3 LAME Encoder (Nvidia’s CUDA Contest).
http://cudacontest.nvidia.com/index.cfm?action=contestdownload&contestid=2.
[26] OpenCL: Parallel Computing on the GPU and CPU. In
Beyond Programmable Shading Course of SIGGRAPH
2008, August 14, 2008.
11