Understanding The Security of Discrete GPUs
Understanding The Security of Discrete GPUs
CCS Concepts •Security and privacy → Operating systems se- that contains quirky features absent from CPUs, such as auxiliary
curity; embedded microprocessors, and by the GPU’s deep software stack
ACM Reference format: whose boundaries and interactions with GPU hardware are deliber-
Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, Emmett Witchel, ately blurred by the vendor. Unfortunately, the complexity of this
and Mark Silberstein. 2016. Understanding The Security of Discrete GPUs. interplay can hide vulnerabilities that attackers can use to subvert a
In Proceedings of GPGPU-10, Austin, TX, USA, February 04-05 2017, 11 pages. GPU’s expected behavior and break critical security properties, as
DOI: http://dx.doi.org/10.1145/3038228.3038233 we show in this paper.
For discrete GPUs, their independent memory system and com-
Abstract putational resources are physically partitioned from the main CPU
GPUs have become an integral part of modern systems, but which makes it plausible that a GPU could function as a secure
their implications for system security are not yet clear. This pa- processor; it might be possible to protect computation on a dis-
per demonstrates both that discrete GPUs cannot be used as secure crete GPU from code executing on the CPU. While plausible in
co-processors and that GPUs provide a stealthy platform for mal- theory, we systematically analyze the shortcomings of one specific
ware. First, we examine a recent proposal to use discrete GPUs proposal to use GPU hardware registers as secure storage (called
as secure co-processors and show that the security guarantees of PixelVault [44]). We show that PixelVault’s security depends on
the proposed system do not hold on the GPUs we investigate. Sec- assumptions about GPU hardware features that do not hold, and in
ond, we demonstrate that (under certain circumstances) it is possi- practice fully depend on the vulnerable GPU software interface that
ble to bypass IOMMU protections and create stealthy, long-lived GPU vendors expose.
GPU-based malware. We demonstrate a novel attack that compro- The flaws we find in PixelVault’s GPU security model stem from
mises the in-kernel GPU driver and one that compromises GPU mi- the lack of a clear software/hardware boundary and shifting respon-
crocode to gain full access to CPU physical memory. In general, we sibilities of hardware and software across GPU generations. For ex-
find that the highly sophisticated, but poorly documented GPU hard- ample a non-bypassable hardware feature in one version of a GPU
ware architecture, hidden behind obscure close-source device dri- can migrate to a bypassable software feature in another version.
vers and vendor-specific APIs, not only make GPUs a poor choice The problem with such a fluctuating software/hardware boundary
for applications requiring strong security, but also make GPUs into is that it becomes hard, if not impossible, to reason about the actual
a security threat. security guarantees of a GPU system.
We also systematically analyze risks that originate with NVIDIA
1 Introduction GPUs, where the GPU serves as a host for stealthy, long-lived ma-
GPUs have enjoyed increasing popularity over the past decade, licious code. It is difficult to detect the execution of GPU-hosted
both as hardware accelerators for graphics applications and as malware and in certain cases, it is even difficult to detect its pres-
highly parallel general-purpose processors. With general purpose ence. We demonstrate attack code running on the NVIDIA GPU
computing on GPUs (GPGPUs) diffusing into the mainstream, re- that reads secrets from CPU memory and corrupts the memory state
searchers are looking at their security implications. In this paper we of CPU computations by leveraging GPU Direct Memory Access
analyze two related questions: can GPUs be used to enhance the se- (DMA) capabilities.
curity of a computing platform? and can GPUs be used to subvert We demonstrate two novel attacks: one against the proprietary
the security of a computing platform? in-kernel closed-source GPU driver, the other against the GPU mi-
Understanding the security of GPUs requires understanding the crocode running on an auxiliary microprocessor resident on the
interplay among the GPU hardware, its software stack, and the GPU card. For the driver attack, we binary patch the proprietary
busses and chipsets that coordinate a platform’s transfer of data. NVIDIA GPU driver while it is loaded and being used by the OS
The interplay of these features is complicated by GPU hardware kernel, and force it to map sensitive CPU memory into the address
space of an unprivileged GPU program. Our second attack lever-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed ages auxiliary microprocessors [27, 32] which GPUs use for various
for profit or commercial advantage and that copies bear this notice and the full citation functions like power management and video display management.
on the first page. Copyrights for components of this work owned by others than the These microprocessors are not exposed as part of the standard GPU
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission programming model (e.g., CUDA or OpenCL). We implement the
and/or a fee. Request permissions from [email protected]. attack microcode running on such an auxiliary microprocessor that
GPGPU-10, Austin, TX, USA combines the functionality of the original microcode with malicious
© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
978-1-4503-4915-4/17/02. . . $$15.00
DOI: http://dx.doi.org/10.1145/3038228.3038233
GPGPU-10, February 04-05 2017, Austin, TX, USA Zhu et. al.
performs no runtime checks. Rather, the GPU driver validates ac- PixelVault threat model. PixelVault assumes that the system boots
cess rights at the time of mapping in software. Additional hardware from a trusted configuration, and it can set up its execution environ-
protection can be provided by the IOMMU which we describe next. ment on the GPU. Once PixelVault is established, the attacker may
2.4 IOMMU have full control over the platform. Specifically, the attacker can
When a device performs direct memory access (DMA) to read execute code at any privilege level and it has access to all platform
or write CPU physical memory, it uses device addresses. When the hardware.
IOMMU is working, it maps device addresses to CPU physical ad- To achieve this goal PixelVault leverages several characteristics
dresses (just as the CPU’s MMU maps virtual to physical address). of the NVIDIA GPU architecture and execution model. While some
The IOTLB caches entries from the IO page table, just as the CPU’s of these characteristics are well known and have been officially con-
TLB caches entries from the process’ page table. IO page table en- firmed, some were only assumed to be correct and others were par-
tries contain protection information, and the IOMMU will check tially validated experimentally.
each access to system memory from a peripheral device, to make Below we list only those assumptions that we later experimen-
sure it has sufficient permissions. tally refute. Even if only one of these assumptions is not satisfied,
The IOTLB is not kept coherent with the IO page table by hard- PixelVault is no longer able to guarantee the secrecy of the master
ware, similarly to TLBs in most common CPUs. Software must encryption key under its threat model.
explicitly manage the IOTLB, flushing the cached mappings when
they are removed from the IO page table. We exploit this software- 1. It is impossible to replace the code of a running GPU kernel
managed IOTLB coherence mechanism to circumvent IOMMU pro- if the code is fully resident in the instruction cache. This
tection and enable unauthorized access to system memory from the feature is critical to ensuring that an adversary cannot re-
GPU, as we discuss in Section 4. place the PixelVault GPU code without stopping the kernel,
2.5 Microprocessors and MMIO registers in GPUs and therefore without losing the master key stored in GPU
GPUs expose a set of memory mapped input output (MMIO) registers. We show that NVIDIA GPUs have unpublished
registers used by the driver for GPU management [2, 27]. In addi- MMIO registers that flush the instruction cache, allowing
tion, they contain several special-purpose microprocessors used to replacement of code from running kernels that are as small
manage internal hardware resources. A GPU driver updates GPU as 32B (4 instructions).
microprocessor code every time a GPU is initialized. The docu- 2. The contents of GPU registers cannot be retrieved after ker-
mentation about the actual purpose of microprocessors and MMIO nel termination. This feature is essential to PixelVault’s
registers used in NVIDIA GPUs is fairly scarce; it usually comes ability to prevent a strong adversary from retrieving a mas-
from unofficial sources, such as open-source driver developers who ter key by stopping a running PixelVault kernel. We show
partially reverse-engineered the official driver. that under certain conditions it is possible to retrieve the
We found that the GPU MMIO registers can invalidate the GPU contents of registers after kernel termination, and the Pixel-
instruction caches. Flushing the instruction caches is key to dynam- Vault design satisfies these conditions.
ically updating the code of a running kernel, which breaks the se- 3. A running GPU kernel cannot be stopped and debugged if
curity guarantees of PixelVault (§ 3.2). Our microcode attack (§ 6) it is not compiled with explicit debug support. This fea-
leverages an important capability of NVIDIA microprocessors that ture is necessary to ensure that an adversary cannot retrieve
allows unrestricted access to GPU and CPU memory [17]. register contents by attaching a GPU debugger to the run-
ning PixelVault kernel. We show that newer versions of
3 Attacking PixelVault the NVIDIA CUDA runtime provide support for attaching
In this section we analyze the GPU model and security guaran- a debugger to any running kernel and it is unclear how to
tees PixelVault uses to claim a GPU as a secure co-processor. We disable this capability. This attack requires root privileges
then present several attacks that clearly violate PixelVault’s assump- to attach to any running process, yet this is permitted under
tions, and therefore its security properties. We conclude that sys- PixelVault threat model.
tems developed using PixelVault’s approach are insecure.
Experimental platform. The attacks described in Section 3.4 In the remainder of this section, we explain in more details how
and Section 3.3 are performed on NVIDIA Tesla C2050/C2075 we have invalidated the listed assumptions, thereby debunking Pix-
GPU (Fermi) and an NVIDIA GK110GL Tesla K20c (Kepler), us- elVault’s security.
ing NVIDIA driver versions 319.37 and 331.38 respectively, and 3.2 Replacing PixelVault as it runs
CUDA version 5.5. The attack in Section 3.2 is performed on an To run a kernel, a GPU needs the binary code to be resident in
NVIDIA Tesla C2050/C2075 (Fermi) with the open source nou- GPU memory. The binary is transferred from CPU memory to GPU
veau [29] and gdev [21] [22]) drivers. memory by the driver prior to kernel invocation. GPUs do not sup-
3.1 PixelVault summary and guarantees port code modification while the kernel is executing. Therefore,
PixelVault proposes a GPU-based design of a security co- PixelVault assumes that GPU hardware makes it impossible for an
processor for RSA and AES encryption which is resilient to even attacker to alter the execution of a running PixelVault kernel by re-
a strong adversary with full control of CPU and/or GPU software. placing the original PixelVault binary in GPU memory, as long as
PixelVault stores the secret keys encrypted in GPU memory, and the the binary is entirely resident in the instruction cache. PixelVault
master key in GPU registers. It implements a software infrastruc- explicitly validates that its kernel is small enough to fit in the in-
ture that strives to prevent any adversarial access to these registers struction cache. Assuming the kernel is simple, PixelVault also ex-
from CPU or GPU. plores all possible execution paths to make the kernel fully resident
in the cache soon after starting execution.
GPGPU-10, February 04-05 2017, Austin, TX, USA Zhu et. al.
We find that the lack of software interfaces for dynamic code We find that a cuda-gdb debugger can retrieve register values
update does not imply the lack of hardware support. Using the open even after kernel termination. However it requires that other GPU
source Envytools [13] reverse engineering toolkit, we can invalidate tasks are concurrently active with the execution of the victim ker-
the instruction caches and replace the instructions of the running nel. Specifically, NVIDIA CUDA enables multiple GPU operations
kernel with the attacker’s modified instructions from GPU memory. such as CPU-GPU memory transfers or GPU invocations to be in-
Technical details. We perform an experiment to show that we can voked concurrently by the same CPU process. Each operation is in-
dynamically update GPU kernel code using a matrix addition ker- voked in its own CUDA stream, and a GPU handles the operations
nel and an updater process. The updater process locates the GPU in different streams concurrently. We found that if GPU kernel B is
kernel code in GPU physical memory by searching for the kernel’s invoked in parallel with running kernel A, A’s register state can be
instructions. In the case of PixelVault, the binary is not secret, so retrieved using the debugger API even after A terminates, as long
it can be detected by an attacker. Once invoked, PixelVault kernel as B is still running.
runs indefinitely, giving the updater enough time to identify and up- The PixelVault implementation employs two CUDA streams,
date its code. If the code is erased from memory, the attacker can one for kernel execution and another for data transfers between a
speculate on its location, methodically working through the address CPU and a GPU. An attacker may take advantage of the data trans-
space. fer stream to invoke a long-running kernel, terminate PixelVault,
The updater replaces the addition instructions in our test kernel and retrieve its secrets.
with subtraction instructions. When the effective size of the code Technical details. We modified the cuda-gdb source code to
in the GPU kernel loop is larger than 32 KB, overwriting the in- read the registers of a terminated kernel. The modifications were
structions in memory causes the behavior of the kernel to change. necessary because by default cuda-gdb will refuse to read the
Such a large kernel (presumably larger than the size of the last level registers of a terminated kernel. If we launch two kernels on differ-
instruction cache) experiences instruction cache misses at runtime, ent CUDA streams, cuda-gdb can read the register values from
yielding a result that is not a simple matrix addition. the terminated kernel so long as the other kernel is running. As soon
However overwriting the kernel’s program has no effect if the as both kernels terminate, we cannot access either of their registers.
size of the kernel’s main loop is smaller than 32KB. In these cases, 3.4 Stopping PixelVault with a debugger
only if we flush the instruction cache via GPU MMIO registers do The PixelVault version discussed in their paper runs NVIDIA
we see the expected change in the kernel output. CUDA version 4.2. This version of CUDA provided no support for
Our prototype requires 3.1 seconds to scan and identify kernel attaching and setting a breakpoint in a running GPU kernel, unless
code in GPU memory. Therefore, we can only effectively flush the that kernel was explicitly compiled with debug information. This
cache for long running kernels. The PixelVault kernel is intended to property of the runtime environment disguised itself as a hardware
run continuously as it provides a runtime encryption service, there- feature. Therefore, PixelVault relied on it to ensure that an adver-
fore it is vulnerable to this cache flush attack. sary cannot attach a debugger to retrieve the values of GPU registers
MMIO registers for instruction cache flush. To invalidate the which store the secret keys, without stopping the running PixelVault
L1 instruction cache with the updated code memory, it is nec- GPU kernel and consequently without erasing the contents of those
essary to flush all the cache levels. The addresses in paren- registers.
theses are the offsets within the MMIO region, which is re- However, with the GPU system software evolving so rapidly,
ferred by the first set of the PCI base address registers (BAR0) many desirable features like attaching a debugger to a running ker-
of NVIDIA GPUs. The register that flushes the per-GPC nel are added in every new release. In particular, this feature was
caches is PGRAPH.GPC BROADCAST.CCACHE.CACHE CTRL added in the cuda-gdb GPU debugger starting from CUDA ver-
(0x419000), especially the first bit of the 32-bit register. sion 5.0 [18]. By using the CUDA debug API it is possible to stop
For the per-SM flush, we used the first and ninth bits a kernel, and inspect all GPU registers of the executing kernel from
of PGRAPH.GPC BROADCAST.TPC ALL.MP.CCACHE CTRL the CPU. This ability invalidates the privacy guarantees for Pixel-
(0x419ea4). To the best of our knowledge, the use of the ninth Vault’s master encryption key.
bit of CCACHE CTRL for flushing per-SM cache is not reported or We found no simple way of preventing software from being able
documented. to attach to a running kernel. When attaching, cuda-gdb needs
3.3 Capturing PixelVault secrets after termination access to certain predefined memory locations stored as symbols
PixelVault relies on GPU registers being initialized to zero when in libcuda.so [30], which is the main library providing CUDA
a new kernel is loaded onto a GPU and begins execution. Initial zero driver API support to GPU applications. In an older version of the
values for registers is necessary to prevent an adversary from termi- CUDA driver (we tested version 319.37), the attach information re-
nating the PixelVault kernel and running a new kernel that looks sides in the symtab section and it can be safely stripped. However,
for PixelVault secrets in its initial register values. Because initial more recent versions of the library (we tested version 331.38), no
zero values is not a feature officially documented by NVIDIA, 1 longer place the attach information in symtab, they place it in the
the PixelVault developers experimentally validate that the register dynsym section. The dynsym section cannot be stripped from
contents cannot be retrieved by another kernel invoked after Pix- the binary because it holds important data necessary for dynamic
elVault terminates. Yet, there is no guarantee that GPU hardware linking. If we remove the dynsym section from the binary, the
clears registers after termination of a GPU kernel. dynamic linker can no longer load libcuda.so.
It is possible to zero out the entries in the dynsym section used
by cuda-gdb, which causes cuda-gdb to crash the controlling
1 http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#state- CPU process when attaching to the running GPU kernel. PixelVault
spaces-types-and-variables could make its own copy of libcuda.so (so that other users can
Understanding The Security of Discrete GPUs GPGPU-10, February 04-05 2017, Austin, TX, USA
continue to debug their kernels) and just zero out or corrupt the [5, 8, 9]), or by exploiting kernel module loader weaknesses [12].
attach information in dynsym. Ultimately however, the ability to After loading a module, the attacker also has access to the GPU con-
stop the running kernel is a hardware feature that PixelVault cannot trol interface i.e., MMIO register regions, and as we explain in our
disable. Because the PixelVault threat model assumes the attacker microcode attack (§6), it can use these registers to load malicious
controls the host, the attacker can still attach to a running kernel microcode onto an embedded auxiliary GPU processor. Loading
and examine its register state even if the default support for how the microcode is done by reloading the GPU driver module into the
cuda-gdb attaches is removed from a version of libcuda.so. OS kernel. After the malware is installed, the attacker loses the
Technical details. We use cuda-gdb to attach to a running GPU module loading capability and is allowed only unprivileged access.
kernel, and retrieve all GPU registers via the CUDA debugger call If an attacker can load a module, why should he or she bother
CUDBGResult (*CUDBGAPI st::readRegister) even if with the attacks we describe? The primary reason is stealthiness:
the CUDA application is compiled without debug information (i.e., all of our attacks originate with the GPU reading and writing CPU
without the -G flag for the NVCC compiler). memory (e.g., sensitive operating system data structures), making
A simple experiment verifies this attack. A CUDA application them hard to detect. Detecting root-level compromise is the subject
launches a kernel with one thread that spins in an infinite loop of much published work (e.g., [20], [36], [35], [3]), open source
and continuously changes a value stored in a register. We attach tools (e.g., chkrootkit) and commercial tools (e.g., Malwarebytes
cuda-gdb to the GPU-controlling CPU process and then attach AntiRootkit, McAfee Rootkit Remover). To our knowledge, there
to the GPU kernel that the process spawned. All register values are no tools and precious few research studies to detect GPU-based
from the running kernel can be extracted whether or not the GPU malware. Our attacks require no changes to the page table of any
kernel was built with debug information. unprivileged process, and they bypass the CPU’s MMU and mem-
Using this technique, we can also attack a more realistic kernel. ory protection settings. The GPU page table that stores the map-
We implement the AES encryption algorithm in a GPU program, pings into system memory are not visible from the CPU (at least
emulating part of PixelVault’s operation. The AES key is stored in not via the public API). Therefore, once mapped into the GPU, a
GPU registers. While the kernel is running, we attach to it using malicious unprivileged GPU kernel may keep accessing any CPU
cuda-gdb, read the GPU registers, and expose the secret key. memory without raising suspicion.
3.5 Discussion Modern systems, however, usually contain an input/output mem-
Discrete GPUs appear to have potential as secure coproces- ory management unit (IOMMU) which monitors devices’ Direct
sors because they have physically distinct and complete process- Memory Accesses (DMAs) to system memory in order to protect it
ing resources: processor, caches, RAM, and access to I/O. They from unauthorized accesses. The IOMMU restricts the devices to
also have micro-architectural (though seemingly robust) guarantees access only the CPU memory pages specified in its I/O page table.
about non-preemption and an incoherent instruction cache. The Pix- Unlike the hidden GPU page tables, the I/O page table can be mon-
elVault system is an intelligent attempt at trying to build a secure itored by security tools (though we are not aware of any that do),
system from these components. undermining the stealthiness of the attack.
However, our investigation yields the clear conclusion that GPUs The malware, therefore, must circumvent IOMMU protection to
are not appropriate as secure coprocessors and cannot contribute to evade detection, and the next section details the techniques to ac-
the trusted computing base (TCB) of the system. GPUs are complex complish that.
devices that rely on sophisticated proprietary hardware and soft- 4.2 IOMMU
ware which is poorly (often purposefully so) publicly documented We exploit the subtleties of IOTLB management in the Linux
– the opposite of a firm basis for security. GPU manufacturers are kernel and our prototype is based on the Intel IOMMU. We first
not interested in exposing their architecture internals, and they can provide a brief overview of the IOMMU management policies.
easily change the architecture in ways that invalidate the security of IO device drivers strive to make all memory mappings as short-
systems based on a GPU, e.g., by adding preemption. lived as possible to increase security at the expense of higher man-
We have found a variety of documented and undocumented ways agement overheads [26]. The OS, therefore, offers several IOMMU
of violating the security of PixelVault. We learned of the existence configurations that influence the IOTLB management policy and
of many MMIO registers from the Envytools GPU reverse engineer- enable different tradeoffs between management cost and security.
ing project [13]. However, some registers that allowed us to invali- The configurations and their respective management policies are
date the GPU instruction cache are not documented as flushing the summarized in Table 1.
cache—yet another example of an obscure GPU architectural sub- IOMMU disabled. Though it is detrimental to security, many
tlety which undermines its use as a secure coprocessor. systems, especially those that include discrete GPUs, disable their
IOMMU by default. IOMMU support for Intel chipsets must be
4 Threat model and IOMMU
configured through the BIOS as part of setup for device virtualiza-
We explore how GPUs might host stealthy malware. Our mal-
tion technology (VT-d), and it also must be enabled in the Linux
ware attacks compromise the privacy and integrity of system mem-
kernel. Some server manufacturers ship their products with VT-d
ory by reading and writing it from the GPU. First we specify our
disabled in the BIOS by default [14].
threat model.
Several major Linux distributions (e.g., Ubuntu 15.04, CentOS
4.1 Threat model 7, RHEL 7, OpenSUSE 13.2) ship with Intel’s IOMMU disabled
An attacker may load and unload kernel modules. In Linux, this in the kernel. The primary reasons for disabling the IOMMU is re-
can be done by briefly gaining the CAP SYS MODULE capability duced I/O performance [49] and the IOMMU’s incompatibility with
(which is a user credential stored in the process control block) using certain devices and features. For example, the peer-to-peer DMA
kernel exploits (e.g., [4, 6, 7]), by bypassing capability checks (e.g.,
GPGPU-10, February 04-05 2017, Austin, TX, USA Zhu et. al.
by a network monitor. This workload uses the NIC and the graph- CPU GPU
User Process 3
ics rendering capability of the GPU. The mapping can be reliably
User Space
read 1 minute after it was erased from the IO page table. Running Kernel 5
1 2
a GPU kernel every 1 minute is sufficient to keep the stale IOTLB Attack Patch
2 6
entry resident for one hour, after which we discontinued the exper- Module GPU driver
iment. Streaming higher bandwidth videos, like the “Auto (720p)” Kernel Space
setting, cause the attack to fail, even when we refresh the stale en-
try every minute. We use round numbers like 1 minute for a stale
PCIe Bus
Chipset GPU Chipset
period to validate that our attacks are practical. Future work might
4 DMA
determine more clever ways to keep an IOTLB entry resident, but
our experiments establish a large enough window of vulnerability CPU RAM GPU RAM
to be a security concern.
Stealthy transition from deferred to strict. Keeping a stale entry
in the IOTLB is possible only if the IOMMU is configured in strict Figure 2. The driver attack, where a patched GPU driver has its
mode, because it is the only mode that invalidates each IOTLB en- memory mapping access control bypassed to allow a GPU kernel to
try separately. However, if the IOMMU is enabled, Linux uses de- access all of the CPU’s memory.
ferred mode by default, which flushes the IOTLB as a whole. These
IOTLB flushes frustrate our attacks (and form a practical counter-
measure to both of our attacks). the keylogger case access is to the keyboard buffer in OS kernel
We find that the kernel transitions between deferred memory. The primary difference is that our attack requires no long-
and strict mode based on the state of a single variable running CPU process to proceed, while the keylogger does. That
(intel iommu strict). By setting this variable to 1, the is because for the keylogger, the malicious memory mapping to the
kernel will put all devices into strict mode. Because this is a small, keyboard buffer must be installed into an unprivileged process that
legal change to kernel state, it is quite stealthy. We experimentally is running while the root-level compromise is active. The attack is
verify that it is effective at engaging strict mode, where we can lost if that unprivileged process subsequently terminates. Attack-
cache a stale IOTLB entry and launch one of the attacks we now ing the driver directly allows us to map any page to any process as
describe. many times as necessary, and without modifying sensitive kernel
We leave for future work developing a Linux IOMMU manage- data structures.
ment policy that combines the best parts of strict and deferred mode. Driver patch. Identifying the specific locations in the NVIDIA
Strict mode minimizes the chances of memory corruption by a de- driver that control access to memory would seem difficult because
vice by quickly unmapping DMA memory. Deferred mode frus- most of the driver is proprietary and undisclosed. However, the par-
trates our attacks by periodically flushing the entire IOTLB. ticular functions that control memory mapping reside in the wrap-
5 GPU driver attack per of the driver that is shipped open source and compiled at the
In this section we show an attack on the stock NVIDIA closed- time of driver installation, making it easier to determine the exact
source GPU driver. The attack enables arbitrary CPU memory map- location that needs patching.
ping into an unprivileged GPU program, concealing malware code We choose to patch the NVIDIA driver in memory rather than
which monitors or changes CPU memory from a GPU kernel. modify its source code. Patching the driver enables the attack on a
The attack scenario. An attacker loads a malicious kernel mod- system where the GPU driver module has been already loaded and
ule which installs a backdoor by patching a GPU driver in CPU is in use by the kernel, allowing the attack to avoid unloading it first,
memory (Step 1 in Figure 2). The driver continues to operate nor- which can be easily noticed. The patch is installed by a malicious
mally. To trigger the backdoor, an unprivileged GPU-controlling module which finds the driver module in memory, overwrites some
process performs a sequence of standard GPU API calls (a trigger of its code, and then unloads itself.
sequence) 3 , and maps the requested CPU memory into the GPU The patch diverts the original control flow of the driver to bypass
4 . The driver patch bypasses the standard access control checks the memory permission checks when handling the cudaHost-
in the driver, and allows the attacker to map any user or kernel mem- Register() API call, which is normally used to lock CPU mem-
ory page. The CPU process invokes a malicious, unprivileged GPU ory pages (os_lock_user_pages()) and map them into the
kernel 5 which accesses the mapped page 6 . The attack module GPU address space (nv_dma_map_pages()).
may unload itself, leaving the modified driver in kernel memory to The driver acts normally as long as the trigger sequence has not
subsequently repeat the attack from another unprivileged process. been detected. The trigger sequence is a series of legal but erro-
If no more attacks are planned, a stealthier alternative is to recon- neous calls to cudaHostRegister(), e.g., passing pointers to
struct the original driver code to evade detection by kernel code unallocated memory. Once triggered, the modified driver expects
integrity scanners [35]. We implement two proof-of-concept GPU another call to cudaHostRegister() with the hidden buffer as
malware kernels – one that escalates the privileges of a given pro- a parameter. The hidden buffer contains the actual parameters of
cess by manipulating its cred structure, and another that diverts the malicious mapping request. For instance, the driver may map
the execution flow of a given process by updating its code, which the virtual address of a certain process, or some known kernel data
resides in read-only memory. structure like the task control block (task_struct in Linux).
Our attack is similar to the previously reported GPU keylog- The patch resolves the physical address of the requested memory
ger [24] attack. Both attacks exploit GPU DMA capabilities; in region, injects this address into the original control structures of the
driver, and resumes the original driver which updates the internal
GPGPU-10, February 04-05 2017, Austin, TX, USA Zhu et. al.
GPU page table with the new mapping. The GPU-controlling pro- CPU GPU
cess then may launch the malicious GPU kernel that accesses the Process Kernel
mapped region.
6 GPU microcode attack GPU Chipset
We demonstrate a novel attack that modifies GPU microcode to
GPU
spy on or corrupt CPU and GPU state. The attack uses an embed- Microprocessor
ded microprocessor in NVIDIA GPUs and can evade any detection 3
Microcode
1
mechanism that relies on evidence from CPU or GPU memory. Chipset
PCIe Bus
RAM
6.1 Background: NVIDIA GPU microprocessors
NVIDIA GPUs contain several on-board microprocessors for
power management, video display, decoding, decryption, and other
purposes. The existence of multiple Falcon microprocessors in
NVIDIA GPUs has been officially disclosed by NVIDIA [32], but 2
no official public API has been released. The reverse-engineering CPU RAM GPU RAM
community discovered that Falcon microprocessors are capable
of issuing data transfers from CPU or GPU physical memory to
microprocessor memory using dedicated memory transfer instruc-
Figure 3. The GPU microcode attack. ( 1 Launch) The attacker
tions [13, 17].
loads the attack code into microprocessor storage. ( 2 Monitor)
Falcon microprocessors expose a common set of MMIO regis-
The microcode transfers data from GPU memory to its own memory
ters that both a CPU and the microprocessor can access or update
in order to identify triggers or commands from the attacker. ( 3
(GPU kernels normally cannot access them). These registers enable
Execute) Once it detects the commands, it launches the attack by
communication between privileged CPU code and the microcode.
writing to critical data structures in CPU memory.
Certain MMIO registers update the code and data memory of the
microprocessor and restart its execution. Linux kernel source code
contains the assembly code for certain Falcon processors3 . CPU memory or in GPU memory. We check all of the GPU base
The platform’s GPU driver loads control code onto the Falcon address register (BAR) regions mapped by the CPUs (BAR0 for
microprocessors as part of its initialization sequence. The code is MMIO, BAR1 for VRAM aperture, BAR3 for kernel-accessible
invoked in response to certain events, for example when serving re- control memory) and none contain any sign of the microcode data
quests to switch GPU control to another CPU process (called GPU or code, both in big- and little- endian representations. The only
context switch [15]). It is this control code that we attack. We call forensic tool we could find for dumping GPU memory [37] did not
this a GPU microcode attack because the microprocessor code is reveal the microcode. The microcode resides only in the GPU mi-
one of several non-user visible code modules loaded into the GPU croprocessor memory. The Falcon processor has MMIO registers
at its initialization, and because it is terminology accepted by GPU for uploading the microcode from CPU memory to microprocessor
driver developers [16]. memory. It also has an MMIO register for transferring microproces-
6.2 The attack sor memory back to CPU memory, but we know of no tool that uses
The attack consists of three phases: launch, monitor, and execute, this interface, let alone one that tries to determine if the microcode
as illustrated in Figure 3. In the launch phase, the attacker with is malicious.
privileged access installs the attack microcode on one of the Fal- Once established, the microcode attack does not require support
con microprocessors, and executes the code. Now the attack enters from a kernel module or a CPU user process, unlike previously
the monitor phase, in which the microprocessor monitors regions known GPU malware attacks [24]. The attack does not affect the
of GPU memory to identify commands from the attacker. The com- integrity of kernel-level data structures.
mands can be located in GPU memory or MMIO registers, though 6.3 Technical details
our prototype monitors only specific GPU memory locations. Fi- The attack code is written in C, based on the microcode assem-
nally, once the commands are identified, they are executed by the bly code in the Linux kernel. It is then compiled by the publicly
microprocessor as part of the execution phase. available LLVM backend and envytools [17]. The Falcon’s xfer
In our proof-of-concept attack, we use a seemingly random bi- instruction can initiate DMA, and it can transfer data from both
nary string as the trigger for the attack. An unprivileged attacker GPU and CPU memory to its own memory, or vice versa.
executes a CUDA program that has a large data structure contain- Out of many Falcon microprocessors, we use a microproces-
ing repeated copies of the trigger string. Our modified microcode sor that manages context switching between multiple command
detects the trigger (probably) in one of its monitoring locations and streams, e.g., between the X server and a 3D application. There are
transitions into the attack execution phase. In our prototype, the several microprocessors involved in this process; microprocessors
attacking microcode escalates the privilege of a predefined running manage the context switch of a group of SMs (graphics process-
shell process, by writing into the process’ credential structure in ing clusters, or GPCs), and a HUB microprocessor that manages
CPU memory. these GPC microprocessors. We update the microcode of HUB mi-
Stealth and unobtrusiveness. The key benefit of this attack is croprocessor because it has larger code and data memory, and the
stealth. We observe no evidence that the attack is occurring in open-source version of the microcode is available in the Linux ker-
nel. An official version of the microcode can be also extracted [16],
3 Located in drivers/gpu/drm/nouveau/core/engine/graph/fuc. and used to build modified microcode.
Understanding The Security of Discrete GPUs GPGPU-10, February 04-05 2017, Austin, TX, USA
The compiled binary is loaded into the microprocessor of the Small code size. Different types of Falcon processors have differ-
NVIDIA C2075 GPU using a set of MMIO registers. The envy- ent limits on code and data memory. For example, the maximum
tools [13] project contains a set of tools the open-source commu- code size and data size for a microprocessor are 16 KB and 4 KB,
nity has developed and used to build the open-source nouveau respectively, whereas the limits for closely related microprocessors
Linux driver for NVIDIA GPUs. RNN is the community-built is only 8 KB and 2KB. If multiple microprocessors communicate
knowledge base for MMIO registers in NVIDIA GPUs. The ad- and launch a more complex attack than one microprocessor can han-
dresses in parentheses are the offsets within the MMIO region, dle, the attacker can distribute the work according to the memory
which is referred by the first set of the PCI base address registers limit of each processor.
(BAR0) of NVIDIA GPUs. Microcode validation. Starting from Maxwell GPUs, NVIDIA
Microcode is uploaded to Falcon using CODE VIRT ADDR significantly strengthened the security of Falcon microprocessors
(0x188) and CODE (0x180) registers. The user can issue 32 bit by requiring their code to be signed and preventing code modifica-
stores to the CODE VIRT ADDR register for the index of the 256B tions after code is initially loaded [32]. Unsigned microprocessor
chunk of code to be written, and CODE for the content of the code code may run in unsecure mode, but it cannot use certain hardware
at the address pointed by CODE VIRT ADDR. DATA (0x1c4) allows features (the precise set of constraints depends on the processor).
upload of microcode data. These new security mechanisms, therefore, are likely to complicate
Most of the attack microcode is loaded into regions that do not or even entirely prevent our microcode attack, because most (but
overlap with the existing code or data, to have minimal effect on the not all) Falcons on NVIDIA Pascal GPUs do not allow unsigned
normal system operation. Only small patches to the code region are code access to physical memory. We leave the vulnerability analy-
needed to redirect the control flow to the injected attack code. sis of Pascal GPUs for future work.
To remain unobtrusive, we split the work done by the attack mi-
7 Related work
crocode into small units and execute only for a limited time, chain-
GPU malware. Vasiliadis et al. [45] present two GPU malware
ing together subsequent executions using continuations. The mi-
techniques, code unpacking and runtime polymorphism, used to
crocode is originally designed to be interrupt-driven: most of the
evade malware detection. These techniques make use of the GPU
functions are interrupt handlers for different types of interrupts,
computing capacity to build more complex packing algorithms and
such as periodic timers or commands waiting to be handled. The
leverage GPU direct memory access (DMA) to modify host mem-
microprocessor assumes that interrupt handlers will be brief.
ory.
We test the unobtrusiveness of our attack code by running glx-
Ladakis et. al. implement a keylogger on the GPU [24], leverag-
gears from the Mesa GL utility library. This application reports
ing the DMA capability of GPUs to monitor the operating system
the frame rate of 3D graphics rendering, and we could observe
keyboard buffer from a GPU kernel. The GPU-based keylogger re-
lower frame rates or even GPU lockup if the execution time of
quires an unprivileged helper process to set up the attack. It relies
our inserted operations took too long. By fine-tuning the amount
on a kernel module to update a page table entry of the helper, so
of work done at each interrupt, our attack microcode supports the
that the process’ address space contains a window on the kernel-
same glxgears framerate as the unmodified microcode. We also
level keyboard buffer. The keyboard buffer address is then moved
could not subjectively observe an effect on typical desktop opera-
to the GPU page table, and erased from the page table in the CPU,
tions.
keeping the kernel memory mapping for a short time.
We implement our proof-of-concept microcode attack on top of
Both of these attacks require helper processes on the CPU, and
the open source nouveau driver. The attack code for the nouveau dri-
these processes violate certain address space integrity properties
ver includes all of the attack steps we describe in the attack scenario.
(though in most systems, these integrity properties are implicit).
We verify that we can unobtrusively inject simple code sequences
Hiding malware with unpacking and polymorphism on the GPU re-
into the NVIDIA microcode, but do not implement the entire attack
quires mapping a CPU memory region that is executable, writable,
for the NVIDIA microcode.
and IO-mapped. The GPU-based keylogger has a user-level page
The embedded NVIDIA microprocessor has a periodic timer in-
that maps kernel memory which no user process should ever map.
terrupt and a one-shot watchdog timer, but in our experience, the
These distinctive memory regions that clearly violate certain safety
use of the periodic timer affects the graphics output. Therefore, to
properties make the malware easy to detect by some rootkit detec-
remain undetectable, we use the watchdog timer and have the watch-
tors [19, 36].
dog event handler reschedule another watchdog event. Our attack
The GPU-based microcode attack described in this paper does
code is 4KB, which we add to the 3KB of Nouveau microcode fit-
not require any running process once it is installed in the micropro-
ting comfortably in the device’s 16KB capacity.
cessor. It leaves no trace in CPU or GPU memory and therefore
6.4 Discussion does not violate any memory integrity property. The GPU driver
Falcon microprocessors are relatively slow, the one we used runs attack does not need a CPU helper until the malicious behavior is
at 270MHz. We only read a small number of GPU memory loca- triggered. The attack is entirely encapsulated in the driver and does
tions to keep execution time short. Therefore, our trigger consists not change any kernel data structure, however the patched driver
of a GPU program that fills much of its memory with the target module might still be detected by kernel integrity checkers.
string which gives a high probability of it being read by the attack Villani et. al. analyze four GPU-assisted malware anti-memory-
microcode. Our proof-of-concept trigger fills 3GB of data and the forensics techniques without modification of GPU microcode: un-
microcode reads five memory locations at 1GB offsets. This com- limited code execution, process-less code execution, context-less
bination makes the microcode recognize the trigger for each of 10 code execution and inconsistent memory mapping and apply them
trials. to integrated Intel GPU cards [46].
GPGPU-10, February 04-05 2017, Austin, TX, USA Zhu et. al.
GPU as secure co-processor. PixelVault [44] proposes to use USENIX Association, pp. 337–352.
GPUs as secure co-processors for cryptographic operations. We [2] C ORP., I. Intel 965 Express Chipset Family and Intel G35 Express Chipset
Graphics Controller Programmers Reference Manual. Volume 1: Graphics
have shown in this paper how features from the official NVIDIA Core, 2008. https://01.org/sites/default/files/documentation/965 g35 vol 1
debugger to unofficial hardware interfaces violate the security as- graphics core 0.pdf.
[3] C UI, A., C OSTELLO , M., AND STOLFO , S. J. When firmware modifications
sumptions of PixelVault. attack: A case study of embedded exploitation. In Proceedings of the Network
Firmware attacks. There are several firmware-based attacks that and Distributed System Security Symposium (NDSS) (2013).
target diverse devices [1, 3, 10, 41, 47, 51]. Similar to the mi- [4] CVE. CVE-2007-1019. https://www.cvedetails.com/cve/CVE-2007-1881/.
Accessed: May 2016.
crocode attack in this paper, these attacks embed malicious code [5] CVE. CVE-2011-1019. https://www.cvedetails.com/cve/CVE-2011-1019/.
into the firmware to circumvent the platform’s security while evad- Accessed: May 2016.
ing detection. Triulzi [42] presents a sniffer that uses a combination [6] CVE. CVE-2014-5207. https://web.nvd.nist.gov/view/vuln/detail?
vulnId=CVE-2014-5207. Accessed: May 2016.
of NIC and GPU to access main memory. The GPU runs an ssh dae- [7] CVE. CVE request: ro bind mount bypass using user namespaces. http://www.
mon that accepts packets from the NIC through PCI-to-PCI transfer. openwall.com/lists/oss-security/2014/08/13/9. Accessed: May 2016.
[8] CVE. How to exploit the x32 recvmmsg() kernel vulnerability CVE 2014-
The firmware modification on the GPU is mainly due to lack of PCI- 0038. http://blog.includesecurity.com/2014/03/exploit-CVE-2014-0038-
to-PCI transfer support. With GPUDirectRDMA [31], this attack x32-recvmmsg-kernel-vulnerablity.html. Accessed: May 2016.
can be implemented without GPU firmware modification. To the [9] CVE. Local root exploit for CVE-2014. https://github.com/saelo/cve-2014-
0038. Accessed: May 2016.
best of our knowledge, our attack is the first GPU microcode-based [10] D UFLOT, L., PEREZ, Y.-A., AND M ORIN , B. What if you can’t trust your
attack that leverages GPU embedded microprocessors. network card? In Proceedings of the International Symposium on Recent
Newer NVIDIA GPUs are expected to disallow the use of un- Advances in Intrusion Detection (RAID) (Berlin, Heidelberg, 2011), Springer-
Verlag, pp. 378–397.
signed microcode, preventing the microcode attack. [11] D UFLOT, L., PEREZ, Y.-A., VALADON , G., AND L EVILLAIN , O. Can you still
Information leaks through GPU. Recent works notice that the trust your network card. CanSecWest/core10 (2010).
[12] E DGE, J. A crypto module loading vulnerability. https://lwn.net/Articles/
GPU driver does not erase the device memory after kernel termi- 630762/. Accessed: May 2016.
nation, leaking private information [25, 28]. This paper describes a [13] ENVYTOOLS. envytools - tools for people envious of nvidia’s blob driver. https://
different type of attacks that leverage the GPU to stealthily perform github.com/envytools/envytools. Accessed: May 2016.
[14] FISCHER , W. Activating the Intel VT-d virtualization feature. https://
unauthorized accesses to CPU memory. www.thomas-krenn.com/en/wiki/Activating the Intel VT-d Virtualization
Attacks using graphics software stack. The security aspects of Feature. Accessed: May 2016.
using GPUs in graphics applications have been the subject of much [15] FREEDESKTOP. ORG. nouveau context switching. http://nouveau.freedesktop.
org/wiki/ContextSwitching/. Accessed: May 2016.
work [38, 39, 43]. Our work is complementary in that we focus on [16] FREEDESKTOP. ORG. nouveau context switching firmware. http://nouveau.
the GPU microcode and the driver, and investigate the weaknesses freedesktop.org/wiki/NVC0 Firmware/. Accessed: May 2016.
[17] FUJII, Y., A ZUMI, T., N ISHIO , N., K ATO , S., AND E DAHIRO , M. Data transfer
of using GPUs as secure co-processors. matters for GPU computing. In Proceedings of the International Conference
Reverse-engineering GPU hardware. Detailed information about on Parallel and Distributed Systems (ICPADS) (Washington, DC, USA, 2013),
GPU hardware architecture is usually never disclosed by the ven- IEEE Computer Society, pp. 275–282.
[18] G ERFIN , G., AND V ENKATARAMAN , V. Debugging experience with CUDA-
dors. Wong et al. [48] reverse engineer GPU internals via carefully GDB and CUDA-MEMCHECK. In GPU Technology Conference (2012).
crafted microbenchmarks. We use similar techniques in this paper [19] H OFMANN , O. S., D UNN , A., K IM, S., ROY, I., AND W ITCHEL , E. Ensuring
to discover the size of the instruction cache. Fujii et al. [17] explain operating system kernel integrity with OSck. In Proceedings of the ACM Inter-
national Conference on Architectural Support for Programming Languages and
the internal organization of GPU microprocessors which we use to Operating Systems (ASPLOS) (March 2011).
implement the microcode attack. [20] H OFMANN , O. S., PORTER , D. E., ROSSBACH , C. J., R AMADAN , H. E., AND
W ITCHEL , E. Solving difficult HTM problems without difficult hardware. In
8 Conclusion Proceedings of the 2nd Workshop on Transactional Computing (TRANSACT)
(Portland, OR, August 2007).
GPUs are not an appropriate choice for a secure coprocessor, and [21] K ATO , S. Gdev: Open-source GPGPU runtime and driver software. https://
they pose a security threat to computing platforms, even those with github.com/shinpei0208/gdev.
an IOMMU. The problem with making hardware, especially hard- [22] K ATO , S., M C T HROW, M., AND M ALTZAHN , C ARLOSAND B RANDT, S.
Gdev: First-class GPU resource management in the operating system. In Pro-
ware as complex as a GPU, into something that enhances security is ceedings of the USENIX Annual Technical Conference (June 2012).
that the security guarantees rely on a large set of assumptions about [23] K HRONOS G ROUP. The OpenCL Specification, Version 2.0, 2014.
[24] KOROMILAS , L., VASILIADIS , G., IOANNIDIS , S., L ADAKIS , E., AND POLY-
architectural, micro-architectural, and software features. These as- CHRONAKIS , M. You can type, but you can’t hide: A stealthy GPU-based key-
sumptions are difficult to verify and they can change for different logger. In Proceedings of the Sixth European Workshop on System Security (EU-
versions of the product because the underlying motivation of the ROSEC) (2013), ACM.
[25] L EE, S., K IM, Y., K IM, J., AND K IM, J. Stealing webpages rendered on your
manufacturer is not security. browser by exploiting GPU vulnerabilities. In Proceedings of the IEEE Sympo-
As an attack platform, they combine powerful access to plat- sium on Security and Privacy (Oakland) (Washington, DC, USA, 2014), IEEE
form hardware with an opacity encouraged by its proprietary nature. Computer Society, pp. 19–33.
[26] M ALKA , M., A MIT, N., B EN -Y EHUDA , M., AND T SAFRIR , D. rIOMMU: Ef-
While forensic tools for GPUs will improve, they represent another ficient IOMMU for I/O Devices That Employ Ring Buffers. In Proceedings of
nettlesome resource for determined attackers. the Twentieth International Conference on Architectural Support for Program-
ming Languages and Operating Systems (New York, NY, USA, 2015), ASPLOS
9 Acknowledgements ’15, ACM.
[27] M ARECK , R. AMD x86 Firmware Analysis. Accessed: May 2016.
Mark Silberstein was supported by the Israel Science Foundation [28] M AURICE , C., N EUMANN , C., H EEN , O., AND FRANCILLON , A. Confiden-
(grant No. 1138/14) and the Israeli Ministry of Science. We also tiality issues on a GPU in a virtualized environment. In Financial Cryptography
gratefully acknowledge funding from NSF grants CNS-1017785 and Data Security. Springer, 2014, pp. 119–135.
[29] NOUVEAU. Nouveau: Accelerated open source driver for nvidia cards. http://
and CCF-1333594. nouveau.freedesktop.org/wiki/. Accessed: May 2016.
[30] NVIDIA. Debugger API. http://docs.nvidia.com/cuda/debugger- api/index.
References html.
[1] B ROCKER , M., AND C HECKOWAY, S. iSeeYou: disabling the MacBook web- [31] NVIDIA. GPUDirectRDMA technology. http://docs.nvidia.com/cuda/
cam indicator LED. In Proceedings of the USENIX Security Symposium (2014), gpudirect-rdma/index.html.
Understanding The Security of Discrete GPUs GPGPU-10, February 04-05 2017, Austin, TX, USA