0% found this document useful (0 votes)

71 views15 pages

Secure and Timely GPU Execution in Cyber-Physical Systems

Uploaded by

xwmxkun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views15 pages

Secure and Timely GPU Execution in Cyber-Physical Systems

Uploaded by

xwmxkun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Secure and Timely GPU Execution in Cyber-physical Systems

Jinwen Wang Yujie Wang Ning Zhang

Washington University in St. Louis Washington University in St. Louis Washington University in St. Louis
St. Louis, USA St. Louis, USA St. Louis, USA
[email protected] [email protected] [email protected]

ABSTRACT The exploitation of these vulnerabilities allows attackers to tamper

Graphics Processing Units (GPU) are increasingly deployed on with or simply deny key safety-critical functionalities in CPS, such
Cyber-physical Systems (CPSs), frequently used to perform real- as pedestrian detection.
time safety-critical functions, such as object detection on autonomous GPU Execution Protection in CPS: Existing secure execution
vehicles. As a result, availability is important for GPU tasks in solutions [27, 32, 33, 42, 43, 48, 59, 67] for GPU mostly focus on the
CPS platforms. However, existing Trusted Execution Environments assurance of confidentiality and integrity. These GPU TEE solutions
(TEE) solutions with availability guarantees focus only on CPU can be categorized into two approaches: leveraging CPU-TEE to
computing. secure GPU access, as demonstrated in [27, 33, 42, 43, 48], or directly
To bridge this gap, we propose AvaGPU, a TEE that guarantees instantiating a TEE on GPU [32, 59, 67]. CPU-based TEEs often
real-time availability for CPU tasks involving GPU execution under result in larger TCB, while GPU-TEE typically require hardware
compromised OS. There are three technical challenges. First, to modifications. However, for GPU protection in CPS, the availability
prevent malicious resource contention due to separate scheduling (timeliness) is also an essential aspect.
of CPU and GPU tasks, we proposed a CPU-GPU co-scheduling While there has been some recent work on TEE with availability
framework that couples the priority of CPU and GPU tasks. Second, assurance [22, 61], they primarily considered systems with CPU
we propose software-based secure preemption on GPU tasks to executions only and cannot be directly applied to GPU. Accelera-
bound the degree of priority inversion on GPU. Third, we propose a tors, such as GPU, has its own resource management mechanisms
new split design of GPU driver with minimized Trusted Computing and structures as an additional computational unit. As a result,
Base (TCB) to achieve secure and efficient GPU management for availability of GPU execution requires a holistic consideration of
CPS. We implement a prototype of AvaGPU on the Jetson AGX both CPU and GPU resource management. Given the prevalence of
Orin platform. The system is evaluated on benchmark, synthetic AI in modern CPS for safety critical functions, such as perception
tasks, and real-world applications with 15.87% runtime overhead and control, it is essential for availability assurance solutions to
on average. also support accelerators, such as GPU.
Secure and Timely GPU Execution with AvaGPU: To bridge
CCS CONCEPTS this gap, we introduce AvaGPU, a TEE designed to provide real-time
• Security and privacy → Trusted computing. availability guarantees for CPU tasks involving GPU execution on
CPSs in the presence of untrusted OS. To achieve this objective,
KEYWORDS AvaGPU needs to address several challenges unique to GPU and
GPU; Cyber-physical System; System Security; Availability beyond existing availability solutions [22, 61]:
ACM Reference Format: C1. CPU-GPU Task Priority Coupling: As an independent computing
Jinwen Wang, Yujie Wang, and Ning Zhang. 2023. Secure and Timely GPU unit, the GPU contains its own task scheduler in the driver. The
Execution in Cyber-physical Systems. In Proceedings of the 2023 ACM SIGSAC separation between CPU and GPU schedulers introduces priority
Conference on Computer and Communications Security (CCS ’23), November decoupling between CPU and GPU tasks. Specifically, when the
26–30, 2023, Copenhagen, Denmark. ACM, New York, NY, USA, 15 pages. GPU scheduler allocates computational resources, like GPU process-
https://doi.org/10.1145/3576915.3623197 ing units, to the submitted GPU tasks, it is unaware of the priority
of the CPU tasks associated with the GPU tasks. Thus, GPU tasks
1 INTRODUCTION from high priority CPU tasks, like safety-critical secure tasks, can
GPU plays an increasingly important role in real-time CPSs [1, 3, 15] be delayed by other GPU tasks submitted by lower priority CPU
as more and more AI components are integrated into safety-critical tasks [35, 46], resulting in priority inversion. Therefore, coupling
CPS, such as self-driving [4, 8, 14]. Under the increasingly rich the priorities of both CPU and GPU tasks is necessary to ensure
features, the software system becomes incredibly complex, making timely completion of GPU-involved secure tasks. However, neither
it extremely challenging, if not impossible, to be vulnerable-free [9]. existing GPU secure execution solutions [27, 33, 59] nor CPU avail-
ability solutions [22, 61] enforces priority coupling between CPU
Permission to make digital or hard copies of part or all of this work for personal or and GPU tasks. To solve this problem, AvaGPU introduces a real-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation time CPU-GPU co-scheduling framework in TEE. This framework
on the first page. Copyrights for third-party components of this work must be honored. couples the priority of CPU and GPU tasks by prioritizing secure
For all other uses, contact the owner/author(s). GPU tasks during the execution period of corresponding secure
CCS ’23, November 26–30, 2023, Copenhagen, Denmark
© 2023 Copyright held by the owner/author(s). CPU tasks.
ACM ISBN 979-8-4007-0050-7/23/11.
https://doi.org/10.1145/3576915.3623197
1

2591
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Jinwen Wang, Yujie Wang, and Ning Zhang

C2. Secure Preemptive Scheduling: Preemptive scheduling is crucial GPU driver in the TEE. To minimize the impact on TCB, AvaGPU
in mitigating priority inversion in real-time systems, by enabling leverages CPS predictability to create a template driver [61]. How-
higher priority tasks to interrupt lower priority tasks. However, ever, different from regular I/O devices, there is complex resource
mainstream GPUs lack hardware-level support or public APIs for management to enable GPU task execution, and trapping to TEE for
task preemption [30]. Existing software-based GPU task preemption every management operation significantly slows down the system,
solutions can be categorized as two approaches. 1) Wait-based inter- violating the real-time requirement for the CPS. To minimize this,
thread preemption [24, 25, 63, 66] often incurs long preemption AvaGPU proposes GPU management delegation, separate command
delay, hindering the real-time responsiveness of the system. 2) Re- buffers, and batch command buffer synchronization mechanisms
execution-based methods are efficient but only work for idempotent based on Stage-2 memory access control.
workload [30, 37], cannot be applied on CPS workloads where Prototype and Evaluation: We implemented the prototype of
states are crucial. To tackle this challenge, AvaGPU proposes to AvaGPU on the Jetson AGX Orin platform. Two cases are used
statically instrument both secure and non-secure GPU task codes to demonstrate the effectiveness of defense against availability at-
to add self-suspending capability according to preemption signal (a tacks, including malicious GPU frequency reduction attack and
software-based flag) from GPU scheduler. preemption bypassing attack. To evaluate the system performance
However, non-secure GPU tasks submitted from the untrusted of AvaGPU, we measure the performance overhead on Rodinia [19]
Rich Execution Environment (REE) can have their instrumentation- GPU and SPECrate 2017 [20] CPU benchmark suite. The real-time
based self-suspending capability disabled in two ways. First, REE performance of AvaGPU is evaluated on both synthetic real-time
attackers can remove the self-suspending instrumentation. Second, GPU tasks and real-world applications. In summary, we make fol-
it’s difficult to verify that a non-secure GPU task is free of vul- lowing contributions:
nerability. Thus, a REE attacker can hijack control flow [39, 44] in
non-secure GPU tasks to bypass self-suspending instrumentation at • We design and implement a software-based real-time TEE
runtime. Both approaches lead to non-secure GPU tasks executing solution for tasks involving both CPU and GPU execution,
without any suspensions, introducing delay on secure GPU tasks ensuring real-time secure GPU tasks finish correctly and
execution by contenting computing resources. AvaGPU proposes timely in the presence of a compromised OS.
two defense mechanisms to prevent/eliminate these attacks as early • To address priority inversion, we propose a secure real-time
as possible. First, AvaGPU only executes non-secure GPU tasks with CPU-GPU co-scheduling mechanism and a fine-grained GPU
correctly verified cryptographic signatures, preventing the removal task preemption mechanism to ensure real-time responsive-
of self-suspending instrumentation before execution. Second, Av- ness. A secure GPU management system is also developed
aGPU detects and eliminates the self-suspending instrumentation to take advantage of CPS predictability to isolate GPU re-
bypassing at runtime. Specifically, AvaGPU monitors the secure sources with minimized overhead in system runtime and
GPU progress. If secure GPU tasks don’t make expected progress in TCB.
a limited time period. AvaGPU proactively kills unresponsive GPU • We implement a prototype of AvaGPU and show the pro-
tasks. However, the progress checkpoints must be selected care- posed system can defend against availability attacks with
fully to defend against instrumentation bypassing attack without case studies. We also evaluate the system performance on
introducing high runtime overhead. While more frequent secure Rodinia and SPECrate 2017 benchmarks, synthetic real-time
GPU tasks progress monitoring can detect attacks quicker, it also tasks, as well as real-world applications.
introduces higher runtime overhead. To minimize the attack detec-
tion runtime overhead while eliminating self-suspending bypassing 2 BACKGROUND
attack in time for the guarantee of application’s real-time perfor-
2.1 Graphics Processing Unit (GPU)
mance, AvaGPU formulates the trade-off between security and
runtime overhead as a constraint optimization problem to solve it. Hardware: GPUs, either dedicated or integrated, are classified
C3. Secure GPU Management: In order to guarantee availability for based on whether they share physical memory with the CPU. CPSs
GPU tasks, it is necessary to leverage hardware resource isolation typically employ integrated GPUs that share memory with the
to control the access of GPU resource from untrusted domains. This CPU because of Size, Weight, and Power (SWaP) limitations. Thus,
is usually accomplished by assigning the GPU to secure domain as a AvaGPU focuses on integrated GPU. We use Nvidia GPU as an
secure device. However, this poses two unique challenges. The first example. The CPU communicates with the GPU through access to
challenge is due to the execution of non-secure GPU code on the the GPU-exposed MMIO memory space. A GPU primarily consists
secure device (GPU). Existing hardware-enforced memory access of copy (DMA) engine, command processor, computation unit, and
control in ARM TrustZone works on the granularity of hardware memory controller. The copy engine transfers data between host
device, and as a result, non-secure GPU tasks on the secure GPU device and GPU memory spaces, while the command processor re-
can access all the secure resources. AvaGPU prevents this attack by ceives commands from the GPU driver and dispatches them to the
deploying a DMA reference monitor in TEE to validate DMA mem- GPU computation unit. This unit features Graph Processing Clus-
ory access in GPU commands, preventing malicious modification of ters (GPCs) sharing an L2 cache, with each GPC containing multiple
secure memory. The second challenge is due to the sharing of GPU Streaming Multiprocessors (SMs). Each SM includes several cores
between secure and non-secure CPU tasks, requiring mechanisms that share an L1 cache. GPU uses different mechanisms to isolate
to harmonize the non-secure GPU driver in the REE and the secure cache between tasks. Specifically, L2 cache memory is indexed us-
ing physical addresses, while L1 cache uses virtual addresses for
2

2592
Secure and Timely GPU Execution in Cyber-physical Systems CCS ’23, November 26–30, 2023, Copenhagen, Denmark

indexing. Thus, L1 caches are flushed during a context switch. A 2.2 Arm TrustZone and Stage-2 Translation
GPU task determines the number of threads used, organizing them
Arm TrustZone: Arm TrustZone is a hardware security mecha-
as thread blocks divided into warps. The hardware scheduler uti-
nism that divides computational resources into two domains: the
lizes warps as the scheduling unit for each SM. Modern mainstream
normal world and the secure world. The normal world cannot ac-
GPUs from leading manufacturers, such as Nvidia [59], AMD [51],
cess the resources of the secure world, while the secure world can
Intel [57], and Arm [27], utilize virtual memory to isolate memory
access all resources. This asymmetrical permission arrangement
space among various GPU tasks. This is accomplished through a
allows the normal world to run a feature-rich, large-size commodity
separate Memory Management Unit (MMU) that employs page
OS, whereas the secure world runs a smaller, secure OS with fewer
table walkers for address translation and a hierarchy Translation
features. The two OSes cannot execute concurrently on a CPU core
Lookaside Buffers (TLBs).
at the same time, thereby ensuring CPU usage isolation. The transi-
Software Stack: The GPU software stack primarily consists of a tion between the two worlds is supervised by the highest privilege
user-space runtime library (e.g., CUDA and OpenCL) and a kernel- level, known as the secure monitor, which employs a specific in-
space GPU driver. The user-space runtime library offers APIs for struction called a secure monitor call (smc). Arm TrustZone also
user-space applications (i.e., GPU tasks) to program the GPU exe- enforces memory isolation at the bus level, preventing the normal
cution unit with codes and to transfer data between the host device world from accessing the secure world’s memory.
buffer and GPU buffer. These API calls are converted into GPU
Stage-2 Translation: The two-stage memory translation mech-
commands, which configure the GPU and control data transfers
anism is commonly employed in high-end series, such as Arm
and task launches. The kernel-space GPU driver mainly responds
Cortex-A, to support virtualization. This mechanism is responsible
to the GPU management, like memory management and GPU com-
for mapping the Virtual Address (VA) in applications and the OS
mand submission. Each GPU task executes in a separate virtual
to the Physical Address (PA). Specifically, Stage-1 translates the
memory space, ensuring memory isolation between tasks. This is
VA into an Intermediate Physical Address (IPA), and then Stage-2
achieved by allocating multi-level page tables in GPU driver for
maps the IPA to the PA. When virtualization is not enabled, the VA
each GPU task before task loading. A command processor in the
is directly translated into the PA.
GPU fetches commands from host devices, with the GPU driver
managing two buffers: a command buffer and a ring buffer. The
runtime library places commands into the command buffer, which 3 THREAT MODEL AND SECURITY GOALS
is memory-mapped to the user space. The GPU’s command pro-
Threat Model: AvaGPU is designed to protect against privileged
cessor utilizes a ring buffer to fetch commands. Specifically, when
attackers capable of executing arbitrary code and reading/writing
the runtime library pushes commands into the command buffer,
any memory in the REE on CPS platforms where GPU is shared
the command group’s location and size are added to the ring buffer.
between TEE and REE. The adversarial goal is to compromise the
Simultaneously, a PUT register is updated with a pointer to the com-
availability of secure safety-critical GPU task execution, such as
mand group. The command processor retrieves the command group
object detection in autonomous vehicles, through Denial of Service
each time the PUT registers are updated and uses the GET register
(DoS) attacks. It’s important to clarify that AvaGPU focuses on
to notify the host device of fetch completion. Once launched on the
computational availability guarantee of GPU tasks, and is comple-
GPU, a task cannot be preempted until completion. This is due to the
mentary to existing work [61] that provides system availability
GPU’s lack of exposed software interface for task execution suspen-
guarantee with only CPU as the computational unit. Concretely,
sion mechanisms [30], unlike an interruptible CPU. Consequently,
there are six unique attack vectors. From the perspective of GPU
the GPU cannot preserve the task context in the same manner as a
task execution, (1) attackers can modify the GPU code and data
CPU. However, mainstream GPUs support concurrently executing
stored in host buffers. (2) Attackers can manipulate the DMA con-
GPU tasks from mutually untrusted processes [23, 46], like Nvida
troller to prevent GPU data transfer from completing correctly and
MPS [18] and AMD ROCm [2], making performance interference
promptly. (3) Attackers can manipulate CPU task and GPU task
through GPU computation resource contention possible. Existing
scheduling, which includes blocking GPU task command commits
GPU support tasks killing mechanisms [16, 30] which can stop
or submitting multiple GPU tasks to compete for shared resources,
specified ongoing tasks on GPU at process-granularity without
such as cache or computation units, on the GPU. In terms of GPU
modifying any GPU configurations.
management, (4) attackers may either deny access to or maliciously
Work Flow: The entire workflow of a GPU task execution com- configure the GPU, such as modifying the GPU frequency through
prises four phases: memory allocation, code/data transfer between the MMIO interface. (5) They can manipulate GPU memory address
host buffer and GPU buffer, task dispatching and computation, and translation by modifying the GPU Page Table Entry (PTE) and Page
data transfer from GPU to host device. During the memory alloca- Table Directory (PD) that refers to the table pointing to page tables
tion phase, the GPU driver allocates memory space for GPU task for more granular address translation. (6) Attackers can exploit
execution and creates PTEs. In the second phase, the host device vulnerabilities of untrusted GPU tasks to arbitrarily read, write, and
specifies the source and destination for the data transfer, transfer- execute in their memory space.
ring input data and loading codes from the host buffer to the GPU
Assumptions: AvaGPU targets real-time CPS where the worst
buffer. Once the data is transferred to the GPU buffer, the task is
case execution time of safety-critical tasks are well understood for
dispatched to the computing unit to execute. Finally, the results are
schedulability analysis. We also assume the designer of the CPS
transferred from the GPU buffer back to the host buffer.
system has the access to GPU tasks’ source code. These GPU tasks
3

2593
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Jinwen Wang, Yujie Wang, and Ning Zhang

Customized GPU Compiler 4 DESIGN

GPU Task Preemption Attack Detection &
Instrumentation Elimination Instrumentation An overview of AvaGPU is shown in Fig. 1. There are three key
components. First, a secure CPU-GPU co-scheduling framework
REE Secure GPU Real-time Execution TEE couples priority of CPU and GPU tasks, making GPU scheduler
EL0 secure secure EL0
sensitive tasks
recognize GPU tasks’ priority to guarantee their enough compu-
insensitive tasks
tation resources timely. Second, a software instrumentation-based
Runtime Library CPU-GPU RT Attack
Detecter
secure and fine-grained GPU task preemption mechanism, allows
Co-scheduling
Scheduling EL1 the GPU scheduler to reliably and effectively preempt GPU tasks,
EL1 Secure GPU Mem Replayer thereby effectively mitigating priority inversion. Third, a trusted
GPU Driver Management
GPU setup and management mechanism isolates secure tasks and
GPU
Stage-2 Translation Mediator/Ref Moni Access essential GPU management functions from untrusted software to
EL2 Secure Monitor EL3
guarantee trusted GPU hardware configuration, dynamic resource
management and GPU task execution environment.

Figure 1: AvaGPU System Overview 4.1 CPU-GPU Co-scheduling Infrastructure

The goal of the real-time CPU-GPU co-scheduler is to couple the
priority for both CPU and GPU tasks, ensuring timely GPU-enabled
task completion. A naive approach to construct such a scheduler to
are compiled on a trusted machine with our customized compiler. move the combined infrastructure to the secure world. However,
The AvaGPU’s software components, including Stage-2 memory this significantly increases the complexity of the TCB. To address
translation, the software stack in secure world, and the customized this, recent work [61] adapts hierarchical scheduling to decouple
GPU compiler are trusted and presumed vulnerability-free. Any the real-time scheduler of REE from the trusted scheduler in TEE,
software components in the TCB on the CPU are verified through and rely on real-time scheduling theory based on the world sched-
secure boot or remote attestation. Device initialization in REE is uler to ensure the allocation of processor resources is adequate
also part of the secure boot process. We also trust the hardware, for both the secure world and the non-secure world to complete
including the CPU and integrated GPU, secure timer, as well as the workload. However, directly adapting this paradigm for CPU-
the corresponding supporting firmware. The physical attacks [29], GPU co-scheduling poses two new challenges. First, GPU lacks the
cryptographic-based attacks, algorithm complexity attacks [41, 60], individual world abstraction provided by ARM TrustZone, unlike
and side-channel attacks [45] are out of the scope of this paper. the processor counterpart. This necessitates a flatten design on the
System Goals: AvaGPU aims to ensure that secure CPU and GPU scheduler. Second, the new co-scheduler has to ensure coherency
tasks can complete correctly and timely. of priority between a hierarchical CPU scheduling and flatten GPU
(R1) Integrity of GPU Task’s Code and Data: If a GPU task’s scheduler.
code and data integrity aren’t protected, it yields incorrect results.
Thus, the first goal is to protect GPU task data and code integrity. EL0 REE EL0 TEE
CPU Task GPU Task GPU Task CPU Task
(R2) Secure GPU Task Input/Output: GPUs use DMA to move
Scheduler

EL1
GPU

data to the GPU buffer. Attackers could corrupt memory content of NS CPU Scheduler S CPU Scheduler

secure GPU tasks with malicious DMA transactions or delay them EL1 CPU World Scheduler
with lengthy data transfers. Therefore, AvaGPU should prevent ma-
licious DMA write, and preemptive GPU data transfer operations.
Figure 2: AvaGPU Real-time CPU-GPU Co-scheduling
(R3) Availability of GPU Computation Resources: Without
access to properly configured GPU hardware, a GPU task cannot
execute. Therefore, AvaGPU needs to ensure that secure tasks have To tackle this challenge, we propose to decouple the scheduler
access to correctly configured GPU computation resources. infrastructure of CPU scheduling and GPU scheduling while main-
taining coherency on priority inheritance. The high level design is
(R4) Real-time Protection for Secure GPU Tasks: Some CPS
shown in Fig. 2, where the CPU scheduler is adapting the hierar-
tasks requiring GPU computation are inherently real-time in nature.
chical scheduling while the GPU scheduler is flatten. To maintain
Latency in computation results could lead to accidents. Thus, all se-
coherency between the two, GPU execution strictly follows the
cure real-time tasks utilizing GPU should have timely completions.
world switching, where the accelerators prioritize on GPU tasks
(R5) Isolation of GPU Resources: A privileged attacker can ma- from the secure world when the processor is in secure state, and
liciously manipulate GPU resource management, including GPU vice versa. Furthermore, within a single secure domain, GPU tasks
memory translation and command submission, to tamper with se- inherit the priorities of the CPU tasks that they are associating
cure GPU tasks. Thus, AvaGPU must ensure the isolation of GPU with. However, it is important to note that such tightly coupled
resource management for secure and non-secure GPU tasks. scheduling infrastructure may not provide the best performance.
(R6) Maintaining a Minimized TCB: Large TCB may introduce Since the focus of this work is not on real-time scheduling, we will
potential vulnerabilities. Therefore, AvaGPU aims to guarantee leave such exploration as future work. Yet, it is important to note
GPU availability while maintaining a minimal TCB. that AvaGPU is designed to be extensible to different types of CPU
4

2594
Secure and Timely GPU Execution in Cyber-physical Systems CCS ’23, November 26–30, 2023, Copenhagen, Denmark

and GPU co-scheduling algorithms. The findings and designs will 01.__global__ void calculate_temp(...){ //GPU Code:
02.if(suspended()){context_restore()} //context restoring
also generalize to or inform new co-scheduling designs. 03.while(true){
04.if(queue_empty()){return;}
05.int curr_thread_idx = trd_dequeue();
4.2 Secure GPU Task Preemption 06.int tx = get_bx(curr_thread_idx);
07.int ty = get_by(curr_thread_idx);
Preemption is one of the important features in real-time systems. ...
However, existing GPU scheduling infrastructure does not allow 08.for (int i=0; i<cnt_preempt; i++){
09. temp_cal(bx,by,tx,ty);
task preemption. To tackle this challenge, there are two key de- 10. checkpoint1 = 1; //preemption checkpoint update
sign components to enable secure GPU task preemption. First, to 11.}}}
12.if(check_preemption()) context_save(); //context saving
enable preemption on GPU tasks (kernels), AvaGPU instruments 13.for (; i<iteration ; i++){
both the secure and non-secure GPU tasks to enable software-based 14. temp_cal(bx,by,tx,ty);}}}
self-suspension and resumption. Second, to ensure that the instru- 15.calculate_temp<<<dimGrid, dimBlock>>> //CPU Code

mentation bypass can be detected, when the tasks are compromised

or intentionally modified by the REE, AvaGPU adapts a two-stage Figure 3: GPU Task Transformation in AvaGPU
defense mechanism that prevents/eliminates suspension instrumen-
tation bypassing attack as early as possible.
preemption approach minimizes unnecessary checks by examining
GPU Task Execution Suspension and Resumption: AvaGPU
preemption signals at submission time for GPU tasks. Consequently,
supports GPU task suspension by instrumenting the GPU task,
preemption checking intervals align with the largest common factor
enabling it to self-terminate at specified locations after execution
of all GPU task release periods. Note that the long data transactions
context saving. When a suspended GPU task is resumed, it restores
are also splitted into multiple ones to support preemption.
the execution context and continuously executes from previous
suspended point. The GPU task suspension mechanism consists of Instrumentation Modification Defense: AvaGPU verifies the
two stages, preventing the GPU hardware from queuing threads signature of instrumented non-secure GPU tasks’ hash prior to
and intra-thread GPU task self-suspension. First, AvaGPU prevents loading, with the aim of identifying any instrumentation modifi-
the GPU hardware scheduler from queuing task threads. Each GPU cation from REE. Specifically, each GPU task has to be signed by
thread always processes one input unit. However, GPUs allow appli- the developer, and the signature for the task has to be checked by
cations to submit tasks requiring more threads than a GPU’s total AvaGPU before loading it into the GPU for execution. Subsequent
core count. Consequently, excess threads execution are queued in loads of the same GPU task can just reuse the previous hash check-
hardware and scheduled by the GPU hardware scheduler. How- sum for verification without the need for public crypto operations.
ever, hardware queued threads cannot be terminated until execu- However, the delay between command queue commitment and
tion, causing preemption delays. AvaGPU solves this problem by GPU code execution could potentially allow the REE to execute a
transforming the GPU tasks to only apply the maximum concur- Time-of-Check-Time-of-Use (TOCTOU) attack by modifying GPU
rent threads a GPU supports, with each thread processing multiple task code after signature check. To defend against this attack, Av-
thread workloads (i.e., multiple input units) sequentially. For exam- aGPU sets memory space storing the GPU task codes as read-only
ple, Fig.3 illustrates the transferred GPU task calculate_temp. The in REE via stage-2 memory translation before verifying the GPU
variables tx and ty are used in line 14 to index the data processed by task signature.
each thread. Rather than deriving tx and ty from the internal vari- Runtime Preemption Instrumentation Bypassing Defense:
able threadIdx only once in each thread, AvaGPU instruments the AvaGPU defends against runtime preemption bypassing using an
GPU task to obtain multiple ones from a software queue, as shown attack detection and elimination mechanism instead of an attack
in lines 4 to 7. Consequently, all GPU task threads can be preempted prevention mechanism, such as memory safety [55] that typically
instantly, as no threads remain queued on GPU hardware. introduces high runtime overhead. Specifically, AvaGPU identi-
Second, AvaGPU proposes an event-based GPU task preemption fies runtime preemption instrumentation bypassing by monitoring
approach, utilizing customized compiler to instrument the GPU corresponding consequences, i.e., progress delay on secure tasks.
task. The instrumentation allows GPU tasks to self-monitor soft- AvaGPU detects progress delay by verifying whether the expected
ware preemption signals (i.e., share variables) updated by trusted amount of computation is finished in a fixed amount of time. Specif-
GPU scheduler. The memory region containing these signals is set ically, AvaGPU first uses a customized compiler to insert progress
to read-only for the untrusted OS via Stage-2 translation to pre- delay checkpoint into secure GPU tasks. When control flow of
vent tampering. When a preemption signal is received, the GPU secure GPU task reaches a delay checkpoint, the value of corre-
task saves its current execution context, and suspends execution. sponding delay checkpoint pass variable will be set, as shown in
The suspended execution resumes by restoring the saved context line 10 in Fig. 3. To verify the progress between two checkpoints at
when the scheduler resumes the suspended task. The saved con- runtime, AvaGPU triggers the secure timer after a period from the
text includes GPU register values, while the memory content re- previous delay checkpoint, checking whether the next checkpoint
mains unaltered for suspended tasks. As shown in Fig. 3, function pass variable is updated. If the next expected delay checkpoint pass
check_preemption in line 12 examines a preemption signal and in- variable value is not set, malicious GPU task contends shared re-
vokes context_save in line 12 if set. When a new thread is launched, sources. The preemption instrumentation is bypassed. As shown in
context_restore in line 2 restores the context if suspended. Lever- Fig. 4, the system is expected to pass through checkpoint2 at time
aging the predictability of real-time CPS, AvaGPU’s event-based point t2. If checkpoint2 is not passed at t2, an attack is occurring.
5

2595
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Jinwen Wang, Yujie Wang, and Ning Zhang

✔ check delay ✔ check delay execution time

exe without attack

start deadline without attack

exe under attack

t1 checkpoint1 t1 checkpoint1 delay attack elimination
✗ check delay
t2 checkpoint2
attack
detected
start
execution time
deadline ✓ meeting deadline
check delay
✔ delay attack elimination
t2 checkpoint2 start deadline ✘ missing deadline
delay checkpoint checking whether checkpoint control flow should execution time
pass variable set pass variable is set pass checkpoint2 at t2

Figure 5: Preemption Bypassing Defense Strategy

Figure 4: Preemption Bypassing Detection

When an attack is identified, AvaGPU kills the non-responsive number of delay checkpoints for each CPU task N = {𝑛𝑖 |0 ≤ 𝑖 < 𝑚},
GPU task [16, 30]. Resetting a specific hardware channel associated and the progress delay checkpoint positions in GPU tasks invoked
with a GPU task will stop corresponding ongoing computation on from each CPU task C = {𝑐𝑖 𝑗 |0 ≤ 𝑖 < 𝑚, 0 ≤ 𝑗 < 𝑛𝑖 0 ≤ 𝑐𝑖 𝑗 ≤ 𝑔𝑖 }.
the Nvidia GPU. This action is specific to individual GPU tasks, To adhere to the developer’s expected system utilization, the total
only erasing the computational context of the target, thus leav- system utilization resulting from the generated strategy must not
ing other tasks’ memory untouched and preserving their accurate exceed the developer’s specified upper bound for system utilization:
execution. Yet, only killing the non-responsive GPU task cannot
𝑒𝑖 + 𝑛𝑖 Δ𝑐
𝑚−1
∑︁
prevent an attacker who turns on the malicious logic one by one. ≤ 𝑈𝑢𝑝 . (1)
𝑑𝑖
To be conservative, AvaGPU suspends other non-secure GPU tasks 𝑖=0
by sending software signals via secure shared memory until the To guarantee every task can complete before the deadline, the total
delayed GPU task is complete. The attack detection strategy, in- task execution time for each task 𝜏𝑖 after detecting the progress
cluding the number and the position of delay checkpoints in a GPU delay should be less than its deadline
task, is decided automatically by AvaGPU which will be described
𝑚𝑎𝑥 {𝑒𝑖 + 𝑐𝑖 𝑗+1 − 𝑐𝑖 𝑗 } + 𝑛𝑖 Δ𝑐 ≤ 𝑑𝑖 , 0 ≤ 𝑗 < 𝑛𝑖 . (2)
in the next section. The delay checkpoint pass variables are located
in TEE. Thus, neither an untrusted CPU process nor an untrusted The optimization objective is to minimize the runtime overhead
GPU task can update delay checkpoint pass variables arbitrarily. introduced by attack detection, which can be formulated as:
Additional discussion on different strategies is in Section 9. 𝑚−1
∑︁
Defense Strategy Generation: As shown in Fig. 5, the time slack 𝑚𝑖𝑛(Δ𝑐 𝑛𝑖 ), (3)
between real-time task execution time and the deadline is limited. 𝑖=0
Detecting progress delay at the end of task execution may lead where Δ𝑐 represents the overhead of checking delay checkpoint
the task to miss its deadline. Furthermore, real-time systems often pass variables, which is measured through program execution pro-
maintain a system utilization upper bound to make sure the sys- filing. We adopt the genetic algorithm [31] to solve this optimization
tem is schedulable [40], making AvaGPU has less time to eliminate problem. The details of the algorithm are presented in Appendix. A.
progress delay. To solve this problem, AvaGPU inserts multiple
delay checkpoints in a secure GPU task. In this way, the longest 4.3 Trusted GPU Setup and Management
delay time of a GPU task will be the duration between two check- The GPU driver manages GPU, such as initializing GPU, managing
points. However, inserting more checkpoints will introduce more power and memory, etc. Secure GPU tasks need trusted driver to
runtime overhead. With all these system constraints considered, have availability guarantee. However, directly migrating the GPU
AvaGPU generates the progress delay checkpoint strategy given a driver into the TEE significantly increases the TCB. Recent stud-
user-provided real-time system utilization upper bound. The gener- ies [47, 61] have replayed pre-recorded MMIO read/write sequences
ated strategy guarantees that real-time tasks can complete in time and values to operate I/O devices or accelerators (i.e., GPU), offering
after detecting GPU progress delay under the system utilization a trusted driver without substantially increasing the TCB. However,
upper bound while maintaining a minimized runtime overhead. they cannot be directly applied to guarantee GPU availability.
Defense Strategy Optimization Formulation: In CPSs, CPU ex- Shown in in Fig. 6 (a), [47] ensures confidentiality and integrity
ecution often relies on the output of secure GPU task execution. As a by isolating the trusted GPU replayer from the untrusted GPU stack.
result, GPU execution often synchronizes with CPU task execution. However, this approach doesn’t prevent attacks from operating the
AvaGPU addresses the challenge of generating defense strategies GPU in the REE, leaving GPU availability unprotected. Similarly,
by formulating the process as an optimization problem. AvaGPU as shown in Fig. 6 (b), [61] protects availability of I/O devices using
assumes that there is a set of 𝑚 CPU tasks, denoted as 𝜏0, 𝜏1, . . . , 𝜏𝑚 . a global replayer in TEE, followed by an I/O reference monitor
Each task 𝜏𝑖 has a deadline 𝑑𝑖 , execution duration of 𝑒𝑖 (where 𝑒𝑖 rep- to verify the validity of all MMIO operations. However, existing
resents the total execution delay for both CPU and GPU execution, design in [61] does not have dynamic memory management needed
excluding the runtime overhead of the progress delay detection in by GPU, including runtime-allocated memory addresses and ring
GPU tasks), and total GPU execution duration 𝑔𝑖 . The developer buffer pointer positions. Thus, as shown in Fig. 6 (c), AvaGPU offers
provides the expected system utilization upper bound, 𝑈𝑢𝑝 ∈ [0, 1], secure dynamic resource management prior to the replayer, provid-
as input to the model. The 𝑈𝑢𝑝 is determined by the scheduling algo- ing the runtime-determined resource metadata for the replayer.
rithm deployed in the system, which in turn determines the schedu- Trust GPU Management Overview: To prevents REE from
lability of the system. The output defense strategy consists of: the arbitrary operating GPU, GPU MMIO memory space is configured
6

2596
Secure and Timely GPU Execution in Cyber-physical Systems CCS ’23, November 26–30, 2023, Copenhagen, Denmark

insecure secure insecure secure insecure secure REE TEE

Apps Apps Apps Apps GPU Driver Debloated Driver
Apps mediator/ref Untrusted Memory page table
Secure Memory
Apps
Management

Mediator/Ref
replayer dynamic res. Management write
update ring buffer
management
GPU Command Buffer commit
replayer I/O monitor GPU Driver replayer Secure Buffer
stack Management cmd sync
NS OS S OS NS OS S OS NS OS Management cmd ring
S OS config.
GPU classic I/O Devices GPU Configuration request Operation Replayer
(a) confidentiality (b) availability (c) availability Code and Data smc
protected GPU Stack protected I/O Stack protected GPU Stack (AvaGPU) func call Code and Data

Figure 6: Solutions about Secure Peripheral Driver Figure 7: AvaGPU Secure GPU Management

as inaccessible for REE using Stage-2 translation. Any GPU con- untrusted OS can maliciously modify commands of secure tasks,
figurations from REE should send a GPU configuration request to corrupting the execution integrity and availability of secure tasks.
TEE. A GPU configuration mediator in TEE validates configurations To address this issue, the TEE needs to take control of the com-
from REE based on OEM policies. As shown in Fig. 7, to provide mand buffer and ring buffer management. AvaGPU migrates ring
a trusted GPU driver while minimizing TCB, AvaGPU provides a buffer and replicates command buffer management within the TEE.
minimalist dynamic resource management (i.e., memory and com- However, switching from REE to TEE when submitting every GPU
mand buffer management) of the GPU driver in the TEE. Dynamic command introduces significant runtime overhead. To address this
memory management and GPU task commitment requests in the issue, AvaGPU proposes the separated command buffer mechanism.
REE driver are trapped and verified in TEE before GPU task exe- Specifically, REE GPU driver submits GPU commands in the com-
cution. Once resource management is completed in TEE, the GPU mand buffer located in the REE. AvaGPU synchronizes the GPU
replayer operates GPU with the result metadata from resource man- command buffer in the REE with the one in the TEE, from which
agement and recorded I/O transaction message. However, trapping the GPU fetches commands. However, actively synchronizing two
resource management from the REE into the TEE increases runtime command buffers frequently introduces non-negligible runtime
overhead significantly. To solve this problem, AvaGPU introduces overhead. Thus, AvaGPU proposes a batch command buffer syn-
GPU PTE management delegation, separated command buffers, and chronization mechanism. Specifically, the two buffers are synchro-
batch command buffer synchronization mechanisms based on TEE nized after the REE sends a request upon completion of a group
and Stage-2 memory translation. of command commitments by a task. This strategy is active only
when the GPU is busy, that is, when the ring buffer is not empty,
Trusted GPU Memory Management: The GPU driver manages
thus avoiding delays caused by waiting for commands.
GPU memory to provide isolation among GPU tasks. An untrusted
REE OS can manipulate GPU memory management to compromise GPU Task’s Code/Data Isolation and Transfer Protection:
the integrity of secure GPU task execution, such as by mapping Code and data integrity of secure tasks should be protected from
the virtual address of secure GPU task code to the physical address untrusted OS as they assure correct GPU task execution. Besides the
of malicious GPU task code. Thus, to support trusted GPU mem- page table in GPU, GPU tasks can also use DMA to transfer large
ory management for secure GPU tasks, AvaGPU manages GPU trunks of data between the GPU memory space and host memory
memory in TEE. However, naively trapping every GPU memory space [27, 59]. To prevent non-secure GPU tasks from sending DMA
management function from REE to TEE introduces significant run- request on GPU to corrupt secure memory by making source and
time overhead due to the context switch between the REE and TEE. destination address in DMA commands fall into secure GPU tasks’
To tackle this challenge, we build on top of the observation that memory space. AvaGPU employs a DMA reference monitor in TEE
most of the page table operations are read-only, AvaGPU therefore to validate DMA memory access in GPU commands before GPU task
only traps the security sensitive write operations to TEE. To realize execution. Specifically, the DMA reference monitor validates that
this design with minimal overhead, AvaGPU leverages the stage-2 the source and destination addresses of DMA transfer commands
translation to enforce the access control for REE. For each write re- from non-secure GPU tasks do not fall into the memory address
quest that is trapped into TEE, the memory management mediator space of secure GPU tasks before these commands are pushed into
refers back to the physical memory range list. Requests for memory the command queue.
outside of the range will be rejected. The access control is similar
to Enclave Page Cache Metadata (EPCM) in Intel’s Software Guard 5 IMPLEMENTATION
Extensions (SGX) and Reverse Map Table (RMP) in AMD Secure We implemented a prototype of AvaGPU on the Jetson AGX Orin
Encrypted Virtualization (SEV). However, different from these TEE platform. This platform has 12 Arm Cortex-A78AE CPU cores, an
designs, AvaGPU maintains a minimal resource management to GPU with 2048 CUDA cores, and 32GB memory. The software stack
ensure availability guarantee for secure GPU tasks. includes LLVM 17.0.0 compiler, the CUDA-11 runtime library, the
Trusted GPU Buffer Management: The GPU runtime library Linux 5.10.65-tegra operating system, and the OP-TEE secure OS
generates GPU commands that are submitted to the command based on Arm TrustZone.
buffer. The location and size of each command group are stored in CPU-GPU Co-scheduling Infrastructure: The end-to-end sched-
the ring buffer by command buffer management. The GPU contin- uling infrastructure in AvaGPU comprises a hierarchical CPU task
uously fetches commands from the ring buffer to execute tasks. An scheduling infrastructure and a global GPU task scheduling system.
7

2597
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Jinwen Wang, Yujie Wang, and Ning Zhang

We utilize the secure physical timer on the Cortex-A78AE to trigger mediator that restricts non-secure tasks from modifying GPU fre-
the hierarchical CPU scheduling process. Both the world sched- quency during secure tasks execution. However, AvaGPU supports
uler and the secure world scheduler employ Rate Monotonic (RM) policy-based hardware configuration mediator. For example, OEMs
scheduling. In our prototype, global GPU task scheduling aligns can limit frequency adjustments to specific ranges.
with the decision of CPU schedulers, allowing the GPU task of the Trusted GPU Resource Management: For secure GPU memory
latest CPU process to preempt the current executing GPU tasks. management, we migrated GPU virtual memory management and
Additionally, it maintains a resource consumption table to man- GPU MMU operations from nvgpu driver to OP-TEE secure OS
age each GPU task’s computational resources, including registers, kernel. We reserve a 4k-aligned memory region to store PD/PTE,
shared memory, and the number of threads. configuring the access rights of this region in Stage-2 memory trans-
GPU Task Preemption Instrumentation: The GPU task preemp- lation to be readable-only for normal world and both readable and
tion method comprises two key components: identifying where to writable for secure world. To trap the PD/PTE update from normal
check for preemption signals, and determining the data to save and world to secure world, we substitute the normal world’s PD/PTE
restore. Given the predictability in real-time systems, the GPU task operation functions (such as nvgpu_pd_write) with an SMC. The
checks for preemption signals whenever a task is released. Thus, secure world’s memory management mediator validates these write
the checking period is set as the greatest common factor of all tasks requests, ensuring the physical addresses in the PTE from the REE
release periods. To figure out the code location of checkpoints, we are outside the secure world’s physical memory, and that the PD
use LLVM front-end passes to record timestamps with clock64 func- doesn’t point to the secure GPU tasks’ page tables. We implemented
tion for each basic block and the statements within blocks that trusted ring buffer and command buffer management in secure
include preemption checkpoints. AvaGPU ensures that at least one world by migrating command/ring buffer management functions
preemption checkpoint is included within a fixed number of loop from the nvgpu driver to secure world. To synchronize command
cycles prior to generating the strategy. The specific number of cy- buffers, the command are copied between two worlds. During func-
cles is determined by the task release period. When a GPU task is tion migration, we identified and removed certain functions and
preempted, AvaGPU saves all registers into secure memory, restor- data structures, like those used in the Linux kernel, debugging, and
ing them when the task is resumed. Memory content are preserved logging functions in nvgpu drivers, to reduce the TCB. Nvidia’s use
during suspension to maintain the content throughout the GPU of virtual addresses in DMA transfers negates the need for our pro-
task’s execution. Note that GPU has more registers than CPU, lead- totype’s DMA reference monitor to check the command’s source
ing to noticeable runtime overhead during context saving in task and destination. However, for vendors like Arm [27] that don’t
preemption. Thus, when multiple GPU tasks of the same priority support virtual address DMA transfers, the DMA reference monitor
execute, AvaGPU use the preemption algorithm in Appendix A to must reject commands if the source and destination addresses in
preempt as fewest GPU tasks as possible, while still freeing enough the control block fall into the secure memory space.
computational resources for the next task to be scheduled. GPU Replay Message Generation: Generating GPU replay mes-
Preemption Bypassing Detection and Elimination: The pre- sages requires recording MMIO read/write operations and correlat-
emption bypassing detection and elimination process consists of ing them with runtime resource management decisions. We logged
two components: detection strategy generation and attack check- these operations by instrumenting user-space CUDA API calls and
point instrumentation. We use a Python script to generate an opti- MMIO real/write nvgpu driver functions like nvgpu_os_readl/writel.
mized delay detection strategy with Algorithm 2. Then we inserted Additionally, we differentiated user-space operations from inter-
delay checkpoints at the specified points within the GPU task, us- rupt handling by instrumenting all interrupt handlers in nvgpu
ing the same approach as when inserting GPU task preemption driver, as messages tied to interrupt handling are replayed only
checkpoints. Each checkpoint updates a unique variable in secure after an interrupt is triggered. To correlate runtime resource man-
memory, marking the execution progress. A secure timer is set ac- agement decisions with MMIO operations, we first obtain decisions
cording to the strategy. It is noted that if two adjacent checkpoints like GPU page table base address, code/data address, and ring buffer
are located in two GPU tasks, two checks are implemented. If a new address from the instrumented runtime management functions in
checkpoint variable isn’t updated when triggered, all non-secure the driver. Next, we located recorded MMIO operations containing
GPU tasks are suspended though setting preemption variables. At these decisions. During runtime, AvaGPU feeds the dynamically
last, AvaGPU halts the non-terminated task by resetting its hard- generated decisions in the messages to control the GPU.
ware channel, thus eliminating the attack. TCB Analysis: AvaGPU minimizes the TCB size through driver
Trusted GPU Access: AvaGPU reserves memory in secure world debloating. As shown in Table. 1, the increased system TCB in Av-
for GPU resource management and Stage-2 translation, providing aGPU includes operations for Stage-2 memory address translation,
trusted services. Specifically, a 16MB secure memory region is re- secure GPU management, mediators, GPU message replayer, and
served for the GPU page table and Stage-2 translation table, with a scheduling framework. The prototype adds a total of 7301 Lines
the Stage-2 configuration registers [21] appropriately configured. of Code (LoC) to the TCB, with the trusted debloated GPU driver
AvaGPU uses a flat memory model for the normal world’s Stage-2 accounting for 5331 LoC. This includes 153 LoC for hardware con-
translation, reflecting the single memory space. When page access figuration and memory management mediators, as well as DMA
rights change, the access permission bits of a PTE are modified, and reference monitor framework, a significant reduction from the 46K
the MMIO addresses of the GPU are configured as readable-only LoC in the nvgpu driver. The TCB size may further vary depending
in the normal world. Our prototype includes a GPU configuration on the quantity and size of secure tasks.
8

2598
Secure and Timely GPU Execution in Cyber-physical Systems CCS ’23, November 26–30, 2023, Copenhagen, Denmark

Table 1: Line of Code (LoC) of Components in AvaGPU Table 2: Effectiveness of Frequency Reduction Defense

S2 Trans. S GPU Mng Medi Repl Schd NS GPU Dri Compiler Record Exe Time Utilization Exe Time Miss Rate Exe Time Utilization
413 4861 153 317 1557 321 4212 1268 w/o atk w/o atk w atk w atk w atk w atk
w/o Ava w/o Ava w/o Ava w/o Ava w Ava w Ava
Trans.: Translation, Mng: Manager, Medi: Mediator, Repl: Replayer, Schd: Scheduler, Dri: Driver
ObjDetect 10.59ms 56.15ms 12.25ms
ImgClassify 1.64ms 47.36% 38.93ms 85.00% 1.89ms 54.88%
ImgCompress 11.45ms 54.26ms 13.3ms
Exe: Execution, w/o: without, w: with, atk: attack, Ava: AvaGPU
6 EVALUATION
In this section, we evaluate AvaGPU using the prototype described
in Sec.5. We aim to answer the following questions: (1) Can Av- Preemption Bypassing Attack Defense: In this case study, we
aGPU effectively defend against availability attacks? (2) What is evaluate the effectiveness of defending against preemption by-
the system overhead of AvaGPU on GPU tasks? (3) What is the passing attacks on Autoware [5], a widely-used open-source au-
system overhead on CPU tasks? (4) What is the real-time perfor- tonomous driving software that requires high-end platform [6].
mance of real-time CPU tasks that involve GPU executions? To Within Autoware, perception tasks are crucial for detecting, recog-
address these questions, we (1) conduct defense case studies to ex- nizing, and tracking objects, with 3D object detection and track-
amine AvaGPU’s impact on two availability attacks, i.e., malicious ing [64] utilizing the GPU for accelerated processing. The correct-
GPU frequency reduction and preemption bypassing attacks, (2) ness and availability of 3D object detection and tracking are the
measure AvaGPU’s system overhead on the Rodinia benchmark prerequisite of subsequent functions such as planning and control-
suite[19], (3) measure AvaGPU’s runtime overhead on SPECrate ling. Thus, we protect it with AvaGPU in secure world. Other tasks
2017 benchmark, and (4) evaluate the real-time performance of both such as sensor simulation run in normal world. The mission and
synthetic real-time tasks and three real-world GPU tasks, as well map we used in simulation is sample-rosbag [7] in Autoware. We
as CPU hierarchical scheduling computation cost on micro bench- have profiled the execution time of each task, setting the deadline
mark. Additionally, we evaluate AvaGPU’s efficiency in generating equal to the period, which corresponds to the time interval between
performance interference attack defense strategies. More details two consecutive task executions. We simulate the attack by exploit-
can be found in Appendix A. ing the buffer overflow vulnerability in five synthetic non-secure
GPU tasks that run arithmetic computation loop to bypass the
6.1 Defense Case Study preemption, contending computation resources with victim GPU
task. The buffer overflow vulnerability in untrusted GPU tasks is
To demonstrate AvaGPU’s effectiveness in defending against avail-
introduced at the end of the loop to overwrite the loop condition,
ability attacks, we conduct two case studies on the prototype plat-
making loop execute without stopping. To show the effectiveness
form: using AvaGPU to defend against GPU working frequency
of defense mechanism, we measure the execution time of 3D object
reduction and preemption bypassing attacks.
detection and tracking under three system deployment scenarios.
GPU Working Frequency Reduction: In a GPU working fre- As illustrated in Table. 3, under a preemption bypassing attack, the
quency reduction attack, an attacker with kernel privileges delays average GPU task execution delay is 6.99 times longer than the
GPU task execution by reducing the GPU’s working frequency. execution without an attack, exceeding the task deadline by 1.36
To delay the victim task as much as possible, the attacker lowers times. The Autoware loss object perception of surrounding objects
the GPU frequency from its highest setting (1.3GHz) to the lowest can be observed in visualized panel, impacting the availability of
(115MHz) during the execution of the GPU task, causing CPU tasks the following functionalities. When AvaGPU is deployed, the task
to miss deadlines. In this case study, we select three open-source meets its deadline even under a preemption bypassing attack, but
applications commonly used on commercial drones: Object De- with a 17% average increase in runtime overhead.
tection (OD) using DetectNet [10], Image Classification (IC) with
ImageNet [13], and Image Compression (IP)[12]. These envisioned Table 3: Effectiveness of Preeemption Bypassing Defense
CPS applications operate autonomously under complex environ-
ments. As such, they often require powerful platforms with GPUs to w/o Atk w Atk w/o AvaGPU w Atk w AvaGPU Deadline
support timely execution of object recognition, object classifications Min 43824 us 281123 us 51142 us 250000 us
or other complex algorithms. Each application has a 50ms deadline, Max 55399 us 385721 us 65094 us 250000 us
Avg 48470 us 338760 us 56710 us 250000 us
corresponding to 20 fps camera operation on a drone camera[11].
Secure applications (OD and IC) operate in secure world, while the
non-secure application (IP) operates in normal world. The upper
bound of system utilization is 69%, aligned with the RM scheduling 6.2 System Overhead on GPU Benchmark
algorithm in our prototype. We continuously execute each task at To evaluate AvaGPU’s runtime overhead on GPU tasks under dif-
least 1000 times. As indicated in Table. 2, the average execution ferent workload distributions, we measure the execution times
time for the three applications increases under a GPU working of applications in Rodinia benchmark suite [19] on systems with
frequency reduction attack, resulting in an 85% deadline miss rate. and without AvaGPU under three workload (the sum of the execu-
However, with AvaGPU deployed, tasks meet deadlines even under tion time of each application in every world, divided by its period)
this attack, with 7.52% increase in system utilization. This is because distributions between secure and non-secure world, i.e., 25%/75%,
the GPU configuration mediator denies requests to reduce GPU 50%/50%, and 75%/25%. For each distribution, we evaluate the run-
execution frequency during secure task execution. time overhead of applications in both worlds over 10 iterations.
9

2599
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Jinwen Wang, Yujie Wang, and Ning Zhang

150.0

Memory Size (MB)

Exe Time w/o AvaGPU Exe Time w AvaGPU SW Mem Size w/o AvaGPU
Exe Time (ms)

2k Exe Time w AvaGPU NW Mem Size w AvaGPU NW

100.0 Mem Size w AvaGPU SW
1k 50.0

0 0
us ff ts fs d ar n ts br va k d a rti w fd hf re us ff ts fs d ar n ts br va k d a rti w fd hf re
ga hu ho b sra he n ho hy la leu lu kmepa n c pat st ga hu ho b sra he n ho hy la leu lu kmepa n c pat st
Applications Applications
Figure 8: Runtime Overhead of Rodinia Benchmark Figure 9: Memory Overhead of Rodinia Benchmark

Table 4: Runtime Overhead Breakdown of AvaGPU (NW)

preemp data code ctx mm cmd config
Specifically, in each measurement iteration, under a given work- check transf verif swch
sched
mng buffer media
total
load distribution, both the number and choice of applications in Avg 3.47% 0.77% 0.66% 5.40% 1.61% 1.93% 1.94% 0.63% 15.87%
each world are randomly selected to prevent biased results. Since no Max 3.96% 3.95% 1.18% 6.16% 1.96% 2.18% 2.19% 0.81% 17.12%
Min 3.10% 0.21% 0.21% 4.77% 0.97% 1.70% 1.75% 0.47% 14.51%
real-time GPU task benchmarks exist, we treat Rodinia applications preemp: preemption, tranf: tranfer, verif: verification, ctx: context, sched: scheduling, mm mng:
as real-time GPU tasks. Specifically, after selecting applications in memory management, cmd: command, config: configuration, media: mediator
each iteration, we assign deadlines to applications to ensure that
the workload distribution in each world is satisfied and the entire Runtime Overhead Difference between NW and SW: Table. 4
system utilization reaches the upper bound, i.e., 69% under the RM and Table. 5 show the runtime overhead differences between the
scheduling algorithm. Assigning deadlines to tasks is reasonable normal and secure world. These differences include (1) GPU task
here as we aim to measure the runtime/memory overhead of real- code verification, (2) preemption bypassing attack detection, (3)
time applications with varied parameters, rather than assessing memory and command buffer management, and (4) configuration
real-time performance that requires actual application deadlines. mediation. The overall runtime overhead of GPU tasks in both the
To measure the runtime overhead of each AvaGPU component, normal and secure worlds is similar due to a combination of fac-
we record starting and ending timestamps to calculate execution tors. (1) AvaGPU does not verify the integrity of GPU task code in
delays. the secure world. (2) Preemption bypassing attack defense is only
Runtime Overhead Analysis: The maximum runtime overhead applied to secure GPU tasks through instrumentation. (3) In the
of each application when they runs in normal world and secure secure world, memory management must validate every PTE sub-
world is shown in Fig. 8. The highest runtime overhead is 17.12% for mitted from the normal world and configure the Stage-2 translation
kmeans in normal world and 18.49% for kmeans in secure world. The table to enforce read-only permission for page tables of non-secure
runtime overhead for each component in AvaGPU in normal world tasks. Additionally, commands must be copied from normal world
and secure world are presented in Table. 4 and Table. 5. Preemption to secure world before non-secure GPU tasks execution. (4) The
checking and context switching exhibit the highest average runtime configuration mediator in the secure world must verify the validity
overhead among all components. This overhead is mainly due to of configuration values from normal world requests. Additionally,
checking variables shared between the CPU and GPU, and saving all GPU operation requests from the normal world result in world
and restoring the execution context. The data transfer overhead context switch runtime overhead.
primarily arises from extra DMA configurations due to transac- Memory Overhead: Fig. 9 illustrates the maximum memory over-
tion splitting, which increases in proportion to the size of the data. head of each application when they run in normal world and secure
The code verification overhead on the other hand, is solely caused world under different system utilizations. The memory overhead
by validating the GPU task code integrity in the normal world, is mainly introduced by context saving when a GPU task is sus-
and it increases with the size of the code. The scheduler’s runtime pended. Among all applications in the benchmark, two bioinformat-
overhead is introduced by CPU-GPU co-scheduling. It fluctuates ics processing programs, namely hear and leuk have the highest
depending on various real-time task parameters throughout the sys- memory overhead, i.e., 37.98%, 31.10% in secure world and 38.21%,
tem. The overhead incurred by memory management and command 29.93% in normal world. These programs require a large number of
buffer management is primarily attributable to page table and com- registers for efficient computation, which are saved during context-
mand/ring queue operations. The memory management overhead switching, thus leading to significant memory overhead. However,
grows in proportion to the size of both the code and data of GPU the memory overhead introduced by context-saving is less than
tasks. This overhead also comprises the runtime overhead incurred 10MB, which is less than 0.04% of the total memory on our platform.
by PTE checking in memory management mediator for non-secure Even on lower-performance platforms, such as the Nvidia Jetson
GPU tasks. The configuration mediator overhead occurs due to Nano [17] with 2GB memory, the memory overhead accounts for
the verification of the validity of GPU operation requests made less than 0.5% of the entire memory.
from the normal world. Preemption bypassing attack detection run-
time overhead is only introduced in secure tasks, mainly caused 6.3 System Overhead on CPU Tasks
by checking the progress variables which are shared between CPU To evaluate AvaGPU’s runtime overhead on CPU tasks, we measure
and GPU. The GPU context switching overhead is introduced by the execution time of programs in SPECrate 2017 benchmark run-
saving/restoring registers on GPU task preemption/resuming. ning in normal/secure world on the system with/without AvaGPU.
10

2600
Secure and Timely GPU Execution in Cyber-physical Systems CCS ’23, November 26–30, 2023, Copenhagen, Denmark

Execution Time w/o AvaGPU 100 100

Miss Rate (%)

S Task w/o AvaGPU S Task w/o AvaGPU
Exe Time (s)

1k Execution Time w AvaGPU 75 NS Task w/o AvaGPU

S Task w AvaGPU
75 NS Task w/o AvaGPU
S Task w AvaGPU
50 NS Task w AvaGPU 50 NS Task w AvaGPU
0.5k 25 25
0 0
0 10 20 30 40 50 60 70 80 90100 0 10 20 30 40 50 60 70 80 90100
rl c f e a 4 p l h z v t d e r rf n 4 g b o s System Utilization (%) System Utilization (%)
pe gc mcomnxal x26dee leeexc x bwacacnampar pov lbmw blecamima na fot rom
(a) Secure Task Prioritized (b) Insecure Task Prioritized
Applications
Figure 10: Runtime Overhead on CPU Tasks Figure 11: Schedulability Analysis of AvaGPU

Table 5: Runtime Overhead Breakdown of AvaGPU (SW) Table 6: RT Performance of Real-world Applications

preemp data ctx mm cmd attack preemp data code ctx mm cmd attack
sched total sched total
check transf swch mng buffer detect check transf verif swch mng buffer detect
Avg 2.78% 0.53% 5.53% 1.17% 1.70% 1.48% 2.39% 15.58% OD 2.38% 0.93% N/A 5.18% 1.78% 1.67% 1.48% 2.18% 15.6%
Max 3.53% 3.73% 6.13% 1.58% 2.16% 1.67% 3.21% 18.49% IC 2.57% 0.81% N/A 5.27% 1.57% 1.53% 1.44% 2.61% 15.8%
Min 1.62% 0.21% 4.73% 1.07% 1.22% 1.21% 2.19% 13.14% IP 3.13% 0.63% 0.83% 5.31% 2.03% 2.18% 2.09% N/A 16.2%
preemp: preemption, tranf: tranfer, verif: verification, ctx: context, sched: scheduling, mm mng:
preemp: preemption, tranf: tranfer, ctx: context, sched: scheduling, mm mng: memory manage-
memory management, cmd: command, attack detect: preemption bypassing attack detect
ment, cmd: command, attack detect: preemption bypassing attack detect

CPU Task Runtime Overhead: As shown in Fig. 10, the max- world, running GPU tasks to achieve 100% CPU utilization for each
imum, minimum, and average runtime overhead of AvaGPU on group. We use the same settings as the GPU working frequency
SPECrate 2017 benchmark in both worlds are 6.87%, 0.23%, and reduction case study (i.e., all tasks have a 50ms deadline, and the
2.25%, respectively. The runtime overhead on CPU tasks is intro- system utilization upper bound is 69%) as real-world workload.
duced by Stage-2 memory translation mechanism that is used to Real-time Performance: Fig. 11 illustrates the real-time perfor-
enforce memory access control, preventing attackers from manip- mance of synthetic tasks in AvaGPU. In a system without AvaGPU,
ulating GPU in normal world without being trapped into secure all tasks miss their deadlines due to attackers submitting numerous
world for verification. It needs additional page table walks, intro- GPU tasks in brief periods, contending for computational resources.
ducing runtime overhead for CPU tasks. AvaGPU guarantees the However, with AvaGPU protection, benign tasks only begin to
correctness of dynamic GPU resource management, GPU tasks miss deadlines when system utilization of benign tasks exceeds
commitment, and GPU configuration by taking over these opera- 69%, in line with the theoretical result[40]. Prioritized tasks start
tions from normal world GPU driver, i.e., relaying these operations missing deadlines at higher utilization levels since they are sched-
from normal world GPU driver to secure world. However, these uled first when computational resources are limited. The real-time
modifications are all located in the normal world GPU driver, thus performance of synthetic tasks shows that AvaGPU can maintain
having no impact on CPU tasks without GPU execution involved. reasonable real-time performance under malicious GPU resource
contention. Table. 6 shows the real-time performance of a system
6.4 Real-time Performance of System running three real-world applications. The highest average run-
To evaluate the real-time performance of AvaGPU, we evaluate the time overhead, 16.2%, occurs on IP in the normal world. It includes
real-time task miss rate for both synthetic real-time tasks under 0.7% configuration mediator checking overhead. Nevertheless, all
varying system workloads and real-world CPS GPU applications, applications complete their tasks before the deadline, proving the
demonstrating AvaGPU’s feasibility for real-world use cases. We feasibility of AvaGPU for systems running real-world applications.
generate synthetic task sets by randomly creating ten benign CPU CPU Hierarchical Scheduling Compuataion Cost: We instru-
tasks that exclusively execute GPU tasks, with GPU tasks featuring mented AvaGPU hierarchical scheduler and normal world OS sched-
different task execution times calculated through varying iterations uler in Linux to record the scheduling event count and the total
of matrix multiplication. GPU execution durations are randomized overhead over the execution of the same set of synthetic real-time
between 10us and 20ms, including five secure tasks and five non- tasks used in above real-time performance section. The total execu-
secure tasks. Task periods are set to yield total system utilization tion time of the tasks set is 15.79s. Table. 7 shows the computation
ranging from 0% to 100%. Each task executes 100 times. To evaluate cost of hierarchical scheduler when the system is schedulable (sys-
the computation cost of hierarchical scheduling in AvaGPU, we also tem utilization remains below 69%). The normal world scheduler
measure the execution time of each component in the hierarchical (i.e., Linux scheduler) has significantly higher execution time com-
scheduler, including world scheduler, secure world scheduler, and pared to secure world scheduler and world scheduler because of
normal world scheduler, during synthetic real-time tasks execution complex data structure and process operations in Linux scheduler.
under different system utilization. To evaluate the influence of task Furthermore, the maximum scheduling events in normal world
priority on real-time performance, we alternate high priority be- scheduler under different workloads is also higher than world sched-
tween secure and non-secure benign tasks. We employ CARTS [50] uler and secure world scheduler. This is because Linux scheduler is
to obtain root-level scheduling parameters. Real-time performance jitter-based, scheduler works at every fixed interval. While world
under an attacker in synthetic GPU tasks is shown by executing scheduler and secure scheduler is event-driven, working only when
10 malicious CPU tasks that have the lowest priority in normal a new scheduling event arises, such as the arrival of a new task. The
11

2601
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Jinwen Wang, Yujie Wang, and Ning Zhang

Table 7: Hierarchical Scheduling Runtime Overhead Secondly, an attacker’s attempts to modify preemption signals to
bypass self-preemption checks are prevented by Stage-2 translation,
Scheduler Max Min Avg Max Events Max Pct as the variables are only readable in the REE. AvaGPU detects and
World Scheduler 0.83 us 0.71us 0.72 us 2863 0.01%
eliminates any efforts to circumvent preemption through runtime
NW Scheduler 36 us 23 us 28 us 4309 0.80%
SW Scheduler 0.86 us 0.76 us 0.79 us 2261 0.01% exploiting vulnerabilities like memory safety bugs, but it doesn’t
Max: Maximum Execution Time, Min: Minimum Execution Time, Avg: Average Execution prevent memory safety corruption due to the significant runtime
Time, Max Events: Maximum number of scheduling events, Max Pct: Maximum percentage of overhead.
scheduler execution time in all real-time tasks execution time.
GPU Resources Isolation Denial/Malicious GPU Management
(R5): AvaGPU utilizes a debloated secure driver accessible directly
maximum percentage of execution time taken up by the hierarchical
within the TEE, making secure GPU tasks independent of REE GPU
scheduling in AvaGPU across varying workloads is 0.82%.
management functions. Thus, REE cannot deny GPU management
7 SECURITY ANALYSIS for these tasks. Memory isolation provided by TEE prevents attacks
on resource management data structures. Memory management
GPU Tasks’ Code and Data Integrity: GPU Task’s Code and Data mediator prevents malicious attempts to corrupt trusted resource
Corruption (R1): Attackers may modify the code and data of secure management in the TEE with invalid requests.
GPU tasks to corrupt their executions. However, the codes and data
of secure GPU tasks are located within the TEE memory, shielding
them from direct modification by REE attackers. 8 RELATED WORK
Secure GPU Task Input/Output: DMA Manipulation (R2): An
GPU Execution Environment Isolation: As shown in Table. 8,
attacker could maliciously send commands to the GPU to deny
existing research on building isolated GPU execution environments
or manipulate the DMA transfers between the host and the GPU
falls into two categories: GPU virtualization and TEE on GPU.
buffer. AvaGPU prevents such attacks by checking the destina-
Through the introduction of a unified additional layer of GPU
tion and length of DMA data transfer transactions with the DMA
memory and configuration management in the hypervisor, GPU
reference monitor. An attacker cannot deny the allocation of a
virtualization [54, 57] enables the isolation of computation environ-
DMA buffer because the DMA buffers used by secure tasks are
ments across VMs on a shared hardware platform, thus eliminating
pre-allocated within the TEE. The DMA attack transferring data
interference between individual VM memory and configuration
from other untrusted devices to secure memory is prevented by
management. GPU virtualization only provides OS-level isolation,
Arm TrustZone.
AvaGPU complements this by providing a finer-grained GPU com-
GPU Access Availability: Denial/Malicious GPU Access/Configu- putation isolation, i.e., between trusted and untrusted processes.
ration (R3): A REE attacker cannot prevent secure GPU tasks from Another research line studies how to build TEE for GPU compu-
accessing or configuring the GPU, as the GPU’s MMIO memory tation. Hardware-based solutions [32, 33, 59, 67] utilize customized
space is accessed directly from the TEE and is typically predefined hardware to provide integrity and confidentiality for GPU work-
for embedded devices. AvaGPU prevents attackers from arbitrarily laod. Graviton [59] uses a customized GPU command processor
configuring the GPU by using Stage-2 translation-based memory to enforce GPU physical memory isolation for different enclaves.
control. All GPU configuration requests are trapped and sent to the Based on Graviton, Telekine[32] transforms GPU computations
TEE configuration mediator for verification before being written to into a data-oblivious form to defend against side-channel attacks.
GPU. HIX [33] leverages customized CPU MMU and enclave metadata to
Real-time Execution Availability: (1) Denial Scheduling GPU enforce the GPU usage isolation. HETEE [67] leverages centralized
Tasks (R4): All secure CPU tasks are invoked directly by the trusted FPGA-based controller to isolate accelerators physically. AvaGPU
scheduler within the TEE, preventing attackers from hindering complements these work by additionally guaranteeing availability
their invocation. AvaGPU’s GPU task scheduler, located within the for secure GPU tasks with software-based GPU resource manage-
TEE, is shielded from REE attackers’ control flow manipulation. ment and access control solutions.
Attackers may delay GPU scheduling by triggering REE interrupts. For software-based solutions, GPU separation Kernel (GSK) [65]
To counter this, a REE interrupt handling task with a fixed periodic provides trusted display by isolating trusted GPU drivers in a sepa-
budget is introduced. CPU scheduling analysis guarantees that the rated kernel to enforce GPU access control. AvaGPU complements
CPU task will meet its deadline, even if the entire interrupt handling GSK by minimizing TCB of trusted GPU driver. Strongbox [27] im-
task’s budget is consumed. (2) Delaying GPU Tasks Finishing (R4): plements GPU TEE on Arm platform by leveraging Stage-2 memory
An attacker can delay a secure GPU task by statically disabling the translation and Arm Trustzone for isolating GPU task data and code.
suspension instrumentation, runtime corrupting preemption sig- Different from Strongbox, AvaGPU uses Stage-2 memory transla-
nals or hijacking control flow. Firstly, AvaGPU prevents suspension tion to provide trusted and efficient GPU memory management.
instrumentation disabling by verifying the instrumented task’s sig- HoneyComb [43] implements a TEE by statically verifying the GPU
nature, using an encrypted hash signed by a trusted compiler with task binary before loading to confine the behaviors of GPU tasks.
the public key in the TEE. Additionally, AvaGPU mitigates potential CODY [48] and RT-TEE [61] adopt I/O message recording and re-
TOCTOU attacks, which may occur after signature verification but playing to provide a trusted peripheral driver. However, GPU access
before execution, by making GPU task code pages read-only for nor- control and dynamic memory management is also necessary when
mal world during Stage-2 translation prior to signature verification. guaranteeing GPU workload availability. Thus, AvaGPU extends
12

2602
Secure and Timely GPU Execution in Cyber-physical Systems CCS ’23, November 26–30, 2023, Copenhagen, Denmark

Table 8: GPU TEE Related Work Comparison Table 9 DISCUSSIONS AND LIMITATIONS
System Conf./Inte. CPU Avai. GPU Avai. OS Avai. C.GPU T. A. SW GPU Source Code Requirement: AvaGPU includes a GPU task
GPUvm [54] ✓ ✓ ✓ ✓ ✓ compiling tool that requires the source code of the GPU task. For
gVirt[57] ✓ ✓ ✓ ✓ ✓
Graviton [59] ✓ binary-only AI applications, it’s possible to extend AvaGPU by in-
HIX [33] ✓ strumenting GPU binary code with GPU binary instrumentation
HETEE [67] ✓
Trusted Disp [65] ✓ ✓
tools [58]. However, significant reverse engineering efforts are re-
StrongBox [27] ✓ ✓ quired. Similar to other security mechanisms [26, 53], modifying
CODY [48] ✓ ✓
SecDeep [42] ✓ ✓
the source code of an application can alter its performance. Conse-
RTTEE [61] ✓ ✓ ✓ quently, secure applications require re-certification procedures and
AvaGPU ✓ ✓ ✓ ✓ ✓ schedulability testing after source code modification.
Conf: GPU Data Confidentiality, Integrity: GPU Data/Code Integrity, OS Avai: OS-level GPU
Compute Availability, C.GPU T. A.: Task-level CPU-GPU Compute Availability SW: Software Availability on Dedicated GPU: AvaGPU is primarily focused on
Solution.
protecting the availability of embedded real-time systems, which
Table 9: Kernel Preemption Related Work Comparison Table typically utilize integrated GPUs. The key distinction between inte-
grated and dedicated GPUs is that integrated GPUs share the same
System Split Kernel Re-execution Inter Block Intra Thread physical memory as the CPU, while dedicated GPUs have their own
PKM, GEPS [24, 66] ✓
Effisha, FLEP [25, 63] ✓ dedicated physical memory. Though the design of AvaGPU is also
REEF, Lee et al. [30, 37, 38] ✓ applicable to platforms utilizing dedicated GPU, securing various
AvaGPU ✓
communication channels presents additional challenges, such as
addressing malicious PCIe channel configurations.
message replaying-based drivers by providing a secure and effi- Suspending/killing Strategies: Current AvaGPU suspends (re-
cient GPU access control and dynamic memory management. From sponsive ones) and kills (non-responsive ones) non-secure GPU
security property perspectives, all above software-based GPU solu- tasks upon detection of delay of secure GPU tasks. Such aggressive
tions focus on confidentiality and integrity of GPU tasks, AvaGPU strategy prevents sequential delays from multiple non-secure GPU
supplements them by additionally guaranteeing the availability tasks at the cost of non-secure world performance. However, de-
of secure GPU tasks. RT-TEE [61] presents a solution to ensure pending on the level to tolerance of the control system, AvaGPU
real-time availability for CPU tasks. AvaGPU complements RT-TEE also supports other strategies with different trade-offs. For example,
by ensuring GPU tasks availability. AvaGPU can only terminates the non-responsive GPU task without
Real-time GPU Scheduling: As shown in Table. 9, GPU task suspending responsive non-secure ones.
scheduling mechanisms are divided into non-preemptive sched- Extending to Mutually Untrusted Secure GPU Tasks: While
uling and preemptive scheduling. The non-preemptive GPU task AvaGPU assumes a simplified model of trusted GPU tasks, there
scheduling solutions [28, 34–36, 52] only schedule GPU tasks af- are systems where secure GPU tasks may be mutually untrusted.
ter one GPU task finishing. AvaGPU proposes a preemptive-based Extending AvaGPU’s support for availability protection to such set-
GPU scheduling mechanism that can schedule the GPU tasks at ting requires addressing three new attack vectors. First, malicious
each preemption checkpoint during a GPU task execution, reducing secure tasks might disable/bypass preemption, by modifying GPU
scheduling delay. GPU doesn’t support a preemptive interface in tasks or exploiting vulnerabilities, to monopolize computational
the hardware. Thus, to support preemptive-based GPU schedul- resources. Second, malicious secure tasks can modify computation
ing either hardware or software needs to be modified. Customized configurations when other secure GPU tasks are running. Last, they
hardware architectures [49, 56, 62] extend existing GPU hardware can compromise other tasks’ memory via DMA requests targeting
to support preemption. However, compatibility issues make it chal- another task’s memory [27]. To defend against these attacks, Av-
lenging for these solutions to be widely adopted. Software-based aGPU can be extended to additionally verify user space requests of
GPU preemption mechanisms have three categories. The first ap- GPU configurations and DMA transactions from secure GPU tasks
proach splits a long execution GPU tasks into multiple short ones with existing GPU configuration mediators and DMA reference
and schedules at the end of splitted GPU tasks execution[24, 66]. monitors. Additionally, preemption bypassing attacks can be pre-
However, launching sub-tasks introduces high latency. AvaGPU vented by applying existing delay detection and attack elimination
complements this approach by implementing GPU task preemption mechanism to secure tasks. To understand the cost of this extension,
with software instrumentation without additional GPU task launch- we implemented a prototype and measured the execution time of
ing. The second approach, thread block-level preemption [25, 63], secure GPU tasks with the same setup in section 6.2 (i.e., System
preempts GPU tasks at the end of a thread block. This method intro- Overhead on GPU Benchmark). The maximum and average runtime
duces high preemption delay if thread block execution time is long. overhead of secure GPU task is 19.32% and 16.40% responsively,
AvaGPU complements these methods to support preemption at any 0.82% higher than the baseline AvaGPU.
expected code execution points no matter how long a thread is. The
last approach kills and restores impotence workloads [30, 37, 38] Remote Attestation: The implementation of AvaGPU adapts the
without saving context. AvaGPU complements this approach to sup- remote attestation mechanism from the embedded GPU TEE so-
port any kind of workload. Additionally, AvaGPU complements all lution [27], where the keying materials are stored in the secure
above work by securing the scheduling infrastructure and defense storage. Upon receiving a challenge from remote verifier, a signed
against compromised GPU tasks. measurement over the software TCB is returned for verification.
13

2603
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Jinwen Wang, Yujie Wang, and Ning Zhang

Though attestation is not the focus of this paper, it is one of most [18] 2023. Nvidia Multiple Process Service. https://docs.nvidia.com/deploy/mps/index.
foundational techniques for TEE and requires additional investiga- html.
[19] 2023. Rodinia Benchmark. https://github.com/yuhc/gpu-rodinia.
tion in the context of AvaGPU in the future. [20] 2023. SPEC CPU2017. https://www.spec.org/cpu2017/.
[21] 2023. Stage2 Translation. https://developer.arm.com/documentation/102142/
Solution Generality: The key features leveraged in AvaGPU, in- 0100/Stage-2-translation.
clude virtual memory system-based GPU memory isolation, two [22] Fritz Alder, Jo Van Bulck, Frank Piessens, and Jan Tobias Mühlberg. 2021. Aion:
stage memory translation, and GPU task killing. Virtual memory Enabling open systems through strong availability guarantees for enclaves. In
Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications
systems and two stage memory translation are well supported by Security. 1357–1372.
mainstream GPU vendors, including Nvidia, AMD, Intel and Arm, [23] Tanya Amert, Nathan Otterness, Ming Yang, James H Anderson, and F Donelson
and CPU architectures, such as Arm, MIPS, and RISC-V. Although Smith. 2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In
2017 IEEE Real-Time Systems Symposium (RTSS). IEEE, 104–115.
GPU task killing are not clearly documented, and may differ by [24] Can Basaran and Kyoung-Don Kang. 2012. Supporting preemptive task executions
GPU vendors. It’s still possible to adapt a less efficient mechanism, and memory copies in GPGPUs. In 2012 24th Euromicro Conference on Real-Time
Systems. IEEE, 287–296.
such as terminating all GPU tasks and resuming only trusted ones [25] Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. Effisha: A
to ensure availability without task level GPU process killing. software framework for enabling effficient preemptive scheduling of gpu. In
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming. 3–16.
10 CONCLUSION [26] Abraham A Clements, Naif Saleh Almakhdhub, Saurabh Bagchi, and Mathias
Payer. 2018. ACES: Automatic compartments for embedded systems. In 27th
In this paper, we introduce AvaGPU, which ensures real-time avail- USENIX Security Symposium (USENIX Security 18). 65–82.
ability guarantees for safety-critical GPU tasks. To couple the prior- [27] Yunjie Deng, Chenxu Wang, Shunchang Yu, Shiqing Liu, Zhenyu Ning, Kevin
ity of secure CPU and GPU tasks, AvaGPU proposes a secure real- Leach, Jin Li, Shoumeng Yan, Zhengyu He, Jiannong Cao, et al. 2022. StrongBox:
A GPU TEE on Arm Endpoints. In Proceedings of the 2022 ACM SIGSAC Conference
time CPU-GPU co-scheduling framework, effectively mitigating on Computer and Communications Security. 769–783.
performance interference. To enable secure and efficient preemptive [28] Glenn A Elliott, Bryan C Ward, and James H Anderson. 2013. GPUSync: A
GPU scheduling, AvaGPU proposes a secure and fine-grained GPU framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems
Symposium. IEEE, 33–44.
task preemption mechanism, effectively bounding priority inver- [29] J Alex Halderman, Seth D Schoen, Nadia Heninger, William Clarkson, William
sion. To provide an efficient and trusted GPU driver with minimized Paul, Joseph A Calandrino, Ariel J Feldman, Jacob Appelbaum, and Edward W
Felten. 2009. Lest we remember: cold-boot attacks on encryption keys. Commun.
TCB, AvaGPU proposes a new splitted GPU driver. We developed a ACM 52, 5 (2009), 91–98.
prototype on Jetson AGX Orin platform and evaluated the system [30] Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-
with benchmarks, synthetic tasks, and real-world applications. The scale Preemption for Concurrent GPU-accelerated DNN Inferences. In OSDI.
539–558.
source code is available at our project repository 1 . [31] John H Holland. 1992. Genetic algorithms. Scientific american 267, 1 (1992),
66–73.
ACKNOWLEDGMENT [32] Tyler Hunt, Zhipeng Jia, Vance Miller, Ariel Szekely, Yige Hu, Christopher J Ross-
bach, and Emmett Witchel. 2020. Telekine: Secure Computing with Cloud GPUs.
We thank the reviewers for their valuable feedback. This work In 17th USENIX Symposium on Networked Systems Design and Implementation
(NSDI 20). 817–833.
was partially supported by the NSF ( CNS-1916926, CNS-2038995, [33] Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethumadhavan, and Jaehyuk Huh.
CNS-2154930, CNS-2238635 ), ARO (W911NF2010141), and Intel. 2019. Heterogeneous isolated execution for commodity gpus. In Proceedings of the
Twenty-Fourth International Conference on Architectural Support for Programming
Languages and Operating Systems. 455–468.
REFERENCES [34] Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa,
[1] 2023. AMD Embedded GPU. https://www.amd.com/en/products/embedded. and Ragunathan Rajkumar. 2011. RGEM: A responsive GPGPU execution model
[2] 2023. AMD ROCm. https://rocm.docs.amd.com/en/latest/. for runtime engines. In 2011 IEEE 32nd RTSS. IEEE, 57–66.
[3] 2023. Arm Embedded GPU. https://www.arm.com/markets/automotive/ [35] Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, Yutaka Ishikawa, et al.
autonomous-vehicles. 2011. TimeGraph:GPU Scheduling for Real-Time Multi-Tasking Environments.
[4] 2023. Artificial Intelligence & Autopilot. https://www.tesla.com/AI. In 2011 USENIX Annual Technical Conference (USENIX ATC 11).
[5] 2023. Autoware. https://autoware.org/. [36] Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012.
[6] 2023. Autoware Hardware Expectation. https://autowarefoundation.github.io/ Gdev:First-Class GPU Resource Management in the Operating System. In 2012
autoware-documentation/main/installation/. USENIX Annual Technical Conference (USENIX ATC 12). 401–412.
[7] 2023. Autoware Sample-Rosbag. https://autowarefoundation.github.io/autoware- [37] Hyeonsu Lee, Hyunjun Kim, Cheolgi Kim, Hwansoo Han, and Euiseong Seo. 2020.
documentation/main/tutorials/ad-hoc-simulation/rosbag-replay-simulation/. Idempotence-based preemptive gpu kernel scheduling for embedded systems.
[8] 2023. BYD Self Driving. https://en.byd.com/news/byd-selects-nvidia-drive- IEEE Trans. Comput. 70, 3 (2020), 332–346.
hyperion-\for-next-generation-software-\defined-electric-vehicles/. [38] Hyeonsu Lee, Jaehun Roh, and Euiseong Seo. 2018. A GPU kernel transaction-
[9] 2023. CPS Kernel Vulnerability. https://cve.mitre.org/cgi-bin/cvekey.cgi? ization scheme for preemptive priority scheduling. In 2018 IEEE Real-Time and
keyword=cyber+physical+kernel+vulnerability. Embedded Technology and Applications Symposium (RTAS). IEEE, 202–213.
[10] 2023. DetectNet. https://github.com/Pypearl/ObjectDetectionCuda. [39] Jaewon Lee, Yonghae Kim, Jiashen Cao, Euna Kim, Jaekyu Lee, and Hyesoon Kim.
[11] 2023. Drone Camera. https://www.bhphotovideo.com/c/product/802344079- 2022. Securing gpu via region-based bounds checking. In Proceedings of the 49th
USE/kolibri_xk6600_gy_hellfire_hd_camera_drone.html. Annual International Symposium on Computer Architecture. 27–41.
[12] 2023. Image Compression. https://github.com/adolfos94/Haar-Wavelet-Image- [40] Chung Laung Liu and James W Layland. 1973. Scheduling algorithms for multi-
Compression. programming in a hard-real-time environment. Journal of the ACM (JACM) 20, 1
[13] 2023. ImageNet. https://github.com/dusty-nv/jetson-inference/blob/master/ (1973), 46–61.
docs/imagenet-console-2.md. [41] Han Liu et al. 2023. SlowLiDAR: Increasing the Latency of LiDAR-Based Detection
[14] 2023. NIO Self Driving. https://www.nio.com/nad. Using Adversarial Examples. In IEEE/CVF CVPR. 5146–5155.
[15] 2023. NVIDIA DRIVE. https://developer.nvidia.com/drive/hyperion. [42] Renju Liu, Luis Garcia, Zaoxing Liu, Botong Ou, and Mani Srivastava. 2021.
[16] 2023. Nvidia GPU Chennel Reset. https://switchbrew.org/wiki/NV_services# SecDeep: Secure and Performant On-device Deep Learning Inference Framework
NVGPU_IOCTL_CHANNEL_FORCE_RESET. for Mobile and IoT Devices. In Proceedings of the International Conference on
[17] 2023. Nvidia Jetson Nano. https://www.nvidia.com/en-us/autonomous- Internet-of-Things Design and Implementation. 67–79.
machines/embedded-systems/jetson-nano/education-projects/. [43] Haohui Mai et al. 2023. Honeycomb: Secure and Efficient GPU Executions via
Static Validation. In USENIX OSDI. 155–172.
1 Source code is available at https://github.com/WUSTL-CSPL/AvaGPU
14

2604
Secure and Timely GPU Execution in Cyber-physical Systems CCS ’23, November 26–30, 2023, Copenhagen, Denmark

[44] Andrea Miele. 2016. Buffer overflow vulnerabilities in CUDA: a preliminary Algorithm 1: Task Preemption Algorithm
analysis. Journal of Computer Virology and Hacking Techniques 12 (2016), 113–
Input : Number of existing GPU tasks with the same priority 𝑛 . Tasks with the same
120. priority. 𝑇 = {𝜏𝑖 = (𝑚𝑖 , 𝑟𝑖 , 𝑡𝑖 ) |0 ≤ 𝑖 < 𝑛} , 𝑚𝑖 , 𝑟𝑖 , 𝑡𝑖 is the required memory,
[45] Hoda Naghibijouybari et al. 2018. Rendered insecure: Gpu side channel attacks registers, and threads of 𝜏𝑖 . A new task 𝜏 𝑤 = (𝑚 𝑤 , 𝑟 𝑤 , 𝑡 𝑤 ) to be executed.
are practical. In CCS. 2139–2153. ′ ′
Output : Tasks to be preempted 𝑇 = {𝜏𝑖 |0 ≤ 𝑖 < 𝑛}
[46] Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H Anderson,
1 for 𝑖 = 0 to 𝑛 do
F Donelson Smith, Alex Berg, and Shige Wang. 2017. An evaluation of the NVIDIA 2 𝑑𝑝 [𝑖 ] [0] [0] [0] ← 0
TX1 for supporting real-time computer-vision workloads. In 2017 IEEE Real-Time 3 for 𝑖 = 0 to 𝑛 , 𝑚 = 0 to 𝑚 𝑤 , 𝑟 = 0 to 𝑟 𝑤 , 𝑡 = 0 to 𝑡 𝑤 do
and Embedded Technology and Applications Symposium (RTAS). IEEE, 353–364. 4 𝑑𝑝 [𝑖 ] [𝑚] [𝑟 ] [𝑡 ] = 𝑑𝑝 [𝑖 − 1] [𝑚] [𝑟 ] [𝑡 ]
[47] Heejin Park and Felix Xiaozhu Lin. 2022. GPUReplay: a 50-KB GPU stack for 5 if 𝑚 >= 𝑚𝑖 −1 and 𝑟 >= 𝑟𝑖 −1 and 𝑡 >= 𝑡𝑖 −1 then
client ML. In Proceedings of the 27th ACM International Conference on Architectural 6 𝑑𝑝 [𝑖 ] [𝑚] [𝑟 ] [𝑡 ] =
Support for Programming Languages and Operating Systems. 157–170. min(𝑑𝑝 [𝑖 ] [𝑚] [𝑟 ] [𝑡 ], 1 + 𝑑𝑝 [𝑖 − 1] [𝑚 − 𝑚𝑖 −1 ] [𝑟 − 𝑟𝑖 −1 ] [𝑡 − 𝑡𝑖 −1 ] )
[48] Heejin Park and Felix Xiaozhu Lin. 2023. Safe and Practical GPU Acceleration in ′
7 𝑇 = {}
TrustZone. EuroSys (2023). 8 for 𝑖 = 𝑛 to 1 do
[49] Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collabora- 9 if 𝑑𝑝 [𝑖 ] [𝑚 𝑤 ] [𝑟 𝑤 ] [𝑡 𝑤 ] != 𝑑𝑝 [𝑖 − 1] [𝑚 𝑤 ] [𝑟 𝑤 ] [𝑡 𝑤 ] then
tive preemption for multitasking on a shared GPU. ACM SIGARCH Computer ′
10 add 𝜏𝑖 −1 to 𝑇
Architecture News 43, 1 (2015), 593–606. 11 𝑚 𝑤 -= 𝑚𝑖 −1 , 𝑟 𝑤 -= 𝑟𝑖 −1 , 𝑡 𝑤 -= 𝑡𝑖 −1
[50] Linh TX Phan et al. 2011. CARTS: a tool for compositional analysis of real-time ′
12 return 𝑇 ;
systems. In SIGBED Review. ACM.
[51] Bharath Pichai and other. 2014. Architectural support for address translation on
gpus: Designing memory management units for cpu/gpus with unified address
spaces. ACM SIGARCH Computer Architecture News 42, 1 (2014), 743–758.
[52] Christopher J Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett
to preempt to meet specific resource needs (including memory,
Witchel. 2011. PTask: operating system abstractions to manage GPUs as compute registers, and threads) for the next GPU task. It takes into account
devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems the number of running tasks with the same priority and iterates
Principles. 233–248.
[53] Zhichuang Sun, Bo Feng, Long Lu, and Somesh Jha. 2020. OAT: Attesting oper- through all task and resource combinations, determining the min-
ation integrity of embedded devices. In 2020 IEEE Symposium on Security and imum number to preempt by including or excluding the current
Privacy (SP). IEEE, 1433–1449. task. Once the table is built, it backtracks to identify the specific
[54] Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm:
Why Not Virtualizing GPUs at the Hypervisor?. In 2014 USENIX Annual Technical tasks preempted in the optimal solution.
Conference (USENIX ATC 14). 109–120.
[55] Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. 2013. Sok: Eternal war
in memory. In 2013 IEEE Symposium on Security and Privacy. IEEE, 48–62. Algorithm 2: Defense Strategy Optimization
[56] Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and
1 Input: Deadline of CPU tasks {𝑑𝑖 } , execution time of CPU tasks {𝑒𝑖 } ,
Mateo Valero. 2014. Enabling preemptive multiprogramming on GPUs. ACM
system utilization upper bound 𝑈𝑢𝑝 , maximum iteration number 𝑐𝑚𝑎𝑥 ,
SIGARCH Computer Architecture News 42, 3 (2014), 193–204.
selected offspring number 𝐾 , number of crossover rounds 𝑁 , 𝑝𝑚 is
[57] Kun Tian, Yaozu Dong, and David Cowperthwaite. 2014. A Full GPU Virtual-
mutation probability.
ization Solution with Mediated Pass-Through. In 2014 USENIX Annual Technical
2 Output: Attack detection points 𝑁 , attack detection position 𝐶
Conference (USENIX ATC 14). 121–132.
[58] Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. 3 Randomly perturb 𝑁 , 𝐶 into initialization groups N0 , C0
Nvbit: A dynamic binary instrumentation framework for nvidia gpus. In Proceed- 4 for 𝑐 = 0 to 𝑐𝑚𝑎𝑥 − 1 do
ings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 5 Calculate fitness score 𝑆 of N𝑐 and C𝑐 , by Eq. 3 in the constraints of Eq.
372–383. 1 and Eq. 2;
[59] Stavros Volos, Kapil Vaswani, and Rodrigo Bruno. 2018. Graviton: Trusted 6 N𝑐 , C𝑐 ← 𝑎𝑟𝑔𝑠𝑜𝑟𝑡 (𝑆 ) [: 𝐾 ] , 𝑆𝑐 ← 𝑠𝑜𝑟𝑡 (𝑆 ) [: 𝐾 ] ;
Execution Environments on GPUs. In 13th USENIX Symposium on Operating 7 for 𝑖 = 1 to 𝑁 do
Systems Design and Implementation (OSDI 18). 681–696. 8 𝑝𝑎𝑟𝑒𝑛𝑡 1 , 𝑝𝑎𝑟𝑒𝑛𝑡 2 ← 𝑆𝑎𝑚𝑝𝑙𝑒 (N𝑐 , C𝑐 , S𝑐 )
[60] Jinwen Wang et al. 2023. ARI: Attestation of Real-time Mission Execution In- 9 𝑐ℎ𝑖𝑙𝑑𝑖 ← 𝐶𝑟𝑜𝑠𝑠𝑜𝑣𝑒𝑟 (𝑝𝑎𝑟𝑒𝑛𝑡 1 , 𝑝𝑎𝑟𝑒𝑛𝑡 2 )
tegrity. In USENIX Security. 2761–2778. 10 𝑁𝑖 , 𝐶𝑖 ← 𝑀𝑢𝑡𝑎𝑡𝑒 (𝑐ℎ𝑖𝑙𝑑𝑖 , 𝑝𝑚 )
[61] Jinwen Wang, Ao Li, Haoran Li, Chenyang Lu, and Ning Zhang. 2022. RT-TEE: 11 N𝑐+1 , C𝑐+1 ← {𝑁𝑖 }, {𝐶𝑖 }
Real-time System Availability for Cyber-physical Systems using ARM TrustZone.
12 return N𝑐𝑚𝑎𝑥 , C𝑐𝑚𝑎𝑥 ;
In 2022 IEEE S&P. IEEE Computer Society, 1573–1573.
[62] Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and
Minyi Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput
processors via fine-grained sharing. In 2016 IEEE International Symposium on Preemption Bypassing Defense Strategy Optimization: Av-
High Performance Computer Architecture (HPCA). IEEE, 358–369.
[63] Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. 2017. Flep: Enabling flexible aGPU utilizes the genetic algorithm [31] to address the preemp-
and efficient preemption on gpus. ACM SIGPLAN Notices 52, 4 (2017), 483–496. tion bypassing attack defense strategy optimization. It begins by
[64] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. 2021. Center-based 3d object
detection and tracking. In IEEE/CVF CVPR. 11784–11793.
randomly selecting 𝑁 strategies and iteratively refines them for
[65] Miao Yu, Virgil D Gligor, and Zongwei Zhou. 2015. Trusted display on untrusted reduced runtime overhead. The fitness score, representing the strat-
commodity platforms. In Proceedings of the 22nd ACM SIGSAC Conference on egy’s runtime overhead, is calculated using Equations 1, 2, and 3. In
Computer and Communications Security. 989–1003.
[66] Husheng Zhou, Guangmo Tong, and Cong Liu. 2015. GPES: A preemptive each iteration, the top K strategies are chosen as offspring based on
execution system for GPGPU computing. In 21st IEEE Real-Time and Embedded their scores. New strategies are then produced via crossover, where
Technology and Applications Symposium. IEEE, 87–97. elements in 𝑁 and 𝐶 from two parent strategies are exchanged,
[67] Jianping Zhu, Rui Hou, XiaoFeng Wang, Wenhao Wang, Jiangfeng Cao, Boyan
Zhao, Zhongpu Wang, Yuhui Zhang, Jiameng Ying, Lixin Zhang, et al. 2020. followed by potential mutations with a probability of 𝑝𝑚 . This pro-
Enabling rack-scale confidential computing using heterogeneous trusted execu- cess continues until a satisfactory offspring emerges. Details can
tion environment. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE,
1450–1465.
be found in Alg. 2. To assess the convergence of our optimization
algorithm, we tested it on three task sets with system utilizations of
20%, 40%, and 60%. Each set had three tasks with random execution
A ADDITIONAL DESIGN DETAILS times. Results showed the algorithm converged in fewer than 2000
iterations for all sets, and reduced runtime overhead from 43.98%
GPU Preemption Algorithm: The algorithm, which utilizes a
to under 3% of task execution time.
4-dimensional DP table, calculates the minimum number of tasks
15

2605

Gemini Enabling Multi-Tenant GPU Sharing Based On Kernel Burst Estimation
No ratings yet
Gemini Enabling Multi-Tenant GPU Sharing Based On Kernel Burst Estimation
14 pages
AWS D1.1 - Example PQR & WPS Documents
0% (1)
AWS D1.1 - Example PQR & WPS Documents
4 pages
Microprocessors and Microsystems
No ratings yet
Microprocessors and Microsystems
33 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Ground Support in Deep Underground Mines
100% (1)
Ground Support in Deep Underground Mines
27 pages
Complete Kitcar - March 2017
No ratings yet
Complete Kitcar - March 2017
84 pages
LABOR RELATIONS Compiled by Clintmaratas v.4
100% (2)
LABOR RELATIONS Compiled by Clintmaratas v.4
182 pages
Jang Asplos19
No ratings yet
Jang Asplos19
14 pages
Pitfalls GPUs Autonomous
No ratings yet
Pitfalls GPUs Autonomous
21 pages
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
No ratings yet
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
8 pages
Calendario de Organizacion Basado en Seleccion Preemptivo
No ratings yet
Calendario de Organizacion Basado en Seleccion Preemptivo
13 pages
Strongbox: A Gpu Tee On Arm Endpoints: Yunjie Deng Chenxu Wang Shunchang Yu Shiqing Liu
No ratings yet
Strongbox: A Gpu Tee On Arm Endpoints: Yunjie Deng Chenxu Wang Shunchang Yu Shiqing Liu
15 pages
Heterogeneous Isolated Execution For Commodity GPUs
No ratings yet
Heterogeneous Isolated Execution For Commodity GPUs
14 pages
GPU Verification Iccad18-Gpu
No ratings yet
GPU Verification Iccad18-Gpu
8 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
CPU-Assisted GPGPU On Fused CPU-GPU Architectures
No ratings yet
CPU-Assisted GPGPU On Fused CPU-GPU Architectures
12 pages
AZ-801 Exam Prep: Configuring Windows Server Hybrid Services
From Everand
AZ-801 Exam Prep: Configuring Windows Server Hybrid Services
Steve Brown
No ratings yet
Key Papers On GPU Command Processors and Schedulin
No ratings yet
Key Papers On GPU Command Processors and Schedulin
3 pages
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
No ratings yet
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
6 pages
Osdi18-Volos-Graviton-Trusted Execution Environments On GPUs
No ratings yet
Osdi18-Volos-Graviton-Trusted Execution Environments On GPUs
17 pages
Understanding The Security of Discrete GPUs
No ratings yet
Understanding The Security of Discrete GPUs
11 pages
Parallel Acceleration of Deadl
No ratings yet
Parallel Acceleration of Deadl
144 pages
Unit V Part B and C - 240514 - 220831
No ratings yet
Unit V Part B and C - 240514 - 220831
17 pages
Research On The Application of Artificial Intelligence Algorithms in GPU
No ratings yet
Research On The Application of Artificial Intelligence Algorithms in GPU
6 pages
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
From Everand
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
Steve Brown
No ratings yet
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
From Everand
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Cs 8803 Ss Project
No ratings yet
Cs 8803 Ss Project
17 pages
Hong 2017
No ratings yet
Hong 2017
37 pages
Dynamic Load Balancing On Single-And Multi-GPU Systems
No ratings yet
Dynamic Load Balancing On Single-And Multi-GPU Systems
12 pages
The Design and Implementation of A Verification Technique For GPU Kernels
No ratings yet
The Design and Implementation of A Verification Technique For GPU Kernels
49 pages
Operating System Abstractions To Manage Gpus As Compute Devices
No ratings yet
Operating System Abstractions To Manage Gpus As Compute Devices
16 pages
Common Counters: Compressed Encryption Counters For Secure GPU Memory
No ratings yet
Common Counters: Compressed Encryption Counters For Secure GPU Memory
13 pages
Intro Computing BCSM-F18-071 - Assignment 1
No ratings yet
Intro Computing BCSM-F18-071 - Assignment 1
10 pages
ccs18 Gpu Side Channel
No ratings yet
ccs18 Gpu Side Channel
15 pages
curator,+CISSE v06 I01 p07
No ratings yet
curator,+CISSE v06 I01 p07
16 pages
Reliability and Security of AI Hardware ETS24 Specialsession
No ratings yet
Reliability and Security of AI Hardware ETS24 Specialsession
11 pages
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
Chapter Eight
No ratings yet
Chapter Eight
73 pages
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
From Everand
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
Steve Brown
No ratings yet
Safety and Security of Cyber-Physical Systems
100% (1)
Safety and Security of Cyber-Physical Systems
559 pages
CCSP - Certified Cloud Security Professional Exam Insights
From Everand
CCSP - Certified Cloud Security Professional Exam Insights
SUJAN
No ratings yet
Owens
No ratings yet
Owens
67 pages
ICCSPA2024
No ratings yet
ICCSPA2024
6 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Article 3226
No ratings yet
Article 3226
28 pages
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
From Everand
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
Maris Fenlor
No ratings yet
Practical GPU Programming
From Everand
Practical GPU Programming
Maris Fenlor
No ratings yet
L3 CPS Jan10 2024
No ratings yet
L3 CPS Jan10 2024
2 pages
Cs Notes... Module 3,4and 5
No ratings yet
Cs Notes... Module 3,4and 5
53 pages
A Survey of Architectural Approaches For Improving GPGPU
No ratings yet
A Survey of Architectural Approaches For Improving GPGPU
24 pages
Cyber-Physical Systems Challenges and Future Direc
No ratings yet
Cyber-Physical Systems Challenges and Future Direc
7 pages
Multi 2 Sim
No ratings yet
Multi 2 Sim
10 pages
AZURE AZ 500 STUDY GUIDE-2: Microsoft Certified Associate Azure Security Engineer: Exam-AZ 500
From Everand
AZURE AZ 500 STUDY GUIDE-2: Microsoft Certified Associate Azure Security Engineer: Exam-AZ 500
Mamta Devi
No ratings yet
Real Time Secure Video Transmission Using Multicore CPUs and GPUs
No ratings yet
Real Time Secure Video Transmission Using Multicore CPUs and GPUs
8 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
From Everand
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
Anand Vemula
No ratings yet
Mastering GCP for Web Applications: A Well-Architected Approach to Cloud Excellence
From Everand
Mastering GCP for Web Applications: A Well-Architected Approach to Cloud Excellence
Chinmoy Mukherjee
No ratings yet
Vulnerabilities in Cyber Physical Systems
No ratings yet
Vulnerabilities in Cyber Physical Systems
8 pages
GPUDirect Async
No ratings yet
GPUDirect Async
18 pages
Google Associate Cloud Engineer Exam Companion: Q&A with Explanations
From Everand
Google Associate Cloud Engineer Exam Companion: Q&A with Explanations
SUJAN
No ratings yet
AZ-900: Microsoft Azure Fundamentals - Study Notes
From Everand
AZ-900: Microsoft Azure Fundamentals - Study Notes
Steve Brown
No ratings yet
Research Paper
No ratings yet
Research Paper
45 pages
CBLM LO3-BREAD - AND - PASTRY - PRODUCTION - NC - II - N
100% (3)
CBLM LO3-BREAD - AND - PASTRY - PRODUCTION - NC - II - N
26 pages
QUESTION BANK - (Laplace and Fourier Transform - CUTM1002)
No ratings yet
QUESTION BANK - (Laplace and Fourier Transform - CUTM1002)
7 pages
翻译公司网站 Website Text - 4.06 Final
No ratings yet
翻译公司网站 Website Text - 4.06 Final
16 pages
ĐỀ KIỂM TRA ĐẦU VÀO - ANH 7 Global
No ratings yet
ĐỀ KIỂM TRA ĐẦU VÀO - ANH 7 Global
5 pages
Notice To IEA Dwarka Museum
No ratings yet
Notice To IEA Dwarka Museum
2 pages
Client Services Agreement
No ratings yet
Client Services Agreement
37 pages
Letter in Support of Responsible Fintech Policy
No ratings yet
Letter in Support of Responsible Fintech Policy
155 pages
Lime 2
No ratings yet
Lime 2
11 pages
1st Sem Result
No ratings yet
1st Sem Result
1 page
LP Rascel
No ratings yet
LP Rascel
13 pages
Lesson 8 PDF
No ratings yet
Lesson 8 PDF
14 pages
Policy Server Installation Guide
0% (1)
Policy Server Installation Guide
24 pages
StraMa Comprehensive Guidelines (C1 To C8) PDF
No ratings yet
StraMa Comprehensive Guidelines (C1 To C8) PDF
103 pages
DPT R8-3W Cat
No ratings yet
DPT R8-3W Cat
2 pages
Hunshu
No ratings yet
Hunshu
6 pages
Mos Cabin R1
100% (1)
Mos Cabin R1
13 pages
YETI Documentation: Release 1.0
No ratings yet
YETI Documentation: Release 1.0
53 pages
RRL in Combined Cryptographic Algorithms
No ratings yet
RRL in Combined Cryptographic Algorithms
8 pages
SAS Weapons Heavy Machine Guns DSHK
100% (1)
SAS Weapons Heavy Machine Guns DSHK
1 page
Modified Acrylic Solid Surface Sheets Price List
No ratings yet
Modified Acrylic Solid Surface Sheets Price List
4 pages
Practice Math AA HL Paper1
100% (2)
Practice Math AA HL Paper1
12 pages
Homework 1
No ratings yet
Homework 1
3 pages
1 F40, R-41, In-House IHTM-14 Test Report
No ratings yet
1 F40, R-41, In-House IHTM-14 Test Report
1 page
List of Imran Series by Ibn-e-Safi - Wikipedia
No ratings yet
List of Imran Series by Ibn-e-Safi - Wikipedia
25 pages
History of Windows
No ratings yet
History of Windows
27 pages

Secure and Timely GPU Execution in Cyber-Physical Systems

Uploaded by

Secure and Timely GPU Execution in Cyber-Physical Systems

Uploaded by

Secure and Timely GPU Execution in Cyber-physical Systems

Jinwen Wang Yujie Wang Ning Zhang

ABSTRACT The exploitation of these vulnerabilities allows attackers to tamper

Customized GPU Compiler 4 DESIGN

Figure 1: AvaGPU System Overview 4.1 CPU-GPU Co-scheduling Infrastructure

mentation bypass can be detected, when the tasks are compromised

✔ check delay ✔ check delay execution time

exe without attack

exe under attack

Figure 5: Preemption Bypassing Defense Strategy

insecure secure insecure secure insecure secure REE TEE

Memory Size (MB)

2k Exe Time w AvaGPU NW Mem Size w AvaGPU NW

Table 4: Runtime Overhead Breakdown of AvaGPU (NW)

Execution Time w/o AvaGPU 100 100

Miss Rate (%)

Miss Rate (%)

1k Execution Time w AvaGPU 75 NS Task w/o AvaGPU

You might also like