Arm Cca

The document introduces the Arm Confidential Compute Architecture (CCA) which provides Realms, secure execution environments that are completely isolated from untrusted system software like hypervisors and operating systems. CCA retains the ability for system software to manage hardware resources for Realms while preventing access to Realm contents. The document describes challenges in verifying the Realm Management Monitor firmware that controls Realms, and introduces the Verification Infrastructure for Armv9-A which enables proving the security and correctness of CCA through modular verification techniques.

Uploaded by

MF Kang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views20 pages

Arm Cca

Uploaded by

MF Kang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Design and Verification of the Arm Confidential Compute Architecture

Xupeng Li Xuheng Li Christoffer Dall Ronghui Gu

Columbia University Columbia University Arm Ltd Columbia University
Jason Nieh Yousuf Sait Gareth Stockwell
Columbia University Arm Ltd Arm Ltd
Abstract To address this problem, we introduce the Arm Confidential
Compute Architecture (Arm CCA). CCA provides Realms,
The increasing use of sensitive private data in computing is
secure execution environments that are completely opaque
matched by a growing concern regarding data privacy. System
to privileged, untrusted system software such as OSes and
software such as hypervisors and operating systems are sup-
hypervisors. CCA retains the ability of existing system
posed to protect and isolate applications and their private data,
software to manage hardware resources for Realms while
but their large codebases contain many vulnerabilities that can
preventing it from violating Realm confidentiality and
risk data confidentiality and integrity. We introduce Realms, a
integrity. For example, a hypervisor should retain its ability to
new abstraction for confidential computing to protect the data
dynamically allocate memory to or free memory from a Realm
confidentiality and integrity of virtual machines. Hardware
VM, but must never be allowed to access the protected memory
creates and enforces Realm world, a new physical address
contents of a Realm VM. CCA guarantees the confidentiality
space for Realms. Firmware controls the hardware to secure
and integrity of Realm code and data in use, that is data in CPU
Realms and handles requests from untrusted system software
registers and memory, but makes no guarantees regarding
to manage Realms, including creating and running them.
their availability. Confidentiality means that any change that
Untrusted system software retains control of the dynamic
a Realm makes to its private data cannot be observed by other
allocation of memory to Realms, but cannot access Realm
Realms or untrusted system software. Integrity means that a
memory contents, even if run at a higher privileged level. To
Realm will not observe any changes to its private data that it
guarantee the security of Realms, we verified the firmware,
did not make. Because CCA does not guarantee availability,
introducing novel verification techniques that enable us to
a Realm data access is allowed to halt Realm execution.
prove, for the first time, the security and correctness of concur-
rent software with hand-over-hand locking and dynamically CCA avoids hardware complexity by only introducing core
allocated shared page tables, data races in kernel code running hardware mechanisms for attestation and basic address space
on relaxed memory hardware, integrated C and Arm assembly protection, then relying on firmware to manage the use of those
code calling one another, and untrusted software being in full mechanisms. Specifically, CCA introduces Realm world, a new
control of allocating system resources. Realms are included physical address space for Realms orthogonal to privilege lev-
in the Arm Confidential Compute Architecture. els and separate from the existing Non-Secure (NS) world used
today for running software stacks. Within each world, the nor-
mal privilege levels apply and instructions retain their existing
1 Introduction semantics, but software in NS world cannot access CPU state
and memory used by software in Realm world. CCA introduces
The use of sensitive private data in many applications from ad- a new Realm Management Monitor (RMM), firmware which
vertising to healthcare, often in the context of machine learning runs in Realm world at a higher privilege level than Realms.
models, has raised concerns regarding the privacy of data in Untrusted system software such as a hypervisor running in
computing. These applications increasingly run on commodity NS world can then make requests to RMM to manage Realms,
cloud providers. For example, data and computation may including creating and running Realms. RMM protects the
be contained in virtual machines (VMs) running on shared confidentiality and integrity of Realms while handling such
hardware in the cloud, relying on a hypervisor to preserve VM requests. System software in NS world is expected to retain
isolation to protect applications and their data in VMs. full control of the dynamic allocation of hardware resources
Software stacks generally require applications to trust to Realms, including memory allocation and CPU scheduling.
system software which they rely on, such as hypervisors and Because any compromise of RMM could violate the security
operating systems (OSes). Although hypervisors and OSes guarantees of Realms, it is crucial to formally verify its
are supposed to protect applications and their private data, security and functional correctness. However, verifying RMM
their large codebases contain vulnerabilities that can risk poses at least four significant challenges. First, RMM employs
data confidentiality and integrity. Vulnerable system software fine-grained synchronization mechanisms such as hand-over-
running at more privileged levels that can access application hand locking to improve performance. Second, RMM has data
data is a significant security issue. races and runs on Arm multiprocessor hardware with relaxed
memory behavior. Third, RMM contains both C and Arm as- This approach allows us to prove, for the first time, the integrity
sembly code integrated together which call one another freely. and confidentiality of Realms. A key feature of the proof is that
Finally, RMM must protect the confidentiality and integrity of it only needs to trust the specification of the small idealized
Realms even though untrusted system software has full control secure machine model; the much larger specification of the
over the dynamic allocation of Realm resources. Previous real system does not need to be trusted.
verification approaches have not been able to verify system We implemented, evaluated, and verified an early prototype
software with these properties [10, 11, 13, 26, 42, 43, 50, 62] of CCA firmware. Although CCA hardware is not yet available,
To verify RMM, we introduce VIA (Verification Infras- we demonstrated CCA on a functionally accurate Arm Fast
tructure for Armv9-A), which supports four key verification Model with CCA support. We modified the Linux KVM hy-
techniques. First, VIA introduces mover oracle queries to com- pervisor [19–21] to run on CCA and manage Realm VMs, and
bine a local CPU model with mover types [44]. These queries ran various VM workloads on the model. We also ported CCA
encapsulate how operations on other CPUs are interleaved firmware to current Arm hardware to obtain preliminary data
with local CPU operations and can be reordered using mover on CCA performance, which shows that KVM on CCA incurs
types to group local CPU operations together. Along with the modest overhead versus vanilla KVM on real application
local CPU model, this allows easier sequential reasoning and workloads. We verified the correctness of both the C and Arm
modular verification. This makes it possible for the first time assembly CCA firmware implementation, including RMM,
to verify hand-over-hand locking with dynamically allocated proving its implementation refines its specification through
shared multi-level page tables in system software. 43 abstraction layers. We then proved the specification is
Second, VIA decomposes concurrent code into data equivalent to the behavior of the idealized secure machine
race free (DRF) and not-DRF components, then introduces model to verify the confidentiality and integrity guarantees of
permutation conditions for the latter such that proofs on a Realms. The proof only needs to trust roughly 200 lines of Coq
sequentially consistent memory model will hold on relaxed specification, making the formal security guarantees easy to
memory hardware for all concurrent code. Instead of having read and understand. This is the first proof of the security guar-
to verify all of the code directly on relaxed memory hardware, antees of a confidential computing architecture. Realms will be
all that is required is to prove that the code satisfies the included in Armv9-A, the next version of the Arm architecture.
permutation conditions, which ensure equivalent behavior on
sequentially consistent and relaxed memory hardware. VIA
allows any permutation conditions to be defined, supporting
2 Threat Model
verification of a broad class of programs.
We consider an attacker without physical access to the
Third, VIA bridges incompatibilities between C and
machine and assume the attacker’s goal is to compromise the
assembly code due to CPU register state being hidden by the
confidentiality and integrity of VM data. Confidentiality and
former but explicitly used by the latter. To accomplish this
integrity attacks in scope include compromising the hypervisor
without dependencies on a specific compiler, VIA introduces
or any other software to read or modify private VM memory or
a register accounting mechanism to correctly verify integrated
register state, including by controlling DMA-capable devices,
C and Arm assembly code. It leverages the machine-level
or via memory remapping and aliasing attacks. We assume a
procedure call standard for the Arm instruction set to specify
VM does not voluntarily reveal its own private data whether on
how registers are potentially used when assembly code calls
purpose or by accident, but attacks from other compromised
a C function or is called by a C function. VIA tracks CPU
VMs, including confidentiality and integrity attacks, are in
register state across invocations of both C and assembly code
scope. Availability attacks by a compromised hypervisor are
primitives, capturing any information flow through CPU
out of scope. Protection against known software error injection
registers even if hidden by C semantics.
attacks and side-channel attacks require appropriate usage
Finally, VIA introduces an ideal/real paradigm for verifying of architectural mitigations and are beyond the scope of this
security properties that can be applied to Realms, even though paper. DRAM attacks, such as cold boot attacks, live probing,
untrusted system software is in full control of system resources or replay, require additional hardware and are outside of the
and can reclaim system resources such as memory without scope of the threat model.
Realm permission, breaking noninterference. VIA defines an
idealized secure machine model that supports declassification.
Realm private data is stored in physically isolated memory 3 CCA Design
and CPU registers. Data channels, governed by security
policies, are used to exchange information between Realms A key challenge with introducing Realms is how to provide
and untrusted software. We can then prove the security backwards compatibility with a widely-used existing archi-
guarantees of Realms by verifying that the implementation tecture that, like other CPU architectures, was designed based
refines its specification and the real system captured by the on the fundamental assumption that more privileged levels
specification simulates the idealized secure machine model. have greater control and access than less privileged levels of
PAS
Security State
Realm Normal NS Secure Realm Root
VM NS Allow Block Block Block
Secure Allow Allow Block Block
EL0 App App App App Realm Allow Block Allow Block
Secure Root Allow Allow Allow Allow
EL1 Guest OS
Table 1: CCA access control policy. The entity accessing a granule
Host OS
RSI
belongs to a security state, while the PAS is a property only of the
granule being accessed.
EL2 RMM Hypervisor
RMI

platform security services [4]. CCA introduces Realm world,

EL3 Monitor
which is fully compatible with NS world so that existing
software stacks that run in NS world can also run in Realm
Figure 1: Arm Confidential Compute Architecture. world. CCA provides three privilege levels in each of the NS,
Realm and Secure worlds: EL0 for user, EL1 for kernel, and
software. One issue is understanding the potential interactions EL2 for hypervisor. Because Realm and Secure worlds are
of Realms with all the features in the Arm architecture. For mutually distrusting, CCA introduces a fourth, more privileged
example, debug registers defined in the Arm architecture are Root world to manage switching between the other worlds.
explicitly designed to allow hypervisors to peer into VM state, Each world has its own Physical Address Space (PAS). Each
which is fundamentally at odds with Realms. The behavior of 4 KB frame of physical memory, which we refer to as a memory
each instruction could be redefined in the context of Realms, granule, belongs to one PAS at any given time. Individual mem-
but this would be an enormous undertaking with unclear ory granules can be dynamically transitioned from NS PAS to
compatibility implications, given that the Arm instruction set Realm PAS; there is no static partitioning of resources between
was designed over multiple decades. NS and Realm worlds. Hardware performs a PAS check on
Another issue is how to provide memory protection and each memory access against a Granule Protection Table (GPT)
isolation for Realms. The way this works for VMs is that that tracks the PAS of each memory granule and enforces the
hypervisors manage nested page tables (NPTs) [9] to isolate access control policy shown in Table 1, forbidding invalid
physical memory between VMs and protect hypervisor accesses. NS world can only access its own memory. Realm
memory from VMs. The physical addresses perceived by a and Secure worlds can access their own respective memory
VM are intermediate physical addresses (IPAs), which are and NS memory, but cannot access each other’s memory. CCA
translated by an NPT to physical addresses for the hardware. hardware requires all DMA accesses be subject to GPT checks,
Physical memory not mapped to the NPT is not accessible to protecting the Realm PAS against DMA-based attacks. We fo-
the VM. However, NPTs are under full control of the untrusted cus on the interactions between NS and Realm worlds and omit
hypervisor, providing no protection against hypervisor access further discussion of Secure world due to space constraints.
to VM data. While it would be possible to introduce an addi- CCA relies on two trusted firmware components: RMM and
tional data structure to track memory ownership for each frame the EL3 Monitor (EL3M). RMM runs at EL2 in Realm world.
of physical memory [3], this approach comes with several It controls the execution of Realms and provides services to
problems. First, the amount of information required for each untrusted system software running in NS world. It isolates
frame of memory would be substantial and significantly impact Realms from each other using existing virtualization technolo-
TLB design and performance. Second, this data structure gies such as NPTs and CPU register save/restore sequences.
would have to be managed either via a separate more privileged Because RMM only enforces the security guarantees of CCA,
software entity than the hypervisor or via complex instructions it can be orders of magnitude smaller than bare-metal hypervi-
capable of capturing measurements of data assigned to a sors which must also provide virtualization functionality. For
Realm. Such complex CISC-like instructions would almost example, to run Realm VMs, RMM protects the confidentiality
certainly require introducing extensive microcode into an and integrity of Realms while relying on existing hypervisors
architecture, which does not currently use any. for everything else, including resource allocation and schedul-
CCA avoids these problems by only introducing simple hard- ing, physical hardware support, and complex device emulation.
ware mechanisms orthogonal to existing privilege levels and EL3M runs in Root world at EL3, the highest level of priv-
then relies on firmware to manage the use of those mechanisms. ilege. It is responsible for context switching CPU execution
This reduces hardware complexity at the cost of depending among the three other worlds and managing the GPT. EL3M
on the firmware for the security guarantees of the architecture. can access memory in any PAS. Only EL3M can change the
As a result, verifying CCA firmware is of crucial importance. PAS of a granule, which involves updating its entry in the GPT.
Figure 1 shows how CCA extends the Arm architecture. Software running in the three other worlds can issue a Secure
Armv8-A provided two statically partitioned worlds, NS Monitor Call (SMC) to EL3M to request a PAS change.
world used by most software stacks and Secure world to host In the current version of CCA, the Realm isolation boundary
Command Description (GST) to track the delegation status and current use of each
Version Query RMI ABI version.
Granule.Delegate Change granule (from NS) to Delegated.
granule. RMM uses the GST to ensure that a granule is in a
Granule.Undelegate Change granule (from Delegated) to NS. valid state to perform the requested action. For example, when
Realm.Create Create Realm Descriptor (RD). the hypervisor delegates a memory granule, RMM checks its
Realm.Destroy Destroy Realm identified by RD. GST to confirm the granule has not already been delegated,
Realm.Activate Change Realm (from New) to Active.
REC.Create Create Realm Execution Context (REC). then issues an SMC to EL3M to request a change to Realm
REC.Destroy Destroy REC. PAS. EL3M checks that the granule is currently in NS PAS,
REC.Run Enter REC (i.e. run VCPU). then updates the GPT to move it to Realm PAS. Finally, RMM
Data.CreateUnknown Change granule to Data with unknown content. updates its GST to record that the granule has been delegated.
Data.Create Change granule to Data, copy NS content.
Data.Destroy Change Data granule to Delegated, zeroed. If the hypervisor attempts to delegate a granule which is al-
RTT.Create Create Realm Translation Table (RTT). ready delegated, or undelegate a granule which is in active
RTT.Destroy Destroy RTT. use by RMM, RMM returns an error code to the untrusted
RTT.MapProtected Map Data granule in RTT.
hypervisor. This pattern of checking valid states and either
RTT.UnmapProtected Remove mapping from RTT.
RTT.MapUnprotected Map NS granule in RTT. performing a discrete action or returning an error is used for all
RTT.UnmapUnprotected Remove NS mapping from RTT. RMI commands, allowing RMM to remain in overall control of
RTT.ReadEntry Return content of an RTT entry. the consistency of the system, while complex logic for policy
Table 2: RMM Realm Management Interface (RMI). and resource allocation remains in the hypervisor. Unlike the
GPT, the GST is not checked by hardware and is only a soft-
is at the level of entire VMs; applying Realms to secure other ware bookkeeping mechanism. By maintaining a separate GST
entities such as containers [59] is future work. Similar to nor- from the GPT, the GPT can be kept simple so that it only needs
mal VMs, a Realm VM can concurrently run multiple virtual to contain information required for hardware-enforced checks.
CPUs (VCPUs) and the number of Realm VMs on a system is The hypervisor creates Realms, Realm Execution Contexts
only limited by the amount of physical memory available, not (RECs), and Realm Translation Tables (RTTs) using the
by any arbitrary limits. The untrusted hypervisor always has respective commands in Table 2. RECs correspond to VCPUs
the ability to stop scheduling a Realm and can always reclaim and RTTs correspond to NPTs for normal VMs. RTTs are Arm
memory assigned to a Realm, but in no circumstances does stage 2 page tables that translate from an IPA to a physical
it have access to Realm CPU or memory state. address. RTTs use the same format and topological layout
This split of responsibility between an untrusted hypervisor in Realm world as NS stage 2 page tables, but also provide
and RMM, where the untrusted hypervisor allocates memory, a bit which allows Realms to access NS granules under the
and RMM provides integrity and confidentiality guarantees control of RMM, for example, for virtual I/O between a Realm
for the data and code stored in that memory, is accomplished and the hypervisor. On each of the Realm, REC, and RTT
through a simple but powerful delegation concept. The hy- create commands, RMM checks the GST entry for the address
pervisor delegates memory to Realm world, and undelegates provided to confirm the granule is already delegated, and
memory back to NS world. All memory used by Realms updates the GST entry to track that it is being used for Realm,
must first be delegated by the hypervisor; RMM does not REC, and RTT metadata, respectively. We refer to a Realm’s
itself manage a pool of memory for Realms. Once memory metadata as its Realm Descriptor (RD).
is delegated to Realm world, the hypervisor can request RMM A Realm provides a Protected Address Range (PAR) within
to use it for various purposes, such as storing metadata or its IPA space, which RMM ensures can only be mapped to
data for a Realm. Whenever a memory granule is delegated Realm PAS granules. For accesses within the PAR, RMM guar-
to Realm world but not used by RMM, RMM ensures that the antees confidentiality and integrity to the Realm; outside the
granule contains only zeros, reducing the risk of accidental PAR, the hypervisor is free to map NS PAS granules or emulate
information flow when a granule is reused or undelegated. accesses. This provides an OS running inside a Realm VM
RMM provides a Realm Management Interface (RMI) for with a reliable mechanism to determine whether it is accessing
the hypervisor to request RMM to delegate memory, create its own private memory, or memory which can be shared with
Realms, execute Realms, and allocate memory to Realms. untrusted agents, for example, buffers used for untrusted DMA
Each RMI command is implemented as an SMC, so when the with virtual or physical network and block devices.
hypervisor invokes the command, it traps to EL3M, which in During Realm creation, the hypervisor can assign a granule
turn switches execution to RMM in Realm world to handle to the Realm at a specific IPA and copy data to it from an NS
the command. Upon completion of the RMI command, RMM granule. The IPA and data are cryptographically hashed and
issues an SMC to EL3M, which switches execution back to the hash is included in the attestation token of the Realm. The
the hypervisor in NS world. Table 2 lists the RMI commands. attestation token allows a Realm owner to reason about its
RMM must know the state of each memory granule on the initial state and content. Once a Realm has been activated,
system to uphold the security guarantees of Realms, which it the measurement is fixed, and memory can only be added
accomplishes by maintaining its own Granule Status Table to otherwise unused IPAs with unknown content. We refer
to delegated granules used to store data for a Realm as Data its layered specification in Coq, then use the top-level
granules. The hypervisor can request that RMM maps NS specification to prove the system’s security properties hold
granules outside the PAR at any time. Physically contiguous for the implementation. To accomplish this, we developed
delegated memory can be mapped to a Realm in blocks larger the VIA verification framework, which supports layered
than 4 KB granules to optimize TLB usage. verification of CCA firmware. VIA introduces four key
The hypervisor can reclaim memory from a Realm at any verification techniques: mover oracle queries, relaxed memory
time. RMM zeros a granule before undelegating it and return- support via permutation conditions, register accounting for C
ing it to the hypervisor. Subsequent accesses from a Realm and assembly code integration, and a new ideal secure system
to the IPA where the memory was reclaimed result in a stage 2 model for proving security properties that cannot be verified
abort to RMM which prevents further execution of the Realm using traditional noninterference-based approaches.
and preserves the CCA integrity guarantee. The hypervisor
cannot subsequently map a granule to a previously-backed
IPA within a PAR without Realm permission. 4.1 Mover Oracle Queries
As a system designed to scale to many cores, RMM makes To verify RMM, it is essential to simplify reasoning about pos-
extensive use of fine-grained locking to support a high degree sible interleavings of executions of concurrent software across
of concurrent operation. For example, each memory granule multiple CPUs. For example, RMM uses hand-over-hand lock-
has its own lock so many granule operations can be done ing to synchronize access to RTTs, which are 4-level page
in parallel. Similarly, an RTT is a multi-level page table, for tables, allowing multiple CPUs to manipulate the same page
which each level has its own lock, and hand-over-hand locking table concurrently. Figure 2 shows the steps to allocate dele-
is used to support concurrent operations on RTTs, as discussed gated granules as new level T1, T2, and T3 tables of a Realm’s
in Section 4.1. For example, two Realm VCPUs can each RTT using RTT.Create and then, in step 4, allocate a delegated
cause a stage 2 page fault at the same time but at different granule to the Realm for its data and map its physical address to
IPAs, which can be resolved by the hypervisor in parallel on the leaf-level T3 table using RTT.MapProtected, which would
two CPUs to improve performance. This is a key requirement typically occur on a page fault. Figure 2 also shows how step 4
to support large Realms. Although most of RMM is written uses hand-over-hand locking, in which RMM first acquires T0’s
in C, Arm assembly code is also used to implement memory lock so it can lookup and acquire T1’s lock and release T0’s lock.
accesses with acquire/release semantics where lockless It can then lookup and acquire T2’s lock and release T1’s lock,
concurrent accesses are used for performance reasons, and so it can lookup and acquire T3’s lock and release T2’s lock,
to implement the locking primitives themselves. and finally update T3’s page entry. At the same time, RMM
CCA firmware is designed for security following best prac- running on other CPUs can do other page table operations,
tices. Systems such as Linux map all physical memory to the such as acquiring T0’s lock to work on a different level 1 table.
kernel page table. RMM and EL3M do not. RMM’s own page To verify the page table operations with hand-over-hand
table statically maps code and metadata exclusively accessed locking, we need to reason about the correctness of all
by RMM, such as the GST and locks for each granule. Addi- possible interleavings of operations. However, reasoning
tional entries in RMM’s page table are used to statically assign about all possible interleavings of all operations all at once
a virtual address range to each physical CPU in the system, re- is too difficult to do for a system as complex as RMM. To
sulting in a fixed number of virtual address slots per CPU. Mem- address this problem, VIA introduces mover oracle queries,
ory is then mapped on demand when needed. RMM maps Data a new mechanism that combines the power of local CPU
granules and metadata granules, such as RD and REC, on de- reasoning with mover types [44], building on previous work
mand, and unmaps them once the respective operation is com- on CertiKOS [24–27] and CSPEC [10].
pleted. EL3M’s own page table only statically maps the EL3M To explain how mover oracle queries work, consider first
code, a small fixed size stack, and the GPT; no other memory an explicit multiprocessor machine model, whose machine
is mapped to its page table. Furthermore, SMC parameters are state consists of per-physical CPU private state (e.g., CPU
only interpreted as values in EL3M, never as pointers used to registers) and a global logical log, a serial list of events
access memory. Even if a bug is introduced in some future generated by all CPUs throughout their execution. Instead
version of CCA firmware that is not completely verified, these of explicitly modeling shared objects, events incrementally
defense-in-depth measures make it much harder for a return- convey interactions with shared objects, whose state may be
oriented or jump-oriented programming attack to succeed. calculated by replaying the logical log. An event is emitted
by a CPU and appended to the log whenever that CPU invokes
4 VIA Framework a primitive that interacts with a shared object. Our abstract
machine is formalized as a transition system, where each
Because CCA relies on firmware to guarantee the security step models some atomic computation taking place on a
of Realms, we verify that firmware, namely RMM and single CPU; concurrency is realized by the nondeterministic
EL3M. We prove the CCA firmware implementation refines interleaving of steps across all CPUs. However, reasoning
1 Create 2 Create 3 Create 4 Map Oracle0 Right0 Oracle1 Right1 Oracle2 None2 Oracle3 Left3
Physical Reorder
T0 T1 T2 T3
Memory
Oracle0 Oracle1 Oracle2 Right0 Right1 None2 Left3 Oracle3
Step 4 Acq LD Acq Rel LD Acq Rel LD Acq Rel ST Rel
Map: T0 T0 T1 T0 T1 T2 T1 T2 T3 T2 T3 T3 Merge Oracle
Oracle’0 Right0 Right1 None2 Left3 Oracle’1

Figure 2: Page table creation and hand-over-hand locking execution. Oracle’’0 EVENT0 Oracle’’1 Event Refine

about interleavings directly with multiple CPUs is difficult. Figure 3: Log refinement with mover oracle queries.
To simplify reasoning about all possible interleavings, we events based on the local events’ mover types. By reordering,
instead lift multiprocessor execution to a local CPU model, consecutive oracle queries will be merged to one. Second, we
which distinguishes execution taking place on a particular can prove local sequences of events generated by the machine
CPU from its concurrent environment [27, 36, 42]. All effects refine an aggregate local event generated by a higher-level
coming from the environment are encapsulated by and machine. This refinement can be applied to any arbitrary CPU,
conveyed through an event oracle, which yields events emitted therefore, it applies to all CPUs, so that the entire log of events
by other CPUs when queried. Querying the event oracle can be refines the log of the higher-level aggregate events.
thought of in the context of the explicit multiprocessor machine Figure 3 shows an example of log refinement to reduce in-
model as returning events from the global log generated terleavings of events across CPUs into an atomic event. We
by all other CPUs; only new events since the last query are identify the mover type of each local event, i.e. [Right 0, Right
returned. How the event oracle synchronizes these events is 1, None 2, Left 3], and initially query the oracle before each
left abstract, its behavior constrained only by rely-guarantee event. Based on the mover types, we can reorder all oracle
conditions [35]. Since the interleaving of events is left abstract, queries before the NoneMover to the beginning, and all re-
our proofs do not rely on any particular interleaving of events maining queries to the end, such that the log before and after
and therefore hold for all possible concurrent interleavings. reordering have the same machine behavior. We then define a
A CPU captures the effects of its concurrent environment new oracle that can be queried to return the consecutive events
by querying the event oracle between local CPU steps. A CPU from the previous oracle queries [Oracle 0, Oracle 1, Oracle
only needs to query the event oracle when interacting with 2], allowing those events to be merged into a single oracle
shared objects, since its private state is not affected by these query [Oracle’ 0]. We then refine the local sequence of events
events. In other words, the CPU repeatedly performs two steps [Right 0, Right 1, None 2, Left 3] into a single higher-level
when interacting with shared objects: querying the event oracle aggregate local event EVENT 0. This can be done for all CPUs
to obtain events from other CPUs, then generating a local CPU so we can reason further only using the higher-level aggregate
event. The result is a composite log of events from other CPUs event EVENT 0 with oracle queries Oracle” 0 and Oracle” 1
interleaved with events from the local CPU. This is equivalent that also return higher-level aggregate events, instead of the
to the logical log in the explicit multiprocessor model, but with- many Left/Right/None events of lower-level machine.
out the complexity of directly reasoning about multiple CPUs.
If possible, we would like to move the interleaved event
4.2 Permutation Conditions
oracle queries out of the way of the local CPU events so we can
use sequential reasoning regarding the local execution of any To verify RMM, we must account for the relaxed memory
given CPU. By using mover types, we can identify how we can behavior of the Arm architecture on code that is not data race
reorder event oracle queries with respect to local CPU events free (DRF). For example, Figure 4 shows how a Realm’s
without changing the machine’s behavior. Thus, these queries list of RECs is updated in REC.Create, REC.Destroy, and
are mover oracle queries. We classify all local CPU events in Realm.Destroy without holding a common lock. Each
the composite log as RightMover, LeftMover, or NoneMover. Realm’s RD has a RECLIST (rd->rec_list), an array that
Mover oracle queries can be reordered before a RightMover stores the pointers to all its RECs. The RECLIST can be
and after a LeftMover. For example, acquiring a lock is a Right- referenced from both the Realm’s RD and each of the Realm’s
Mover because if other CPUs do something after acquiring the RECs (rec->rec_list). Each REC records its index in the
lock on the local CPU, they must be able to do the same thing RECLIST (rec->id). RD’s counter keeps tracking of how
before acquiring the lock. The oracle queries which capture the many RECs are in a Realm. The hypervisor must destroy all
other CPUs’ events can be reordered before acquiring the lock. RECs of a Realm before destroying its RD because once RD is
Mover oracle queries cannot be reordered with a NoneMover. destroyed, the Realm can no longer be referenced. Access
For example, an oracle query followed by a NoneMover then to the RECLIST is not synchronized by its own lock, to avoid
a LeftMover cannot be reordered after the LeftMover. potential deadlock issues due to needing to hold multiple locks.
VIA can then reduce the interleaving of events in the log that Instead, in REC.Create, the RD’s lock must be held to insert
need to be considered in two ways, which we refer to as log a new REC in RECLIST to ensure mutual exclusion. However,
refinement. First, we can reorder oracle queries with local CPU in REC.Destroy, the REC’s lock is held instead of the RD’s
Rec.Create(rd, id) { Rec.Destroy(rec) { Realm.Destroy(rd) {
acq(rd->lock) acq(rec->lock); acq(rd->lock); For example, to handle the non-DRF code in Figure 4, we
… … …
(a) if (rd->rec_list[id] == NULL) { (d) rec->rec_list[rec->id] = NULL; (f) if (rd->counter == 0) {
identify P to be when Realm.Destroy finds rd->counter
(b) rd->rec_list[id] = NEW_REC;
(c) atomic_inc(rd->counter);
(e) atomic_dec(rec->rd->counter);
rel(rec->lock);
// rec_list should be EMPTY
(g) destroy(rd->rec_list);
equals 0, rd->rec_list must be empty. This is necessary
… } … because rd->rec_list must be empty when destroying it
rel(rd->lock); rel(rd->lock);
} } in (g), otherwise the system may crash due to reclaiming
Figure 4: Pseudo code of RECLIST data races, marked in bold blue. non-empty memory. Since REC.Create and Realm.Destroy
use the same lock, data races can only occur when either
locks when clearing the REC’s entry from the RECLIST so that runs concurrently with REC.Destroy. We prove each function
multiple CPUs can destroy different RECs of the same Realm always behaves the same on SC and relaxed memory. For
concurrently. Furthermore, the RD’s counter is increased or REC.Create, since (b) and (c) cannot be reordered with (a)
checked in REC.Create and Realm.Destroy while holding due to the branch dependency, as required by Promising Arm,
RD’s lock, but it is decreased in REC.Destroy without holding its possible executions are (a)(b)(c) or (a)(c)(b). Since
any lock. As a result, data races can occur when concurrently (a) confirms that rec_list[id] is empty, all concurrent
executing REC.Destroy with REC.Create or Realm.Destroy. REC.Destroy on other CPUs must destroy slots other than
To address this problem, VIA builds on VRM [57]. VRM id because REC.Destroy will only work if the rec exists,
verifies programs on Arm relaxed memory hardware that which must be a non-empty slot in the rec_list. Therefore,
are DRF except for synchronization methods and virtual swapping (b) and (c) will never change any CPU’s behavior
memory hardware. VRM verifies a program on a sequentially and (a)(c)(b) is equivalent to (a)(b)(c), which is the order
consistent (SC) multiprocessor hardware model, defines and on SC. For REC.Destroy, if (e) executes before (d), P will
proves that a fixed set of conditions hold for the program be broken because when Realm.Destroy checks counter
running on relaxed memory hardware, and proves that the concurrently on other CPUs, it may find counter is 0 but
conditions guarantee that the program has the same behavior rec_list is not empty, as shown below:
on SC and relaxed memory hardware so that its SC proofs also (g) destroy(list)
(e) counter-- (f) counter==0 (d) list[id]=NULL
hold for relaxed memory hardware. (list is not empty)
VIA generalizes this approach for programs that are not
DRF. It ensures that such a program will have the same This was actually a real bug in the prototype implementation
behavior on SC and relaxed memory hardware by first decom- of RMM. Therefore, we must enforce that (d) always executes
posing the program into components that are DRF and not before (e) by adding a barrier between them so it must follow
DRF. Previous work already shows that the DRF components program order as on SC. For Realm.Destroy, the proof is
will have the same behavior on SC and relaxed memory trivial because the branch dependency between (f) and
hardware [57]. VIA then introduces permutation conditions (g) guarantees that they execute in program order as on SC.
P on the non-DRF components such that P can be verified Therefore, this non-DRF code will not generate more behavior
to hold for the program on relaxed memory hardware, and on relaxed memory hardware than on SC.
P can be proven to guarantee that the non-DRF components
will have the same behavior on SC and relaxed memory
4.3 Register Accounting
hardware. Our experience suggests that even for programs
that are not DRF, only a small percentage of the code in these To verify CCA firmware with both C and assembly code, we
programs is not DRF, so non-DRF programs can be verified must account for the interactions of C and assembly code
on relaxed memory hardware by only proving a small number primitives that call one another across language boundaries.
of permutation conditions in practice. This observation holds However, C code hides the details of how it uses CPU registers,
for RMM, in which almost all of the code is DRF. as the use of registers during C code execution is decided by the
VIA uses VRM’s extended Promising Arm model [57] to implementation of specific C compiler used. Although register
model Arm’s relaxed memory hardware, such that P needs to behavior is not expressed by C language semantics, ignoring it
be verified against all instruction permutations of the program causes problems when attempting to verify programs in which
allowed by VRM’s Promising Arm model. Unlike VRM which C and assembly code call one another, as shown in Figure 5,
defines a fixed set of conditions that do not all hold for RMM, which illustrates a real bug in the original prototype RMM
VIA allows any condition P to be specified for non-DRF com- implementation detected during our verification. Existing
ponents that will result in their behavior being in the same on verification approaches cannot support bidirectional calls
SC and relaxed memory hardware and that can be proven to between C and assembly code, such that the example in
hold for the program on relaxed memory hardware. The condi- Figure 5 would be erroneously verified without detecting the
tion is essentially a constraint based on the program’s seman- information leakage [10, 11, 23, 26, 37, 42, 43, 46].
tics that restricts the possible instruction reorderings that can To address this problem, VIA introduces a novel register
occur on relaxed memory hardware so that resulting program accounting mechanism to correctly verify integrated C and
behavior is the same on SC and relaxed memory hardware. Arm assembly code while making minimal assumptions
ENTRY(store_inner): void store_c() { ENTRY(store_outer): u64 sca_read64(u64 *ptr) { u64 sca_read64(u64 *ptr) {
str x5, [x1] int s = secret; mov x5, #0 u64 val; Bind: u64 val;
ret ptr -> I0
// s stored in x5 bl store_c asm volatile( init_pr();
val -> O0
ENDPROC(store_inner) store_inner(); ret “ldr %[val], %[ptr]\n” set_pr(I0, ptr);
} ENDPROC(store_outer) : [val] "=r" (val) asm volatile(
Spec: mem[%x1] = %x5; Spec: mem[%x1] = %x5; WRONG Spec: mem[%x1] = 0; : [ptr] "m" (*ptr) “ldr %O0, [%I0]\n”
); )
return val; } val = get_pr(O0);
Figure 5: An example of incorrectly combining C and assembly
return val; }:
specifications. Assembly function store_outer clears register
To asm prim
x5 to 0, then calls C function store_c. store_c calls assembly
function store_inner, which stores register x5 into memory. ENTRY(sca_read64_inline): u64 sca_read64(u64 *ptr) {
ldr O0, [I0] u64 val;
The intended behavior is that the value 0 will be stored to memory. ret init_pr();
The actual behavior is that x5 stores C temporary variable s which ENDPROC(sca_read64_inline) set_pr(I0, ptr);
sca_read64_inline();
contains secret data, resulting in undetected information leakage. val = get_pr(O0);
return val; }
regarding compiler behavior. VIA leverages the Arm64 Pro-
cedure Call Standard (AAPCS64) [7] to specify how registers Figure 6: Translation of parameterized inline assembly.
are potentially used when assembly code calls a C function atomicity or memory order semantics, as shown in the sca_-
or is called by a C function. It then conservatively marks all read64 example in Figure 6. sca_read64 implements a 64-bit
registers used by C code whose values cannot be determined single-copy-atomic read in one line of assembly code plus an
based on AAPCS64 as of Unknown value, and requires interface, which can specify a list of input registers, output reg-
assembly code to not depend on registers with Unknown values. isters and clobbered registers. VIA translates inline assembly
AAPCS64 constrains how some Arm registers are used. In code into an assembly function according to the interface con-
CCA firmware, C functions pass no more than eight integer straints; "r", "Q", and "m" constraints are currently supported.
or pointer parameters and return an integer or pointer. For such It then checks its correctness like any other assembly function.
functions, AAPCS64 specifies that a C compiler will only pass Translation is done using a set of logical registers I0-In
parameters through registers r0-r7 and save the return value for inputs and O0-On for outputs so that verification does not
in r0. It also specifies registers that must have their values depend on the specifics of GCC register assignment. Input
preserved through a function call, namely all callee-saved registers are defined read only. VIA also defines abstract
registers r19-r29 and the stack register sp. The use of other accessors init_pr, which initializes all logical registers to
general-purpose registers (GPRs) may depend on the specific UNKNOWN, set_pr, which writes to a register, and get_pr,
C compiler implementation. which reads from a register. As shown in Figure 6, the trans-
For an assembly function that calls a C function, VIA checks lated sca_read64 function first calls init_pr for initialization,
that the assembly code does not read any Unknown registers. saves parameters to input registers by calling set_pr, uses the
Legal assembly code can either keep such Unknown registers input and output registers in the assembly code, and gets the
untouched or overwrite them before using them. VIA uses return value from the output register by calling set_pr.
AAPCS64 to model the register behavior of the C function For simplicity, VIA imposes additional requirements
by identifying register r0 as containing the return value, and to guarantee GCC generates correct machine code whose
registers r19-r29 and sp as preserving the values. It marks the behavior is the same as VIA’s translated code. VIA forbids
values of other registers after the C function call as Unknown, inline assembly code from explicitly using any GPRs or goto
including caller-saved registers r1-r18 and the link register lr. labels. For inline assembly with multiple instructions, VIA
For an assembly function that can be called from a C enforces that all output registers are constrained by "&" or
function, VIA checks that its behavior does not depend on "+". Thus, an output-only register never doubles as an input
Unknown registers, and that it obeys AAPCS64 C calling register, and the same register is used for input and output
conventions so that it will not cause unexpected behavior in its of an operand. This avoids any unexpected overlap in the
caller. VIA checks that (1) callee-saved registers r19-r29 and assignment of input and output registers [53].
sp preserve the values; (2) the program counter pc after the call Finally, because assembly code functions may be at the
is equal to lr before the call so the assembly primitive returns interface to outside programs that are untrusted, VIA enforces
like a function call; (3) if the caller expects a return value, r0’s that all register values are not Unknown when returning from
value is never Unknown; and (4) the assembly code behavior those assembly functions. This ensures that there is no unin-
remains the same if we initialize all GPRs to Unknown except tentional information leakage from assembly code functions
for those carrying parameters. The last condition implies that to untrusted programs through registers with Unknown values.
the assembly code does not read any Unknown registers, except
for saving and restoring callee-saved registers.
4.4 Ideal Secure System Model
VIA also supports GNU Compiler Collection (GCC) inline
assembly extensions within a C function. This is used in CCA protects the confidentiality and integrity of Realms’
inline assembly memory accessors in RMM which guarantee private data during their lifetime. Confidentiality means any
Exclusive Type Rule
Regs & Mem
Mem When a Realm accesses an IPA within its PAR but it is Unknown, the
data copy
Realm will copy the data from a special initialization buffer in memory
Realm RMM Hypervisor Realm RMM Hypervisor to exclusive memory before accessing the IPA. This can only be done
once per granule. The buffer is populated before the Realm is activated,
NS granules Other memory Regs NS granules Other memory Regs and cannot be changed once it has been activated.
Real System Ideal System Mem When a Realm accesses an IPA outside of its PAR, it will directly access
memory, not exclusive memory.
Figure 7: The real and ideal secure system model. Reg On any trap from a Realm to the RMM, a Realm exposes the contents
of various exclusive system registers, marking them Unknown, and
change a Realm makes to its private data is only observable marks various timer-related exclusive registers Unknown.
Reg If a trap is due to system register emulation, a Realm will mark a
by that Realm. Integrity means a Realm will not observe any specified exclusive GPR as Unknown.
changes to its private data that it did not make, but does not Reg If a trap is due to a hypercall, a Realm will expose and mark the seven
imply availability; data access should either fail or return exclusive GPRs r0-r6 used for parameter passing as Unknown.
the data previously stored. The confidentiality definition is Reg If a trap is due to an RMM call, a Realm will expose and mark the four
exclusive GPRs r0-r3 used for parameter passing as Unknown.
standard, but the integrity definition allows untrusted software
Table 3: Declassification rules.
to modify a Realm’s private data as long as the Realm does
not observe the change. For example, to reclaim memory
its PAR or it accesses a granule or register that is Unknown.
from Realms, a hypervisor can unmap a Realm’s private data
If it accesses memory outside its PAR, the Realm will access
without the Realm’s permission. This is allowed because the
non-exclusive memory directly. If it accesses a granule or
Realm’s access to the unmapped data will trigger a page fault
register that is Unknown, the data will be copied from a special
so the Realm cannot observe future changes to the data content.
initialization buffer or non-exclusive register, respectively, be-
However, this breaks noninterference, which therefore cannot
fore accessing it. A granule is Unknown if it is not yet initialized.
be used to to prove security as is done for other verified
A register is Unknown if it is used by the Realm to communicate
systems [16, 23, 29, 34, 42, 49, 55].
with RMM or the hypervisor. For example, when a Realm
To address this problem, VIA introduces an ideal/real
invokes a hypercall, it exposes the arguments in registers
paradigm, shown in Figure 7, inspired by the idea from formal r0-r6, which RMM will provide to the hypervisor, then return
verification of separation kernels [22, 30]. The real system the results back in those registers. Marking a granule or register
is defined by the RMM top-layer specification, which builds
as Unknown is used to represent declassification in the model.
on and incorporates EL3M, in which all memory and CPU
We can then use this ideal system model with declassifica-
registers are shared by Realms, RMM, and the hypervisor. The
tion to verify that RMM guarantees Realm confidentiality and
ideal system is defined by an ideal system model specification,
integrity. The key is to establish a simulation relation in which
in which each Realm has its own exclusive memory, and
all machine states are equivalent between the ideal and real
each REC of the Realm has its own exclusive CPU registers,
systems and show that, at any step in the two systems satisfying
while other software can only access the same non-exclusive
the simulation relation, the same data is obtained when access-
memory and registers as in the real system.
ing memory or registers. This involves proving a one-to-one
If each Realm only accesses its exclusive memory and mapping of data between the two systems. With declassifica-
registers in the ideal system, we could then show that RMM tion, the mapping will change such that a different mapping
guarantees confidentiality and integrity by proving that the will be used depending on whether the data is declassified or
real system simulates the ideal system. This would mean that not. For example, if a granule within a Realm’s PAR is not de-
each Realm only accesses its exclusive memory and registers classified, we will want to show that accessing that granule in
in the real system as well, so nothing other than a Realm can non-exclusive memory in the real system correponds to access-
access its own data. However, such a simplistic model does not ing it in exclusive memory in the ideal system to get the same
work in practice. For CCA, we need a model that allows declas- data. On the other hand, if a granule within a Realm’s PAR is
sification so Realms can access NS granules for initialization declassified, because its contents were initialized from an NS
and I/O, and CPU registers can be used to pass parameters granule, we will want to show that first accessing that granule
between Realms and RMM, or Realms and the hypervisor. in non-exclusive memory in the real system correponds to ac-
VIA introduces a new ideal system model for Armv9-A that cessing it in non-exclusive memory in the ideal system since
supports declassification of memory and registers based on the respective exclusive memory is initially Unknown so the
a set of well-designed rules that define when declassification data is first copied from non-exclusive to exclusive memory.
is allowed. The model has six declassification rules, listed in
Table 3. In this model, Realm exclusive memory consists of all
memory in its PAR and exclusive CPU registers consists of all 5 CCA Implementation and Verification
registers accessible by a Realm or that can affect its execution,
such as system registers. A Realm will only access its exclusive We used VIA to verify an early prototype implementation
memory and registers, unless it accesses a granule outside of CCA firmware, which includes both RMM and EL3M as
Description LOC Description LOC AcqT0 Oracle0 LDT0 Oracle1 AcqT1 Oracle2 RelT0
Machine model 1.4K RMM refinement proofs 6.1K Reorder
Lock proof 1.7K Top-level specification 1.1K Oracle0 Oracle1 Oracle2 AcqT0 LDT0 AcqT1 RelT0
EL3M layer specifications .2K Ideal secure system model .2K
Refine Events
EL3M refinement proofs .9K Security simulation proofs 3.4K Walk until level 1: Oracle’0 Walk until level 1 T0 T1
RMM layer specifications 4.4K Permutation condition proofs 1.2K
Total 20.6K
Table 4: Lines of Coq code for verifying CCA firmware. Walk until level 1 T0 T1 Oracle’0 LDT1 Oracle’1 AcqT2 Oracle’2 RelT1
Reorder
Oracle’0 Oracle’1 Oracle’2 Walk until level 1 T0 T1 LDT1 AcqT2 RelT1
described in Section 3. The verification outcomes, including
Refine Events
the discovery of several latent bugs, were confirmed by Arm’s Walk until level 2: Oracle’’0 Walk until level 2 T0 T1 T2
development team and used to further improve the firmware
implementation. RMM contains 3.2K lines of code (LOC) Figure 8: Proving atomicity for page table operations.
in C and .3K LOC in assembly. The runtime critical parts of
EL3M contain .1K LOC in C and .7K LOC in assembly; all into a “destroy level 1 table” event.
of the C code is for updating the GPT. All RMM and EL3M We then refine the procedure of walking the page table until
code is verified, except for the portion of assembly code for acquiring the lock of T2 into an atomic step. We first prove that
initialization (.1K LOC in RMM and .5K LOC in EL3M). For “walk until level 1” is a RightMover because any subsequent
remote attestation, RMM also uses functions provided by a events at this layer from other CPUs can be reordered with it,
crypto library, which was not verified, though a verified crypto i.e., “create level 1 table”, “destroy level 1 table”, “walk until
library could be ported and used instead [42, 61]. level 1”, and acq/rel/LD/ST events for T2 and T3 level tables. A
“create level 1 table” from other CPUs is irrelevant to the local
Table 4 shows our proof effort, measured in LOC in Coq.
“walk until level 1” because it can only create other level 1 ta-
45 abstraction layers were used. The bottom layer machine
bles and cannot overwrite T1 since RMM only allows creating
model is based on VRM’s Promising Arm model [57] to model
a table that does not exist yet. Events “destroy level 1 table” and
Arm’s relaxed memory. Another layer was used to verified the
“walk until level 1” from other CPUs are irrelevant because they
spinlock implementation on the relaxed memory model and
cannot hold T1’s lock so can only access other level 1 tables,
lift it to an SC model. We verify the EL3M implementation
not T1. Other events are also irrelevant because they do not
refines its layered specification through three layers. On top
manipulate T0 and T1 tables. Therefore, “walk until level 1” is
of that, we verify the RMM implementation refines its layered
a RightMover and all subsequent mover oracle queries can be
specification through 39 layers. The top-level specification
reordered before it. Thus, we refine “walk until level 2” into
reflects RMM’s interface, combining both RMM and EL3M
an atomic step, as shown in the bottom of Figure 8. In a similar
functionality. Another layer defines the ideal secure system
fashion, we prove “walk until level 2” to be a RightMover and
model. We verify that the top-level specification simulates the
refine the steps of “walk until level 3.” Continuing in this man-
ideal secure system model.
ner, we eventually refine all RTT operations into atomic steps.
Proving RTT operations to be atomic allows us to prove de-
5.1 Concurrent Multi-level Page Tables sired properties about RMM’s RTT management. The key prop-
erty to prove is that each non-empty entry in the RTTs, includ-
The most challenging refinement proofs were for verifying ing both intermediate entries pointing to lower-level RTTs and
RMM’s RTT implementation. RTT primitives use hand- leaf mappings, uses a unique delegated granule. This prevents
over-hand locking to synchronize access to dynamically page remapping attacks while still allowing fine-grained access
allocated 4-level page tables, allowing fine-grain concurrent to the RTTs for improved performance. The proof is straightfor-
operation on different page table levels. This required nine ward because every operation on an RTT entry is proved to be
layers. We leverage mover oracle queries and log refinement, atomic, only the PA of a delegated granule is used to populate a
discussed in Section 4.1, to refine all of RMM’s page table previously empty RTT entry, and each such granule is guaran-
operations to atomic operations, verifying the correctness of teed to be unused and zeroed. Once a granule is used for an RTT
hand-over-hand locking in a real system for the first time. entry, its state changes from delegated to RTT or Data, prevent-
Figure 8 visualizes the proof. Since acquiring a lock is ing it from being used for other RTT entries. By using mover
a RightMover, releasing a lock is a LeftMover, and reading oracle queries and log refinement, we complete the first proof of
the page table entry is both a LeftMover and RightMover, hand-over-hand locking in a real system, and the first proof of
we can reorder mover oracle queries to refine the procedure a system with fully dynamically allocated shared page tables.
of walking the page table until acquiring the lock of T1 into
an atomic step. We group the local CPU events into a single 5.2 Relaxed Memory
higher-level aggregate “walk until level 1” event. Similarly,
we can group events together from creating a level 1 table into We prove permutation conditions as discussed in Section 4.2 to
a “create level 1 table’ event, and destroying a level 1 table verify the proofs hold on Arm relaxed memory hardware. Veri-
fying CCA firmware only requires six permutation conditions, L2: RMM C primitive invoke SMC return
the RECLIST empty condition discussed in Section 4.2, and five L1: EL3M Asm primitive
EL3M handler EL3M exit
conditions previously introduced by VRM, namely (1) N O -
BARRIER -M ISUSE, (2) T RANSACTIONAL -PAGE -TABLE, L0: EL3M C primitive
GPT Update
(3) S EQUENTIAL -TLB-I NVALIDATION, (4) W RITE -O NCE - Figure 9: Verify RMM and EL3M GPT update operations. Solid
K ERNEL -M APPING, and (5) M EMORY-I SOLATION. N O - arrows represent C code and dashed arrows represent assembly code.
BARRIER -M ISUSE requires that barriers are correctly placed.
We verified that all lock acquisitions have acquire memory se-
Rec.Run
mantics and all lock releases have release memory semantics. Hyp to Realm run Realm handle Realm exit Realm to Hyp
We also proved that memory accesses to shared objects outside run
Realm enter Realm Realm steps Realm trap exit Realm
critical sections have release semantics so that they cannot
be reordered, preserving program ordering and SC behavior. Figure 10: Verify REC.Run and its inner run_realm loop. Solid
arrows represent C code and dashed arrows represent assembly code.
T RANSACTIONAL -PAGE -TABLE requires that shared page
table writes within a critical section are transactional. This
RMM
ensures that page table writes will not result in any behavior on handler Hyp to EL3M EL3M to RMM handle_ns_smc RMM to EL3M EL3M to Hyp
relaxed memory hardware that cannot be produced on an SC
Figure 11: Verify rmm_handler in the top layer. Solid arrows
model. In RMM and EL3M, each critical section contains at
represent C code and dashed arrows represent assembly code.
most one page table write, so they are obviously transactional.
S EQUENTIAL -TLB-I NVALIDATION requires that a page
table unmap or remap be followed by a TLB invalidation, 5.3 C and Assembly Code Integration
with a barrier between them. This precludes relaxed memory
behavior in TLB management code. There are no remaps in Another key aspect of the refinement proofs was verifying the
RMM or EL3M. We verified that all page table unmaps are interactions between RMM and EL3M, RMM and Realms, and
followed by a TLB invalidation with a barrier between them. RMM and the hypervisor, which required the C and assembly
W RITE -O NCE -K ERNEL -M APPING requires that if RMM code integration techniques discussed in Section 4.3. For
or EL3M’s own page tables are shared, they can only be written RMM and EL3M, we verified the correctness of GPT updates.
once—only empty page table entries can be modified. This Figure 9 shows how to verify a C primitive in RMM which is-
precludes relaxed memory behavior due to out-of-order reads sues an SMC to EL3M to update the GPT. Layer L0 verifies the
of these page tables. For EL3M, this holds as it uses a statically C code for EL3M’s GPT operations. Layer L1 verifies EL3M’s
reserved hardcoded page table shared across all CPUs that is assembly code handler, which handles traps from RMM and
never changed after booting. For RMM, although its kernel calls the GPT operations in C. Finally, layer L2 verifies the
page table is shared across all CPUs and can be changed, we C code in RMM that traps to EL3M’s assembly code handler.
prove that it is logically partitioned into two tables, as discussed For RMM and Realms, we verified REC.Run, which runs
in Section 3. We prove one table is shared but never changed a VCPU of a Realm and required five layers. Figure 10 shows
once initialized, and the other table is not shared because it this C primitive, which calls the run_realm assembly code
is statically divided into per-CPU ranges private to each CPU. primitive, which restores the Realm’s VCPU contexts and
M EMORY-I SOLATION requires that the memory space ac- enters the Realm. We proved that all GPRs are correctly
cessible by RMM and EL3M is partially isolated with Realms restored such that there is no information leakage from RMM
and NS hypervisors. This ensures that any relaxed memory to the Realm through registers with Unknown values.
behavior of Realms or NS hypervisors cannot be propagated to For RMM and the hypervisor, we verified the RMM
RMM or EL3M. We verify that Realms and the hypervisor will handling of RMI calls from the hypervisor. Figure 11 shows
only access Data and NS granules. Realms’ memory accesses when the hypervisor invokes an RMI call, it traps to EL3M first,
are managed by RTTs, We prove RTTs will only map Data then jumps to RMM and calls the C function handle_ns_smc
granules and NS granules. A hypervisor’s memory accesses are to execute the RMI call. Eventually, RMM returns to EL3M
controlled by the GPT. We prove all delegated granules are in and then the hypervisor. We proved that when returning to the
the Realm PAS state in the GPT so the hypervisor cannot access hypervisor, there is no information leakage to the hypervisor
them. We further prove that RMM and EL3M behavior do not through GPRs with Unknown values.
rely on what Realms or the hypervisor may do with Data or NS
granules. We prove EL3M never accesses memory other than 5.4 Security
its own, RMM will not access the contents of Data granules,
and whenever RMM accesses NS granules, it may obtain arbi- We prove that the real system specified by the RMM top-level
trary data because the hypervisor can make arbitrary changes specification simulates the ideal system model with declassi-
to the data. Thus, we show RMM’s proof on SC does not rely fication, as discussed in Section 4.4. We discuss the simulation
on the concrete implementation of Realms or NS hypervisors. relation in three parts: all machine states except for Data
granules, CPU registers, and VCPU contexts stored in REC declassified. Exclusive registers are not involved in context
granules (Rel 1), Data granules (Rel 2), and CPU registers and switches. We then prove that RMM indeed correctly saves and
VCPU contexts (Rel 3). Each relation is proved by induction, restores Realms’ VCPU contexts, so that Rel 3 is preserved.
in which we assume the relation is initial true at machine Finally, we note that our simulation proofs between the
boot and prove that it is preserved during RMM, hypervisor, real system and ideal secure system model verify Realm
and Realm execution so that the same data is obtained when confidentiality and integrity without even trusting the
accessing memory or registers in both real and ideal systems. correctness of the RMM or EL3M specifications. The proofs
We prove that Rel 1 is preserved during execution and only need to trust the specification of the ideal secure system
all data accessed from memory is the same. Rel 1 concerns model, which encodes the declassification rules and consists
NS granules, delegated granules, and granules containing of only .2K LOC in Coq. Furthermore, as shown in Table 3,
Realm metadata including RTTs, none of which involve the declassification rules only allow a Realm to disclose its
declassification. We prove two invariants: (1) all RTTs only data in two ways, by writing NS granules outside of its PAR
map IPAs within the respective Realm’s PAR to Data granules or via the eight GPRs used for hypercalls, making the security
and IPAs outside its PAR to NS granules; and (2) the GPT only policy formalization easy to understand.
labels NS granules in the NS PAS while all delegated granules
are labeled in the Realm PAS. The first invariant ensures that
Realms will only access Data and NS granules, and the former
5.5 Bugs Found
will not affect Rel 1. The second invariant ensures that the We identified several bugs in the CCA firmware prototype
hypervisor can only access NS granules. Since Realms and implementation during verification. Through refinement
the hypervisor access NS granules in the same non-exclusive proofs, we detected common bugs such as incorrect boundary
memory in both real and ideal systems, they will obtain the checking for some variables and misuse of locks; some
same data. All other granules for Rel 1 can only be accessed locks were released without previously holding them. More
by RMM. Since RMM accesses NS and other granules in the importantly, verification of C and assembly code integration
same non-exclusive memory in both real and ideal systems, identified a serious security bug that neither EL3M nor RMM
it will obtain the same data; the VCPU contexts that are part clear the caller-saved registers when returning to the hypervi-
of REC granules are excluded here and considered in Rel 3. sor. These registers may carry RMM’s private execution states
We prove that Rel 2 is preserved during execution. The and leak information. For example, RMM saves and restores
invariant above ensures that the hypervisor cannot access Data Realms’ VCPU contexts, and some contexts may remain in
granules, and we prove that RMM does not access Data gran- caller-saved registers and leak to the untrusted hypervisor.
ules, so Rel 2 is preserved for both the hypervisor and RMM. Another bug identified was in the REC execution handler. The
Data granules are only accessed by Realms. From Rel 1, the hypervisor provides an NS granule to communicate entry
RTTs must be the same in both real and ideal systems. If an and exit information with RMM. RMM locks and checks
RTT maps an ipa within a Realm’s PAR to a Data granule at that the given granule is indeed an NS granule, accesses its
host physical address hpa, the Realm will access the same data contents, unlocks the granule, and enters the Realm. However,
at exclusive memory ipa in the ideal system as at hpa in the when exiting from the Realm, RMM did not lock and check
real system, so Rel 2 is preserved. To ensure that an hpa cannot the granule state before accessing it. This may lead to RMM
be mapped to ipas in different Realms, we prove an invariant unexpectedly receiving a Granule Protection Fault (GPF) from
that if an RTT maps ipa to hpa, then the Data granule at hpa the hardware when accessing the granule using the NS PAS,
inversely maps to (Realm, ipa). Because there is a one-to-one if the granule was delegated by another CPU. This could lead
mapping for each Data granule to (Realm, ipa), any changes at to a denial of service of RMM or have worse consequences
hpa can only be observed by the specific Realm at the specific if GPF handling was not properly implemented in RMM.
ipa as is the case in the ideal system, so Rel 2 is preserved for Through permutation condition proofs, we identified
all other data. If an an ipa within a Realm’s PAR is Unknown, an RMM bug that REC.Destroy does not implement
the Realm will access the same data at non-exclusive memory “counter−−” with the release semantics (instruction (e) in
hpa in the ideal and real system, so Rel 2 is preserved. Figure 4) such that it can be reordered with (d) on Arm’s relax
We prove that Rel 3 is preserved during execution. We prove hardware. This may cause Realm.Destroy to wrongly set the
if a Realm’s VCPU V is running, its register r in the real system RECLIST to be reusable before REC.Destroy clears it because
equals the corresponding exclusive register r if not Unknown or when counter is zero, all RECs in the list should have been
the non-exclusive register r if Unknown in the ideal system. We destroyed, which was not true due to this relaxed memory bug.
prove if a Realm’s VCPU V is not running, V’s REC context of r Through security proofs, we identified an RMM bug that
in the real system equals the corresponding exclusive register allows the hypervisor to create two Data granules for the same
r if not Unknown or the V’s REC context of r if Unknown in the memory address of a Realm. Thus, RMM can unmap one Data
ideal system. In the ideal system, Realm’s register data is granule from an IPA of a Realm and map another Data granule
always stored in the exclusive registers except for those being to the same IPA, violating the Realm integrity guarantee,
because the Realm could observe a change in Realm data not between NS and Realm worlds is mimicked by modifying
caused by a Realm memory access. EL3M to switch between two separate contexts within NS
world. EL3M is further modified to support the RMI as well as
5.6 CCA KVM handle GPT update requests from RMM. We did not include
EL3M code that controls GPT registers as they do not exist
CCA provides a standard application binary interface (ABI) on the N1SDP, but all data written to the GPT memory can
to allow hypervisors to communicate their intents to RMM via be done, although without any effect.
RMI commands, which is suitable for adoption by commodity This setup necessarily will have some performance differ-
hypervisors. However, existing hypervisors do require some ences from real CCA hardware, but it provides a useful approx-
modifications to use CCA to support Realm VMs. Regardless imation of actual Realm performance. The cost of GPT checks
of whether a hypervisor is modified to use CCA, it cannot by CCA hardware are not included since no GPT hardware is
compromise the confidentiality and integrity of Realms. available, but are expected to exhibit good caching behavior
Without modifications, existing hypervisors cannot run Realm and will not affect the relative performance of VMs versus
VMs, but can still run non-Realm VMs. Realm VMs since they apply equally in NS and Realm worlds.
We modified the Linux KVM hypervisor to use CCA, The cost of some hypervisor operations, such as those that
which we refer to as CCA KVM. The modifications involved require exiting to userspace, will be overly conservative as
roughly 3K LOC in C to KVM, including .5K LOC for RMI controlling timer interrupt behavior requires those operations
commands, .4K LOC for handling exits from Realms, .8K to write to the Arm Generic Interrupt Controller (GIC) on the
LOC for creating and destroying Realms, and 1.1K LOC for N1SDP which is slow, whereas real CCA hardware will have
stage 2 page table management using RMI commands. The system registers that can be used by RMM to achieve the same
modifications also required roughly .5K LOC in C to QEMU, functionality. Finally, the current prototype lacks support for
mostly related to VM boot, initialization, and exit handling. directly injecting virtual interrupts without hypervisor interven-
Finally, roughly 40 LOC in C of modifications to the virtio tion, which is expected to be available in future CCA hardware.
driver in the Linux guest kernel were required so that it uses We ran both microbenchmark and application workloads in
a bounce buffer to communicate I/O data with the hypervisor. VMs on unmodified KVM and CCA KVM in Linux 5.12 on the
This is needed because the ring buffer normally used by the N1SDP, which has two dual-core 2.6 GHz Neoverse N1 CPUs,
virtio driver in the VM is in memory not accessible to the 6 GB RAM, a 240 GB SATA3 SSD and a Intel 82574L 1 Gbps
hypervisor when using Realms. Our experience with KVM NIC. We used QEMU 4.2.0 [8] to run VMs, with the modi-
indicates that the modifications required for a commodity fications discussed in Section 5.6 to support CCA KVM. VMs
hypervisor to use CCA are quite modest and involve changes were run using KVM or CCA KVM with 4 cores and 1 GB
to a very small percentage of its existing codebase. RAM with the VM capped at 2 VCPUs and 512 MB RAM;
VCPUs were pinned to individual cores. VHOST networking
6 Performance Evaluation was used and virtual block storage devices were configured
with cache=none [28, 38, 56]. Arm VHE [6, 17, 18] was used
We have run the CCA software stack, including RMM, for all measurements. For client-server workloads, clients
EL3M, and modifications to the Linux KVM hypervisor to ran on an x86 machine with a 16-core Intel Xeon E5-2690
use Realms, on an Arm Fast Model which implements the 2.9 GHz CPU, 378 GB RAM and an Intel I350 1 Gbps NIC,
Realm Management Extensions (RME) CPU architecture. connected to the N1SDP via a Linksys LGS108 1 Gbps switch.
The Fast Model is a valid software emulation of the CPU
architecture, allowing us to demonstrate that the CCA software 6.1 Microbenchmarks
stack provides the desired security guarantees and system
functionality. However, Fast Models do not provide any cycle We ran KVM unit tests [39], which execute common
accurate measure of real performance and are too slow to run micro-level hypervisor operations, plus an additional system
real application workloads. While CCA will be available in register access microbenchmark, as listed in Table 5. For each
Armv9-A, Armv9-A hardware is not yet available. test, we ran it 216 times and report the average latency. Table
To provide a preliminary measure of CCA performance, we 6 shows the microbenchmark measurements in nanoseconds
have ported the CCA software prototype to run on currently for unmodified KVM and CCA KVM. The measurements
available Arm hardware, an Arm N1 System Development show that the security benefits of CCA design do come with
Platform (N1SDP) [5] with an Armv8.2-A Neoverse N1 SoC. a performance cost on most micro-level hypervisor operations,
This version of EL3M is based on the the Trusted Firmware-A because the cost of transitioning between a VM and the
(TFA) codebase. The N1SDP does not provide GPT or Realm hypervisor is much more expensive on CCA KVM than
world hardware, so it cannot enforce the security guarantees unmodified KVM, which is most clearly shown for Hypercall.
of Realms, but we can use it to mimic the performance costs Hypercall simply traps from the VM to the hypervisor in
of Realms by modifying the EL3M code. Context switching EL2 and returns for KVM, but involves additional operations
Name Description Benchmark Hypercall I/O Kernel I/O User Virtual IPI Sysreg
Hypercall Trap from a VM to the hypervisor and return to the VM imme- KVM 362 549 1,761 1,806 437
diately. Measures base transition cost of hypervisor operations. CCA KVM 1,865 2,060 4,049 4,324 70
I/O Kernel Trap from a VM to the emulated interrupt controller in the host Table 6: Microbenchmark performance (ns).
OS kernel and return to the VM. Measures cost of accessing I/O
devices supported in kernel space.
I/O User Trap from a VM to read the device ID of virtio mmio device then dominated by the transition cost between VM and hypervisor.
return to the VM. Measures base cost of operations that access The one microbenchmark that is much faster on CCA KVM
I/O devices emulated in user space. than KVM is Sysreg. Accessing system registers is roughly
Virtual IPI Issue virtual IPI to another VCPU on a different CPU. Measures
time from sending virtual IPI until receiving VCPU handles it.
5 times as expensive on KVM versus CCA KVM. On CCA
Sysreg Trap from a VM to emulate access to system register ID_- KVM, RMM handles this register access directly without
AA64PFR0_EL1 in the hypervisor and return to the VM. Measures returning to the hypervisor. RMM’s system register trap han-
system register access cost. dling mechanism is simpler than KVM’s because it does not
Table 5: Microbenchmarks. need to support KVM’s more general hypervisor functionality
that requires synchronizing accesses to hypervisor-related
for CCA KVM: (1) trap from VM in EL1 to RMM in EL2; data structures and additional conditional checks.
(2) map NS granule to copy exit info to NS world, unmap
granule; (3) trap from RMM to EL3M in EL3; (4) save Realm
context, restore NS context; (5) exception return from EL3M to 6.2 Application Benchmarks
hypervisor in EL2; (6) trap from hypervisor to EL3M in EL3; We next ran the application benchmarks listed in Table 7 to
(7) save NS context, restore Realm context; (8) exception return measure performance on more realistic workloads. We also ran
from EL3M to RMM in EL2; (9) map NS granule to copy entry the workloads on native hardware running the same kernel to
info from NS world, unmap granule; (10) map and read data in provide a baseline for comparison, restricting the system to use
REC and RD granules, unmap granules; (11) exception return 2 CPUs and 512 MB RAM to provide a comparable configura-
from RMM to VM in EL1. The additional operations result tion to the VMs. For each platform, we ran each workload 50
in Hypercall costing an additional 1.5 µs on CCA KVM than times and measured the average, worst, and best performance.
vanilla KVM. Roundtrip transitions between RMM and the hy- Figure 12 shows the average performance for each
pervisor take roughly 700 ns, and roundtrip transitions between benchmark for unmodified KVM versus CCA KVM, with
the VM and RMM take roughly 60 ns. Saving and restoring sys- error bars indicating worst and best performance. Performance
tem registers when transitioning between the VM and RMM was normalized to average native execution on the N1SDP
takes roughly 200 ns per transition, or 400 ns total. The four hardware; lower is better. Unlike microbenchmark perfor-
map/unmap operations take roughly 100 ns each, 400 ns total. mance, the application benchmark performance shows that
The remaining roughly 250 ns is due to other bookkeeping CCA KVM and KVM have much more modest performance
code, including saving and restoring GPRs and error checking. differences on more realistic workloads.
I/O Kernel and I/O User include the same transition from CCA KVM has less than 8% overhead versus unmodified
the VM to the hypervisor and back as the Hypercall, so they KVM for most workloads, but in the worst case, overhead
also require more than 1.5 µs to execute on CCA KVM was 18% for MongoDB, an I/O intensive workload. The I/O
than vanilla KVM. Although the difference between CCA intensive workloads have higher overhead for a couple reasons.
KVM and vanilla KVM is roughly 1.5 µs for I/O Kernel, the The main reason is because the VM exits more frequently, so
difference for I/O User is roughly 2.3 µs. This is because on the cost of exits has a more significant impact on performance.
the N1SDP, CCA KVM must write to the GIC when going Exits are more expensive on CCA KVM as shown by the
to userspace, which is quite slow and takes an extra 800 ns. Hypercall microbenchmark results in Table 6, in which an exit
Virtual IPI is more expensive on CCA KVM versus vanilla to the hypervisor costs an extra 1.5 µs. If there are many exits
KVM because it involves multiple transitions between a VM as will be case for I/O intensive workloads, this additional
and the hypervisor. Sending the virtual IPI involves the source cost can become significant. For example, Memcached incurs
vCPU writing to a system register, causing a trap to the RMM, roughly a million VM exits to the hypervisor. This results in
which forwards the operation to the hypervisor (1). The hyper- roughly 1.5 s of additional overhead, or .75 s of overhead per
visor issues a physical IPI to the CPU running the destination core if the exits are split evenly across cores for a VM with
vCPU, then returns to the source vCPU (2). The physical 2 VCPUs. Memcached takes 9 s to run on vanilla KVM, so
IPI causes an exit from the destination vCPU (3). On taking this is 8% overhead due to the extra latency for exits on CCA
this exit, the hypervisor detects that there is a pending virtual KVM, which roughly matches the actual overhead measured
IPI, and returns to the destination vCPU (4). Of these four for Memcached on CCA KVM versus vanilla KVM.
transitions, approximately two occur in parallel, so the cost is A secondary reason is because CCA KVM needs to use a
roughly twice that of a Hypercall on CCA KVM for the transi- bounce buffer while vanilla KVM does not. CCA KVM needs
tions, plus the cost of the actual operation. Because Hypercall a bounce buffer to support virtio because Realm memory is
is much faster for unmodified KVM, its Virtual IPI cost is not protected from the hypervisor. KVM uses the default virtio
Name Description
Apache Apache server v2.4.41 handling 100 concurrent requests via 1.4
KVM CCA KVM
TLS/SSL from remote ApacheBench [1] v2.3 client, serving 1.2
the index.html of the GCC 7.5.0 manual.
Hackbench Hackbench [54] using Unix domain sockets and 20 process 1.0
groups running in 500 loops. 0.8
Kernbench Compilation of the Linux kernel v4.18 using allnoconfig for
Arm with GCC 9.3.0. 0.6
Memcached Memcached v1.5.22 handling requests from a remote 0.4
memtier [51] v1.2.11 client with default parameters.
MongoDB MongoDB server v3.6.8 handling requests from a remote 0.2
YCSB [14] v0.17.0 client running workload A with 16 0.0
concurrent threads and operationcount=500000. he ch ch ed B
Apac Hackben Kernben Memcach MongoD MySQ
L Redis
MySQL MySQL v8.0.27 running sysbench v1.0.11 with 32 concurrent
threads and TLS encryption.
Redis Redis v4.0.9 server handling requests from a remote redis- Figure 12: Application benchmark performance.
benchmark client (redis-tools v5.0.7) [52] running GET/SET
with 50 parallel connections and 12 pipelined requests. level of VMs with similar threat models to CCA. The initial ver-
Table 7: Application benchmarks. sion of SEV ensured confidentiality by encrypting VM memory
at runtime, but did not ensure memory data integrity, which has
mechanism to directly access VM memory, so it does not been utilized as an attack vector such that a compromised hy-
require bounce buffers and does not need to perform the addi- pervisor can tamper with or steal private VM data [31,40,47,48,
tional data copying. Since KVM can also be configured to use a 60]. Secure Nested Paging (SNP) [3] now provides the previ-
bounce buffer, we also measured KVM with this configuration ously missing integrity protection capability. SEV-SNP allows
to isolate the impact of using a bounce buffer on performance. an untrusted hypervisor to directly manage NPTs, but checks
The overhead with versus without a bounce buffer was negligi- accesses against a reverse map table, an additional data struc-
ble in most cases, but in the worst case as high as 3-4% for the ture managed by a security co-processor. In contrast, Intel TDX
more disk I/O intensive workloads, MongoDB and MySQL. runs a TDX module in a privileged SEAM (Secure-Arbitration
We expect the overheads for I/O intensive workloads on Mode) root CPU mode. The firmware manages NPTs used by
real CCA hardware to be less than what we measured on the protected VMs in response to requests issued by the untrusted
N1SDP hardware. Exits are expected to occur less frequently hypervisor. Unlike CCA, the security of SGX, SEV, SEV-SNP
on real CCA hardware when support for direct virtual interrupt and TDX relies on complex implementations in unverified
injection is added. Exits that go to userspace are expected to microcode and firmware [12, 15]. They are difficult to update,
cost less on real CCA hardware as the expensive GIC writes either to patch security flaws or introduce new features.
required for N1SDP hardware will be eliminated, though this Komodo [23] draws on ideas from SGX, but is implemented
was not a dominant factor in our results with the use of VHOST as a software monitor in verified Arm assembly code on
networking. This cost can be further mitigated by using device top of TrustZone instead of requiring hardware to support
passthrough instead of paravirtual I/O, which will largely complex enclave-manipulation instructions. This avoids
avoid these exits and their associated performance overhead. hardware complexity and enables deployment of new enclave
Support for Realm device passthrough will be added to future features independently of CPU upgrades. Komodo does not
CCA hardware. Overall, our measurements indicate that support multiprocessor execution, largely due to the challenge
CCA’s security guarantees can be delivered with acceptable of verifying low-level concurrent code. CCA retains the
performance overheads for real application workloads. advantages of Komodo’s approach by relying on a verified
software monitor to implement Realms, but supports verified
VM protection and multiprocessor execution.
7 Related Work The idea of retrofitting a commodity hypervisor so that
its security guarantees are enforced by a small trusted core
Hardware-enforced trusted execution environments have was first explored by SeKVM [41–43, 57]. SeKVM was
become an important feature of major computer architectures. the first to show how this retrofitting approach, known as
Arm TrustZone [4] can be used to statically partition and isolate microverification, makes it possible to verify that a commodity
a memory region in Secure world, but most implementations hypervisor guarantees the confidentiality and integrity of VMs.
only support a small number of such memory regions, limiting CCA allows hypervisors to be modified to support Realm
its scalability. Intel Software Guard Extensions (SGX) [33] can VMs, whose confidentiality and integrity are protected by
be used by application developers to protect userspace memory a verified monitor, reminscient of SeKVM. While SeKVM
from other programs, including a potentially malicious OS uses existing Arm hardware, CCA introduces new hardware
or hypervisor. SGX is not suitable for securing VMs. mechanisms that protect VMs from untrusted software running
AMD Secure Encrypted Virtualization (SEV) [2] and Intel in both NS and Secure world, and allow hypervisors to make
Trust Domain Extensions (TDX) [32] provide protection at the full use of Arm virtualization features such as VHE for better
performance. Furthermore, CCA firmware is designed to be applied to RMM given the definition of data integrity and
support a higher degree of scalability and concurrent operation confidentiality supported by Realms. While most of these
by allowing data races, leveraging fine-grain synchronization, approaches rely on some static partitioning of memory to
and enabling the hypervisor to provide fully dynamic memory simplify their noninterference proofs, RMM imposes no such
allocation for all VM-related metadata. scalability limitations. The ideal/real simulation paradigm
While verifying CCA firmware required new VIA veri- has been used to verify information-flow security of a simple
fication techniques, many of them build on previous work. 750 LOC two-user uniprocessor separation kernel without
Various concurrent systems have been verified, including Cer- page tables [22], but we show for the first time how it can
tiKOS [26, 27, 45], SeKVM, and CMAIL using CSPEC [10]. be applied in the presence of declassification to verify data
CertiKOS and SeKVM support sequential reasoning with confidentiality and integrity of a real system that supports
a local CPU model and encapsulate other CPUs’ behavior modern multiprocessor and MMU hardware with page tables.
by rely/guarantee conditions, but do not support reordering
using mover types, making proving hand-over-hand locking
infeasible. Although hand-over-hand locking can theoretically 8 Conclusions
be proved using rely/guarantee reasoning [58], the approach
Arm CCA is the first confidential compute architecture backed
is not machine-checkable or scalable to a real system like
by verified firmware that is correct and secure. CCA introduces
RMM. CSPEC provides proof patterns with mover types,
Realms, secure execution environments that protect the
but lacks a local CPU model and does not verify C code;
confidentiality and integrity of VMs against untrusted system
it offers little help for RMM code not reducible by movers
software such as hypervisors. Realms are made possible by
(e.g. REC.Destroy in Figure 4) that still need rely/guarantee
hardware support for Realm world, a new physical address
reasoning to verify. VIA builds on CertiKOS, SeKVM, and
space for Realms inaccessible to untrusted system software,
CSPEC to combine a local CPU model with mover types.
and a firmware monitor that runs in Realm world to control
Some programs have been previously verified on relaxed
CCA hardware to secure and manage Realms, including
memory hardware. Armada [46] supports verifying programs
handling requests from untrusted hypervisors to create Realms,
on the x86-TSO memory model, but their approach of verifying
run Realms, and allocate memory to Realms. This design
the entire program on a relaxed memory model has not been
maintains compatibility with the Arm architecture without
shown to scale to real systems such as RMM. VRM [57] in-
introducing complex hardware mechanisms by relying on
stead allows proofs on an SC model to hold on relaxed memory
firmware, and avoids complexity in the firmware by relying
hardware by ensuring certain conditions hold, making possible
on existing hypervisors to provide virtualization functionality.
the verification of SeKVM, the first machine-checked proof
We formally verified CCA firmware, demonstrating
for concurrent systems software on Arm relaxed memory hard-
the feasibility of relying on trustworthy firmware for the
ware. VIA generalizes VRM to arbitrary non-DRF programs.
security guarantees of the architecture. We introduced various
Verifying programs with both C and assembly code has been
verification techniques to make it possible to verify for the first
done to varying degrees, but none support bidirectional calls
time concurrent firmware with data races running on relaxed
between them. seL4 [37] verifies C code, but its assembly code
memory hardware, fine-grain synchronization such as hand-
is unverified. CertiKOS relies on a verified x86 C compiler to
over-hand locking, dynamically allocated shared multi-level
verify assembly primitives invoking C primitives by compiling
page tables, and integrated C and assembly code. We also
the invoked C primitives into assembly primitives, but cannot
prove the security guarantees despite untrusted software being
verify C primitives that invoke assembly primitives. Since
in full control of resource allocation decisions. The proof only
no verified Arm C compiler exists, this approach cannot be
needs to trust roughly two hundred lines of Coq specification,
used for CCA. SeKVM verifies C and Arm assembly code
making the formal security guarantees easy to read and
separately, but does not link the proofs, in part because no
understand. CCA provides its security guarantees with only
verified Arm C compiler exists. Komodo is written entirely
modest performance overhead compared to running VMs with
in assembly code which is then verified, but this is difficult to
the Linux KVM hypervisor without verified VM protection.
scale to a large system as it is hard to write and maintain a large
codebase in assembly. Ironclad [29] conducts verification
at the assembly level by compiling programs in a high-level 9 Acknowledgments
language down to assembly. This is also difficult to scale as
it is harder to verify the much larger generated assembly code Andrew Baumann and Charles Garcia-Tobin provided helpful
than the original high-level language implementation. VIA comments on earlier drafts. This work was supported in part by
allows most proofs to be done at the C level while verifying Arm, OPPO, an Amazon Research Award, a Guggenheim Fel-
interactions between C and assembly code are safe. lowship, DARPA contract N66001-21-C-4018, and NSF grants
Noninterference has been frequently used to prove CCF-1918400, CNS-2052947, and CCF-2124080. Ronghui
information-flow security [16,23,29,34,42,49,55], but cannot Gu is the Founder of and has an equity interest in CertiK.
References 2019), pages 243–258, Huntsville, ON Canada, October
2019.
[1] ab, The Apache Software Foundation. http://
httpd.apache.org/docs/2.4/programs/ab.html, [12] Anrin Chakrabortid, Reza Curtmola, Jonathan Katz,
April 2015. Jason Nieh, Ahmad-Reza Sadeghi, Radu Sion, and
Yinqian Zhang. Cloud Computing Security: Foundations
[2] Advanced Micro Devices. Secure Encrypted Virtualiza-
and Research Directions. Foundations and Trends in
tion API Version 0.16. https://support.amd.com/
Privacy and Security, 3(2):103–213, February 2022.
TechDocs/55766_SEV-KM%20API_Spec.pdf,
February 2018. [13] Hao Chen, Xiongnan (Newman) Wu, Zhong Shao,
[3] Advanced Micro Devices. AMD SEV-SNP: Joshua Lockerman, and Ronghui Gu. Toward Compo-
Strengthening VM Isolation with Integrity Protec- sitional Verification of Interruptible OS Kernels and
tion and More. https://www.amd.com/system/ Device Drivers. In Proceedings of the 37th ACM SIG-
files/TechDocs/SEV-SNP-strengthening-vm- PLAN Conference on Programming Language Design
isolation-with-integrity-protection-and- and Implementation (PLDI 2016), pages 431–447, Santa
more.pdf, January 2020. Barbara, CA, June 2016.

[4] ARM Ltd. ARM Security Technology Build- [14] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu
ing a Secure System using TrustZone Technol- Ramakrishnan, and Russell Sears. Benchmarking Cloud
ogy. https://documentation-service.arm.com/ Serving Systems with YCSB. In Proceedings of the 1st
static/5f212796500e883ab8e74531, April 2009. ACM Symposium on Cloud Computing (SoCC 2010),
pages 143–154, Indianapolis, IN, June 2010.
[5] ARM Ltd. Arm Neoverse N1 Core Technical Ref-
erence Manual. https://developer.arm.com/ [15] Victor Costan and Srinivas Devadas. Intel SGX
documentation/100616/0400/, April 2019. Explained. Cryptology ePrint Archive, Report 2016/086,
[6] ARM Ltd. Virtualization Host Extensions. https: January 2016. https://ia.cr/2016/086.
//developer.arm.com/documentation/102142/
[16] David Costanzo, Zhong Shao, and Ronghui Gu. End-
0100/Virtualization-Host-Extensions,
to-End Verification of Information-Flow Security for
January 2019.
C and Assembly Programs. In Proceedings of the 37th
[7] ARM Ltd. Procedure Call Standard for ACM Conference on Programming Language Design
the Arm R 64-bit Architecture (AArch64). and Implementation (PLDI 2016), pages 648–664, Santa
https://github.com/ARM-software/abi- Barbara, CA, June 2016.
aa/releases/download/2022Q1/aapcs64.pdf,
April 2022. [17] Christoffer Dall, Shih-Wei Li, Jin Tack Lim, Jason
Nieh, and Georgios Koloventzos. ARM Virtualization:
[8] Fabrice Bellard. QEMU, a Fast and Portable Dynamic Performance and Architectural Implications. In
Translator. In Proceedings of the USENIX 2005 Annual Proceedings of the 43rd International Symposium on
Technical Conference, FREENIX Track (FREENIX Computer Architecture (ISCA 2016), pages 304–316,
2005), pages 41–46, Anaheim, CA, April 2005. Seoul, South Korea, June 2016.
[9] Edouard Bugnion, Jason Nieh, and Dan Tsafrir. Hard-
[18] Christoffer Dall, Shih-Wei Li, and Jason Nieh. Opti-
ware and Software Support for Virtualization. Synthesis
mizing the Design and Implementation of the Linux
Lectures on Computer Architecture. Morgan and
ARM Hypervisor. In Proceedings of the 2017 USENIX
Claypool Publishers, February 2017.
Annual Technical Conference (USENIX ATC 2017),
[10] Tej Chajed, M. Frans Kaashoek, Butler Lampson, and pages 221–234, Santa Clara, CA, July 2017.
Nickolai Zeldovich. Verifying concurrent software using
movers in CSPEC. In Proceedings of the 13th Symposium [19] Christoffer Dall and Jason Nieh. KVM/ARM: Experi-
on Operating Systems Design and Implementation (OSDI ences Building the Linux ARM Hypervisor. Technical
2018), pages 306–322, Carlsbad, CA, October 2018. Report CUCS-010-13, Department of Computer Science,
Columbia University, June 2013.
[11] Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and
Nickolai Zeldovich. Verifying concurrent, crash-safe [20] Christoffer Dall and Jason Nieh. Supporting KVM on
systems with Perennial. In Proceedings of the 27th ACM the ARM Architecture. LWN Weekly Edition, pages
Symposium on Operating Systems Principles (SOSP 18–22, July 2013.
[21] Christoffer Dall and Jason Nieh. KVM/ARM: The De- [29] Chris Hawblitzel, Jon Howell, Jacob R. Lorch, Arjun
sign and Implementation of the Linux ARM Hypervisor. Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill.
In Proceedings of the 19th International Conference on Ironclad Apps: End-to-End Security via Automated
Architectural Support for Programming Languages and Full-System Verification. In Proceedings of the 11th
Operating Systems (ASPLOS 2014), pages 333–347, Salt USENIX Symposium on Operating Systems Design
Lake City, UT, March 2014. and Implementation (OSDI 2014), pages 165–181,
Broomfield, CO, October 2014.
[22] Mads Dam, Roberto Guanciale, Narges Khakpour,
Hamed Nemati, and Oliver Schwarz. Formal Verification [30] Constance L. Heitmeyer, Myla Archer, Elizabeth I.
of Information Flow Security for a Simple ARM-Based Leonard, and John McLean. Formal Specification and
Separation Kernel. In Proceedings of the 2013 ACM Verification of Data Separation in a Separation Kernel
SIGSAC Conference on Computer & Communications for an Embedded System. In Proceedings of the 13th
Security (CCS 2013), pages 223–234, Berlin, Germany, ACM Conference on Computer and Communications
November 2013. Security (CCS 2006), pages 346–355, Alexandria,
Virginia, October 2006.
[23] Andrew Ferraiuolo, Andrew Baumann, Chris Hawblitzel,
[31] Felicitas Hetzelt and Robert Buhren. Security Analysis
and Bryan Parno. Komodo: Using verification to
of Encrypted Virtual Machines. In Proceedings of the
disentangle secure-enclave hardware from software.
13th ACM SIGPLAN/SIGOPS International Conference
In Proceedings of the 26th Symposium on Operating
on Virtual Execution Environments (VEE 2017), pages
Systems Principles (SOSP 2017), pages 287–305,
129–142, Xi’an, China, April 2017.
Shanghai, China, October 2017.
[32] Intel Corporation. Intel Trust Domain Extensions.
[24] Ronghui Gu, Jérémie Koenig, Tahina Ramananandro, https://www.intel.com/content/www/us/
Zhong Shao, Xiongnan Newman Wu, Shu-Chun Weng, en/developer/articles/technical/intel-
and Haozhong Zhang. Deep Specifications and Certified trust-domain-extensions.html, October 2014.
Abstraction Layers. In Proceedings of the 42nd ACM
Symposium on Principles of Programming Languages [33] Intel Corporation. Intel Software Guard Ex-
(POPL 2015), pages 595–608, Mumbai, India, January tensions Programming Reference. https:
2015. //software.intel.com/sites/default/files/
managed/48/88/329298-002.pdf, May 2021.
[25] Ronghui Gu, Zhong Shao, Hao Chen, Jieung Kim,
Jérémie Koenig, Xiongnan Wu, Vilhelm Sjöberg, and [34] Dongseok Jang, Zachary Tatlock, and Sorin Lerner.
David Costanzo. Building Certified Concurrent OS Establishing Browser Security Guarantees through
Kernels. Communications of the ACM, 62(10):89–99, Formal Shim Verification. In Proceedings of the 21st
September 2019. USENIX Security Symposium (USENIX Security 2012),
pages 113–128, Bellevue, WA, August 2012.
[26] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan Newman
[35] C. B. Jones. Tentative Steps toward a Development
Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo.
Method for Interfering Programs. ACM Transactions
CertiKOS: An Extensible Architecture for Building
on Programming Languages and Systems (TOPLAS),
Certified Concurrent OS Kernels. In Proceedings of the
5(4):596–619, October 1983.
12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 2016), pages 653–669, [36] Jieung Kim, Vilhelm Sjöberg, Ronghui Gu, and Zhong
Savannah, GA, November 2016. Shao. Safety and Liveness of MCS Lock—Layer by
Layer. In Proceedings of the Asian Symposium on
[27] Ronghui Gu, Zhong Shao, Jieung Kim, Xiongnan New- Programming Languages and Systems (APLAS 2017),
man Wu, Jérémie Koenig, Vilhelm Sjöberg, Hao Chen, pages 273–297, Suzhou, China, November 2017.
David Costanzo, and Tahina Ramananandro. Certified
Concurrent Abstraction Layers. In Proceedings of [37] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June
the 39th ACM SIGPLAN Conference on Programming Andronick, David Cock, Philip Derrin, Dhammika
Language Design and Implementation (PLDI 2018), Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael
pages 646–661, Philadelphia, PA, June 2018. Norrish, Thomas Sewell, Harvey Tuch, and Simon
Winwood. seL4: Formal Verification of an OS Kernel.
[28] Stefan Hajnoczi. An Updated Overview of the QEMU In Proceedings of the ACM SIGOPS 22nd Symposium
Storage Stack. In LinuxCon Japan 2011, Yokohama, on Operating Systems Principles (SOSP 2009), pages
Japan, June 2011. 207–220, Big Sky, MT, October 2009.
[38] KVM contributors. Tuning KVM. http:// [48] Mathias Morbitzer, Manuel Huber, Julian Horsch, and
www.linux-kvm.org/page/Tuning_KVM, May 2015. Sascha Wessel. SEVered: Subverting AMD’s Virtual
Machine Encryption. In Proceedings of the 11th
[39] KVM contributors. KVM Unit Tests. http: European Workshop on Systems Security (EuroSec
//www.linux-kvm.org/page/KVM-unit-tests, 2018), pages 1–6, Porto, Portugal, April 2018.
August 2020.
[49] Toby Murray, Daniel Matichuk, Matthew Brassil, Peter
[40] Mengyuan Li, Yinqian Zhang, Zhiqiang Lin, and Yan Gammie, Timothy Bourke, Sean Seefried, Corey Lewis,
Solihin. Exploiting Unprotected I/O Operations in Xin Gao, and Gerwin Klein. seL4: from General
AMD’s Secure Encrypted Virtualization. In Proceedings Purpose to a Proof of Information Flow Enforcement.
of the 28th USENIX Security Symposium (USENIX In Proceedings of the 2013 IEEE Symposium on Security
Security 2019), pages 1257–1272, Santa Clara, CA, and Privacy (IEEE S&P 2013), pages 415–429, San
August 2019. Francisco, CA, May 2013.
[41] Shih-Wei Li, John S. Koh, and Jason Nieh. Protecting [50] Luke Nelson, James Bornholt, Ronghui Gu, Andrew Bau-
Cloud Virtual Machines from Commodity Hypervisor mann, Emina Torlak, and Xi Wang. Scaling Symbolic
and Host Operating System Exploits. In Proceedings of Evaluation for Automated Verification of Systems Code
the 28th USENIX Security Symposium (USENIX Security with Serval. In Proceedings of the 27th ACM Symposium
2019), pages 1357–1374, Santa Clara, CA, August 2019. on Operating Systems Principles (SOSP 2019), pages
[42] Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, and 225–242, Huntsville, ON Canada, October 2019.
John Zhuang Hui. A Secure and Formally Verified Linux [51] Redis Labs. Memtier Benchmark. https:
KVM Hypervisor. In Proceedings of the 2021 IEEE //github.com/RedisLabs/memtier_benchmark,
Symposium on Security and Privacy (IEEE S&P 2021), January 2018.
pages 1782–1799, San Francisco, CA, May 2021.
[52] Redis Labs. Redis Benchmark. https://redis.io/
[43] Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, docs/reference/optimization/benchmarks/,
and John Zhuang Hui. Formally Verified Memory March 2022.
Protection for a Commodity Multiprocessor Hypervisor.
In Proceedings of the 30th USENIX Security Symposium [53] Richard M. Stallman and the GCC Developer Com-
(USENIX Security 2021), pages 3953–3970, Vancouver, munity. Using the GNU Compiler Collection (GCC).
BC Canada, August 2021. https://gcc.gnu.org/onlinedocs/gcc-12.1.0/
gcc.pdf, May 2022.
[44] Richard J. Lipton. Reduction: A Method of Proving
Properties of Parallel Programs. Communications of the [54] Rusty Russell. Hackbench. http://
ACM, 18(12):717–721, December 1975. people.redhat.com/mingo/cfs-scheduler/
tools/hackbench.c, January 2008.
[45] Mengqi Liu, Lionel Rieg, Zhong Shao, Ronghui Gu,
David Costanzo, Jung-Eun Kim, and Man-Ki Yoon. [55] Helgi Sigurbjarnarson, Luke Nelson, Bruno Castro-
Virtual Timeline: A Formal Abstraction for Verifying Karney, James Bornholt, Emina Torlak, and Xi Wang.
Preemptive Schedulers with Temporal Isolation. Pro- Nickel: A Framework for Design and Verification of
ceedings of the ACM on Programming Languages, Information Flow Control Systems. In Proceedings of
4(POPL):1–31, December 2019. the 13th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 2018), pages
[46] Jacob R. Lorch, Yixuan Chen, Manos Kapritsos, Bryan 287–305, Carlsbad, CA, October 2018.
Parno, Shaz Qadeer, Upamanyu Sharma, James R.
Wilcox, and Xueyuan Zhao. Armada: Low-Effort [56] SUSE. Performance Implications of Cache Modes.
Verification of High-Performance Concurrent Programs. https://www.suse.com/documentation/
In Proceedings of the 41st ACM SIGPLAN Conference sles11/book_kvm/data/sect1_3_chapter_
on Programming Language Design and Implementation book_kvm.html, September 2016.
(PLDI 2020), pages 197–210, London, UK, June 2020.
[57] Runzhou Tao, Jianan Yao, Xupeng Li, Shih-Wei Li,
[47] Mathias Morbitzer, Manuel Huber, and Julian Horsch. Jason Nieh, and Ronghui Gu. Formal Verification of a
Extracting Secrets from Encrypted Virtual Machines. In Multiprocessor Hypervisor on Arm Relaxed Memory
Proceedings of the 9th ACM Conference on Data and Hardware. In Proceedings of the 28th ACM Symposium
Application Security and Privacy (CODASPY 2019), on Operating Systems Principles (SOSP 2021), pages
pages 221–230, Dallas, TX, March 2019. 866–881, Virtual Event, Germany, October 2021.
[58] Viktor Vafeiadis, Maurice Herlihy, Tony Hoare, and Marc
Shapiro. Proving Correctness of Highly-Concurrent
Linearisable Objects. In Proceedings of the 11th ACM
SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP 2006), pages 129–136,
New York, NY, March 2006.

[59] Alexander Van’t Hof and Jason Nieh. BlackBox: A

Container Security Monitor for Protecting Containers on
Untrusted Operating Systems. In Proceedings of the 16th
USENIX Symposium on Operating Systems Design and
Implementation (OSDI 2022), Carlsbad, CA, July 2022.

[60] Luca Wilke, Jan Wichelmann, Mathias Morbitzer, and

Thomas Eisenbarth. SEVurity: No Security Without
Integrity Breaking Integrity-Free Memory Encryption
with Minimal Assumptions. In Proceedings of the 2020
IEEE Symposium on Security and Privacy (IEEE S&P
2020), pages 1483–1496, San Francisco, CA, May 2020.
[61] Jean-Karim Zinzindohoué, Karthikeyan Bhargavan,
Jonathan Protzenko, and Benjamin Beurdouche.
HACL*: A Verified Modern Cryptographic Library. In
Proceedings of the 2017 ACM SIGSAC Conference on
Computer and Communications Security (CCS 2017),
pages 1789–1806, Dallas, TX, October 2017.
[62] Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui
Gu, and Haibo Chen. Using Concurrent Relational
Logic with Helpers for Verifying the AtomFS File
System. In Proceedings of the 27th ACM Symposium
on Operating Systems Principles (SOSP 2019), pages
259–274, Huntsville, ON Canada, October 2019.