Arm Cca
Arm Cca
Figure 2: Page table creation and hand-over-hand locking execution. Oracle’’0 EVENT0 Oracle’’1 Event Refine
about interleavings directly with multiple CPUs is difficult. Figure 3: Log refinement with mover oracle queries.
To simplify reasoning about all possible interleavings, we events based on the local events’ mover types. By reordering,
instead lift multiprocessor execution to a local CPU model, consecutive oracle queries will be merged to one. Second, we
which distinguishes execution taking place on a particular can prove local sequences of events generated by the machine
CPU from its concurrent environment [27, 36, 42]. All effects refine an aggregate local event generated by a higher-level
coming from the environment are encapsulated by and machine. This refinement can be applied to any arbitrary CPU,
conveyed through an event oracle, which yields events emitted therefore, it applies to all CPUs, so that the entire log of events
by other CPUs when queried. Querying the event oracle can be refines the log of the higher-level aggregate events.
thought of in the context of the explicit multiprocessor machine Figure 3 shows an example of log refinement to reduce in-
model as returning events from the global log generated terleavings of events across CPUs into an atomic event. We
by all other CPUs; only new events since the last query are identify the mover type of each local event, i.e. [Right 0, Right
returned. How the event oracle synchronizes these events is 1, None 2, Left 3], and initially query the oracle before each
left abstract, its behavior constrained only by rely-guarantee event. Based on the mover types, we can reorder all oracle
conditions [35]. Since the interleaving of events is left abstract, queries before the NoneMover to the beginning, and all re-
our proofs do not rely on any particular interleaving of events maining queries to the end, such that the log before and after
and therefore hold for all possible concurrent interleavings. reordering have the same machine behavior. We then define a
A CPU captures the effects of its concurrent environment new oracle that can be queried to return the consecutive events
by querying the event oracle between local CPU steps. A CPU from the previous oracle queries [Oracle 0, Oracle 1, Oracle
only needs to query the event oracle when interacting with 2], allowing those events to be merged into a single oracle
shared objects, since its private state is not affected by these query [Oracle’ 0]. We then refine the local sequence of events
events. In other words, the CPU repeatedly performs two steps [Right 0, Right 1, None 2, Left 3] into a single higher-level
when interacting with shared objects: querying the event oracle aggregate local event EVENT 0. This can be done for all CPUs
to obtain events from other CPUs, then generating a local CPU so we can reason further only using the higher-level aggregate
event. The result is a composite log of events from other CPUs event EVENT 0 with oracle queries Oracle” 0 and Oracle” 1
interleaved with events from the local CPU. This is equivalent that also return higher-level aggregate events, instead of the
to the logical log in the explicit multiprocessor model, but with- many Left/Right/None events of lower-level machine.
out the complexity of directly reasoning about multiple CPUs.
If possible, we would like to move the interleaved event
4.2 Permutation Conditions
oracle queries out of the way of the local CPU events so we can
use sequential reasoning regarding the local execution of any To verify RMM, we must account for the relaxed memory
given CPU. By using mover types, we can identify how we can behavior of the Arm architecture on code that is not data race
reorder event oracle queries with respect to local CPU events free (DRF). For example, Figure 4 shows how a Realm’s
without changing the machine’s behavior. Thus, these queries list of RECs is updated in REC.Create, REC.Destroy, and
are mover oracle queries. We classify all local CPU events in Realm.Destroy without holding a common lock. Each
the composite log as RightMover, LeftMover, or NoneMover. Realm’s RD has a RECLIST (rd->rec_list), an array that
Mover oracle queries can be reordered before a RightMover stores the pointers to all its RECs. The RECLIST can be
and after a LeftMover. For example, acquiring a lock is a Right- referenced from both the Realm’s RD and each of the Realm’s
Mover because if other CPUs do something after acquiring the RECs (rec->rec_list). Each REC records its index in the
lock on the local CPU, they must be able to do the same thing RECLIST (rec->id). RD’s counter keeps tracking of how
before acquiring the lock. The oracle queries which capture the many RECs are in a Realm. The hypervisor must destroy all
other CPUs’ events can be reordered before acquiring the lock. RECs of a Realm before destroying its RD because once RD is
Mover oracle queries cannot be reordered with a NoneMover. destroyed, the Realm can no longer be referenced. Access
For example, an oracle query followed by a NoneMover then to the RECLIST is not synchronized by its own lock, to avoid
a LeftMover cannot be reordered after the LeftMover. potential deadlock issues due to needing to hold multiple locks.
VIA can then reduce the interleaving of events in the log that Instead, in REC.Create, the RD’s lock must be held to insert
need to be considered in two ways, which we refer to as log a new REC in RECLIST to ensure mutual exclusion. However,
refinement. First, we can reorder oracle queries with local CPU in REC.Destroy, the REC’s lock is held instead of the RD’s
Rec.Create(rd, id) { Rec.Destroy(rec) { Realm.Destroy(rd) {
acq(rd->lock) acq(rec->lock); acq(rd->lock); For example, to handle the non-DRF code in Figure 4, we
… … …
(a) if (rd->rec_list[id] == NULL) { (d) rec->rec_list[rec->id] = NULL; (f) if (rd->counter == 0) {
identify P to be when Realm.Destroy finds rd->counter
(b) rd->rec_list[id] = NEW_REC;
(c) atomic_inc(rd->counter);
(e) atomic_dec(rec->rd->counter);
rel(rec->lock);
// rec_list should be EMPTY
(g) destroy(rd->rec_list);
equals 0, rd->rec_list must be empty. This is necessary
… } … because rd->rec_list must be empty when destroying it
rel(rd->lock); rel(rd->lock);
} } in (g), otherwise the system may crash due to reclaiming
Figure 4: Pseudo code of RECLIST data races, marked in bold blue. non-empty memory. Since REC.Create and Realm.Destroy
use the same lock, data races can only occur when either
locks when clearing the REC’s entry from the RECLIST so that runs concurrently with REC.Destroy. We prove each function
multiple CPUs can destroy different RECs of the same Realm always behaves the same on SC and relaxed memory. For
concurrently. Furthermore, the RD’s counter is increased or REC.Create, since (b) and (c) cannot be reordered with (a)
checked in REC.Create and Realm.Destroy while holding due to the branch dependency, as required by Promising Arm,
RD’s lock, but it is decreased in REC.Destroy without holding its possible executions are (a)(b)(c) or (a)(c)(b). Since
any lock. As a result, data races can occur when concurrently (a) confirms that rec_list[id] is empty, all concurrent
executing REC.Destroy with REC.Create or Realm.Destroy. REC.Destroy on other CPUs must destroy slots other than
To address this problem, VIA builds on VRM [57]. VRM id because REC.Destroy will only work if the rec exists,
verifies programs on Arm relaxed memory hardware that which must be a non-empty slot in the rec_list. Therefore,
are DRF except for synchronization methods and virtual swapping (b) and (c) will never change any CPU’s behavior
memory hardware. VRM verifies a program on a sequentially and (a)(c)(b) is equivalent to (a)(b)(c), which is the order
consistent (SC) multiprocessor hardware model, defines and on SC. For REC.Destroy, if (e) executes before (d), P will
proves that a fixed set of conditions hold for the program be broken because when Realm.Destroy checks counter
running on relaxed memory hardware, and proves that the concurrently on other CPUs, it may find counter is 0 but
conditions guarantee that the program has the same behavior rec_list is not empty, as shown below:
on SC and relaxed memory hardware so that its SC proofs also (g) destroy(list)
(e) counter-- (f) counter==0 (d) list[id]=NULL
hold for relaxed memory hardware. (list is not empty)
VIA generalizes this approach for programs that are not
DRF. It ensures that such a program will have the same This was actually a real bug in the prototype implementation
behavior on SC and relaxed memory hardware by first decom- of RMM. Therefore, we must enforce that (d) always executes
posing the program into components that are DRF and not before (e) by adding a barrier between them so it must follow
DRF. Previous work already shows that the DRF components program order as on SC. For Realm.Destroy, the proof is
will have the same behavior on SC and relaxed memory trivial because the branch dependency between (f) and
hardware [57]. VIA then introduces permutation conditions (g) guarantees that they execute in program order as on SC.
P on the non-DRF components such that P can be verified Therefore, this non-DRF code will not generate more behavior
to hold for the program on relaxed memory hardware, and on relaxed memory hardware than on SC.
P can be proven to guarantee that the non-DRF components
will have the same behavior on SC and relaxed memory
4.3 Register Accounting
hardware. Our experience suggests that even for programs
that are not DRF, only a small percentage of the code in these To verify CCA firmware with both C and assembly code, we
programs is not DRF, so non-DRF programs can be verified must account for the interactions of C and assembly code
on relaxed memory hardware by only proving a small number primitives that call one another across language boundaries.
of permutation conditions in practice. This observation holds However, C code hides the details of how it uses CPU registers,
for RMM, in which almost all of the code is DRF. as the use of registers during C code execution is decided by the
VIA uses VRM’s extended Promising Arm model [57] to implementation of specific C compiler used. Although register
model Arm’s relaxed memory hardware, such that P needs to behavior is not expressed by C language semantics, ignoring it
be verified against all instruction permutations of the program causes problems when attempting to verify programs in which
allowed by VRM’s Promising Arm model. Unlike VRM which C and assembly code call one another, as shown in Figure 5,
defines a fixed set of conditions that do not all hold for RMM, which illustrates a real bug in the original prototype RMM
VIA allows any condition P to be specified for non-DRF com- implementation detected during our verification. Existing
ponents that will result in their behavior being in the same on verification approaches cannot support bidirectional calls
SC and relaxed memory hardware and that can be proven to between C and assembly code, such that the example in
hold for the program on relaxed memory hardware. The condi- Figure 5 would be erroneously verified without detecting the
tion is essentially a constraint based on the program’s seman- information leakage [10, 11, 23, 26, 37, 42, 43, 46].
tics that restricts the possible instruction reorderings that can To address this problem, VIA introduces a novel register
occur on relaxed memory hardware so that resulting program accounting mechanism to correctly verify integrated C and
behavior is the same on SC and relaxed memory hardware. Arm assembly code while making minimal assumptions
ENTRY(store_inner): void store_c() { ENTRY(store_outer): u64 sca_read64(u64 *ptr) { u64 sca_read64(u64 *ptr) {
str x5, [x1] int s = secret; mov x5, #0 u64 val; Bind: u64 val;
ret ptr -> I0
// s stored in x5 bl store_c asm volatile( init_pr();
val -> O0
ENDPROC(store_inner) store_inner(); ret “ldr %[val], %[ptr]\n” set_pr(I0, ptr);
} ENDPROC(store_outer) : [val] "=r" (val) asm volatile(
Spec: mem[%x1] = %x5; Spec: mem[%x1] = %x5; WRONG Spec: mem[%x1] = 0; : [ptr] "m" (*ptr) “ldr %O0, [%I0]\n”
); )
return val; } val = get_pr(O0);
Figure 5: An example of incorrectly combining C and assembly
return val; }:
specifications. Assembly function store_outer clears register
To asm prim
x5 to 0, then calls C function store_c. store_c calls assembly
function store_inner, which stores register x5 into memory. ENTRY(sca_read64_inline): u64 sca_read64(u64 *ptr) {
ldr O0, [I0] u64 val;
The intended behavior is that the value 0 will be stored to memory. ret init_pr();
The actual behavior is that x5 stores C temporary variable s which ENDPROC(sca_read64_inline) set_pr(I0, ptr);
sca_read64_inline();
contains secret data, resulting in undetected information leakage. val = get_pr(O0);
return val; }
regarding compiler behavior. VIA leverages the Arm64 Pro-
cedure Call Standard (AAPCS64) [7] to specify how registers Figure 6: Translation of parameterized inline assembly.
are potentially used when assembly code calls a C function atomicity or memory order semantics, as shown in the sca_-
or is called by a C function. It then conservatively marks all read64 example in Figure 6. sca_read64 implements a 64-bit
registers used by C code whose values cannot be determined single-copy-atomic read in one line of assembly code plus an
based on AAPCS64 as of Unknown value, and requires interface, which can specify a list of input registers, output reg-
assembly code to not depend on registers with Unknown values. isters and clobbered registers. VIA translates inline assembly
AAPCS64 constrains how some Arm registers are used. In code into an assembly function according to the interface con-
CCA firmware, C functions pass no more than eight integer straints; "r", "Q", and "m" constraints are currently supported.
or pointer parameters and return an integer or pointer. For such It then checks its correctness like any other assembly function.
functions, AAPCS64 specifies that a C compiler will only pass Translation is done using a set of logical registers I0-In
parameters through registers r0-r7 and save the return value for inputs and O0-On for outputs so that verification does not
in r0. It also specifies registers that must have their values depend on the specifics of GCC register assignment. Input
preserved through a function call, namely all callee-saved registers are defined read only. VIA also defines abstract
registers r19-r29 and the stack register sp. The use of other accessors init_pr, which initializes all logical registers to
general-purpose registers (GPRs) may depend on the specific UNKNOWN, set_pr, which writes to a register, and get_pr,
C compiler implementation. which reads from a register. As shown in Figure 6, the trans-
For an assembly function that calls a C function, VIA checks lated sca_read64 function first calls init_pr for initialization,
that the assembly code does not read any Unknown registers. saves parameters to input registers by calling set_pr, uses the
Legal assembly code can either keep such Unknown registers input and output registers in the assembly code, and gets the
untouched or overwrite them before using them. VIA uses return value from the output register by calling set_pr.
AAPCS64 to model the register behavior of the C function For simplicity, VIA imposes additional requirements
by identifying register r0 as containing the return value, and to guarantee GCC generates correct machine code whose
registers r19-r29 and sp as preserving the values. It marks the behavior is the same as VIA’s translated code. VIA forbids
values of other registers after the C function call as Unknown, inline assembly code from explicitly using any GPRs or goto
including caller-saved registers r1-r18 and the link register lr. labels. For inline assembly with multiple instructions, VIA
For an assembly function that can be called from a C enforces that all output registers are constrained by "&" or
function, VIA checks that its behavior does not depend on "+". Thus, an output-only register never doubles as an input
Unknown registers, and that it obeys AAPCS64 C calling register, and the same register is used for input and output
conventions so that it will not cause unexpected behavior in its of an operand. This avoids any unexpected overlap in the
caller. VIA checks that (1) callee-saved registers r19-r29 and assignment of input and output registers [53].
sp preserve the values; (2) the program counter pc after the call Finally, because assembly code functions may be at the
is equal to lr before the call so the assembly primitive returns interface to outside programs that are untrusted, VIA enforces
like a function call; (3) if the caller expects a return value, r0’s that all register values are not Unknown when returning from
value is never Unknown; and (4) the assembly code behavior those assembly functions. This ensures that there is no unin-
remains the same if we initialize all GPRs to Unknown except tentional information leakage from assembly code functions
for those carrying parameters. The last condition implies that to untrusted programs through registers with Unknown values.
the assembly code does not read any Unknown registers, except
for saving and restoring callee-saved registers.
4.4 Ideal Secure System Model
VIA also supports GNU Compiler Collection (GCC) inline
assembly extensions within a C function. This is used in CCA protects the confidentiality and integrity of Realms’
inline assembly memory accessors in RMM which guarantee private data during their lifetime. Confidentiality means any
Exclusive Type Rule
Regs & Mem
Mem When a Realm accesses an IPA within its PAR but it is Unknown, the
data copy
Realm will copy the data from a special initialization buffer in memory
Realm RMM Hypervisor Realm RMM Hypervisor to exclusive memory before accessing the IPA. This can only be done
once per granule. The buffer is populated before the Realm is activated,
NS granules Other memory Regs NS granules Other memory Regs and cannot be changed once it has been activated.
Real System Ideal System Mem When a Realm accesses an IPA outside of its PAR, it will directly access
memory, not exclusive memory.
Figure 7: The real and ideal secure system model. Reg On any trap from a Realm to the RMM, a Realm exposes the contents
of various exclusive system registers, marking them Unknown, and
change a Realm makes to its private data is only observable marks various timer-related exclusive registers Unknown.
Reg If a trap is due to system register emulation, a Realm will mark a
by that Realm. Integrity means a Realm will not observe any specified exclusive GPR as Unknown.
changes to its private data that it did not make, but does not Reg If a trap is due to a hypercall, a Realm will expose and mark the seven
imply availability; data access should either fail or return exclusive GPRs r0-r6 used for parameter passing as Unknown.
the data previously stored. The confidentiality definition is Reg If a trap is due to an RMM call, a Realm will expose and mark the four
exclusive GPRs r0-r3 used for parameter passing as Unknown.
standard, but the integrity definition allows untrusted software
Table 3: Declassification rules.
to modify a Realm’s private data as long as the Realm does
not observe the change. For example, to reclaim memory
its PAR or it accesses a granule or register that is Unknown.
from Realms, a hypervisor can unmap a Realm’s private data
If it accesses memory outside its PAR, the Realm will access
without the Realm’s permission. This is allowed because the
non-exclusive memory directly. If it accesses a granule or
Realm’s access to the unmapped data will trigger a page fault
register that is Unknown, the data will be copied from a special
so the Realm cannot observe future changes to the data content.
initialization buffer or non-exclusive register, respectively, be-
However, this breaks noninterference, which therefore cannot
fore accessing it. A granule is Unknown if it is not yet initialized.
be used to to prove security as is done for other verified
A register is Unknown if it is used by the Realm to communicate
systems [16, 23, 29, 34, 42, 49, 55].
with RMM or the hypervisor. For example, when a Realm
To address this problem, VIA introduces an ideal/real
invokes a hypercall, it exposes the arguments in registers
paradigm, shown in Figure 7, inspired by the idea from formal r0-r6, which RMM will provide to the hypervisor, then return
verification of separation kernels [22, 30]. The real system the results back in those registers. Marking a granule or register
is defined by the RMM top-layer specification, which builds
as Unknown is used to represent declassification in the model.
on and incorporates EL3M, in which all memory and CPU
We can then use this ideal system model with declassifica-
registers are shared by Realms, RMM, and the hypervisor. The
tion to verify that RMM guarantees Realm confidentiality and
ideal system is defined by an ideal system model specification,
integrity. The key is to establish a simulation relation in which
in which each Realm has its own exclusive memory, and
all machine states are equivalent between the ideal and real
each REC of the Realm has its own exclusive CPU registers,
systems and show that, at any step in the two systems satisfying
while other software can only access the same non-exclusive
the simulation relation, the same data is obtained when access-
memory and registers as in the real system.
ing memory or registers. This involves proving a one-to-one
If each Realm only accesses its exclusive memory and mapping of data between the two systems. With declassifica-
registers in the ideal system, we could then show that RMM tion, the mapping will change such that a different mapping
guarantees confidentiality and integrity by proving that the will be used depending on whether the data is declassified or
real system simulates the ideal system. This would mean that not. For example, if a granule within a Realm’s PAR is not de-
each Realm only accesses its exclusive memory and registers classified, we will want to show that accessing that granule in
in the real system as well, so nothing other than a Realm can non-exclusive memory in the real system correponds to access-
access its own data. However, such a simplistic model does not ing it in exclusive memory in the ideal system to get the same
work in practice. For CCA, we need a model that allows declas- data. On the other hand, if a granule within a Realm’s PAR is
sification so Realms can access NS granules for initialization declassified, because its contents were initialized from an NS
and I/O, and CPU registers can be used to pass parameters granule, we will want to show that first accessing that granule
between Realms and RMM, or Realms and the hypervisor. in non-exclusive memory in the real system correponds to ac-
VIA introduces a new ideal system model for Armv9-A that cessing it in non-exclusive memory in the ideal system since
supports declassification of memory and registers based on the respective exclusive memory is initially Unknown so the
a set of well-designed rules that define when declassification data is first copied from non-exclusive to exclusive memory.
is allowed. The model has six declassification rules, listed in
Table 3. In this model, Realm exclusive memory consists of all
memory in its PAR and exclusive CPU registers consists of all 5 CCA Implementation and Verification
registers accessible by a Realm or that can affect its execution,
such as system registers. A Realm will only access its exclusive We used VIA to verify an early prototype implementation
memory and registers, unless it accesses a granule outside of CCA firmware, which includes both RMM and EL3M as
Description LOC Description LOC AcqT0 Oracle0 LDT0 Oracle1 AcqT1 Oracle2 RelT0
Machine model 1.4K RMM refinement proofs 6.1K Reorder
Lock proof 1.7K Top-level specification 1.1K Oracle0 Oracle1 Oracle2 AcqT0 LDT0 AcqT1 RelT0
EL3M layer specifications .2K Ideal secure system model .2K
Refine Events
EL3M refinement proofs .9K Security simulation proofs 3.4K Walk until level 1: Oracle’0 Walk until level 1 T0 T1
RMM layer specifications 4.4K Permutation condition proofs 1.2K
Total 20.6K
Table 4: Lines of Coq code for verifying CCA firmware. Walk until level 1 T0 T1 Oracle’0 LDT1 Oracle’1 AcqT2 Oracle’2 RelT1
Reorder
Oracle’0 Oracle’1 Oracle’2 Walk until level 1 T0 T1 LDT1 AcqT2 RelT1
described in Section 3. The verification outcomes, including
Refine Events
the discovery of several latent bugs, were confirmed by Arm’s Walk until level 2: Oracle’’0 Walk until level 2 T0 T1 T2
development team and used to further improve the firmware
implementation. RMM contains 3.2K lines of code (LOC) Figure 8: Proving atomicity for page table operations.
in C and .3K LOC in assembly. The runtime critical parts of
EL3M contain .1K LOC in C and .7K LOC in assembly; all into a “destroy level 1 table” event.
of the C code is for updating the GPT. All RMM and EL3M We then refine the procedure of walking the page table until
code is verified, except for the portion of assembly code for acquiring the lock of T2 into an atomic step. We first prove that
initialization (.1K LOC in RMM and .5K LOC in EL3M). For “walk until level 1” is a RightMover because any subsequent
remote attestation, RMM also uses functions provided by a events at this layer from other CPUs can be reordered with it,
crypto library, which was not verified, though a verified crypto i.e., “create level 1 table”, “destroy level 1 table”, “walk until
library could be ported and used instead [42, 61]. level 1”, and acq/rel/LD/ST events for T2 and T3 level tables. A
“create level 1 table” from other CPUs is irrelevant to the local
Table 4 shows our proof effort, measured in LOC in Coq.
“walk until level 1” because it can only create other level 1 ta-
45 abstraction layers were used. The bottom layer machine
bles and cannot overwrite T1 since RMM only allows creating
model is based on VRM’s Promising Arm model [57] to model
a table that does not exist yet. Events “destroy level 1 table” and
Arm’s relaxed memory. Another layer was used to verified the
“walk until level 1” from other CPUs are irrelevant because they
spinlock implementation on the relaxed memory model and
cannot hold T1’s lock so can only access other level 1 tables,
lift it to an SC model. We verify the EL3M implementation
not T1. Other events are also irrelevant because they do not
refines its layered specification through three layers. On top
manipulate T0 and T1 tables. Therefore, “walk until level 1” is
of that, we verify the RMM implementation refines its layered
a RightMover and all subsequent mover oracle queries can be
specification through 39 layers. The top-level specification
reordered before it. Thus, we refine “walk until level 2” into
reflects RMM’s interface, combining both RMM and EL3M
an atomic step, as shown in the bottom of Figure 8. In a similar
functionality. Another layer defines the ideal secure system
fashion, we prove “walk until level 2” to be a RightMover and
model. We verify that the top-level specification simulates the
refine the steps of “walk until level 3.” Continuing in this man-
ideal secure system model.
ner, we eventually refine all RTT operations into atomic steps.
Proving RTT operations to be atomic allows us to prove de-
5.1 Concurrent Multi-level Page Tables sired properties about RMM’s RTT management. The key prop-
erty to prove is that each non-empty entry in the RTTs, includ-
The most challenging refinement proofs were for verifying ing both intermediate entries pointing to lower-level RTTs and
RMM’s RTT implementation. RTT primitives use hand- leaf mappings, uses a unique delegated granule. This prevents
over-hand locking to synchronize access to dynamically page remapping attacks while still allowing fine-grained access
allocated 4-level page tables, allowing fine-grain concurrent to the RTTs for improved performance. The proof is straightfor-
operation on different page table levels. This required nine ward because every operation on an RTT entry is proved to be
layers. We leverage mover oracle queries and log refinement, atomic, only the PA of a delegated granule is used to populate a
discussed in Section 4.1, to refine all of RMM’s page table previously empty RTT entry, and each such granule is guaran-
operations to atomic operations, verifying the correctness of teed to be unused and zeroed. Once a granule is used for an RTT
hand-over-hand locking in a real system for the first time. entry, its state changes from delegated to RTT or Data, prevent-
Figure 8 visualizes the proof. Since acquiring a lock is ing it from being used for other RTT entries. By using mover
a RightMover, releasing a lock is a LeftMover, and reading oracle queries and log refinement, we complete the first proof of
the page table entry is both a LeftMover and RightMover, hand-over-hand locking in a real system, and the first proof of
we can reorder mover oracle queries to refine the procedure a system with fully dynamically allocated shared page tables.
of walking the page table until acquiring the lock of T1 into
an atomic step. We group the local CPU events into a single 5.2 Relaxed Memory
higher-level aggregate “walk until level 1” event. Similarly,
we can group events together from creating a level 1 table into We prove permutation conditions as discussed in Section 4.2 to
a “create level 1 table’ event, and destroying a level 1 table verify the proofs hold on Arm relaxed memory hardware. Veri-
fying CCA firmware only requires six permutation conditions, L2: RMM C primitive invoke SMC return
the RECLIST empty condition discussed in Section 4.2, and five L1: EL3M Asm primitive
EL3M handler EL3M exit
conditions previously introduced by VRM, namely (1) N O -
BARRIER -M ISUSE, (2) T RANSACTIONAL -PAGE -TABLE, L0: EL3M C primitive
GPT Update
(3) S EQUENTIAL -TLB-I NVALIDATION, (4) W RITE -O NCE - Figure 9: Verify RMM and EL3M GPT update operations. Solid
K ERNEL -M APPING, and (5) M EMORY-I SOLATION. N O - arrows represent C code and dashed arrows represent assembly code.
BARRIER -M ISUSE requires that barriers are correctly placed.
We verified that all lock acquisitions have acquire memory se-
Rec.Run
mantics and all lock releases have release memory semantics. Hyp to Realm run Realm handle Realm exit Realm to Hyp
We also proved that memory accesses to shared objects outside run
Realm enter Realm Realm steps Realm trap exit Realm
critical sections have release semantics so that they cannot
be reordered, preserving program ordering and SC behavior. Figure 10: Verify REC.Run and its inner run_realm loop. Solid
arrows represent C code and dashed arrows represent assembly code.
T RANSACTIONAL -PAGE -TABLE requires that shared page
table writes within a critical section are transactional. This
RMM
ensures that page table writes will not result in any behavior on handler Hyp to EL3M EL3M to RMM handle_ns_smc RMM to EL3M EL3M to Hyp
relaxed memory hardware that cannot be produced on an SC
Figure 11: Verify rmm_handler in the top layer. Solid arrows
model. In RMM and EL3M, each critical section contains at
represent C code and dashed arrows represent assembly code.
most one page table write, so they are obviously transactional.
S EQUENTIAL -TLB-I NVALIDATION requires that a page
table unmap or remap be followed by a TLB invalidation, 5.3 C and Assembly Code Integration
with a barrier between them. This precludes relaxed memory
behavior in TLB management code. There are no remaps in Another key aspect of the refinement proofs was verifying the
RMM or EL3M. We verified that all page table unmaps are interactions between RMM and EL3M, RMM and Realms, and
followed by a TLB invalidation with a barrier between them. RMM and the hypervisor, which required the C and assembly
W RITE -O NCE -K ERNEL -M APPING requires that if RMM code integration techniques discussed in Section 4.3. For
or EL3M’s own page tables are shared, they can only be written RMM and EL3M, we verified the correctness of GPT updates.
once—only empty page table entries can be modified. This Figure 9 shows how to verify a C primitive in RMM which is-
precludes relaxed memory behavior due to out-of-order reads sues an SMC to EL3M to update the GPT. Layer L0 verifies the
of these page tables. For EL3M, this holds as it uses a statically C code for EL3M’s GPT operations. Layer L1 verifies EL3M’s
reserved hardcoded page table shared across all CPUs that is assembly code handler, which handles traps from RMM and
never changed after booting. For RMM, although its kernel calls the GPT operations in C. Finally, layer L2 verifies the
page table is shared across all CPUs and can be changed, we C code in RMM that traps to EL3M’s assembly code handler.
prove that it is logically partitioned into two tables, as discussed For RMM and Realms, we verified REC.Run, which runs
in Section 3. We prove one table is shared but never changed a VCPU of a Realm and required five layers. Figure 10 shows
once initialized, and the other table is not shared because it this C primitive, which calls the run_realm assembly code
is statically divided into per-CPU ranges private to each CPU. primitive, which restores the Realm’s VCPU contexts and
M EMORY-I SOLATION requires that the memory space ac- enters the Realm. We proved that all GPRs are correctly
cessible by RMM and EL3M is partially isolated with Realms restored such that there is no information leakage from RMM
and NS hypervisors. This ensures that any relaxed memory to the Realm through registers with Unknown values.
behavior of Realms or NS hypervisors cannot be propagated to For RMM and the hypervisor, we verified the RMM
RMM or EL3M. We verify that Realms and the hypervisor will handling of RMI calls from the hypervisor. Figure 11 shows
only access Data and NS granules. Realms’ memory accesses when the hypervisor invokes an RMI call, it traps to EL3M first,
are managed by RTTs, We prove RTTs will only map Data then jumps to RMM and calls the C function handle_ns_smc
granules and NS granules. A hypervisor’s memory accesses are to execute the RMI call. Eventually, RMM returns to EL3M
controlled by the GPT. We prove all delegated granules are in and then the hypervisor. We proved that when returning to the
the Realm PAS state in the GPT so the hypervisor cannot access hypervisor, there is no information leakage to the hypervisor
them. We further prove that RMM and EL3M behavior do not through GPRs with Unknown values.
rely on what Realms or the hypervisor may do with Data or NS
granules. We prove EL3M never accesses memory other than 5.4 Security
its own, RMM will not access the contents of Data granules,
and whenever RMM accesses NS granules, it may obtain arbi- We prove that the real system specified by the RMM top-level
trary data because the hypervisor can make arbitrary changes specification simulates the ideal system model with declassi-
to the data. Thus, we show RMM’s proof on SC does not rely fication, as discussed in Section 4.4. We discuss the simulation
on the concrete implementation of Realms or NS hypervisors. relation in three parts: all machine states except for Data
granules, CPU registers, and VCPU contexts stored in REC declassified. Exclusive registers are not involved in context
granules (Rel 1), Data granules (Rel 2), and CPU registers and switches. We then prove that RMM indeed correctly saves and
VCPU contexts (Rel 3). Each relation is proved by induction, restores Realms’ VCPU contexts, so that Rel 3 is preserved.
in which we assume the relation is initial true at machine Finally, we note that our simulation proofs between the
boot and prove that it is preserved during RMM, hypervisor, real system and ideal secure system model verify Realm
and Realm execution so that the same data is obtained when confidentiality and integrity without even trusting the
accessing memory or registers in both real and ideal systems. correctness of the RMM or EL3M specifications. The proofs
We prove that Rel 1 is preserved during execution and only need to trust the specification of the ideal secure system
all data accessed from memory is the same. Rel 1 concerns model, which encodes the declassification rules and consists
NS granules, delegated granules, and granules containing of only .2K LOC in Coq. Furthermore, as shown in Table 3,
Realm metadata including RTTs, none of which involve the declassification rules only allow a Realm to disclose its
declassification. We prove two invariants: (1) all RTTs only data in two ways, by writing NS granules outside of its PAR
map IPAs within the respective Realm’s PAR to Data granules or via the eight GPRs used for hypercalls, making the security
and IPAs outside its PAR to NS granules; and (2) the GPT only policy formalization easy to understand.
labels NS granules in the NS PAS while all delegated granules
are labeled in the Realm PAS. The first invariant ensures that
Realms will only access Data and NS granules, and the former
5.5 Bugs Found
will not affect Rel 1. The second invariant ensures that the We identified several bugs in the CCA firmware prototype
hypervisor can only access NS granules. Since Realms and implementation during verification. Through refinement
the hypervisor access NS granules in the same non-exclusive proofs, we detected common bugs such as incorrect boundary
memory in both real and ideal systems, they will obtain the checking for some variables and misuse of locks; some
same data. All other granules for Rel 1 can only be accessed locks were released without previously holding them. More
by RMM. Since RMM accesses NS and other granules in the importantly, verification of C and assembly code integration
same non-exclusive memory in both real and ideal systems, identified a serious security bug that neither EL3M nor RMM
it will obtain the same data; the VCPU contexts that are part clear the caller-saved registers when returning to the hypervi-
of REC granules are excluded here and considered in Rel 3. sor. These registers may carry RMM’s private execution states
We prove that Rel 2 is preserved during execution. The and leak information. For example, RMM saves and restores
invariant above ensures that the hypervisor cannot access Data Realms’ VCPU contexts, and some contexts may remain in
granules, and we prove that RMM does not access Data gran- caller-saved registers and leak to the untrusted hypervisor.
ules, so Rel 2 is preserved for both the hypervisor and RMM. Another bug identified was in the REC execution handler. The
Data granules are only accessed by Realms. From Rel 1, the hypervisor provides an NS granule to communicate entry
RTTs must be the same in both real and ideal systems. If an and exit information with RMM. RMM locks and checks
RTT maps an ipa within a Realm’s PAR to a Data granule at that the given granule is indeed an NS granule, accesses its
host physical address hpa, the Realm will access the same data contents, unlocks the granule, and enters the Realm. However,
at exclusive memory ipa in the ideal system as at hpa in the when exiting from the Realm, RMM did not lock and check
real system, so Rel 2 is preserved. To ensure that an hpa cannot the granule state before accessing it. This may lead to RMM
be mapped to ipas in different Realms, we prove an invariant unexpectedly receiving a Granule Protection Fault (GPF) from
that if an RTT maps ipa to hpa, then the Data granule at hpa the hardware when accessing the granule using the NS PAS,
inversely maps to (Realm, ipa). Because there is a one-to-one if the granule was delegated by another CPU. This could lead
mapping for each Data granule to (Realm, ipa), any changes at to a denial of service of RMM or have worse consequences
hpa can only be observed by the specific Realm at the specific if GPF handling was not properly implemented in RMM.
ipa as is the case in the ideal system, so Rel 2 is preserved for Through permutation condition proofs, we identified
all other data. If an an ipa within a Realm’s PAR is Unknown, an RMM bug that REC.Destroy does not implement
the Realm will access the same data at non-exclusive memory “counter−−” with the release semantics (instruction (e) in
hpa in the ideal and real system, so Rel 2 is preserved. Figure 4) such that it can be reordered with (d) on Arm’s relax
We prove that Rel 3 is preserved during execution. We prove hardware. This may cause Realm.Destroy to wrongly set the
if a Realm’s VCPU V is running, its register r in the real system RECLIST to be reusable before REC.Destroy clears it because
equals the corresponding exclusive register r if not Unknown or when counter is zero, all RECs in the list should have been
the non-exclusive register r if Unknown in the ideal system. We destroyed, which was not true due to this relaxed memory bug.
prove if a Realm’s VCPU V is not running, V’s REC context of r Through security proofs, we identified an RMM bug that
in the real system equals the corresponding exclusive register allows the hypervisor to create two Data granules for the same
r if not Unknown or the V’s REC context of r if Unknown in the memory address of a Realm. Thus, RMM can unmap one Data
ideal system. In the ideal system, Realm’s register data is granule from an IPA of a Realm and map another Data granule
always stored in the exclusive registers except for those being to the same IPA, violating the Realm integrity guarantee,
because the Realm could observe a change in Realm data not between NS and Realm worlds is mimicked by modifying
caused by a Realm memory access. EL3M to switch between two separate contexts within NS
world. EL3M is further modified to support the RMI as well as
5.6 CCA KVM handle GPT update requests from RMM. We did not include
EL3M code that controls GPT registers as they do not exist
CCA provides a standard application binary interface (ABI) on the N1SDP, but all data written to the GPT memory can
to allow hypervisors to communicate their intents to RMM via be done, although without any effect.
RMI commands, which is suitable for adoption by commodity This setup necessarily will have some performance differ-
hypervisors. However, existing hypervisors do require some ences from real CCA hardware, but it provides a useful approx-
modifications to use CCA to support Realm VMs. Regardless imation of actual Realm performance. The cost of GPT checks
of whether a hypervisor is modified to use CCA, it cannot by CCA hardware are not included since no GPT hardware is
compromise the confidentiality and integrity of Realms. available, but are expected to exhibit good caching behavior
Without modifications, existing hypervisors cannot run Realm and will not affect the relative performance of VMs versus
VMs, but can still run non-Realm VMs. Realm VMs since they apply equally in NS and Realm worlds.
We modified the Linux KVM hypervisor to use CCA, The cost of some hypervisor operations, such as those that
which we refer to as CCA KVM. The modifications involved require exiting to userspace, will be overly conservative as
roughly 3K LOC in C to KVM, including .5K LOC for RMI controlling timer interrupt behavior requires those operations
commands, .4K LOC for handling exits from Realms, .8K to write to the Arm Generic Interrupt Controller (GIC) on the
LOC for creating and destroying Realms, and 1.1K LOC for N1SDP which is slow, whereas real CCA hardware will have
stage 2 page table management using RMI commands. The system registers that can be used by RMM to achieve the same
modifications also required roughly .5K LOC in C to QEMU, functionality. Finally, the current prototype lacks support for
mostly related to VM boot, initialization, and exit handling. directly injecting virtual interrupts without hypervisor interven-
Finally, roughly 40 LOC in C of modifications to the virtio tion, which is expected to be available in future CCA hardware.
driver in the Linux guest kernel were required so that it uses We ran both microbenchmark and application workloads in
a bounce buffer to communicate I/O data with the hypervisor. VMs on unmodified KVM and CCA KVM in Linux 5.12 on the
This is needed because the ring buffer normally used by the N1SDP, which has two dual-core 2.6 GHz Neoverse N1 CPUs,
virtio driver in the VM is in memory not accessible to the 6 GB RAM, a 240 GB SATA3 SSD and a Intel 82574L 1 Gbps
hypervisor when using Realms. Our experience with KVM NIC. We used QEMU 4.2.0 [8] to run VMs, with the modi-
indicates that the modifications required for a commodity fications discussed in Section 5.6 to support CCA KVM. VMs
hypervisor to use CCA are quite modest and involve changes were run using KVM or CCA KVM with 4 cores and 1 GB
to a very small percentage of its existing codebase. RAM with the VM capped at 2 VCPUs and 512 MB RAM;
VCPUs were pinned to individual cores. VHOST networking
6 Performance Evaluation was used and virtual block storage devices were configured
with cache=none [28, 38, 56]. Arm VHE [6, 17, 18] was used
We have run the CCA software stack, including RMM, for all measurements. For client-server workloads, clients
EL3M, and modifications to the Linux KVM hypervisor to ran on an x86 machine with a 16-core Intel Xeon E5-2690
use Realms, on an Arm Fast Model which implements the 2.9 GHz CPU, 378 GB RAM and an Intel I350 1 Gbps NIC,
Realm Management Extensions (RME) CPU architecture. connected to the N1SDP via a Linksys LGS108 1 Gbps switch.
The Fast Model is a valid software emulation of the CPU
architecture, allowing us to demonstrate that the CCA software 6.1 Microbenchmarks
stack provides the desired security guarantees and system
functionality. However, Fast Models do not provide any cycle We ran KVM unit tests [39], which execute common
accurate measure of real performance and are too slow to run micro-level hypervisor operations, plus an additional system
real application workloads. While CCA will be available in register access microbenchmark, as listed in Table 5. For each
Armv9-A, Armv9-A hardware is not yet available. test, we ran it 216 times and report the average latency. Table
To provide a preliminary measure of CCA performance, we 6 shows the microbenchmark measurements in nanoseconds
have ported the CCA software prototype to run on currently for unmodified KVM and CCA KVM. The measurements
available Arm hardware, an Arm N1 System Development show that the security benefits of CCA design do come with
Platform (N1SDP) [5] with an Armv8.2-A Neoverse N1 SoC. a performance cost on most micro-level hypervisor operations,
This version of EL3M is based on the the Trusted Firmware-A because the cost of transitioning between a VM and the
(TFA) codebase. The N1SDP does not provide GPT or Realm hypervisor is much more expensive on CCA KVM than
world hardware, so it cannot enforce the security guarantees unmodified KVM, which is most clearly shown for Hypercall.
of Realms, but we can use it to mimic the performance costs Hypercall simply traps from the VM to the hypervisor in
of Realms by modifying the EL3M code. Context switching EL2 and returns for KVM, but involves additional operations
Name Description Benchmark Hypercall I/O Kernel I/O User Virtual IPI Sysreg
Hypercall Trap from a VM to the hypervisor and return to the VM imme- KVM 362 549 1,761 1,806 437
diately. Measures base transition cost of hypervisor operations. CCA KVM 1,865 2,060 4,049 4,324 70
I/O Kernel Trap from a VM to the emulated interrupt controller in the host Table 6: Microbenchmark performance (ns).
OS kernel and return to the VM. Measures cost of accessing I/O
devices supported in kernel space.
I/O User Trap from a VM to read the device ID of virtio mmio device then dominated by the transition cost between VM and hypervisor.
return to the VM. Measures base cost of operations that access The one microbenchmark that is much faster on CCA KVM
I/O devices emulated in user space. than KVM is Sysreg. Accessing system registers is roughly
Virtual IPI Issue virtual IPI to another VCPU on a different CPU. Measures
time from sending virtual IPI until receiving VCPU handles it.
5 times as expensive on KVM versus CCA KVM. On CCA
Sysreg Trap from a VM to emulate access to system register ID_- KVM, RMM handles this register access directly without
AA64PFR0_EL1 in the hypervisor and return to the VM. Measures returning to the hypervisor. RMM’s system register trap han-
system register access cost. dling mechanism is simpler than KVM’s because it does not
Table 5: Microbenchmarks. need to support KVM’s more general hypervisor functionality
that requires synchronizing accesses to hypervisor-related
for CCA KVM: (1) trap from VM in EL1 to RMM in EL2; data structures and additional conditional checks.
(2) map NS granule to copy exit info to NS world, unmap
granule; (3) trap from RMM to EL3M in EL3; (4) save Realm
context, restore NS context; (5) exception return from EL3M to 6.2 Application Benchmarks
hypervisor in EL2; (6) trap from hypervisor to EL3M in EL3; We next ran the application benchmarks listed in Table 7 to
(7) save NS context, restore Realm context; (8) exception return measure performance on more realistic workloads. We also ran
from EL3M to RMM in EL2; (9) map NS granule to copy entry the workloads on native hardware running the same kernel to
info from NS world, unmap granule; (10) map and read data in provide a baseline for comparison, restricting the system to use
REC and RD granules, unmap granules; (11) exception return 2 CPUs and 512 MB RAM to provide a comparable configura-
from RMM to VM in EL1. The additional operations result tion to the VMs. For each platform, we ran each workload 50
in Hypercall costing an additional 1.5 µs on CCA KVM than times and measured the average, worst, and best performance.
vanilla KVM. Roundtrip transitions between RMM and the hy- Figure 12 shows the average performance for each
pervisor take roughly 700 ns, and roundtrip transitions between benchmark for unmodified KVM versus CCA KVM, with
the VM and RMM take roughly 60 ns. Saving and restoring sys- error bars indicating worst and best performance. Performance
tem registers when transitioning between the VM and RMM was normalized to average native execution on the N1SDP
takes roughly 200 ns per transition, or 400 ns total. The four hardware; lower is better. Unlike microbenchmark perfor-
map/unmap operations take roughly 100 ns each, 400 ns total. mance, the application benchmark performance shows that
The remaining roughly 250 ns is due to other bookkeeping CCA KVM and KVM have much more modest performance
code, including saving and restoring GPRs and error checking. differences on more realistic workloads.
I/O Kernel and I/O User include the same transition from CCA KVM has less than 8% overhead versus unmodified
the VM to the hypervisor and back as the Hypercall, so they KVM for most workloads, but in the worst case, overhead
also require more than 1.5 µs to execute on CCA KVM was 18% for MongoDB, an I/O intensive workload. The I/O
than vanilla KVM. Although the difference between CCA intensive workloads have higher overhead for a couple reasons.
KVM and vanilla KVM is roughly 1.5 µs for I/O Kernel, the The main reason is because the VM exits more frequently, so
difference for I/O User is roughly 2.3 µs. This is because on the cost of exits has a more significant impact on performance.
the N1SDP, CCA KVM must write to the GIC when going Exits are more expensive on CCA KVM as shown by the
to userspace, which is quite slow and takes an extra 800 ns. Hypercall microbenchmark results in Table 6, in which an exit
Virtual IPI is more expensive on CCA KVM versus vanilla to the hypervisor costs an extra 1.5 µs. If there are many exits
KVM because it involves multiple transitions between a VM as will be case for I/O intensive workloads, this additional
and the hypervisor. Sending the virtual IPI involves the source cost can become significant. For example, Memcached incurs
vCPU writing to a system register, causing a trap to the RMM, roughly a million VM exits to the hypervisor. This results in
which forwards the operation to the hypervisor (1). The hyper- roughly 1.5 s of additional overhead, or .75 s of overhead per
visor issues a physical IPI to the CPU running the destination core if the exits are split evenly across cores for a VM with
vCPU, then returns to the source vCPU (2). The physical 2 VCPUs. Memcached takes 9 s to run on vanilla KVM, so
IPI causes an exit from the destination vCPU (3). On taking this is 8% overhead due to the extra latency for exits on CCA
this exit, the hypervisor detects that there is a pending virtual KVM, which roughly matches the actual overhead measured
IPI, and returns to the destination vCPU (4). Of these four for Memcached on CCA KVM versus vanilla KVM.
transitions, approximately two occur in parallel, so the cost is A secondary reason is because CCA KVM needs to use a
roughly twice that of a Hypercall on CCA KVM for the transi- bounce buffer while vanilla KVM does not. CCA KVM needs
tions, plus the cost of the actual operation. Because Hypercall a bounce buffer to support virtio because Realm memory is
is much faster for unmodified KVM, its Virtual IPI cost is not protected from the hypervisor. KVM uses the default virtio
Name Description
Apache Apache server v2.4.41 handling 100 concurrent requests via 1.4
KVM CCA KVM
TLS/SSL from remote ApacheBench [1] v2.3 client, serving 1.2
the index.html of the GCC 7.5.0 manual.
Hackbench Hackbench [54] using Unix domain sockets and 20 process 1.0
groups running in 500 loops. 0.8
Kernbench Compilation of the Linux kernel v4.18 using allnoconfig for
Arm with GCC 9.3.0. 0.6
Memcached Memcached v1.5.22 handling requests from a remote 0.4
memtier [51] v1.2.11 client with default parameters.
MongoDB MongoDB server v3.6.8 handling requests from a remote 0.2
YCSB [14] v0.17.0 client running workload A with 16 0.0
concurrent threads and operationcount=500000. he ch ch ed B
Apac Hackben Kernben Memcach MongoD MySQ
L Redis
MySQL MySQL v8.0.27 running sysbench v1.0.11 with 32 concurrent
threads and TLS encryption.
Redis Redis v4.0.9 server handling requests from a remote redis- Figure 12: Application benchmark performance.
benchmark client (redis-tools v5.0.7) [52] running GET/SET
with 50 parallel connections and 12 pipelined requests. level of VMs with similar threat models to CCA. The initial ver-
Table 7: Application benchmarks. sion of SEV ensured confidentiality by encrypting VM memory
at runtime, but did not ensure memory data integrity, which has
mechanism to directly access VM memory, so it does not been utilized as an attack vector such that a compromised hy-
require bounce buffers and does not need to perform the addi- pervisor can tamper with or steal private VM data [31,40,47,48,
tional data copying. Since KVM can also be configured to use a 60]. Secure Nested Paging (SNP) [3] now provides the previ-
bounce buffer, we also measured KVM with this configuration ously missing integrity protection capability. SEV-SNP allows
to isolate the impact of using a bounce buffer on performance. an untrusted hypervisor to directly manage NPTs, but checks
The overhead with versus without a bounce buffer was negligi- accesses against a reverse map table, an additional data struc-
ble in most cases, but in the worst case as high as 3-4% for the ture managed by a security co-processor. In contrast, Intel TDX
more disk I/O intensive workloads, MongoDB and MySQL. runs a TDX module in a privileged SEAM (Secure-Arbitration
We expect the overheads for I/O intensive workloads on Mode) root CPU mode. The firmware manages NPTs used by
real CCA hardware to be less than what we measured on the protected VMs in response to requests issued by the untrusted
N1SDP hardware. Exits are expected to occur less frequently hypervisor. Unlike CCA, the security of SGX, SEV, SEV-SNP
on real CCA hardware when support for direct virtual interrupt and TDX relies on complex implementations in unverified
injection is added. Exits that go to userspace are expected to microcode and firmware [12, 15]. They are difficult to update,
cost less on real CCA hardware as the expensive GIC writes either to patch security flaws or introduce new features.
required for N1SDP hardware will be eliminated, though this Komodo [23] draws on ideas from SGX, but is implemented
was not a dominant factor in our results with the use of VHOST as a software monitor in verified Arm assembly code on
networking. This cost can be further mitigated by using device top of TrustZone instead of requiring hardware to support
passthrough instead of paravirtual I/O, which will largely complex enclave-manipulation instructions. This avoids
avoid these exits and their associated performance overhead. hardware complexity and enables deployment of new enclave
Support for Realm device passthrough will be added to future features independently of CPU upgrades. Komodo does not
CCA hardware. Overall, our measurements indicate that support multiprocessor execution, largely due to the challenge
CCA’s security guarantees can be delivered with acceptable of verifying low-level concurrent code. CCA retains the
performance overheads for real application workloads. advantages of Komodo’s approach by relying on a verified
software monitor to implement Realms, but supports verified
VM protection and multiprocessor execution.
7 Related Work The idea of retrofitting a commodity hypervisor so that
its security guarantees are enforced by a small trusted core
Hardware-enforced trusted execution environments have was first explored by SeKVM [41–43, 57]. SeKVM was
become an important feature of major computer architectures. the first to show how this retrofitting approach, known as
Arm TrustZone [4] can be used to statically partition and isolate microverification, makes it possible to verify that a commodity
a memory region in Secure world, but most implementations hypervisor guarantees the confidentiality and integrity of VMs.
only support a small number of such memory regions, limiting CCA allows hypervisors to be modified to support Realm
its scalability. Intel Software Guard Extensions (SGX) [33] can VMs, whose confidentiality and integrity are protected by
be used by application developers to protect userspace memory a verified monitor, reminscient of SeKVM. While SeKVM
from other programs, including a potentially malicious OS uses existing Arm hardware, CCA introduces new hardware
or hypervisor. SGX is not suitable for securing VMs. mechanisms that protect VMs from untrusted software running
AMD Secure Encrypted Virtualization (SEV) [2] and Intel in both NS and Secure world, and allow hypervisors to make
Trust Domain Extensions (TDX) [32] provide protection at the full use of Arm virtualization features such as VHE for better
performance. Furthermore, CCA firmware is designed to be applied to RMM given the definition of data integrity and
support a higher degree of scalability and concurrent operation confidentiality supported by Realms. While most of these
by allowing data races, leveraging fine-grain synchronization, approaches rely on some static partitioning of memory to
and enabling the hypervisor to provide fully dynamic memory simplify their noninterference proofs, RMM imposes no such
allocation for all VM-related metadata. scalability limitations. The ideal/real simulation paradigm
While verifying CCA firmware required new VIA veri- has been used to verify information-flow security of a simple
fication techniques, many of them build on previous work. 750 LOC two-user uniprocessor separation kernel without
Various concurrent systems have been verified, including Cer- page tables [22], but we show for the first time how it can
tiKOS [26, 27, 45], SeKVM, and CMAIL using CSPEC [10]. be applied in the presence of declassification to verify data
CertiKOS and SeKVM support sequential reasoning with confidentiality and integrity of a real system that supports
a local CPU model and encapsulate other CPUs’ behavior modern multiprocessor and MMU hardware with page tables.
by rely/guarantee conditions, but do not support reordering
using mover types, making proving hand-over-hand locking
infeasible. Although hand-over-hand locking can theoretically 8 Conclusions
be proved using rely/guarantee reasoning [58], the approach
Arm CCA is the first confidential compute architecture backed
is not machine-checkable or scalable to a real system like
by verified firmware that is correct and secure. CCA introduces
RMM. CSPEC provides proof patterns with mover types,
Realms, secure execution environments that protect the
but lacks a local CPU model and does not verify C code;
confidentiality and integrity of VMs against untrusted system
it offers little help for RMM code not reducible by movers
software such as hypervisors. Realms are made possible by
(e.g. REC.Destroy in Figure 4) that still need rely/guarantee
hardware support for Realm world, a new physical address
reasoning to verify. VIA builds on CertiKOS, SeKVM, and
space for Realms inaccessible to untrusted system software,
CSPEC to combine a local CPU model with mover types.
and a firmware monitor that runs in Realm world to control
Some programs have been previously verified on relaxed
CCA hardware to secure and manage Realms, including
memory hardware. Armada [46] supports verifying programs
handling requests from untrusted hypervisors to create Realms,
on the x86-TSO memory model, but their approach of verifying
run Realms, and allocate memory to Realms. This design
the entire program on a relaxed memory model has not been
maintains compatibility with the Arm architecture without
shown to scale to real systems such as RMM. VRM [57] in-
introducing complex hardware mechanisms by relying on
stead allows proofs on an SC model to hold on relaxed memory
firmware, and avoids complexity in the firmware by relying
hardware by ensuring certain conditions hold, making possible
on existing hypervisors to provide virtualization functionality.
the verification of SeKVM, the first machine-checked proof
We formally verified CCA firmware, demonstrating
for concurrent systems software on Arm relaxed memory hard-
the feasibility of relying on trustworthy firmware for the
ware. VIA generalizes VRM to arbitrary non-DRF programs.
security guarantees of the architecture. We introduced various
Verifying programs with both C and assembly code has been
verification techniques to make it possible to verify for the first
done to varying degrees, but none support bidirectional calls
time concurrent firmware with data races running on relaxed
between them. seL4 [37] verifies C code, but its assembly code
memory hardware, fine-grain synchronization such as hand-
is unverified. CertiKOS relies on a verified x86 C compiler to
over-hand locking, dynamically allocated shared multi-level
verify assembly primitives invoking C primitives by compiling
page tables, and integrated C and assembly code. We also
the invoked C primitives into assembly primitives, but cannot
prove the security guarantees despite untrusted software being
verify C primitives that invoke assembly primitives. Since
in full control of resource allocation decisions. The proof only
no verified Arm C compiler exists, this approach cannot be
needs to trust roughly two hundred lines of Coq specification,
used for CCA. SeKVM verifies C and Arm assembly code
making the formal security guarantees easy to read and
separately, but does not link the proofs, in part because no
understand. CCA provides its security guarantees with only
verified Arm C compiler exists. Komodo is written entirely
modest performance overhead compared to running VMs with
in assembly code which is then verified, but this is difficult to
the Linux KVM hypervisor without verified VM protection.
scale to a large system as it is hard to write and maintain a large
codebase in assembly. Ironclad [29] conducts verification
at the assembly level by compiling programs in a high-level 9 Acknowledgments
language down to assembly. This is also difficult to scale as
it is harder to verify the much larger generated assembly code Andrew Baumann and Charles Garcia-Tobin provided helpful
than the original high-level language implementation. VIA comments on earlier drafts. This work was supported in part by
allows most proofs to be done at the C level while verifying Arm, OPPO, an Amazon Research Award, a Guggenheim Fel-
interactions between C and assembly code are safe. lowship, DARPA contract N66001-21-C-4018, and NSF grants
Noninterference has been frequently used to prove CCF-1918400, CNS-2052947, and CCF-2124080. Ronghui
information-flow security [16,23,29,34,42,49,55], but cannot Gu is the Founder of and has an equity interest in CertiK.
References 2019), pages 243–258, Huntsville, ON Canada, October
2019.
[1] ab, The Apache Software Foundation. http://
httpd.apache.org/docs/2.4/programs/ab.html, [12] Anrin Chakrabortid, Reza Curtmola, Jonathan Katz,
April 2015. Jason Nieh, Ahmad-Reza Sadeghi, Radu Sion, and
Yinqian Zhang. Cloud Computing Security: Foundations
[2] Advanced Micro Devices. Secure Encrypted Virtualiza-
and Research Directions. Foundations and Trends in
tion API Version 0.16. https://support.amd.com/
Privacy and Security, 3(2):103–213, February 2022.
TechDocs/55766_SEV-KM%20API_Spec.pdf,
February 2018. [13] Hao Chen, Xiongnan (Newman) Wu, Zhong Shao,
[3] Advanced Micro Devices. AMD SEV-SNP: Joshua Lockerman, and Ronghui Gu. Toward Compo-
Strengthening VM Isolation with Integrity Protec- sitional Verification of Interruptible OS Kernels and
tion and More. https://www.amd.com/system/ Device Drivers. In Proceedings of the 37th ACM SIG-
files/TechDocs/SEV-SNP-strengthening-vm- PLAN Conference on Programming Language Design
isolation-with-integrity-protection-and- and Implementation (PLDI 2016), pages 431–447, Santa
more.pdf, January 2020. Barbara, CA, June 2016.
[4] ARM Ltd. ARM Security Technology Build- [14] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu
ing a Secure System using TrustZone Technol- Ramakrishnan, and Russell Sears. Benchmarking Cloud
ogy. https://documentation-service.arm.com/ Serving Systems with YCSB. In Proceedings of the 1st
static/5f212796500e883ab8e74531, April 2009. ACM Symposium on Cloud Computing (SoCC 2010),
pages 143–154, Indianapolis, IN, June 2010.
[5] ARM Ltd. Arm Neoverse N1 Core Technical Ref-
erence Manual. https://developer.arm.com/ [15] Victor Costan and Srinivas Devadas. Intel SGX
documentation/100616/0400/, April 2019. Explained. Cryptology ePrint Archive, Report 2016/086,
[6] ARM Ltd. Virtualization Host Extensions. https: January 2016. https://ia.cr/2016/086.
//developer.arm.com/documentation/102142/
[16] David Costanzo, Zhong Shao, and Ronghui Gu. End-
0100/Virtualization-Host-Extensions,
to-End Verification of Information-Flow Security for
January 2019.
C and Assembly Programs. In Proceedings of the 37th
[7] ARM Ltd. Procedure Call Standard for ACM Conference on Programming Language Design
the Arm R 64-bit Architecture (AArch64). and Implementation (PLDI 2016), pages 648–664, Santa
https://github.com/ARM-software/abi- Barbara, CA, June 2016.
aa/releases/download/2022Q1/aapcs64.pdf,
April 2022. [17] Christoffer Dall, Shih-Wei Li, Jin Tack Lim, Jason
Nieh, and Georgios Koloventzos. ARM Virtualization:
[8] Fabrice Bellard. QEMU, a Fast and Portable Dynamic Performance and Architectural Implications. In
Translator. In Proceedings of the USENIX 2005 Annual Proceedings of the 43rd International Symposium on
Technical Conference, FREENIX Track (FREENIX Computer Architecture (ISCA 2016), pages 304–316,
2005), pages 41–46, Anaheim, CA, April 2005. Seoul, South Korea, June 2016.
[9] Edouard Bugnion, Jason Nieh, and Dan Tsafrir. Hard-
[18] Christoffer Dall, Shih-Wei Li, and Jason Nieh. Opti-
ware and Software Support for Virtualization. Synthesis
mizing the Design and Implementation of the Linux
Lectures on Computer Architecture. Morgan and
ARM Hypervisor. In Proceedings of the 2017 USENIX
Claypool Publishers, February 2017.
Annual Technical Conference (USENIX ATC 2017),
[10] Tej Chajed, M. Frans Kaashoek, Butler Lampson, and pages 221–234, Santa Clara, CA, July 2017.
Nickolai Zeldovich. Verifying concurrent software using
movers in CSPEC. In Proceedings of the 13th Symposium [19] Christoffer Dall and Jason Nieh. KVM/ARM: Experi-
on Operating Systems Design and Implementation (OSDI ences Building the Linux ARM Hypervisor. Technical
2018), pages 306–322, Carlsbad, CA, October 2018. Report CUCS-010-13, Department of Computer Science,
Columbia University, June 2013.
[11] Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and
Nickolai Zeldovich. Verifying concurrent, crash-safe [20] Christoffer Dall and Jason Nieh. Supporting KVM on
systems with Perennial. In Proceedings of the 27th ACM the ARM Architecture. LWN Weekly Edition, pages
Symposium on Operating Systems Principles (SOSP 18–22, July 2013.
[21] Christoffer Dall and Jason Nieh. KVM/ARM: The De- [29] Chris Hawblitzel, Jon Howell, Jacob R. Lorch, Arjun
sign and Implementation of the Linux ARM Hypervisor. Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill.
In Proceedings of the 19th International Conference on Ironclad Apps: End-to-End Security via Automated
Architectural Support for Programming Languages and Full-System Verification. In Proceedings of the 11th
Operating Systems (ASPLOS 2014), pages 333–347, Salt USENIX Symposium on Operating Systems Design
Lake City, UT, March 2014. and Implementation (OSDI 2014), pages 165–181,
Broomfield, CO, October 2014.
[22] Mads Dam, Roberto Guanciale, Narges Khakpour,
Hamed Nemati, and Oliver Schwarz. Formal Verification [30] Constance L. Heitmeyer, Myla Archer, Elizabeth I.
of Information Flow Security for a Simple ARM-Based Leonard, and John McLean. Formal Specification and
Separation Kernel. In Proceedings of the 2013 ACM Verification of Data Separation in a Separation Kernel
SIGSAC Conference on Computer & Communications for an Embedded System. In Proceedings of the 13th
Security (CCS 2013), pages 223–234, Berlin, Germany, ACM Conference on Computer and Communications
November 2013. Security (CCS 2006), pages 346–355, Alexandria,
Virginia, October 2006.
[23] Andrew Ferraiuolo, Andrew Baumann, Chris Hawblitzel,
[31] Felicitas Hetzelt and Robert Buhren. Security Analysis
and Bryan Parno. Komodo: Using verification to
of Encrypted Virtual Machines. In Proceedings of the
disentangle secure-enclave hardware from software.
13th ACM SIGPLAN/SIGOPS International Conference
In Proceedings of the 26th Symposium on Operating
on Virtual Execution Environments (VEE 2017), pages
Systems Principles (SOSP 2017), pages 287–305,
129–142, Xi’an, China, April 2017.
Shanghai, China, October 2017.
[32] Intel Corporation. Intel Trust Domain Extensions.
[24] Ronghui Gu, Jérémie Koenig, Tahina Ramananandro, https://www.intel.com/content/www/us/
Zhong Shao, Xiongnan Newman Wu, Shu-Chun Weng, en/developer/articles/technical/intel-
and Haozhong Zhang. Deep Specifications and Certified trust-domain-extensions.html, October 2014.
Abstraction Layers. In Proceedings of the 42nd ACM
Symposium on Principles of Programming Languages [33] Intel Corporation. Intel Software Guard Ex-
(POPL 2015), pages 595–608, Mumbai, India, January tensions Programming Reference. https:
2015. //software.intel.com/sites/default/files/
managed/48/88/329298-002.pdf, May 2021.
[25] Ronghui Gu, Zhong Shao, Hao Chen, Jieung Kim,
Jérémie Koenig, Xiongnan Wu, Vilhelm Sjöberg, and [34] Dongseok Jang, Zachary Tatlock, and Sorin Lerner.
David Costanzo. Building Certified Concurrent OS Establishing Browser Security Guarantees through
Kernels. Communications of the ACM, 62(10):89–99, Formal Shim Verification. In Proceedings of the 21st
September 2019. USENIX Security Symposium (USENIX Security 2012),
pages 113–128, Bellevue, WA, August 2012.
[26] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan Newman
[35] C. B. Jones. Tentative Steps toward a Development
Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo.
Method for Interfering Programs. ACM Transactions
CertiKOS: An Extensible Architecture for Building
on Programming Languages and Systems (TOPLAS),
Certified Concurrent OS Kernels. In Proceedings of the
5(4):596–619, October 1983.
12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 2016), pages 653–669, [36] Jieung Kim, Vilhelm Sjöberg, Ronghui Gu, and Zhong
Savannah, GA, November 2016. Shao. Safety and Liveness of MCS Lock—Layer by
Layer. In Proceedings of the Asian Symposium on
[27] Ronghui Gu, Zhong Shao, Jieung Kim, Xiongnan New- Programming Languages and Systems (APLAS 2017),
man Wu, Jérémie Koenig, Vilhelm Sjöberg, Hao Chen, pages 273–297, Suzhou, China, November 2017.
David Costanzo, and Tahina Ramananandro. Certified
Concurrent Abstraction Layers. In Proceedings of [37] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June
the 39th ACM SIGPLAN Conference on Programming Andronick, David Cock, Philip Derrin, Dhammika
Language Design and Implementation (PLDI 2018), Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael
pages 646–661, Philadelphia, PA, June 2018. Norrish, Thomas Sewell, Harvey Tuch, and Simon
Winwood. seL4: Formal Verification of an OS Kernel.
[28] Stefan Hajnoczi. An Updated Overview of the QEMU In Proceedings of the ACM SIGOPS 22nd Symposium
Storage Stack. In LinuxCon Japan 2011, Yokohama, on Operating Systems Principles (SOSP 2009), pages
Japan, June 2011. 207–220, Big Sky, MT, October 2009.
[38] KVM contributors. Tuning KVM. http:// [48] Mathias Morbitzer, Manuel Huber, Julian Horsch, and
www.linux-kvm.org/page/Tuning_KVM, May 2015. Sascha Wessel. SEVered: Subverting AMD’s Virtual
Machine Encryption. In Proceedings of the 11th
[39] KVM contributors. KVM Unit Tests. http: European Workshop on Systems Security (EuroSec
//www.linux-kvm.org/page/KVM-unit-tests, 2018), pages 1–6, Porto, Portugal, April 2018.
August 2020.
[49] Toby Murray, Daniel Matichuk, Matthew Brassil, Peter
[40] Mengyuan Li, Yinqian Zhang, Zhiqiang Lin, and Yan Gammie, Timothy Bourke, Sean Seefried, Corey Lewis,
Solihin. Exploiting Unprotected I/O Operations in Xin Gao, and Gerwin Klein. seL4: from General
AMD’s Secure Encrypted Virtualization. In Proceedings Purpose to a Proof of Information Flow Enforcement.
of the 28th USENIX Security Symposium (USENIX In Proceedings of the 2013 IEEE Symposium on Security
Security 2019), pages 1257–1272, Santa Clara, CA, and Privacy (IEEE S&P 2013), pages 415–429, San
August 2019. Francisco, CA, May 2013.
[41] Shih-Wei Li, John S. Koh, and Jason Nieh. Protecting [50] Luke Nelson, James Bornholt, Ronghui Gu, Andrew Bau-
Cloud Virtual Machines from Commodity Hypervisor mann, Emina Torlak, and Xi Wang. Scaling Symbolic
and Host Operating System Exploits. In Proceedings of Evaluation for Automated Verification of Systems Code
the 28th USENIX Security Symposium (USENIX Security with Serval. In Proceedings of the 27th ACM Symposium
2019), pages 1357–1374, Santa Clara, CA, August 2019. on Operating Systems Principles (SOSP 2019), pages
[42] Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, and 225–242, Huntsville, ON Canada, October 2019.
John Zhuang Hui. A Secure and Formally Verified Linux [51] Redis Labs. Memtier Benchmark. https:
KVM Hypervisor. In Proceedings of the 2021 IEEE //github.com/RedisLabs/memtier_benchmark,
Symposium on Security and Privacy (IEEE S&P 2021), January 2018.
pages 1782–1799, San Francisco, CA, May 2021.
[52] Redis Labs. Redis Benchmark. https://redis.io/
[43] Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, docs/reference/optimization/benchmarks/,
and John Zhuang Hui. Formally Verified Memory March 2022.
Protection for a Commodity Multiprocessor Hypervisor.
In Proceedings of the 30th USENIX Security Symposium [53] Richard M. Stallman and the GCC Developer Com-
(USENIX Security 2021), pages 3953–3970, Vancouver, munity. Using the GNU Compiler Collection (GCC).
BC Canada, August 2021. https://gcc.gnu.org/onlinedocs/gcc-12.1.0/
gcc.pdf, May 2022.
[44] Richard J. Lipton. Reduction: A Method of Proving
Properties of Parallel Programs. Communications of the [54] Rusty Russell. Hackbench. http://
ACM, 18(12):717–721, December 1975. people.redhat.com/mingo/cfs-scheduler/
tools/hackbench.c, January 2008.
[45] Mengqi Liu, Lionel Rieg, Zhong Shao, Ronghui Gu,
David Costanzo, Jung-Eun Kim, and Man-Ki Yoon. [55] Helgi Sigurbjarnarson, Luke Nelson, Bruno Castro-
Virtual Timeline: A Formal Abstraction for Verifying Karney, James Bornholt, Emina Torlak, and Xi Wang.
Preemptive Schedulers with Temporal Isolation. Pro- Nickel: A Framework for Design and Verification of
ceedings of the ACM on Programming Languages, Information Flow Control Systems. In Proceedings of
4(POPL):1–31, December 2019. the 13th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 2018), pages
[46] Jacob R. Lorch, Yixuan Chen, Manos Kapritsos, Bryan 287–305, Carlsbad, CA, October 2018.
Parno, Shaz Qadeer, Upamanyu Sharma, James R.
Wilcox, and Xueyuan Zhao. Armada: Low-Effort [56] SUSE. Performance Implications of Cache Modes.
Verification of High-Performance Concurrent Programs. https://www.suse.com/documentation/
In Proceedings of the 41st ACM SIGPLAN Conference sles11/book_kvm/data/sect1_3_chapter_
on Programming Language Design and Implementation book_kvm.html, September 2016.
(PLDI 2020), pages 197–210, London, UK, June 2020.
[57] Runzhou Tao, Jianan Yao, Xupeng Li, Shih-Wei Li,
[47] Mathias Morbitzer, Manuel Huber, and Julian Horsch. Jason Nieh, and Ronghui Gu. Formal Verification of a
Extracting Secrets from Encrypted Virtual Machines. In Multiprocessor Hypervisor on Arm Relaxed Memory
Proceedings of the 9th ACM Conference on Data and Hardware. In Proceedings of the 28th ACM Symposium
Application Security and Privacy (CODASPY 2019), on Operating Systems Principles (SOSP 2021), pages
pages 221–230, Dallas, TX, March 2019. 866–881, Virtual Event, Germany, October 2021.
[58] Viktor Vafeiadis, Maurice Herlihy, Tony Hoare, and Marc
Shapiro. Proving Correctness of Highly-Concurrent
Linearisable Objects. In Proceedings of the 11th ACM
SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP 2006), pages 129–136,
New York, NY, March 2006.