0% found this document useful (0 votes)
49 views

LSSEU2019 - Exploiting Race Conditions On Linux

Bug 1 allows a physical page use-after-free exploit on Linux kernels before 4.9 via a race condition between mremap() and fallocate(). An attacker can reallocate a freed page containing kernel data by triggering the race and exploiting the resulting use-after-free. The document discusses using preemption and scheduling controls to enable disk I/O within the narrow race window needed to read and write privileged pages. Later bugs involve similar race conditions involving refcount decrements and file locking that can also enable privilege escalation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

LSSEU2019 - Exploiting Race Conditions On Linux

Bug 1 allows a physical page use-after-free exploit on Linux kernels before 4.9 via a race condition between mremap() and fallocate(). An attacker can reallocate a freed page containing kernel data by triggering the race and exploiting the resulting use-after-free. The document discusses using preemption and scheduling controls to enable disk I/O within the narrow race window needed to read and write privileged pages. Later bugs involve similar race conditions involving refcount decrements and file locking that can also enable privilege escalation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

v1

Exploiting race conditions on [ancient] Linux


Jann Horn, Google Project Zero

(if this text is too small for you to read, maybe open the slides on your laptop)

slides at: https://sched.co/TynD


Confidential + Proprietary
Introduction
● bugs described here have been fixed for a long time
● all exploits against kernel 4.4
● focus on exploitation techniques, not impact of individual
bugs

Confidential + Proprietary
Agenda
● physical-page use-after-free via stale TLB [bug 1]
○ [kernel bug, PoC for Google Pixel 2]
○ buddy allocator
○ preemption and scheduler control
● refcount decrement on struct file [bug 2]
○ [kernel bug, PoC for Ubuntu 16.04]
○ userfaultfd() and FUSE
○ kcmp()
● a poor substitute for FUSE/userfaultfd [bug 3]
○ [userspace bug, PoC for Google Pixel 2]
○ i_mutex on kernel 4.4
○ priority inversion
○ repeated file mapping faults
Confidential + Proprietary
Bug 1: mremap()+fallocate() race

Confidential + Proprietary
Translation Lookaside Buffer (TLB)

● in-CPU cache for page table entries (PTEs)


● PTEs are essentially refcounted page pointers
● TLB borrows references from PTEs
● kernel can invalidate TLB for virtual address
ranges
○ x86: IPI for remote CPUs
○ arm64: magic system-wide TLBI instruction
Confidential + Proprietary
mremap(): moving a memory mapping

● moves associated page 1. move PTEs (also


moves references)
table entries (PTEs)
old new
● has to flush the TLB for mapping mapping
the old address range
2. flush TLB
(old mapping
becomes
inaccessible)

Confidential + Proprietary
fallocate(): (de)allocate space for a file

● interesting case: punch a


hole in a file
● file pages in the hole are: mapping

○ yanked out of all mappings


1. delete PTEs
(in all processes) 2. flush TLB
○ released once all references 3. drop references

are gone

Confidential + Proprietary
Bug 1: mremap()+fallocate() race
crbug.com/project-zero/1695

● mremap() holds no relevant lock 1. move PTEs (also


moves references)
between moving PTEs and TLB
flush, fallocate() possible in old new
between mapping mapping

● fallocate() drops page references 5. flush TLB 2. delete PTEs


(old mapping 3. flush TLB
after its TLB flush 4. drop
becomes
● stale TLB entry for old mapping inaccessible) references
(can free pages)
permits physical-page
use-after-free between dropping
page reference and flushing TLB
for old mapping Confidential + Proprietary
Exploit plan: Basics
● biggest impact on Linux <4.9; exploiting
for write access is much harder on
newer kernels
old new
● goal: Pixel 2 (Linux 4.4) exploit mapping mapping
● exploit idea: reallocate freed page with
kernel data TLB flush pending
pages freed
=> physical UAF!

Confidential + Proprietary
Buddy allocator
percpu freelist [with UAF page] (highly
cpu X simplified, not
MIGRATE_MOVABLE entirely correct)

Page freelist Page freelist Page freelist


order 0 order 1 order 2
MIGRATE_MOVABLE MIGRATE_MOVABLE MIGRATE_MOVABLE

Page freelist Page freelist Page freelist


order 0 order 1 order 2
MIGRATE_UNMOVABLE MIGRATE_UNMOVABLE MIGRATE_UNMOVABLE

percpu freelist SLAB


cpu X e.g. kmalloc-256
MIGRATE_UNMOVABLE Confidential + Proprietary
Exploit plan
● biggest impact on Linux <4.9; exploiting
for write access is much harder on
newer kernels
old new
● goal: Pixel 2 (Linux 4.4) exploit mapping mapping
● exploit idea: reallocate freed page with
kernel data - ✘, looked too messy TLB flush pending
pages freed
● exploit idea: reallocate freed page as => physical UAF!
page cache for privileged code ✔
○ requires disk I/O within the race window
○ need to make the mremap() race window
wide enough for disk I/O
● race window detectable through procfs
Confidential + Proprietary
CPU 0 CPU 1
Preemption task A task B
wakeup
IPI task C
running blocked running

● waking up a task can cause a scheduler Inter-Processor Interrupt (IPI)


○ depending on policy, priority and past CPU usage [see check_preempt_wakeup() ]
● Linux supports three kernel preemption models:
○ "voluntary" preemption can yield the CPU at cond_resched() [called
in ~1000 places]
■ used by many Linux distributions by default
○ full preemption
■ enabled on Android
■ syscall context interruptible directly via inter-processor interrupt (IPI) on task wakeup
■ no preemption in some code regions (holding a spinlock [/ preemption explicitly disabled
/ interrupts disabled])
■ [preemption requests in critical region are delivered on critical section exit]
■ mutexes don't block preemption!
Confidential + Proprietary
CPU 0 CPU 1
Scheduler control task A
running
task B
blocked
wakeup
IPI
task C
SCHED_IDLE SCHED_NORMAL running

● sched_setscheduler(): set SCHED_NORMAL / SCHED_IDLE


○ [realtime policies require CAP_SYS_NICE or RealtimeKit]
● on busy CPU, SCHED_IDLE has infrequent wakeups
● SCHED_IDLE never preempts
● sched_setaffinity(): pin task to CPU bitmask
● also affects execution in kernel mode!
➢ pin two own tasks to a single CPU
➢ set different scheduling classes
➢ interrupt kernel code execution

Confidential + Proprietary
Android kernel exploit (app -> zygote)
task A task B task C task D task E
[pinned to CPU 0] [pinned to CPU 1] [pinned to CPU 1] [pinned to CPU 2] [pinned to CPU 3]
[normal priority] [idle priority] [normal priority] [normal priority] [normal priority]
busyloop
read(<pipe>) busyloop busyloop
y wa it ing
yield b reading procfs attempting to
stats write through
mremap(...) detect old mapping
PTE is moved wakeup mremap() [keep
write(<pipe>)
yield by preemption TLB
entry
sched_setaffinity so that we can block alive]
(task B, CPU 0) without yielding to B
keep
[preempted]
task B fallocate(...) free the pages
off the
CPU pread(<zygote reuse one of
code>) the pages as
page cache overwrite
the page
Confidential + Proprietary
Android kernel exploit (app -> zygote)
task A task B task C task D task E
[pinned to CPU 0] [pinned to CPU 1] [pinned to CPU 1] [pinned to CPU 2] [pinned to CPU 3]
[normal priority] [idle priority] [normal priority] [normal priority] [normal priority]
busyloop
read(<pipe>) busyloop busyloop
y wa it ing
yield b reading procfs attempting to
stats write through
mremap(...) detect old mapping
PTE is moved wakeup mremap() [keep
write(<pipe>)
yield by preemption TLB
entry
sched_setaffinity so that we can block alive]
(task B, CPU 0) without yielding to B
keep
[preempted]
task B fallocate(...) free the pages
off the
CPU pread(<zygote reuse one of
code>) the pages as
page cache overwrite
the page
Confidential + Proprietary
Android kernel exploit (app -> zygote)
task A task B task C task D task E
[pinned to CPU 0] [pinned to CPU 1] [pinned to CPU 1] [pinned to CPU 2] [pinned to CPU 3]
[normal priority] [idle priority] [normal priority] [normal priority] [normal priority]
busyloop
read(<pipe>) busyloop busyloop
y wa it ing
yield b reading procfs attempting to
stats write through
mremap(...) detect old mapping
PTE is moved wakeup mremap() [keep
write(<pipe>)
yield by preemption TLB
entry
sched_setaffinity so that we can block alive]
(task B, CPU 0) without yielding to B
keep
[preempted]
task B fallocate(...) free the pages
off the
CPU pread(<zygote reuse one of
code>) the pages as
page cache overwrite
the page
Confidential + Proprietary
Android kernel exploit (app -> zygote)
task A task B task C task D task E
[pinned to CPU 0] [pinned to CPU 1] [pinned to CPU 1] [pinned to CPU 2] [pinned to CPU 3]
[normal priority] [idle priority] [normal priority] [normal priority] [normal priority]
busyloop
read(<pipe>) busyloop busyloop
y wa it ing
yield b reading procfs attempting to
stats write through
mremap(...) detect old mapping
PTE is moved wakeup mremap() [keep
write(<pipe>)
yield by preemption TLB
entry
sched_setaffinity so that we can block alive]
(task B, CPU 0) without yielding to B
keep
[preempted]
task B fallocate(...) free the pages
off the
CPU pread(<zygote reuse one of
code>) the pages as
page cache overwrite
the page
Confidential + Proprietary
Android kernel exploit (app -> zygote)
task A task B task C task D task E
[pinned to CPU 0] [pinned to CPU 1] [pinned to CPU 1] [pinned to CPU 2] [pinned to CPU 3]
[normal priority] [idle priority] [normal priority] [normal priority] [normal priority]
busyloop
read(<pipe>) busyloop busyloop
y wa it ing
yield b reading procfs attempting to
stats write through
mremap(...) detect old mapping
PTE is moved wakeup mremap() [keep
write(<pipe>)
yield by preemption TLB
entry
sched_setaffinity so that we can block alive]
(task B, CPU 0) without yielding to B
keep
[preempted]
task B fallocate(...) free the pages
off the
CPU pread(<zygote reuse one of
code>) the pages as
page cache overwrite
the page
Confidential + Proprietary
Android kernel exploit (app -> zygote)
task A task B task C task D task E
[pinned to CPU 0] [pinned to CPU 1] [pinned to CPU 1] [pinned to CPU 2] [pinned to CPU 3]
[normal priority] [idle priority] [normal priority] [normal priority] [normal priority]
busyloop
read(<pipe>) busyloop busyloop
y wa it ing
yield b reading procfs attempting to
stats write through
mremap(...) detect old mapping
PTE is moved wakeup mremap() [keep
write(<pipe>)
yield by preemption TLB
entry
sched_setaffinity so that we can block alive]
(task B, CPU 0) without yielding to B
keep
[preempted]
task B fallocate(...) free the pages
off the
CPU pread(<zygote reuse one of
code>) the pages as
page cache overwrite
the page
Confidential + Proprietary
Bug 2: refcount decrement on
struct file
(yes, the bug doesn't involve a race condition, but the exploit kinda does)

Confidential + Proprietary
userfaultfd and FUSE
● userfaultfd() and FUSE allow userspace to synchronously handle page faults
● => userspace can block arbitrarily at copy_from_user()/copy_to_user()
● userfaultfd() and FUSE are not exposed to unprivileged Android code

=> not applicable on Android, but relevant on desktop Linux

Confidential + Proprietary
FUSE for exploiting struct file refcount
overdecrement in Linux 4.4
f = fdget(insn->imm);
map = __bpf_map_get(f);
bug from 2016 to illustrate FUSE-based if (IS_ERR(map)) {
verbose("fd %d is not pointing to valid
use-after-free exploitation bpf_map\n", insn->imm);
fdput(f);
return PTR_ERR(map);
● file reference acquired with fdget()
}
● error path accidentally called fdput()
struct bpf_map *__bpf_map_get(struct fd f)
twice {
● struct file freed prematurely if (!f.file)
return ERR_PTR(-EBADF);
● use-after-free if (f.file->f_op != &bpf_map_fops) {
● exploited on Ubuntu 16.04 fdput(f);
return ERR_PTR(-EINVAL);
}

return f.file->private_data;
}
crbug.com/project-zero/808 Confidential + Proprietary
kcmp() for reliable UAF
● CONFIG_CHECKPOINT_RESTORE static long kptr_obfuscate(long v, int type)
● smaller/equal/greater comparison {
return (v ^ cookies[type][0]) *
between permuted kernel pointers cookies[type][1];
● intended for grouping same-object }

references in O(n log(n)) static int kcmp_ptr(void *v1, void *v2, enum
kcmp_type type)
● works on: {
○ struct file long t1, t2;
○ struct mm_struct
○ struct files_struct t1 = kptr_obfuscate((long)v1, type);
t2 = kptr_obfuscate((long)v2, type);
○ struct fs_struct
○ struct sighand_struct return (t1 < t2) | ((t1 > t2) << 1);
○ struct io_context }
○ struct sem_undo_list
● tag reuse oracle for Memory Tagging
Confidential + Proprietary
unless tag bits are ignored
FUSE for exploiting struct file refcount
overdecrement in Linux 4.4
● create FUSE mapping ssize_t vfs_writev(struct file *file, const
struct iovec __user *vec, [...]) {
● open writable file (/dev/null) if (!(file->f_mode & FMODE_WRITE))
● start writev() with iov in FUSE return -EBADF;
[...]
mapping return do_readv_writev(WRITE, file, vec,
vlen, pos);
● write mode check passes }
● import_iovec() stalls on page fault static ssize_t do_readv_writev(int type,
● trigger bug to free the file struct file *file, const struct iovec __user
* uvector, unsigned long nr_segs, loff_t
● open /etc/crontab as read-only *pos) {
[...]
● verify that struct file was allocated at ret = import_iovec(type, uvector, nr_segs,
ARRAY_SIZE(iovstack), &iov, &iter);
the same address with kcmp() (else [...]
re-open /etc/crontab) if (iter_fn)
ret = do_iter_readv_writev(file, &iter,
● resolve FUSE page fault pos, iter_fn);
[...]
● writev() writes into /etc/crontab } Confidential + Proprietary
Bug 3: use of getpidcon()

Confidential + Proprietary
int getpidcon(pid_t pid, char **context)

● userspace daemons need to check peer SELinux contexts


● unix domain sockets: SO_PEERSEC
● Android binder: until recently no context name, only sender
PID

fd = open("/proc/$pid/attr/current", O_RDONLY)
read(fd, buf, len)

Confidential + Proprietary
Bug 3: race condition in hwservicemanager
crbug.com/project-zero/1741

● receive binder IPC call (with caller PID)


● getpidcon(pid, &context)
● ACL check for context

➢ exit and make privileged thread reuse the PID


● race window can be widened to ~15s
Confidential + Proprietary
i_mutex on kernel 4.4
● sys_getdents() (for readdir()) iterates directory entries and
copies to userspace under inode->i_mutex
○ potentially a large amount of data if the directory has many
entries
● lookup_slow() (for looking up uncached directory entries)
takes parent->d_inode->i_mutex
● => blocking userspace access in the middle of
sys_getdents() blocks concurrent path traversal (e.g.
open()) on the same inode

(Linux >=4.7 uses a semaphore i_rwsem in read mode instead of


i_mutex) Confidential + Proprietary
Priority Inversion
● high-priority task blocks on mutex task A block on lock
held by low-priority task (high priority)

● low-priority task is preempted by


medium-priority task (same CPU)
task B
long-running task
● also applies for violating fairness (normal priority)
between two normal-priority tasks
preemption
● kernel mutexes are vulnerable to
task C take
priority inversion! (low priority) lock
○ (unless you're on PREEMPT_RT)
● => we can artificially create a
priority inversion problem time
● mitigated by infrequent idle-priority
scheduling Confidential + Proprietary
I/O

Major faults task A


[idle]
lock copy_to_user() #PF unlock
runnable?
yield no preempt resched
task B
spinning
Instead of userfaultfd(): [normal]

● create an uncached writable file mapping


○ [by filling up RAM with other data to force page cache eviction]
● let A trigger copy_to_user() on the file mapping while holding a lock
● let B spinloop at the same time

Consequences:

● copy_to_user() enters disk I/O path


● I/O path sleeps until disk responds, yielding the CPU
● scheduler won't preempt B when A is runnable again
Confidential + Proprietary
Repeated file mapping faults
● [map pages such that readahead logic can't fire]
● 83560 bytes output from sys_getdents() = 21 pages
[rounded up]
● >1s delay per disk read because of scheduler policy
● => >21s total delay
I/O I/O I/O
task A
lock copy_from_user() unlock
[idle]
runnable?
yield resched yield resched yield resched
task B
spinning
[normal]
Confidential + Proprietary
Confidential + Proprietary
Click to edit title
● Click to edit text
○ Second level
■ Third level
● Fourth level

○ Fifth level

Confidential + Proprietary
Click to place
text here

Confidential + Proprietary
Timing diagram
task A task B task C task D (simplified)
[pinned to CPU 0] [pinned to CPU 0] [normal priority] [normal priority]
[normal priority] [IDLE priority]
busyloop open binder

keep
task B
off the
CPU

Confidential + Proprietary

You might also like