0% found this document useful (0 votes)
23 views

Memory Safety

The document discusses the importance of memory safety in software development, emphasizing how unintended states can lead to security vulnerabilities. It outlines various approaches to achieve memory safety, including spatial and temporal safety, and examines different programming languages and techniques, such as garbage collection, reference counting, and Rust's ownership model. The document also highlights the challenges and trade-offs associated with these methods, including performance and complexity considerations.

Uploaded by

thomasdullien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Memory Safety

The document discusses the importance of memory safety in software development, emphasizing how unintended states can lead to security vulnerabilities. It outlines various approaches to achieve memory safety, including spatial and temporal safety, and examines different programming languages and techniques, such as garbage collection, reference counting, and Rust's ownership model. The document also highlights the challenges and trade-offs associated with these methods, including performance and complexity considerations.

Uploaded by

thomasdullien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Is this memory safety here

in the room with us?


Halvar Flake / Thomas Dullien
DistrictCon 0 2025
Why memory safety?

The 40000 foot view.


Why do people write software?
● They want to solve a concrete problem by means of a (finite state) machine
(or more precisely transducer - a finite-state machine with output)
● They do not have the machine to solve it
● They write software that emulates the finite state machine they want on a
general-purpose CPU

What I need What I have


Abstract view
Our software is an “intended” FSM emulated on a real-world CPU - the CPU has
many more states, but our intent is to restrict it to those that “make sense” as FSM
states.
An unintended state is entered
An event triggers a transition into a state that is “nonsensical” or “unintended” from
when viewed through the FSM lens.
Trying to transition as if it was a sane state
Further events have the software attempts to transition to the new FSM state (see
red arrow), but the state is “broken”.
The weird machine
Transforming a broken state leads to a new broken state.
The weird machine
Attackers can continue driving the machine into new states, possibly reaching “all
states” (or at least many that violate expected security properties).
Nested state spaces in computing

Possible physical
states of the
computational device

Observable states of
the computational
device

“Sane” states of the Documented possible


computational device states of the
running the software computational device
Program execution should follow trajectories through
“intended”, “sane” states
During exploitation, a state outside the “intended”, “sane”
set of states is reached
The attacker carefully controls the trajectory through those
“weird” states
Memory safety attempts to put an extra “wall” into this
diagram.
“Memory safe” states.
Fewer states of More states of the
the machine machine
reachable reachable

Small C/C++
RegExp Java/Go Safe Rust Assembly
FSM Unsafe
Rust
What does memory safety provide?
● In C/C++, pointers and array indices are exchangeable
● They are also special in a particular way:

Corrupting a pointer or array index or writing to it after


memory has been released
throws nearly all statements about the state of the
machine out of the window.
Corrupt memory tends to letting the unicorns escape

”Here be unicorns”
Why is memory corruption special?
● In C/C++, corrupting a pointer or array index risks letting the unicorns escape
and trample all over all program states.

● A corrupted pointer (or index) can alias to pretty much anything. Writing to it
breaks the relationship between source code and program states.

● Business logic variables do not behave like that.


What does memory safety provide?

If memory safety is maintained, the


abstract machine that the language
defines stays intact in the presence of
most other bugs.

A link between the language syntax


(and the source code) and behavior of
the machine is maintained.

A horse stays a horse and does not


grow wings and a horn.
Example: Graph of variables assignments
● Take your memory-safe input program
● Draw a graph with edges derived from assignments in the source

typeof(LHS) <—- typeof(RHS)

● This graph stays intact no matter how bizarre the values in the business logic
variables become.
● You can still reason about what is going on from the source
How is memory safety usually achieved?
Memory safety is commonly viewed as two components
- Spatial memory safety: You cannot access outside of array boundaries.

- Temporal memory safety: You cannot access memory after it has been released or
before it has been allocated.
In theory, you could prove for a given C/C++ program that it satisfies these properties. In
practice for most codebases, this isn’t done, so languages (or hardware) are modified to
have safety mechanisms.
These safety mechanisms can be implemented in runtime, during compile-time, or a
combination of both.
Application logic can still become arbitrarily confused. But the goal is to avoid such
a confusion to ever allow the dereference of a corrupt pointer.
Obtaining spatial safety
Spatial safety is usually (not always) obtained through the following steps:
- Array accesses go through a centralized chokepoint (e.g. a an Array class or somesuch).
- Run-time checks on the centralized chokepoint ensure that indices into the array are within
bounds.
- Remove pointer arithmetic. You can have a pointer to an object, even to one in an array, but
you cannot modify it.
- If you need pointer arithmetic, the assumption is that you have an underlying array of
identical objects. You can have a span or a slice that you can index into which references
the same underlying memory.
- Reduce the need on the language level to juggle indices by having “for each element in
container”-style loop structures.

These safety mechanisms are implemented in a combination of runtime and compile-time.


Obtaining temporal safety
There are different approaches for obtaining temporal safety. The common ones
are:
- Garbage collection: Remove the ability to manually free objects, and have a
garbage collection algorithm collect objects whose lifespan is over.
- Reference counting: Perform reference counting on data structures, and
free them when the last one hits zero.
- Strict ownership: Every dynamic allocation is “owned” by an object; enforces
a tree structure of the heap. Memory gets freed when a parent goes out of
scope. Only single mutable reference.
All of these approaches require the coordination between the compiler/interpreter
and the runtime.
The different flavors of memory safety
1. Memory safety by proving absence of out-of-bounds memory access using
whole-program analysis (ASTREE etc.)
2. Memory safety by using garbage collection (Go, Java, C#, Python etc.) and
runtime bounds checks
3. Memory safety through reference counting (Swift) and runtime bounds checks
4. Memory safety through a type system with strict ownership, lifetimes and
runtime bounds checks (Rust, SafeCPP proposal).
5. (Memory safety through C++ profiles, ”21st century C++”)
Flavor 1:
Whole-program
analysis
Flavors of memory safety: (1) Proving absence
- Memory safety can be proven for real-world C/C++ control systems
- Caveats:
- Subset of C/C++
- Typically no dynamic memory allocations
- In real-world use for critical control systems with hard real-time requirements
- Aerospace, Automotive, Nuclear, Space systems

- Very rarely applied to existing codebases, usually impractical to do so –


codebase needs to be designed from scratch to pass analysis.
- Extensive use of annotations in the codebase to help the static analyzer verify
properties.
- Development cost much higher than the rest of the industry.
ASTREE and Airbus Avionics
- Memory-safe concurrent C.
- No dynamic memory allocations.
- Code co-developed along with the static analyzer ASTREE to verify absence
of runtime errors (which includes out-of-bounds memory access).
Benefits of this approach:
- Performance. All your safety properties are verified statically before
compilation. No unnecessary run-time enforcement.

- Provable (assuming known hardware semantics etc.).

- Comparatively few assumptions on correct runtime implementation.


Downsides
- Expensive. Very little real-world software can pass such verification.

- Cumbersome. Most real-world systems make significant use of dynamic


memory allocation, linked data structures etc., all of which can still make static
property proving hard.
Flavor 2: Garbage
Collection
and runtime array
checking
GC and runtime bounds checking
- Restrict or abolish the use of pointers for array indexing.
- Java - go full reference semantics.
- Go - go reference semantics for pointers, introduces slices for pointers-to-array-ranges.
- Array indexing now happen through array interfaces vs. raw pointers.
- Obtain spatial safety by introducing runtime bounds checks on these indices.
- Obtain temporal safety by removing manual release of memory.
- Run algorithms that analyze the heap graph to determine whether memory
can be released or not.
Garbage collection: Java, C#, Go, Python etc.
- 1960 - McCarthy and Collins for LISP
- Huge success story - few new languages did not adopt it - ActionScript,
Erlang, Groovy, Objective-C, Python, C#, Java, Lua, PostScript, Go, JS,
Ruby, VB.

- Until the emergence of Rust, “memory safe language” was (falsely) used as
synonymous with “garbage collected language”.

- But … TANSTAAFL
The most important wall of our time
The memory wall. Why linked lists suck. Cache rules everything around me.
Cost of garbage collection
- Hardware evolution has “moved against” GC.

- When GC was invented, memory access was ~1-5 cycles. Now it’s 100s of cycles.

- “Liveness is a global property, free’ing a local one”

- Traversal of the global heap is necessary to perform GC. This is a pessimal


workload for modern systems. Lots of DRAM references. All dependent on each
other (speculative computing ahead of DRAM becomes too hard). Expensive:
- The cost of traversing the heap graph
- The cost of polluting cache lines by not re-using memory more
- The cost of consuming more memory (in terms of both $ and power)
Hertz/Berger 2005: GC heap size vs. perf tradeoff

More than 50% more


cycles at 2x RAM
consumption
Cycle equivalence at
5x RAM
consumption.
GC everywhere: Do I pay more DRAM or more cycles?
Napkin math:

- Global datacenter energy draw is ballpark ~$100bn annually (1.1 pWh * 0.083
per kWh). DRAM draws between 25%-30%, CPU around 60%.
- If I 5x or 3x available DRAM, I’ll get close to doubling power consumption.
- Power cost of GC-based memory safety in production may be anywhere
from a few billion to tens of billions annually (not counting Capex for hardware
purchase and software rewrites).
Quick note on language design and GC: Go vs. Java
- Java has a very mature and sophisticated GC infrastructure with decades of
production experience.
- Go has a very naive and simple GC.
- Why is in-production performance approximately the same?
- Go’s language design allows programmers to explicitly perform stack-based
allocations.
- Java doesn’t, leaving the work to move allocations to the stack to the
optimizer.
- Many more objects make it to the heap.
- Go allows arrays-of-piece-of-data, reducing heap fragmentation.
- Language design and idiomatic use matter.
Flavor 3: Reference
counting
and runtime array
checking
Reference counting
- Apple is religious about battery life (example: ECC to extend it)
- DRAM refresh is a huge component of total power draw of idle devices
- If you are religious about battery life, you need to be religious about driving
down DRAM usage.
- Design decision for Swift: Reference counting in place of garbage collection.

- Achieves temporal safety even without a garbage collector, and achieves the
upper bound of manual memory management performance (e.g. at least as
good as the worst manual memory manager).
Pro/Cons of reference counting
Pro: Compact heap.

- Keeps memory overhead low. You don’t pay power to keep your garbage alive in
DRAM.

Con: Synchronization performance hit.

- When traversing data structures in concurrent setups, naive implementations turn


read operations into read/write.
- Refcount updates need to be synchronized across cores in concurrent access.
- Refcount takes up space, average objects are 20-60 bytes, so 5-10% overhead.
Flavor 4: Strict
ownership semantics,
lifetimes, and runtime
array checking (Rust)
Rust’s big contribution
- “memory safety” used to be synonymous with “garbage collection”.
- Rust introduces a number of design choices that achieve production-grade
memory-safety without the GC overhead:
- Strict ownership: Either one mutable reference to a variable exists, or multiple immutable
ones. This removes use-after-frees, but more generally mutating data structures while
simultaneously iterating.
- Lifetime annotations: When returning a reference derived from other references, the valid
lifetime of that reference depends on the lifetime of the source references.
- Game changer: Memory safety without the GC drawbacks for latency,
memory consumption, locality etc.
“Rewrite it in Rust”
- Rust as a language got multiple things right:
- Allow the type system to prove temporal safety at compile time.
- Runtime checks for arrays can be removed by compiler optimizations in many (though not all)
cases.
- Unsafe behavior is permitted but explicitly encapsulated.
- Great FFI allows gradual rewrite of C++ code to Rust.
- Learning curve: Borrow checker acts like a very very stringent code reviewer
and forces particular program structures that are easy to reason about at
compile-time.
Industry traction
- Rust has gained a lot of industry traction in domains that stuck with C/C++.
- AWS, parts of Android, even some Linux Kernel drivers.
- Startups such as Oxide that build low-level firmware stuff in Rust.
- Initially there were even attempts to rebuild a browser engine (Servo) and a
JS engine in Rust.
- With Mozilla running out of funds, those attempts have stalled somewhat
(although they are still alive)
Pro/Cons of strict ownership semantics
Pros:

- Compact heap: Memory gets released when it ceases to be reachable.


- Little overhead: Temporal safety arises from compile-time enforcement, meaning no run-time
overhead to obtain it.
- Concurrency: Safety for concurrency can be obtained compile-time too.
- Contained un-safety: Unsafe is available, but contained.

Con:

- Enforced heap structure: Not all problems lend themselves easily to the ownership semantics.
- Directed cyclic graphs or RCU-like structures are impossible or near-impossible in safe Rust.
- Learning curve: Many developers dislike arguing with the borrow checker.

Rust forces specific architectural choices on the programmer.


These are often, but not always, the right choices for the task.
(Flavor 5: C++
safety profiles and
21st century C++)
“21st century C++”
- B. Stroustrup lays out roadmap for getting memory safety (and resource
safety) into C++.
- Extra annotations to enforce language rules per-translation-unit:
- [[profile::enforce(type)]] // no casts or uninitialized objects in this TU
- [[profile::enforce(bounds)]] // all derefs bounds-checked, no pointer arithmetic
- [[profile::enforce(lifetime)]] // …
- Ability to disable for individual lines:
- [profile::suppress(lifetime))] this->succ = this->succ->succ;
Pro/Cons of 21st century C++
Pro: Backward compatibility, incremental porting.

- Allows incremental porting of existing C++ to memory safety.

Con: Only exists on paper.

- Right now, the approach only exists on paper.


- Interesting direction, given the amount of existing C++ code.
- Even on paper, this approach does not appear to be technically feasible. (See
criticism from the SafeCPP author, Sean Baxter).
Current hardware approaches:
MT and CHERI
Hardware approaches: MT
Historically, most memory safety approaches were software-only.
Over the last few years, memory tagging has entered the discussion (and even
implementation), which allows a limited, probabilistic form of memory-safety to be
hardware-enforced.
MT modifies malloc to “tag” memory (using special instructions) with a few bits of
“tag”. These bits get also stored in the 64-bit pointers ignored by the architecture
(usually 57 through 63).
On memory dereference, these bits are compared (by the hardware) to the tag,
and an exception is raised when they don’t match.
Relatively easily retrofitted to existing systems (but DRAM cost!)
Hardware approaches: CHERI
Custom CPU cores (historically MIPS, now RISC-V) with capabilities.

Fat pointers with bounds and permissions encoded.


Honorable mention: MiraclePtr
Retrofitting UAF safety into Chrome via adding refcounting - more mitigation
than mem safety. Doesn’t help against iterator invalidation etc.
wipe sweat off brow
Observations
Local reasoning vs. global problems
- Big wins come from turning global problems into locally-solvable ones:

- Global nature of determining liveness is a problem.

- Security problems when mismanaged (memory corruption), or very costly


when automatically managed (garbage collection).
Using a more powerful type system to turn global problems into locally
checkable problems seems to work.
Local reasoning vs. global problems
If you squint, you are proving local properties on each function, and then
composing local proofs into a whole-program proof of safety.
Copious annotations (in the form of types) are needed to make the proofs work.
As the type-checker (theorem prover) becomes more powerful, fewer annotations
are needed (lifetime elision).
In the limit, the type system approach and the program analysis approach
converge, from different sides.
Rust is already adding a Prolog-style theorem prover (Chalk) to the compiler
to deal with implications in the type system.
TANSTAAFL
- None of the approaches is truly “free”.
- Computational cost when doing GC.
- Architectural restrictions on code and exclusion of certain high-performance
data structures and patterns in safe Rust.
Where does this leave us?
Building a memory-safe userspace network
application (such as an SMTP server etc.) is
a solved problem.
We can write memory safe userspace services
- Writing userspace network services can be done in a memory-safe manner.
- Examples: Code such as an SMTP server can be implemented reliably and
safely in either GC’ed languages or something like Rust.
- Caveat: Runtime bugs, and willingness to pay cost.

Great, we are safe then?

- Wait, we don’t really have a web browser, a JS engine, etc.


- Also, many device drivers lead to kernel privilege escalations in spite of being
“memory safe” when examined at the language level.
What is not (yet?) covered by existing
mechanisms?
Writing safe unsafe Rust is not easy
- The Rustonomicon starts with a big warning about being outdated.
- There is no authoritative complete source of safe unsafe Rust.
- It is very very easy to write unsafe unsafe Rust.
- A good understanding of the Rust type system is needed.

- Example of a common mistake: Making a Rust container that permits


Send/Sync (e.g. sharing between threads) but does not require the type it
contains to permit Send/Sync
- Programmers can use this container in an unsafe manner from safe Rust
now.
Shared memory TOCTOU
- Memory safety through type systems is a “local” property.
- When two trust domains access the same shared memory, one side can be
memory-safe but the other side can invalidate assumptions (by modifying the
shared state nefariously or accidentally).
- Classical example: Kernel memory corruption from dereferencing userspace
pointers more than once.
- Extremely common primitive for privilege escalations.

- Also relevant when CPU, Baseband, NIC, GPU don’t trust each other.
Surprising callbacks out of the type system
Classical C++ browser bug pattern:
1. Take a reference to an object.
2. Read a value from the object.
3. Perform a check on a value from the object.
4. Inadvertently cause a callback into Javascript. Javascript mutates the
underlying object / heap.
5. Return into C++ in an incoherent state, corrupt memory.
If you rely on your type system to provide memory safety, the invariants of the type
system need to be kept intact by any other language you call into.
This is conceptually a variant of TOCTOU shared-memory bugs, in some sense.
Issues around FFI (and GC and type systems)
Subtleties about FFIs and memory-safe languages can fill a book.

- Memory allocated by the GC runtime might move around


- What happens to pointers passed to non-GC code?
- Go also effectively disables heap randomization - becomes a problem in
mixed code with vulnerabilities?

Dynamic linking in Rust was historically a nightmare:

- BTreeMap example: https://github.com/rust-lang/rust/pull/63338


Array indices as proto pointers
- Ok, I want to implement a dynamically-changing cyclic graph in Rust that
needs to be accessed by multiple threads.
- This is a classical problem when implementing a DOM, but for many other
use-cases, too.
- What do I do?
- Two approaches:
- Allocate a pool of nodes, use indices as proto-pointers (petgraph).
- User Rc<T> and Arc<T> (reference counting wrappers), pay overhead and risk memory leaks.

Both have their advantages and disadvantages.


Array indices as proto pointers
The pool-of-nodes approach raises a philosophical question:

- Who tracks the lifetime of elements in the pool of nodes?


- Have I just re-introduced use-after-free, but … “typesafe” use-after-free?
- What are the implications of this?
- What does memory safety but with typesafe use-after-free even mean?
JIT miscompiles
The majority of exploited browser bugs in recent years were not issues of memory
safety.

- The vast majority of issues were subtle bugs in the JIT engines.
- These often led to generation of incorrect / unsafe JIT’ed Javascript.
- JS is special: Most other languages trust the compiler input.
- JS needs to both be blazingly fast (JIT, sophisticated optimization) and treat
input as malicious
- Unique problems: Emit fast code fast from adversarial source.
Hardware errata
If the hardware misbehaves, all bets are obviously off.

- Rowhammer-style bugs
- AMD Bobcat CPU series JIT issues
- AMD SIMD Register Information Leakage
- “Cores that don’t count” - Mercurial cores
GPU and xPU interactions
- CPUs are increasingly a bottleneck.
- Systems are mutating to network-of-systems communicating via shared
memory
- CPU <-> GPU have shared DMA communication. CPU <-> NIC too, and more
heterogeneous compute is added.
- Driver logic for managing shared memory is prone to logic bugs:
- https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-2270
6/
- New compute units need to respect operating system abstractions (process
isolation in GPU memory etc.)
What’s next?
The role of LLMs and AI in the process
The role of AI
- LLMs as drivers for theorem provers are improving rapidly. This opens a
number of new opportunities:
- Verification of unsafe Rust is currently done via Iris / Coq. Getting LLMs to drive a theorem
prover to validate more Rust crates.
- Retrofitting memory safety into C++ will almost certainly require more code annotations. LLMs
can help write some of these annotations.
- As LLMs-as-drivers-for-theorem-provers mature and improve, the cost of verifying code will
come down.
Research & Engineering ahead
Research & Engineering topics
- We know how to build safe applications now.
- Yet almost nothing beyond some network services in production are safe:
- We don’t have a memory-safe mainstream OS kernel.
- We don’t have memory-safe mobile phone basebands.
- We don’t have a memory-safe ffmpeg.
- We don’t have a broadly-deployed memory-safe browser engine (yet?).
- We don’t have a memory safe JS JIT engine, and don’t know how to get there.
- All kernel-bypass high-performance networking libs are unsafe (DPDK etc.)
- Outside of Oxide, few firms write firmware in Rust.
- Even Rust in the Linux Kernel is struggling (see recent discussions).
- Very few Rust crates can be verified to have written safe unsafe Rust.
- We have no viable path to convert legacy C/C++ code to memory safety.
- There’s a lot of work ahead.
Research & Engineering topics
- The language of the future will not be today’s Rust or C++
- Type systems will continue to evolve to provide more and better guarantees
- Theorem proving on top of more sophisticated type systems will become
more powerful
- Most likely, AI systems will learn much better on more stringently typed
languages, because they have a (partial) correctness oracle.
With memory
corruption,
anything is
possible Hitler was a
leftist

You might also like