SURROGATES: Enabling Near-Real-Time
Dynamic Analyses of Embedded Systems
Karl Koscher Tadayoshi Kohno David Molnar
UC San Diego University of Washington Microsoft
[email protected] [email protected] [email protected]ABSTRACT heterogeneity requires substantial work to customize
Embedded systems are becoming increasingly instrumentation for each device. Whereas traditional
sophisticated, inter-connected, and pervasive. software runs on top of a few standard OSes (with
Unfortunately, securing these systems remains standard facilities that support instrumentation, such as
challenging. While powerful dynamic analysis tools a file system and dynamic linker), embedded systems
have been developed for traditional software, the may not even have an OS. The analyst must identify
unique characteristics of embedded systems make it instrumentation points and storage available for
difficult to apply these well-known techniques; prior measurements, and surgically insert code into the
work has been limited either to small systems or short firmware.
segments of code. In this paper, we demonstrate a An alternative to placing instrumentation on the device
system that is capable of emulating and instrumenting itself is to run the system under emulation. However,
embedded systems in near-real-time, enabling a variety this introduces its own set of challenges. Embedded
of dynamic analysis techniques. Our approach uses a systems are highly intertwined with their environment,
custom, low-latency FPGA bridge between the host’s through sensors, actuators, and other interfaces.
PCI Express bus and the system under test, allowing Furthermore, the peripherals that control these
the emulator full access to the system’s peripherals. interfaces can vary a great deal from one device to
This provides the emulator with a faithful another. Faithfully emulating these peripherals requires
representation of the environment the firmware a great deal of work building customized solutions.
normally executes in, enabling additional dynamic
analysis techniques such as concolic execution. We An early approach to this problem came in the form of
discuss the design decisions and engineering tradeoffs in-circuit emulators, which are drop-in, hardware
made and evaluate our system against prior work. replacements for microprocessors. These are typically
microprocessor cores identical to those being
1. INTRODUCTION “emulated,” with extra debugging signals bonded-out
Embedded systems are becoming increasingly and connected to external analyzers. These analyzers
sophisticated, inter-connected, and pervasive, making can be used to examine and control the operation of the
the “Internet of Things” the buzzword du jour. microprocessors. However, as processor speeds have
Unfortunately, these systems have repeatedly been increased, and as microcontrollers have evolved into
shown to be insecure, with vulnerabilities in a diverse full Systems-on-Chip (SoCs), hardware in-circuit
range of products such as automobiles [1], medical emulators have been replaced by special debugging
devices [2], routers [3], and voting machines [4]. Even facilities built in to most modern microcontrollers and
if we can convince manufacturers to invest the time SoCs. These facilities, while useful for development
and resources to secure their products, the security and debugging, often do not readily lend themselves to
tools available to embedded systems developers pale in supporting advanced dynamic analysis techniques,
comparison to those for traditional software. such as taint tracking, fuzzing, or concolic execution.
In particular, dynamic analysis techniques are Another approach, as described in section 2, is to treat
challenging to apply due to the difficulty of peripherals as unconstrained symbolic inputs.
instrumenting embedded systems. There may not be However, this relies on the analysis using symbolic
sufficient storage space for an instrumented binary or execution. Unconstrained inputs can lead to state
its measurements. There may not be sufficient explosion, rendering this technique unsuitable for all
processing power for instrumentation. There may not but the smallest embedded systems.
be a way to provide arbitrary data to the system—a
necessity for fuzzing. Even if a system is technically We take a different approach. Like Avatar [5] (also
capable of added instrumentation, firmware described in section 2), we run the device’s firmware
under emulation, directing peripheral I/O to the actual 2. RELATED WORK
device, giving the emulated firmware a realistic view The poor state of embedded security and the
of its environment. This leverages the fact that many seriousness of its consequences have led researchers to
devices rely on a relatively small set of embedded propose new ways to automatically analyze embedded
processors; SoC manufacturers typically license a systems, building on the success of traditional dynamic
well-known CPU core and add their own custom analysis tools. However, there are a number of
peripherals. challenges in applying traditional dynamic analysis
However, there are a number of challenges in making tools to embedded systems.
this approach work without being prohibitively slow. Whereas traditional software is written against OS-
Avatar attempts to overcome these challenges by provided APIs, the “API” that firmware is written
limiting the amount firmware executed under against is usually a hardware specification. Peripherals
emulation. However, this raises a number of additional typically expose their behavior through several
problems. The analyst must have sufficient insight into memory-mapped registers. These registers appear as
operation of the firmware to decide which parts are normal memory, but reads and writes to these
interesting enough to run under emulation. Emulated addresses directly control the hardware. With the large
code still executes slowly, so this technique may not heterogeneity of embedded devices, faithfully
work with timing-sensitive devices (such as a medical reproducing hardware behavior to dynamic analysis
device with a watchdog coprocessor.) Furthermore, it tools is a time-consuming and error-prone proposition.
doesn’t provide a feasible way to do whole-system
analysis. FIE [6] symbolically executes the firmware of small,
MSP430-based embedded devices. FIE overcomes the
Instead of limiting the scope of emulated execution, we challenges in the diversity of devices and the need to
introduce a system called SURROGATES, which can understand peripheral semantics by treating all
emulate entire systems in near-real-time. We peripheral I/O as an unconstrained symbolic input.
accomplish this by using custom, low-latency Unfortunately, this can easily lead to a state space
hardware to bridge the PCI Express bus of the host to explosion, making this technique impractical for all but
the device under test, as well as making a number of the smallest embedded systems.
optimizations. In doing so, we uncover and surmount
new challenges in emulating entire systems, such as Avatar [5] attempts to constrain the number of states
handling interrupts, DMA, and clocking changes. explored by using the actual hardware as a guide for
peripheral semantics. It does so by redirecting
In this paper, we make the following contributions: 1) peripheral I/O to the real device, either by using a
We describe new hardware which enables near-real JTAG debugger or through serial communication with
time emulation of arbitrary ARM-based embedded an in-memory stub loaded onto the target in a manner
systems, providing a platform to build advanced similar to SerialICE [7]. Unfortunately, with the ability
dynamic analysis tools on; 2) We discuss the to do only about five memory operations per second,
engineering tradeoffs in building SURROGATES and redirecting all I/O is prohibitively slow. Avatar
provide comprehensive performance evaluations of the overcomes this limitation by migrating executing code
different techniques; 3) We describe and solve several between the emulator and the device, and emulating
issues that arise when emulating entire systems; and 4) only small portions of interest of the firmware.
We demonstrate the practicality of using our system on However, this optimization is unsuitable for timing-
a diverse set of devices. sensitive systems. We seek to overcome this limitation
The rest of this paper is organized as follows. Section 2 by enabling near-real-time peripheral interaction.
describes related work. Section 3 discusses a number 3. TOWARDS REAL-TIME I/O
of options to improve the performance of systems like Our system targets ARM processors, which are
Avatar, guiding the design of SURROGATES, which is ubiquitous in medium-to-high complexity embedded
introduced in section 4. Section 5 evaluates the devices. Our system communicates over the JTAG
performance of our system, compares it to prior work, interface exposed on most microcontrollers. JTAG has
and describes our experience applying our system to a several nice properties: 1) it is usually present in
variety of embedded systems. Section 6 describes embedded devices for programming and testing during
future work. Finally, we conclude in section 7. manufacturing, 2) JTAG pins are usually dedicated for
programming and debugging, so it provides a
communications channel that is not already used for
some other purpose during normal operation, 3) JTAG Unfortunately, we then encountered an unexpected
interfaces tend to support high transfer rates (e.g. ARM bottleneck: USB transaction latency. USB requires all
processors can support JTAG clock rates up to 1/6th of communications to be initiated by the host. This
the core processor speed), limited primarily by off-chip requires the host to periodically poll all devices for
factors such as connection length, and 4) existing their status. The maximum polling rate is 1 kHz, which
JTAG tools can be used to read and write arbitrary imposes a minimum latency of 1 ms on each USB
memory addresses on a device, making it easy to transaction. While this may sound insignificant, it is
rapidly develop an Avatar-like prototype. several orders of magnitude slower than the latency of
native I/O operations. Furthermore, because code
JTAG interfaces expose a simple, standard state
execution may depend on the result of a memory read,
machine that can be driven by a JTAG adapter. This
this effectively places an upper-limit on the number of
state machine lets the JTAG adapter select, capture,
memory operations we can perform per second. Note
and update either a JTAG instruction register or a data
that while we could continue to execute symbolically
register. These registers act like shift registers; data is
(later replacing the symbolic result of the read with its
shifted in and out simultaneously. While there is only
concrete value and pruning inconsistent code paths),
one instruction register, several different data registers
further interactions with the hardware may depend on
(called scan chains) can be selected using the different
the result of the read, and thus to ensure consistency
JTAG instructions.
we must wait for the read to complete. This latency is a
As with Avatar, we first redirected emulated memory- fundamental limitation of USB, which means that we
mapped I/O to the target over JTAG using OpenOCD must look at other interfaces to overcome it.
[8] (an open-source JTAG program). We initially used
OpenOCD’s built-in GDB protocol interface to initiate 4. OUR APPROACH: SURROGATES
reads and writes and control the processor’s state. We decided to avoid further unexpected bottlenecks
However, memory operations are extremely slow over and latencies that might be lurking in other interfaces
regular JTAG interfaces. This is because these memory (such as Ethernet and Firewire) by developing a
operations are typically injected into the CPU’s state. custom JTAG adapter that connects directly to the
The JTAG interface must halt the CPU, transfer the host’s PCI Express bus. Our goal was to transparently
CPU's state, update the CPU’s state to perform a map the target’s entire 32-bit physical address space
memory operation (including general purpose registers into the 64-bit address space of the emulator, such that
and the instruction register), single-step the CPU, peripheral I/O is simply a memory read or write by the
transfer out the CPU’s state again if the memory emulator. While practical reasons (explained later in
operation was a read, restore the CPU’s original state, this section) prevent us from achieving this goal, our
and resume the CPU. JTAG interface is directly memory-mapped into the
emulator process, giving us extremely low-latency
While exposing the CPU’s state over JTAG gives access to the target. We still use our DCC stub to
debuggers extremely powerful control over the system, communicate with the target processor.
its performance is poor for common tasks, such as
transferring large segments of memory. To improve The PCI Express bus is not really a bus at all, but a
performance of these operations, CPU vendors have packet-switched network. The root complex translates
introduced additional scan chains that expose small CPU reads and writes into PCI Express packets, which
communications channels between the JTAG interface get routed by address. (Alternate routing schemes can
and a program running on the CPU. For example, most be used, e.g., for device discovery and configuration.)
ARM processors support the Debug Communications Writes are posted transactions which complete
Channel (DCC), which is a 32-bit register accessible immediately, while reads are unposted, which require a
over a separate JTAG scan chain. JTAG interfaces can completion packet (usually with data) to be sent back
upload a small stub to the target and use the DCC to to the root complex. Since PCI Express is a packet-
transfer large portions of memory efficiently. switched network, devices can send packets to their
peers, as well as performing DMA by sending packets
We leverage the relatively fast DCC by developing a to the root complex.
custom stub that runs on the target, accepting memory
read and write commands from the host. A full 4.1 The Hardware
discussion of our stub and DCC protocol is in section Our hardware consists of an off-the-shelf PCI Express
4.2. We modified QEMU [9] to directly pass selected FPGA card (a Pico Computing E17FX70T), a custom
reads and writes as DCC commands to a Segger J- FPGA-to-JTAG interface board, and a custom JTAG
Link, a commercial, off-the-shelf USB JTAG interface. debugging board, as shown in Figure 1. The FPGA-to-
Table 1: FPGA Utilization
Used Available Utilization
Slice Registers 6,503 44,800 14%
Slice LUTs 6,615 44,800 14%
Occupied Slices 3,397 11,200 30%
BlockRAMs/FIFOs 11 148 7%
Total Memory (KB) 306 5,328 5%
Figure 1: Hardware components of our system.
target’s 32-bit physical address space somewhere in
Left-to-right: An off-the-shelf FPGA ExpressCard,
the host’s 64-bit address space. Unfortunately, the PCI
our JTAG adapter board, a JTAG breakout/debug
Express specification requires that all 64-bit address
board, and the device under test (a FriendlyARM
ranges be prefetchable—meaning that reads are side-
Mini2440). FPGA development and debugging is
effect free. This is not the case for several embedded
done through another JTAG connection via the
devices. For example, a UART controller may have a
JTAG interface board, as well as a small logic
single, memory-mapped character register. A read
analyzer connected to the JTAG breakout/debug
from this register frees the UART to receive another
board.
byte. While some chipsets do allow 64-bit PCI Express
JTAG board shifts signal voltage levels between the regions to not be prefetchable, others do not.
FPGA and the target’s JTAG interface, and provides a
standard ARM JTAG connector. It also provides a Of course, only a portion of the target’s 32-bit address
SATA-like, high-speed serial interface that can space is mapped to peripherals. We considered
transport JTAG signals over a longer distance. The transparently mapping a small view of the target’s
JTAG debugging board can convert this serial stream address space, allowing the host to pick the address
back to a standard JTAG interface, and provides an range that is mapped in. However, on a typical PC,
easy interface for a logic analyzer to examine the there is a great deal of contention for address space
JTAG signals. below the 4GB boundary. This makes it difficult to
map reasonably large 32-bit regions. Furthermore,
Our implementation uses a Xilinx Virtex5 FX70T devices typically use large peripheral address spaces
FPGA. While this FPGA is overkill for our purposes, it (e.g. 320 MB on the Samsung S3C2440) even though
was available off-the-shelf as a PCI Express card, with they are sparsely populated. Since the host may have to
the bulk of the PCI Express glue logic already keep remapping different views of the target’s address
developed by Xilinx and Pico Computing. Our space, we decided to simply expose a few memory-
application logic is implemented in approximately mapped registers that initiate reads and writes to the
1,100 lines of Verilog, excluding tests (which are target. These registers are described below and shown
approximately another 1,000 lines of Verilog). Device in Appendix A.
utilization is summarized in Table 1.
There are two address registers–one for reads, and one
We implement two PCIe-to-JTAG bridges in the for writes, as well as a data register. When a write
FPGA. The first is a simple set of FIFOs for the TDI, address and value are written, the FPGA initiates a
TMS, and TDO signals, and supports generic JTAG write operation on the target through its DCC interface.
operations, such as manipulating the processor’s state, When an address is written to the read address register,
dumping firmware, and uploading code. We extend a read operation on the target is initiated. We also
OpenOCD to support this new interface and use it for provide two FIFOs and control registers to allow the
some complicated-but-infrequent operations, such as host to initiate optimized multiple-word transactions.
resetting the target to a known state and uploading the
stub. The packet-based nature of PCI Express lets us stall
reads of the data register if the target hasn’t returned
The second interface is designed specifically to work data yet. However, while the root complex is supposed
with our stub. As previously mentioned, the original to abort transactions that have timed out, our particular
intention was to provide a transparent mapping of the root complex doesn’t. This means that if the target
device doesn’t respond (due to a bug, being powered Table 2: Our stub protocol as 32-bit hex words
off, etc.), the host will freeze. Not even the NMI
watchdog can recover the system. For this reason, we ►1SXXXXXX Read XX words of size S (1, 2, or 4
bytes) from address YY. XX data
typically poll the FPGA for completion. ►YYYYYYYY
elements ZZ are returned.
When there are no pending read or write requests, the ◄ZZZZZZZZ …
FPGA can be configured to continuously poll the
target’s DCC register to see if an interrupt has ►2S00XXXX Write a single word XX of size S (1
or 2 bytes) to address YY.
occurred. Interrupts received from the stub are ►YYYYYYYY
dispatched as interrupts to the host’s processor. This
required a small modification to the FPGA’s PCI ►3SXXXXXX Write XX words of size S (1, 2, or
Express interface code. The preferred way of sending 4 bytes) to address YY. XX data
►YYYYYYYY
interrupts over PCI Express is to use Message Signaled elements ZZ are sent.
Interrupts (MSIs), which are simply memory writes of ►ZZZZZZZZ …
a specific value to a specific address. Peripherals no ►50XXXXXX Set the CPSR register to XX.
longer have to share a total of four interrupt signals, Primarily used to set and clear
and can in fact request multiple interrupts. This would interrupt flags.
appear to allow the hardware to send different
interrupts to the host based on the target’s interrupt … An interrupt of type XX has
type. Unfortunately, Linux has limited support for occurred. This word can be sent at
◄C347A5XX
multiple interrupts per peripheral, so the driver must any time, including before a read
poll the hardware to determine the interrupt type, as … response. In the unlikely case that a
described in section 4.3. word C347A5XX is the result of a
read operation, C347A500 is sent
4.2 The Stub as an escape sequence.
Our stub targets most microcontrollers based on
ARMv4T or newer cores. (Some newer ARM Cortex
cores have different debugging options and to the host. The host delivers the interrupt to the
capabilities.) This covers a wide range of interesting emulated processor when its CPSR is set to allow
embedded devices, including hard drives, cellular interrupts. The emulated firmware can then query the
baseband processors, medical devices, and automotive interrupt controller like any other peripheral to
systems. The stub is implemented in approximately determine the source(s) of the interrupt. Note that
400 lines of assembly and takes up only 768 bytes— multiple interrupt sources may be set in the interrupt
which can be easily locked into the instruction cache controller–setting the IRQ or FIQ Disable flag does not
on processors that support it. The stub does not use any mask interrupts from being handled by the interrupt
RAM for data or a stack, allowing the emulator to use controller, but merely prevents them from being
all available RAM on the target if desired. delivered to the CPU. The firmware acknowledges any
interrupts it handles. When the emulated firmware
Our stub uses a custom word-based protocol to finally re-enables interrupts, a CPSR update command
efficiently perform memory operations as well as is sent to the target to re-enable its interrupts. If the
transferring status information, such as interrupts and interrupt controller still has an unacknowledged
interrupt masks. A summary of our protocol is listed in interrupt active, it will once again interrupt the target
Table 2. CPU. This process repeats until no interrupts are
The stub provides handlers for standard (IRQ) and fast active. The acknowledgement protocol prevents any
(FIQ) interrupts. Unlike Avatar, no de-multiplexing is race conditions where the emulated processor may
attempted. When an interrupt is received, ARM miss an interrupt. Since these race conditions can
processors update their Current Program Status appear natively, all ARM firmware must implement
Register (CPSR) to set the IRQ or FIQ Disable bit, this type of protocol. Some ARM SoCs provide
preventing the handler from being interrupted itself. vectored interrupts, where the firmware can specify
The old CPSR value is stored in the Saved Program different handlers for each interrupt source. However,
Status Register (SPSR). Normally when the handler since the ARM core itself only supports two interrupt
returns, the SPSR is copied back to the CPSR, re- types, these vectors are normally implemented with a
enabling interrupts. However, we adjust the SPSR to small handler in ROM, which queries the interrupt
keep interrupts disabled and deliver the interrupt type controller and jumps to the correct vector. This ROM
can be emulated by our system like any other
firmware, allowing us to support fully-vectored Table 3: Raw MMIO Performance
interrupts with no additional work. Extracting this
ROM and other per-device setup is discussed in MMIO Operations Per Second
section 5.2. Avatar ~5 (over serial debug port at 38400 bps)
4.3 The Software Our system 17172 writes / 15761 reads
We modified QEMU [9] to pass all MMIO to our w/ syscalls (over 4 MHz JTAG)
hardware. We accomplished this by creating a new
“surrogate” peripheral in QEMU, which owns the Our system 17174 writes / 15772 reads
entire MMIO address space of the target and forwards w/ mmap (over 4 MHz JTAG)
MMIO operations to the hardware. We also created a
new QEMU “system,” which selects the proper CPU, performance aspects of our system and compare it with
creates the necessary address spaces, initializes the prior work. All of our performance experiments were
surrogate peripheral, and loads the firmware to run against a FriendlyARM Mini2440 development
emulate. Note that since we build on QEMU, our board, described in Section 5.2.
system easily integrates with tools such as S2E [10]
and Avatar. (We later created interfaces to our To test raw MMIO performance, we measure the time
hardware as S2E and Avatar plugins, but found that needed to make 1,000,000 read or write requests to the
doing so incurs a substantial performance hit. Thus, we SRAM of the FriendlyARM’s SoC, connected to our
appear to S2E like any other virtualized peripheral.) hardware with a 4 MHz JTAG clock. We find that our
raw MMIO performance is four orders of magnitude
Initially we ran our system under Windows to take faster than what the Avatar authors reported, as shown
advantage of the existing drivers for the PCIe card. in Table 3. We also measured the time taken to write to
However, the drivers were optimized for streams of an FPGA register 1,000,000 times. Although accessing
data, where latency is less of a concern that the FPGA through a mmap interface is about 60%
throughput. For example, transfers to the card would faster (1.4 µs vs. 2.2 µs), the overall performance
always use DMA, regardless of the transfer size. impact under real workloads is negligible.
We ultimately re-implemented a simplified version of To evaluate whether this performance was reasonable
the driver on Linux (which was based on an open- to support near-real-time emulation, we set out to boot
source driver for Pico Computing’s other FPGA Linux on the emulated processor. To accurately
products). To avoid syscall overhead on every MMIO measure the amount of time to boot, we replaced the
operation, we allow applications to mmap the init binary with one that simply contains a special
hardware’s register space, although in practice this did illegal instruction. This instruction shuts down QEMU
not significantly improve performance. and reports performance statistics. We found that the
Finally, we extended the driver’s interrupt handler to kernel boots in about 27 seconds. 25 seconds were
deliver a signal to any process that requests it spent performing I/O. However, during boot the kernel
whenever a non-DMA interrupt is received. A signal initializes all of the peripherals, so its I/O
handler in QEMU delivers this interrupt to the virtual characteristics are different from typical usage of a
CPU. This provides a low-latency path for interrupts. booted system. During this time, approximately
126,000 reads and 87,000 writes were performed.
5. Evaluation
We evaluate our system against two metrics: its To evaluate interactivity, we replaced the init binary
performance and the ease of configuring it to work with the busybox [11] version of /bin/sh, allowing us
with a new target device. to interact with the system over its serial port. While
file system accesses were noticeably slower than on
5.1 Performance the real hardware, the shell maintained a subjectively
One of the key motivations for SURROGATES was to good amount of responsiveness.
overcome the performance limitations of Avatar.
To get a more objective measure of responsiveness, we
While we had independently built a system very
connected the FriendlyARM’s Ethernet port directly to
similar to Avatar, we were unable to use it against
a Windows laptop and performed a ping test against
several devices of interest because proper operation of
the emulated system. After 100 pings, the average
those devices relies on timing constraints that it could
response time was 15 ms. The minimum response time
not meet (e.g. watchdogs on co-processors of a
was 8 ms, and the maximum was 61 ms. We then
medical device). Therefore, we evaluate the several
connected the FriendlyARM to our campus network
(which has significantly more broadcast traffic) and and exception handlers. For example, on the S3C2440,
obtained similar results. Finally, we loaded a web page exception handlers must be located at 0x00000000,
from the emulated device’s HTTP server, which loads while on the iMX21, we can place exception handlers
content off of the physical SD card and sends it over anywhere in memory because the ROM at 0x00000000
the physical NIC. When loading a 369KB image from uses an exception vector table stored in dedicated
the SD card, we obtained an effective throughput of RAM as a level of indirection. On the S3C2440, we
17.3 KB/s, which includes an initial stall to read the place our stub in the NAND “SteppingStone” SRAM
file from the SD card. Subsequent transfers of the same at 0x00000000. On the iMX21, we place our stub in
image (now in the filesystem cache) had a throughput the dedicated exception handler SRAM. Depending on
of about 26 KB/s. Note that neither the SD card driver the SoC, it may also be possible to lock the stub into
nor the NIC driver use DMA, which would allow us to the cache, allowing you to virtually place it over
exploit the multi-word transfer mode of our system to address spaces that are normally not usable (such as
approximately double our throughput (since we ROMs at 0). MMUs, if available, may also be used to
transfer the address only once, and not on every word place the stub at arbitrary locations, but this is left for
transfer). future work.
While slower than running natively, we are able to Next, the layout of the target’s address space must be
emulate an entire system with reasonable usability. In specified in QEMU. Usually this is as simple as
contrast, the authors of Avatar reported that it took defining the address regions of RAM, Flash, and
almost four minutes to reach the bootloader prompt of peripherals. For the iMX21, an additional address
a hard drive. space entry is created for the ROM.
5.2 Portability There are usually a few exceptions that must be carved
This work was also motivated by our desire to build a out of the peripheral address space. These are for
dynamic analysis platform that does not require a great registers that, when updated, cause the target to lose
deal of work to apply to a new target. Therefore, we sync with the host. For example, on the S3C2440,
evaluate the ease of supporting new devices and there are registers that control the core clock speed.
discuss some of the new challenges encountered when When the clock speed is adjusted, the CPU is halted
supporting entire systems. We look at two devices as until the PLLs re-lock. JTAG communication fails
case studies: a FriendlyARM Mini2440 development until the CPU resumes execution. We can use dynamic
board with a Samsung S3C2440 SoC, and a wireless analyses techniques to easily determine these
medical device with an iMX21 SoC. exceptions. If we log all MMIO as the system boots,
the last MMIO operation before the system halts is
When applying our system to a new target, the first usually responsible for the failure. The SoC datasheet
task is to identify the target’s JTAG port. These are can be consulted for the effect of the corresponding
often connected to test pads on the target’s PCB, but register so that an intelligent exception can be made.
sometimes they are brought out to dedicated
connectors. As a development board, the Finally, different SoCs have wildly varying DMA
FriendlyARM features a well-identified JTAG port. controllers, some of which must be emulated for
The wireless medical device, however, just has dozens proper emulation of the device. For example, the
of unmarked test points. We had previously identified S3C2440 has a general-purpose DMA controller as
the JTAG test points through manual analysis; well as a dedicated LCD DMA controller. Neither are
however, today there are tools like the JTAGulator required to be emulated to boot Linux. For the iMX21,
[12] that perform a brute-force search over all test we emulated the LCD DMA controller registers in
points to find the JTAG signals. QEMU with only eight additional lines of C. This
emulated DMA controller simply copies the specified
Once JTAG connectivity is established, firmware of video memory from the emulator to the same location
the device is downloaded. In some cases, the SoC itself on the target, and then passes the DMA request on to
has a small amount of firmware in ROM that is the real DMA controller to transfer the data to the
essential to proper operation of the SoC. For example, LCD.
the ROM in the iMX21 performs interrupt vectoring,
so if the firmware chooses to use vectored interrupts, As an alternative to emulating different DMA
the ROM must be emulated as well. controllers, we can treat the emulator’s memory as
another level of cache. DMA controllers typically
A location for the stub must be identified. Different cannot access the L1 or L2 caches, so any data
SoCs have varying requirements for locating interrupt involved in a transfer must reside in main memory. We
can treat intentional cache invalidations as an system. This would enable dynamic analysis systems
indication that the memory was or will be used in a to run largely independent of physical hardware,
DMA transfer and flush the affected memory to or allowing it to scale up massively. The models do not
from the target. (Note that the stub always runs with necessarily need to be 100% accurate; as long as they
the target’s data caches off, so flushes from the reasonably constrain the state space search, it is
emulator to the target will go directly to main feasible to explore several potentially vulnerable code
memory). Unfortunately, this approach only works paths. When a potentially vulnerable code path is
with firmware that turns the data caches on, which was found, it can be verified against the actual hardware
not the case with our wireless medical device. using our system.
Overall, we find it straightforward to apply our system 7. Conclusions
to different devices, requiring far less work than We have built and evaluated a system that enables
building an emulator for all of the target’s hardware. dynamic analysis of embedded systems at an
There is some manual configuration involved, but this unprecedented scale. Our approach is similar to
is true of most dynamic analysis tools. Avatar; we run the system under emulation in QEMU
and redirect I/O to the target hardware to guide
6. Future Work
execution and provide the firmware with a faithful
6.1 Further improving performance reproduction of its environment. However, by using a
While our stub protocol is relatively efficient, it still custom FPGA bridge between the host and target, we
suffers from inefficiencies in ARM’s DCC enable near-real time emulation of the target system,
specification and limitations of JTAG interfaces. For allowing us to analyze systems of far greater
example, to read a debug register, we must clock in 36 complexity. This will ultimately enable embedded
bits into the EmbeddedICE interface to select the systems developers to take advantage of several
register to read, and then clock another 36 bits out to dynamic analysis techniques that were previously
read the value. There are two EmbeddedICE registers available only to traditional software developers,
we use: the DCC status register, and the DCC data allowing them to deliver safer and more secure
register. To read a single 32-bit value from the DCC embedded systems.
data register, at least 144 bits need to be transferred.
While we could propose some changes to the DCC 8. References
specification, the most recent ARM processors have [1] Stephen Checkoway et al., "Comprehensive
transitioned to debugging interfaces that provide Experimental Analyses of Automotive Attack
complete access to the SoC bus. We have not yet Surfaces," in USENIX Security Symposium, San
examined these new interfaces in detail, as many Francisco, 2011.
systems of interest do not use them yet, but it may be
straight forward to adapt our system to ARM’s new [2] Daniel Halperin et al., "Pacemakers and
debugging interfaces. Implantable Cardiac Defibrillators: Software
Radio Attacks and Zero-Power Defenses," in
6.2 Eliminating our dependence on
IEEE Symposium on Security and Privacy, 2008.
hardware
While our system enables dynamic analysis of
embedded systems at an unprecedented scale, it [3] Michael Lynn, "Cisco IOS Shellcode," in
doesn’t necessarily scale any further. Systems like Blackhat USA, Las Vegas, 2005.
SAGE [13] and S2E depend on the ability to massively
parallelize state space searches. This is easy with well- [4] Ariel J Feldman, Alex Halderman, and Edward
defined OS APIs, but our approach depends on an W Felten, "Security Analysis of the Diebold
individual physical system to guide execution. Even AccuVote-TS Voting Machine," in Electronic
worse, to ensure the hardware is in a consistent state, Voting Technology Workshop, 2007.
we may need to reset the SoC and replay all I/O
operations when another code branch is explored. (In [5] Jonas Zaddach, Luca Bruno, Aurelien Francillon,
practice, peripherals usually have limited state, so once and Davide Balzarotti, "Avatar: A Framework to
they are initialized, we may be able to relax our Support Dynamic Security Analysis of
consistency requirements and ignore their states.) Embedded Systems' Firmwares," in Network and
However, it may be possible to learn models of the Distributed System Security Symposium, 2014.
hardware based on execution traces collected with our
[6] Drew Davidson, Benjamin Moench, Somesh Jha, [11] BusyBox, http://www.busybox.net/.
and Thomas Ristenpart, "FIE on Firmware:
Finding Vulnerabilities in Embedded Systems [12] Joe Grand, JTAGulator, 2013,
Using Symbolic Execution," in USENIX Security http://www.grandideastudio.com/portfolio/jtagul
Symposium, 2013. ator/.
[7] coresystems GmbH, SerialICE, 2009, [13] Patrice Godefroid, Michael Y. Levin, and David
http://ww.serialice.com/. Molnar, "Automated Whitebox Fuzz Testing," in
The 15th Annual Network & Distributed System
[8] Dominic Rath, Open On-Chip Debugger: Design Security Conference, San Diego, 2008.
and Implementation of an On-Chip Debug
Solution for Embedded Target Systems, 2005.
[9] F. Bellard, et. al. QEMU. http://www.qemu.org/
[10] Vitaly Chipounov, Volodymyr Kuznetsov, and
George Candea, "S2E: A Platform for In-Vivo
Multi-Path Analysis of Software Systems," in
6th Intl. Conference on Architectural Support for
Programming Languages and Operating Systems
(ASPLOS), Newport Beach, CA, 2011.
Appendix A: FPGA Register Map
Addr Desc. Value Specification
000 Output Bits:
Control 31-11 10 9 8 7 6 5 4 3 2 1 0
Register
Reserved FORCE OUT DBGACK DBGRQ nSRST TDO RTCK TCK TMS TDI nTRST
OUT EN
FORCEOUT – Forces JTAG output pins to the values set in this register
OUTEN – Enables JTAG output pins
004 JTAG Bits:
Stream 31-27 26 25 24 23-0
Control
Reserved Stub Interface Reset Stub Interface Scan Enable Stream Enable Stream Length
Register
Stub Interface Reset – Reinitializes the stub interface logic
Stub Interface Scan Enable – Causes the stub interface logic to poll the target for interrupts
Stream Enable – Streams arbitrary JTAG data (used for non-stub communication)
Stream Length – The number of bits to stream
008 JTAG Bits:
Clock 31 30-0
Divisor
JTAG Clock Reset Divisor
Divisor – The JTAG clock divisor. The JTAG clock speed is 125 MHz / (divisor – 1).
00C Read Bits:
Stall 31 30-0
Control
Read Stall Enable Read Timeout
Read Stall Enable – Stalls reads from the Data Register until data is ready
Read Timeout – Read stall timeout, in multiples of 8 ns
x10 Read Target address to read. X is the transfer size: 1 = Byte, 2 = 16 bit word, 4 = 32 bit word. Writes
Address to this register initiate a read from the target.
x14 Write Target address to write. X is the transfer size: 1 = Byte, 2 = 16 bit word, 4 = 32 bit word.
Address
018 Data Data returned from a read, or data to be written. Ignored in bulk transfer mode. Writes to this
Register register always initiate a write to the target.
01C IRQ Bits:
Register 31-8 7 6 5 4 3-0
Reserved FIQ IRQ Reserved Data Abort Reserved
Reads from this register are unacknowledged exceptions received from the stub. Write a 1 back
to the corresponding bit to acknowledge the exception.
024 Target Writes to this register update the target’s CPSR to the given value.
CPSR
028 Bulk Bits:
Data 31-25 24 23-0
Length
Reserved BULKEN Number of elements (bytes, half-words, words) to send
BULKEN – If set, the stub interface logic uses the bulk-optimized stub protocol, using the stub
data FIFOs instead of the Data Register