0% found this document useful (0 votes)
12 views19 pages

Instrumentation Wasm

Uploaded by

gabrielgildino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Instrumentation Wasm

Uploaded by

gabrielgildino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Flexible Non-intrusive Dynamic Instrumentation for

WebAssembly
Ben L. Titzer Elizabeth Gilbert Bradley Wei Jie Teo
[email protected] [email protected] [email protected]
Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
Pittsburgh, PA, USA Pittsburgh, PA, USA Pittsburgh, PA, USA

Yash Anand Kazuyuki Takayama Heather Miller


[email protected] [email protected] [email protected]
Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
arXiv:2403.07973v1 [cs.PL] 12 Mar 2024

Pittsburgh, PA, USA Pittsburgh, PA, USA Pittsburgh, PA, USA

Abstract Clearly manual inspection cannot scale and programmatic


A key strength of managed runtimes over hardware is the observation is required.
ability to gain detailed insight into the dynamic execution of
programs with instrumentation. Analyses such as code cov- 1.1 Monitors and M-code
erage, execution frequency, tracing, and debugging, are all
Just as we use the common term application to refer to a self-
made easier in a virtual setting. As a portable, low-level byte-
contained program, we will use the term monitor to refer to
code, WebAssembly offers inexpensive in-process sandbox-
a self-contained analysis which monitors an application and
ing with high performance. Yet to date, Wasm engines have
observes execution events and internal states. For example, a
not offered much insight into executing programs, support-
profile monitor might count the number of runtime iterations
ing at best bytecode-level stepping and basic source maps,
of a loop or the execution count of basic blocks. A monitor
but no instrumentation capabilities. In this paper, we show
may instrument programs using various mechanisms, before
the first non-intrusive dynamic instrumentation system for
or during execution, execute runtime logic during program
WebAssembly in the open-source Wizard Research Engine.
execution, and generate a post-execution report. Of partic-
Our innovative design offers a flexible, complete hierarchy
ular interest is a monitor’s additional runtime storage and
of instrumentation primitives that support building high-
logic, which we refer to as monitor data and monitor code. For
level, complex analyses in terms of low-level, programmable
example, a profile monitor’s data includes counters and the
probes. In contrast to emulation or machine code instru-
monitor code includes the updates and reporting of counters.
mentation, injecting probes at the bytecode level increases
Monitor code may be written in a high-level language and
expressiveness and vastly simplifies the implementation by
compiled to a lower form. We will use M-code to refer to the
reusing the engine’s JIT compiler, interpreter, and deopti-
actual monitor code that will be executed at runtime. M-code
mization mechanism rather than building new ones. Wizard
may take many forms, including injected bytecode, source
supports both dynamic instrumentation insertion and re-
code, or machine code, utilities in the engine itself such as
moval while providing consistency guarantees, which is key
tracing modes, or extensions to the engine.
to composing multiple analyses without interference. We de-
While most monitors aim to observe program behavior
tail a fully-featured implementation in a high-performance
without changing it, the implementation technique may not
multi-tier Wasm engine, show novel optimizations specifi-
always guarantee this. For example, injecting M-code by
cally designed to minimize instrumentation overhead, and
overwriting machine code in a native program is fraught
evaluate performance characteristics under load from vari-
with peril because native programs can observe machine-
ous analyses. This design is well-suited for production engine
level details such as reading their own code as data. Robustly
adoption as probes can be implemented to have no impact
separating monitor data from program data is a common
on production performance when not in use.
problem. Some approaches, such as emulation, avoid these
low-level complexities and are inherently side-effect free as
1 Introduction
they fully virtualize the execution model and M-code can
Programs have bugs and sometimes run slow. Understanding operate outside of this virtual execution context.
the dynamic behavior of programs is key to debugging, profil-
ing, and optimizing programs so that they execute correctly
1.1.1 Intrusive approaches. We say a monitor is intrusive
and efficiently. Program behavior can be extremely complex
if it alters program behavior in a semantically observable
with millions of interesting events [26] [23] [30] [19] [64].
way, i.e. it has side-effects on program state. Intrusiveness is
https://github.com/titzer/wizard-engine a property of the monitor together with the chosen technique
1
for implementing instrumentation. For example, instrument- + can dynamically adapt to program behavior, instrument-
ing a native program that reads its own code as data could be ing more or less
intrusive if done with code injection, but non-intrusive with + implementation technique may be able to preserve some
an emulator. The intrusiveness of a monitor is independent of the original addresses
of whether it perturbs performance characteristics such as However, it can have its own disadvantages:
execution time and memory consumption1 .
Instrumentation implementations that risk intrusiveness − more instrumentation cost is paid at runtime
include static and dynamic code rewriting techniques. − may make the execution platform vastly more compli-
Static rewriting. If the execution platform does not di- cated, e.g. requiring dynamic recompilation
rectly offer debugging and inspection services, static rewrit- − the monitor is heavily coupled to the framework used
ing is often used where source code, bytecode, or machine to implement dynamic instrumentation
code is injected directly into the program before execution. This approach is very common and is demonstrated in
Static rewriting has its advantages: the source-level tool Jalangi [49], the bytecode-level tool
+ no support from the execution platform is necessary; will DiSL [38], the machine-code-level tools Dyninst [15], Pin [36]
work anywhere and Dtrace [27], among others (see Section 6).
+ not limited by the instrumentation capabilities of under- 1.1.2 Non-intrusive approaches. Several techniques ex-
lying execution platform; can do anything ist that do not alter the program code or behavior; they are
+ inserted M-code can be small and inline; approaches min- side-effect free as their logic runs non-intrusively outside of
imal overhead the program space. Debuggers for native binaries can use
+ instrumentation overhead is fixed before runtime; no dy- hardware-assisted techniques such as debug registers, JTAG,
namic instrumentation costs and process tracing APIs to debug programs directly on a
However, static rewriting can also have disadvantages: CPU. Emulators can support debugging and tracing easily
in their interpreters.
− M-code intrudes on the state space of the code under test;
Typically non-intrusive native mechanisms are slow, im-
easier to break the program
posing orders of magnitude execution time overhead. Yet
− for machine- and bytecode-level instrumentation, M-code
accurately profiling high-frequency events that happen mil-
must necessarily be low-level; more tedious to implement
lions of times per second (branches, method calls, loops, and
− for machine code, the binary may need to be reorganized
memory accesses) requires a more high-performance mech-
to fit instrumentation; may not always be possible
anism. For limited cases such as profiling, Linux Perf [11] is
− offline instrumentation must instrument all possible events
a non-intrusive sampling profiler that can analyze programs
of interest; cannot dynamically adapt
directly running on the CPU. Valgrind [42] and QEMU [14]
− source-level locations and mappings are altered by added
are emulators with analysis features and use dynamic bi-
code; additional mapping needed
nary translation via JIT compilation to reduce overheads.
− M-code perturbs performance in subtle and potentially
Intel CPUs support a tracing mode known as Intel Processor
complex ways; unpredictable performance impacts
Trace [7] that emits a densely-encoded history of branches,
− pervasive instrumentation could massively increase code
which can be used to reconstruct program execution paths.
size; binary bloat
Managed runtime environments offer high-performance
− some information is only dynamically discoverable; could
implementations of languages and bytecode. Many offer APIs
miss libraries, indirect calls, and generated code
for observing and interacting with a running program. Sup-
Given these properties, this approach is frequently used port for instrumentation can be standard tracing modes, APIs
and is demonstrated in the source-level tool Oron [40], the for bytecode injection, or hot code reload (i.e. swapping the
bytecode-level tools BISM [50] and Wasabi [33], and the entire code of a function or class at a time). For example, the
machine-code-level tool EEL [32], among others. Java Virtual Machine offers JVMTI [3], and the .NET platform
Dynamic rewriting. In contrast to static rewriting, dy- offers the .NET profiling API [12]. JVMTI offers both intru-
namic rewriting allows a monitor to add M-code at runtime. sive (dynamic bytecode rewriting with java.lang.instrument)
This remedies some of the static rewriting disadvantages: and non-intrusive (Agents 2 ) mechanisms.
The flexible sensor network simulator Avrora [57], pow-
+ can discover information only available at runtime
ered by a microcontroller emulator, allows monitors to at-
+ can instrument 100% of the code
tach M-code to code and memory locations as well as clock
+ does not require recompilation or relinking of binary
events [58]. The M-code is written in Java and therefore runs
+ potentially less code bloat
2 In JVMTI, users can write Agents in native code that fire when Events occur
1 Often, non-intrusive implementation techniques allow measuring memory in the running application. They then interface with the program to query
consumption of the program and monitor separately. state or control the execution itself.
2
outside of the emulated CPU. Its cycle-accurate interpreter complex analyses from this basic building block, including
runs microcontroller code faster than realtime on desktop tracing, profiling, and debugging.
CPUs without the need for a JIT. Consistent Dynamic Instrumentation. In contrast with
Emulator and VM-based approaches still have the disad- prior work, our system also supports the dynamic insertion
vantages of dynamic rewriting, but have more advantages: and removal of individual probes. For example, Pin supports
+ non-intrusive instrumentation; side-effect free dynamic clearing of instrumentation on a region of code,
+ simplifies source-level address mapping but it is not probe-specific. Further, we make consistency
+ no need to reorganize binaries guarantees about when insertion and removal take effect,
+ can reuse existing JIT in engine; no instrumentation- which allows multiple analyses to be seamlessly composed.
specific JIT Zero Overhead When Not Used. This framework im-
+ does not require writing M-code in a low-level language poses zero overhead for disabled instrumentation. To our
+ does not perturb memory consumption; logical separa- knowledge, it is the first system that leverages dispatch table
tion between program and monitor memory switching to implement global probes, bytecode overwriting
for local probes, and specific JIT compiler support to achieve
1.2 WebAssembly zero overhead in all execution tiers.
WebAssembly [28], or Wasm for short, is a portable, low-level JIT Intrinsification. We further show novel JIT optimiza-
bytecode that serves as a compilation target for many lan- tions that reduce the overhead of common instrumentation
guages, including C/C++, Rust, AssemblyScript, Java, Kotlin, tasks by intrinsifying some probes. We evaluate the effec-
OCaml, and many others. Initially released for the Web, it has tiveness on a standard suite of benchmarks and place our
since seen uptake in many new contexts such as Cloud [60] system’s performance in context with related work.
and Edge computing [1, 43], IoT [34, 35], and embedded and Engine Mechanism Reuse. In contrast to Pin [36] and
industrial [41] systems. Wasm is gaining momentum as the DynamoRIO [18], our work makes only minor additions
primary sandboxing mechanism in many new computing to the existing execution tiers of the Wasm engine (a few
platforms, as its execution model robustly separates Wasm hundred lines of code), rather than a new, purpose-built JIT
module instance state. The format is designed to be load- and (tens of thousands of lines of code). In the Wizard multi-
run-time efficient with many high-performance implemen- tier research engine, we cleverly4 reuse its deoptimization
tations. Wasm comes with strong safety guarantees, starting mechanism to achieve these consistency guarantees without
with a strict formal specification [6], a mechanically-proven needing to build a custom mechanism or resort to interpre-
sound type system [62], and implementations being sub- tation only.
jected to verification [17]. Feasible Production Adoption. Together, these inno-
Yet to date, no standard APIs for Wasm instrumentation ex- vations make it feasible for production engines to provide
ist. Only intrusive instrumentation techniques exist today. To direct support for instrumentation without adding unneces-
work around the lack of standard APIs for instrumentation, sary complexity, putting powerful capabilities into the hands
several static bytecode rewriting tools have emerged [33] [47]. of application developers.
Wasm engines achieve near-native performance through
AOT or JIT compilation. Compilation is greatly simplified 2 Non-intrusive instrumentation in Wizard
(over dynamic binary translation) as Wasm’s code units are High-performance virtual machines optimize execution time
modules and functions rather than unstructured, arbitrarily- by cheating. JIT compiler optimizations skip some unobserv-
addressable machine code. While Wasm JITs give excellent able execution steps of the abstract machine. For example,
performance, some engines such as JavaScriptCore [4] and not every update of a local variable or operand stack value
wasm3 [2] employ interpreters either for startup time or is modeled at runtime, but function-local storage is virtu-
memory footprint. Interpreters also help debuggability and alized and register-allocated. Yet monitoring a program for
introspection; recent work [55] outlined a fast in-place in- dynamic analysis inherently observes intermediate states
terpreter design in the Wizard Research Engine. of a program, rather than just its final outcome. Dynamic
analyses typically observe states of the abstract machine, so
1.3 Our contributions VMs that support introspection must materialize the abstract
Flexible Non-Intrusive Instrumentation. In this work, states whenever requested. For example, a dynamic analysis
we describe the first dynamic, non-intrusive (side-effect free) can observe any of the function-level storage such as the
instrumentation framework for WebAssembly and detail its local variables and operand stack.
implementation in the open-source Wizard [54] Research
engine 3 . We show how to implement efficient support for
probes in a multi-tier Wasm engine and how to build useful, 4We observe that the deoptimization (and on-stack replacement) mechanism

already solves the hard problem of correct transfer between optimization


3 https://github.com/titzer/wizard-engine levels and can be repurposed for instrumentation, saving tons of code.
3
Figure 1. Illustration of instrumentation in the interpreter. Global probes can be inserted into the interpreter loop and local
probes are implemented via bytecode overwriting. The FrameAccessor API allows a probe programmatic access to the state
in the Wasm frame.

Where to instrument programs? Most monitors instru- 2.1 Global Probes


ment code to observe the flow of execution or program data. The simplest type of probe is a global probe, which fires a
With code instrumentation, locations in the original program callback for every instruction executed by the program. Clearly
code (e.g. bytecode offset, address, line number) become the global probes are complete since they can execute arbitrary
natural points of reference. This makes it intuitive to use an logic at any point in execution. Despite their inefficiency,
instrumentation API to attach M-code to program locations global probes are still useful. A global probe is the easiest way
which will fire when that point is reached during execution. to implement tracing, counting, or the step-instruction
Monitoring in Wizard with probes. A dynamic analysis operation of a debugger.
for Wizard involves writing a Monitor in Virgil [53] against Global probes are easy to implement in an interpreter; its
an engine API. With this API, everything about a Wasm main loop or dispatch sequence simply contains a check for
program’s execution can be observed on demand, including any global probe(s) and calls them at each iteration. They
any/every bytecode executed, any/every internal state com- also are the slowest M-code because, even with a JIT, they
puted, and all interaction with the environment. Monitors effectively reduce the VM to an interpreter5 .
observe execution by inserting probes that fire callbacks be- Unfortunately, the simple implementation technique of
fore specified events or states occur. Callbacks are dynamic an extra check per interpreted instruction imposes overhead
logic but their M-code can be statically compiled into the en- even when not enabled, which tempts VMs to have different
gine. Their M-code is efficient machine code that the engine production and debug builds. A key innovation in Wizard
invokes directly from either the interpreter or JIT-compiled (Section 4) is to implement global probes with dispatch table
code. Since this M-code executes as part of the engine and switching, which imposes zero overhead when global probes
the engine virtualizes Wasm program state, monitors are are not enabled, obviating the need for a separate debug
inherently non-intrusive. build. This technique also allows efficient dynamic insertion
Probes are maximally general. While many systems [13, and removal of global probes, which we have found to be a
24, 37, 64] offer event traces that can be analyzed asyn- useful mechanism for implementing some analyses, shown
chronously or offline, probes are more fundamental since in Section 2.6. Regardless of implementation efficiency, an
they enable the insertion of arbitrary code at arbitrary loca- engine can achieve instrumentation-completeness by adding
tions. For example, probes can generate event traces or react only global probe support.
to program behavior, but event traces do not offer the ability
to influence execution, an inherent capability of synchro- 2.2 Local Probes
nous probes. Thus, we say that probes are complete in the Many dynamic analyses are sparse, only needing to instru-
sense that every type of instrumentation can be built from ment a subset of code locations. For this reason, Wizard also
them. allows local probes to be attached to specific locations in
Figure 1 illustrates the probe hooks offered by Wizard and
their implementation in the interpreter. 5 E.g.in a compile-only engine, a compilation mode which inserts a call to
fire global probes before every instruction suffices, but bloats generated
code and has marginal performance benefit over an interpreter.
4
the bytecode. At runtime, the engine fires local probes just Dangling accessor objects. FrameAccessor objects are
before executing the respective instruction. Each Wasm in- allocated in the engine’s state space (for Wizard, the man-
struction can be identified uniquely by its module, function, aged heap), and since probes are free to store references to
and byte offset from the start of the function, making the them across multiple callbacks, it is possible that the acces-
triple (module, funcdecl, pc) a natural location identi- sor object outlives the execution frame that it represents7 .
fier in the API. Since local probes only fire when reaching a While the accessor object itself will be eventually reclaimed,
specific location, they are more convenient for implementing it is problematic if M-code accesses frames that have been
analyses such as branch profiling, call graph analysis, code unwound. We identified a number of implementation mech-
coverage, breakpoints, etc. anisms to protect the runtime system from buggy monitors.
Local probes can be significantly more efficient than global Possible solutions include:
probes for several reasons: 1. Clear accessor on entry. Upon entry to a Wasm func-
• zero overhead for uninstrumented instructions tion, the accessor slot in the execution frame is uncon-
• efficient implementation in interpreter and compilers ditionally set to null.
• compilers can optimize around local probes 2. Invalidate accessor on return. A dynamic check is
performed on all returns from a function; if the accessor
Like global probes, local probes are complete; both can be slot points to a valid FrameAccessor object, the object
implemented in terms of each other at the cost of efficiency6 . itself is invalidated (e.g. by setting a field in the object
to false).
2.3 The FrameAccessor API 3. Invalidate accessors on unwind. When unwinding
While many analyses need only the sequence of program lo- frames for a trap or exception thrown, which is typically
cations executed, more advanced dynamic analyses like taint done in the runtime rather than compiled code, the
tracking, fuzzing and debugging observe program states. To accessor object itself is invalidated.
allow probe callbacks access to program state, they receive 4. Return guards invalidate accessor. When an acces-
not only the program location, but also a lazily-allocated ob- sor slot is set, the return address for the frame is also
ject with an API for reading state, called the FrameAccessor. redirected to a trampoline that will invalidate the acces-
The FrameAccessor provides callbacks a façade [25] with sor object before returning to the actual caller.
methods to read frame state, abstracting over the machine- 5. FrameAccessor methods check frame validity. Ev-
level details of frames, which often differ between execution ery call to an accessor object checks that the underlying
tiers and engine versions. They offer a stable interface to a machine frame points back at the accessor object.
frame where, due to dynamic optimization and deoptimiza- 6. FrameAccessor methods check self validity. Every
tion, the engine may change the frame representation during call to an accessor object checks the object’s validity
the execution of a function. field.
A FrameAccessor object represents a single stack frame Our solution is to minimize checks in the interpreter
and is allocated when a callback first requests state other than and compiled code and favor checks at the FrameAccessor
the easily-available WasmFunction and pc. Importantly, the API boundary and corresponds to a combination of 1, 4,
identity of this object is observable to probes so that they can and 5. This relies on stack frame layout invariants: func-
implement higher-level analyses across multiple callbacks. tion entry clears the accessor slot, the first request for the
At the implementation level, execution frames maintain the FrameAccessor materializes the object, and subsequent ac-
mapping to their accessor by storing a reference in the frame cessor calls compare the accessor slot to a cached stack
itself, called the accessor slot. The slot is not used in normal pointer in the object. To make these checks bulletproof to
execution, but imposes a one machine word space overhead; monitor bugs, FrameAccessors should be invalidated on re-
its execution time impact should be negligible. turn, e.g. with a runtime check8 .
Stackwalking and callstack depth. The FrameAccessor
API allows walking up the callstack to callers so monitors 2.4 Consistency guarantees
can implement context-sensitive analyses and stacktraces.
Many analyses can be implemented by making use of dy-
The depth of the call stack alone is also often useful for
namic probe insertion and removal. Other analyses, partic-
tracing or context-sensitive profiling, so FrameAccessor ob-
ularly debuggers, could make modifications to frames that
jects include a depth() method, which a VM can implement
alter program behavior. When do new probes and frame mod-
slightly more efficiently.
ifications take effect? Providing consistency guarantees is a
6 Emulating local probes with global probes can be done with logic that 7With ownership, as in Rust, lifetime annotations can statically prevent
looks up each local probe in M-state, and global probes can be emulated by a FrameAccessor object from escaping from a single callback, yet some
inserting local probes everywhere, but incurs overhead from data structure monitors legitimately want to track frames across multiple callbacks.
lookups in the engine. 8 Or a return guard trampoline, which avoids any runtime overhead.

5
key innovation in our system that makes composing multiple This effectively requires immediate deoptimization of a
analyses reliable. With these guarantees, probes from multi- frame, also guaranteed by JVMTI. Otherwise, if execution
ple monitors do not interfere, making monitors composable continues in JIT-compiled code, almost any invariant the JIT
and deterministic. Monitors can be used in any combination relied on could be invalid, and it may appear that updates
without explicit foresight in their implementation. have not occurred yet, violating consistency.
2.4.1 Deterministic firing order. What should happen 2.4.3 Multi-threading. While Wizard is not currently
if a probe 𝑝 at location 𝐿 fires and inserts another probe 𝑞 multi-threaded, WebAssembly does have proposals to add
at the same location 𝐿? Should the new probe 𝑞 also fire threading capabilities which Wizard must eventually support.
before returning to the program, or not? Similarly, if probes That brings with it the possibility of multi-threaded instru-
𝑝 and 𝑞 are inserted on the same event, is their firing order mentation. Locks around insertion and removal of probes
predictable? should maintain our consistency guarantees through serial-
We found that a guaranteed probe firing order is subtly izing dynamic instrumentation requests. Our design inher-
important to the correctness of some monitors (e.g. the func- ently separates monitor state from program state. Thus data
tion entry/exit utility shown in Section 2.5). For this reason, races on the monitor state are the responsibility of the mon-
we guarantee three dynamic probe consistency properties: itors, for example by using lock-free data structures and/or
• Insertion order is firing order: Probes inserted on the locks at the appropriate granularity. The FrameAccessor
same event 𝐸 fire in the same order as they were in- can also include synchronization to prevent data races on
serted. Wasm state10 .
• Deferred inserts on same event: When a probe fires 2.5 Function Entry/Exit Probes
on event 𝐸 and inserts new probes on the same 𝐸, the
new probes do not fire until the next occurrence of 𝐸. Probes are a low-level, instruction-based instrumentation
• Deferred removal on same event: When a probe fires mechanism, which is natural and precise when interfacing
on event 𝐸 and removes probes on the same 𝐸, the re- with a VM. Yet many analyses focus on function-level behav-
moved probes do fire on this occurrence of 𝐸 but not ior and are interested in calls and returns. Instrumentation
subsequent occurrences. hooks for function entry/exit make such analyses much eas-
ier to write.
2.4.2 Frame modifications. As shown, the FrameAccessor At first glance, detecting function entry can be done by
provides a mostly read-only interface to program state. Since probing the first bytecode of a function, and exit can be de-
monitors run in the engine’s state space, and not the Wasm tected by probing all returns, throws, and brs that target
program’s state space, by construction this guarantees that the function’s outermost block. However, some special cases
monitors do not alter the program behavior. However, some make this tricky. First, a function may begin with a loop;
monitors, such as a debugger’s fix-and-continue opera- the entry probe must distinguish between the first entry to a
tion, or fault-injection, intentionally change program state. function, a backedge of the loop, and possible (tail-)recursive
For an interpreter, modifications to program state, such calls. Second, local exits are not enough: frames can be un-
as local variables, require no special support, since inter- wound by a callee throwing an exception caught higher in
preters typically do not make assumptions across bytecode the callstack.
boundaries. For JIT-compiled code, any assumption about Should the VM support function entry/exit as special
program state could potentially be violated by M-code frame hooks for probes? Interestingly, we find this is not strictly
modifications. Depending on the specific circumstance, con- necessary. This functionality can be built from the programma-
tinuing to run JITed code after state changes might exhibit bility of local probes and offered as a library. There are sev-
unpredictable program behavior9 . eral possible implementation strategies: 1) use entry probes
It’s important for the engine to provide a consistency that push the FrameAccessor objects onto an internal stack,
model for state changes made through the FrameAccessor. with exit probes popping; 2) sampling the stack depth via
When monitors explicitly intend to alter the program’s be- the FrameAccessor’s depth() method; or 3) by instrument-
havior, it is natural for them to expect state changes to take ing, and thus ignoring, loop backedges. Thus, function en-
effect immediately, as if the program is running in an inter- try/exit reside above global/local probes in the hierarchy
preter. Thus, our system guarantees: of instrumentation mechanisms. This is further evidence
that the programmability of probes allows building higher-
• Frame modification consistency: State changes made
level instrumentation utilities for more expressive dynamic
by a probe are immediately applied, and execution after analyses.
a probe resumes with those changes.
10 Note:frames are by-definition thread local; races can only exist if the
9 Trueeven for baseline compilers like Wizard’s compiler, which perform monitor itself is multi-threaded and FrameAccessor objects are shared
limited optimizations like register allocation and constant propagation. racily.
6
2.6 After-instruction offers the perfect mechanism: the global probe. Instruction-
Some analyses, such as branch profiling or dynamic call level tracing in Wizard simply uses one global probe. Other
graph construction, are naturally expressed as M-code that than a short flag to enable it, there is nothing special about
should run after an instruction rather than before. For exam- this probe; it uses the standard FrameAccessor API as it
ple, profiling which functions are targets of a call_indirect prints instructions and the operand stack.
would be easiest if a probe could fire after the instruction The Coverage monitor measures code coverage. It in-
is executed and a frame for the target function has been serts a local probe at every instruction (or basic block), which,
pushed onto the execution stack. However, the API has no when fired, sets a bit in an internal datastructure and then
such functionality. removes itself. By removing itself, the probe will no longer
Should the VM support an “after-instruction” hook di- impose overhead, either in the interpreter or JITed code.
rectly? Interestingly, we find that like function entry/exit, Eventually, all executed paths in the program will be probe-
the unlimited programmability of probes allows us to invoke free and JITed code quality will asymptotically approach
M-code seemingly after instructions. For example, suppose zero overhead. This is a good example of a monitor using
we want to execute probe 𝑝 after a br_table (i.e. Wasm’s dynamic probe removal.
switch instruction). We identified at least three strategies: The Loop monitor counts loop iterations. It inserts CountProbes
at every loop header and then prints a nice report. This is a
• A probe 𝑞𝑝 executed before the br_table can use the
good example of a counter-heavy analysis.
FrameAccessor object to read the top (i32 value) of
The Hotness monitor counts every instruction in the
the operand stack, determine where the branch will go,
program. It inserts CountProbes at every instruction and
and dynamically insert probe 𝑝 at that location.
then prints a summary of hot execution paths. Another ex-
• Insert probes into all targets of the br_table. Since
ample of a counter-heavy analysis.
br_table has a fixed set of targets, we can insert probes
The Branch monitor profiles the direction of all branches.
once and use M-state to distinguish reaching each tar-
It instruments all if, br_if and br_table instructions and
get from the br_table versus another path. This only
uses the top-of-stack to predict the direction of each branch.
works in limited circumstances; other instructions like
It is a good example of non-trivial FrameAccessor usage.
call_indirect have an unlimited set of targets.
The Memory monitor traces all memory accesses. It
• Insert a global probe for just one instruction and remove
instruments all loads and stores and prints loaded and stored
it after. The probe will fire on the next instruction, wher-
addresses and values. Another good example of non-trivial
ever that is, then the probe will remove itself. For a use
FrameAccessor usage.
case like this, it’s important that dynamically enabling
The Debugger REPL implements a simple read-eval-print
global probes doesn’t ruin performance, e.g. by deopti-
loop that allows interactive debugging at the Wasm bytecode
mizing all JIT-compiled code. We show in Section 4.1
level. It supports breakpoints, watchpoints, single-step,
how dispatch-table switching can make this use case
step-over, and changing the state of value stack slots. It
efficient.
primarily uses local probes but uses a global probe to im-
With multiple strategies to emulate its behavior, an after- plement single-step functionality. This monitor is a good
instruction hook resides above global/local probes in the example of dynamic probe insertion and removal. It is also
instrumentation mechanism hierarchy. the only monitor (so far) that modifies frames.
The Calls monitor instruments callsites in the program
3 The Monitor Zoo and records statistics on direct calls and the targets of indirect
The wide variety and ease with which analyses are imple- calls. Its output can be used to build a dynamic call graph
mented11 showcases the flexibility of having a fully pro- from an execution.
grammable instrumentation mechanism in a high-level lan- The Call tree profiler measures execution time of func-
guage. Users activate monitors with flags when invoking tion calls and prints self and nested time using the full calling-
Wizard (e.g. wizeng --monitors=MyMonitor), which in- context tree. It can also produce flame graphs. It inserts local
strument modules at various stages of processing before probes at all direct and indirect callsites and all return lo-
execution and may generate post-execution reports. Exam- cations 12 . It is a good example of a monitor that measures
ples of monitors we have built include a variety of useful non-virtualized metrics like wall-clock time.
tools.
The Trace monitor prints each instruction as it is exe-
cuted. While many VMs have tracing flags and built-in modes
that may be spread throughout the code, Wizard already
11 Mostmonitors required a dozen or two lines of instrumentation code; in
fact, most lines are usually spent on making pretty visualizations of the
data!
7
Figure 2. Code generated by Wizard’s baseline JIT for different types of M-code implemented with probes. The machine code
sequence for generic probes is more general than for probes that only need the top-of-stack value, versus a fully-intrinsified
counter probe.

4 Optimizing probe overhead Dynamically adding and removing global probes shouldn’t
Optimizations in Wizard’s interpreter and JIT compiler re- ruin performance, as they might be used to implement “after-
duce overhead for both global and local probes, see the effec- instruction” or to trace a subset of the code, such as an indi-
tiveness of this technique in Section 5.3. We define overhead vidual function or loop. Our design further extends [55] by
as the execution time spent in neither application code nor supporting global probes without deoptimizing JITed code.
M-code, but in transitions between application and M-code This can be done by temporarily returning to the interpreter
or additional work in the runtime system and compiler. in the global probe mode. In global probe mode, a different
dispatch table is used, which, in addition to calling probes
4.1 Optimizing global probes in the interpreter for every instruction, can use special handlers for certain
bytecodes. For example, the loop bytecode does not check
Global probes, being the most heavyweight instrumenta-
for dynamic tier-up (which would cause a transfer to JITed
tion mechanism, are supported only in the interpreter. It is
code), call instructions reenter the interpreter (rather than
straightforward to add a check to the interpreter loop that
entering the callee’s JITed code, if any), and return returns
checks for any global probes at each instruction. However,
only to the interpreter (rather than the caller’s JIT code). Oth-
this naive approach imposes overhead on all instructions
erwise, JIT code remains in-place. Removing global probes
executed, even if global probes are not enabled. One option
leaves this mode and JIT code will naturally be reentered as
to avoid overhead when global probes are disabled is to have
normal. See Section 4.6 for how we guarantee consistency
two different interpreter loops, one with the check and one
after state modifications. To our knowledge, our design is
without, and dynamically switch between them. This comes
the first to support switching into a heavyweight instrumen-
at some VM code space cost, since it duplicates the entire
tation mode and back without discarding any JITed code,
interpreter loop and handlers. Another approach described
preserving performance.
in [55] avoids the code space cost by maintaining a pointer
to the dispatch table in a hardware register. When global 4.2 Optimizing local probes in the interpreter
probes are not in use, this register points to a “normal” dis-
Both Wizard’s interpreter and baseline JIT support local
patch table without instrumentation; inserting a global probe
probes. In the interpreter, local probes impose no overhead
switches the register to point to an “instrumented” dispatch
on non-probed instructions by using in-place bytecode modi-
table where all (256) entries point to a small stub that calls the
fications. With bytecode overwriting, inserting a local probe at
probe(s) and then dispatches to the original handler via the
a location 𝐿 overwrites its original opcode with an otherwise-
“normal” dispatch table. Both code duplication and dispatch-
illegal probe opcode. The original unmodified opcode is
table switching are suitable for production, as they allow the
saved on the side. When the interpreter reaches a probe
VM to support global probes while imposing no overhead
opcode, the Wasm program state (e.g. value stack) is already
when disabled.
up-to-date; it saves the interpreter state, looks up the lo-
cal probe(s) at the current bytecode location, and simply
12 Wizardhas preliminary support for the proposed Wasm exception han- calls that M-code callback. This is somewhat reminiscent
dling mechanism, but does not yet have monitoring hooks for unwind of machine code overwriting, a technique sometimes used
events. to implement debugging or machine code instrumentation
8
(Pin, gdb and DynamoRIO). However, our approach is vastly code. For the generic probe case, the JIT inserts a call to a
simpler and more efficient as it doesn’t require hardware generic runtime routine calls the user’s probe. For the next
traps or solving a nasty code layout issue —only a single more specialized case, the top-of-stack, it inserts a direct
bytecode is overwritten. call to the probe’s fire method, passing the top-of-stack
In Wizard, since the callback is compiled machine code, value, skipping the runtime call overhead and the cost of
the overhead is a small number of machine instructions to reifying an expensive FrameAccessor object. In general val-
exit the interpreter context and enter the callback context. ues from the frame can be directly passed from the JITed
After returning from M-code, 𝐿’s original opcode is loaded code to M-code. Lastly, for the counter probe, we see that
(e.g. by consulting an unmodified copy of the function’s Wizard’s JIT simply inlines an increment instruction to a
code) and executed. Removing a probe is as simple as copy- specific CountProbe object without looking it up.
ing the original bytecode back; the interpreter will no longer Other systems allow building custom inline M-code. For
trip over it. In contrast, Pin allows disabling by removing example, Pin offers using a type of macro-assembler that
all instrumentation from a specified region of the original builds IR that it compiles into the instrumented program,
code, which effectively reinstalls the original code, an all-or- which is very low-level, tedious, and error-prone.
nothing approach rather than having control at the probe
granularity. Overwriting has two primary advantages over
bytecode injection; the original bytecode offsets are main- 4.5 Monitor consistency for JITed code
tained, making it trivial to report locations to M-code, and We just saw how a JIT can inline M-code into the compiled
insertion/removal of probes is a cheap, constant-time opera- code. However, M-code can change as probes can be inserted
tion. Consistency is trivial; the bytecode is always up-to-date and removed during execution, making compiled code that
with the state of inserted instrumentation. has been specialized to M-code out-of-date. This problem
can be addressed by standard deoptimization techniques
4.3 Local probes in the JIT such as on-stack-replacement back to the interpreter and
In a JIT compiler, local probes can be supported by injecting invalidating relevant machine code. To our knowledge, no
calls to M-code into the compiled code at the appropriate prior bytecode-based system has employed deoptimization
places. Since probe logic could potentially access (and even to support dynamic instrumentation of an executing frame,
modify) the state of the program through the FrameAccessor, but offer only hot code replacement.
a call to unknown M-code must checkpoint the program and
VM-level state. For baseline code from Wizard’s JIT, the over-
4.6 Strategies for multi-tier consistency
head is a few machine instructions more than a normal call
between Wasm functions13 . Compilation speed is paramount There are several different strategies for guaranteeing moni-
to a baseline compiler, and bytecode parsing speed actually tor consistency in a multi-tier engine like Wizard. We iden-
matters. Similar to the benefits to interpreter dispatch, byte- tified four plausible strategies:
code overwriting avoids any compilation speed overhead
because the probe opcode marks instrumented instructions 1. When instrumentation is enabled, disable the JIT.
and additional checks aren’t needed. Overall, supporting 2. When instrumentation is enabled, disable only relevant
probes adds little complexity to the JIT compiler; in Wizard’s JIT optimizations.
JIT, it requires less than 100 lines of code. 3. Upon frame modification, recompile the function under
different assumptions about frame state and perform
4.4 JIT intrinsification of probes on-stack-replacement from JITed to JITed code.
While probes are a fully-programmable instrumentation 4. Upon frame modification, perform on-stack-replacement
mechanism to implement unlimited analyses, there are a from JITed code to the interpreter.
number of common building blocks such as counters, switches,
and samplers that many different analyses use. For logic as Strategy 1) is the simplest to implement for engines with
simple as incrementing a counter every time a location is interpreters, but slow. A production Wasm engine could
reached, it is highly inefficient to save the entire program achieve functional correctness and the key consistency guar-
state and call through a generic runtime function to execute antees at little engineering cost, leaving instrumented per-
a single increment to a variable in memory. Thus, we imple- formance as a later product improvement. Strategy 2) elimi-
mented optimizations in Wizard’s JIT to intrinsify counters nates interpreter dispatch cost, but, ironically, is actually a
as well as probes that access limited frame state. lot of work in practice, since it introduces modes into the JIT
Figure 2 shows how Wizard’s baseline JIT optimizes dif- compiler and optimizations must be audited for correctness.
ferent kinds of probes. At the left, we have uninstrumented The compiler becomes littered with checks to disable opti-
mization and ultimately the JIT emits very pessimistic code.
13 Primarily because the calling convention models an explicit value stack. Strategy 3) has other implications for JIT compilation, such
9
Hotness (Local probes) Branches (Local probes) 5 Evaluation
Relative Execution Time (log scale)

Hotness (Global probes) Branches (Global probes)


1010 1010

Number of probe fires (log scale)


108 108 In this section, we evaluate performance of monitoring code
106 106 using three suites of benchmarks and several different im-
104 104
101 102 101 102 plementation strategies. We compare instrumenting Wasm
100 100 code in Wizard, bytecode rewriting, bytecode injection with
Wasabi, and native code instrumentation with DynamoRIO.

100 100 5.1 Evaluation setup


su iso d
dummlv
r v
bbicin
ge amtagx
trmvevrt
coorre itsgemn
va latyrk
gra gsyamncne
ms semm
chyr2m
mk
nufdt2dmidmt
jac s3in-2od
o mv
sei bi-2m
d ad
che el- di
flo hluoalet-32dd

su iso d
dummlv
-w my
ars lp
hau
ll

r v
bbicin
ge amtagx
to rmvevrt
coorre itsgemn
va latyrk
gra gsyamncne
ms semm
chyr2m
mk
nufdt2dmidmt
jac s3in-2od
obmmv
d ad
che el- di
flo hluoalet-32dd

hau
ll
-w my
ars lp
We evaluate the performance of Wizard by executing Wasm
yd dcsk

yd dcsk
ge otrbi-1

ri io

ge otrbi-1

ri io

sei i-2
m

m
o

s
jac

jac
d

d
c

c
code under both the interpreter and JIT using different mon-
itors and measure total execution time of the entire program,
Figure 3. Average relative execution time for the hotness including engine startup and program load. We chose the
monitor (left) and branch monitor (right), when implemented “hotness” and “branch” monitors (described in Section 3). The
with local probes and when implemented with a global probe hotness monitor instruments every instruction17 with a local
on the PolyBenchC suite. Points above the bars denote num- CountProbe, which is representative of monitors with many
ber of probe fires. simple probes. The branch monitor probes branch instruc-
tions and tallies each destination by accessing the top of the
operand stack. Compared to the hotness monitor, probes in
the branch monitor are more sparse but more complex.
as requiring support for arbitrary OSR locations14 , which is
These monitors were chosen because they strike a balance
also significant engineering work.
between being powerful enough to capture insights about
In Wizard, we chose strategy 4, which we believe to be not
the execution of a program, yet simple enough to be imple-
only the simplest, but most robust. Frame modifications trig-
mented in other systems. They are also likely to instrument
ger immediate deoptimization of only the modified frame15 ,
a nontrivial portion of program bytecode.
rewriting it in place to return to the interpreter. In the dy-
Benchmark Suites. We run Wasm programs from three
namic tiering configuration mode, sending an execution
benchmark suites: PolyBench/C [45] with the medium dataset,
frame back to the interpreter due to modification doesn’t
Ostrich [29] and Libsodium [22] and average execution time
banish it there forever; if it remains hot, it can be recompiled
over 5 runs.
under new assumptions16 . This means frame modification
Given instrumented execution time𝑇𝑖 and uninstrumented
support requires the interpreter; Wizard will not allow mod-
execution time 𝑇𝑢 , we define absolute overhead as the quan-
ifications in the JIT-only configuration.
tity 𝑇𝑖 − 𝑇𝑢 and relative execution time as the ratio 𝑇𝑖 /𝑇𝑢 . We
Inserting or removing probes in a function also triggers
report relative execution time for Wizard’s interpreter, Wiz-
deoptimization of JITed code for the function and sends ex-
ard’s JIT (with and without intrinsification), DynamoRIO,
isting frames back to the interpreter. This is different than
Wasabi, and bytecode rewriting in Figures 6 and 7.
a frame modification, because the JIT may have specialized
the code to instrumentation at the time of compilation; the
code is actually invalid w.r.t. the instrumentation it should 5.2 Global vs local probes
execute. Like with frame modifications, hot functions will Global probes can emulate the behavior of local probes, but
eventually be recompiled. It’s likely that such highly dy- impose a greater performance cost by introducing checks at
namic instrumentation scenarios would perform better by every bytecode instruction. We compare two implementa-
using M-state to enable and disable their probes rather than tions of the branch and hotness monitors, one using a global
repeatedly inserting and removing them, which confounds probe and the other using local probes. Both are executed
engine tiering heuristics. in Wizard’s interpreter, since Wizard’s JIT doesn’t support
global probes. The results can be found in Figure 3. For the
hotness monitor, since the number of probe fires is the same
14 Most JITs that allow tier-up OSR into compiled code only do so at loop
for local and global probes, the relative overhead is similar
headers. across all programs. For the branch monitor, local probes on
15We observe that the JIT-compiled code for a function is not invalid, it branch instructions have relative execution times between
is only the state of the single frame that now differs from assumptions in 1.0–2.2×, whereas it is between 7.7–16.4× for global probes.
the JIT code. New calls to the involved function can still legally enter the
existing JIT code.
16 Pathological cases can occur where hot frames are repeatedly modified,

constantly transferring between interpreter and JITed code. A typical fix


employed in many VMs is to simply limit the number of times a function 17 Obviously,it is more efficient to count basic blocks. We chose to count
can be optimized and offer user diagnostics. every instruction in order to maximize instrumentation workload.
10
Hotness (JIT intrins.) Branches (JIT intrins.) We further decompose the runtime of the benchmarks
Relative Execution Time (log scale)

Hotness (JIT) Branches (JIT)


into the time spent in the program’s JIT-compiled code (𝑇JIT ),

Number of probes fires (log scale)


1010 1010
108 108
102 106 102 106 time in M-code (𝑇𝑀 ), and time in the probe dispatch logic
104 104 (𝑇PD ). This decomposition is done by recording:
102 102
100 100
101 101
1. The uninstrumented execution time of code in the JIT,
which approximates 𝑇JIT ;
100 100
2. The instrumented execution time with empty probes
ge dtruri-b1d
geumismoilnv

co itg m
mv v
aticgr
trmmavxt
covar syen
rre ian rk
gra gemtione
ms ssyyr2m
ch m k
mm
fn dt2midt
us d m
jac s3in-2od
heobmi- mv
ch at-32dd
flo s ludlecskdyi
yd ei mp
arsl-2u
had
ll

ge dtruri-b1d
geumismoilnv
mv v
aticgr
to rmmavxt
co itg m
covar syen
rre ian rk
gra gemtione
ms ssyyr2m
ch m k
mm
nufdt2dmidmt
jac s3in-2od
heobmi- mv
ch at-32dd
flo s ludlecskdyi
yd ei mp
arsl-2u
had
ll
be

be
la c

la c
-w de l

-w de l
o a

o a
ob

ob (probes with empty fire functions), which approxi-


o

s
jac

jac
d

d
s

s
mates 𝑇PD + 𝑇JIT ;
Figure 4. Average relative execution times for the hotness 3. The instrumented execution time with actual probes,
(left) and branch monitors (right), with and without probe which gives 𝑇PD + 𝑇𝑀 + 𝑇JIT .
intrinsification on the PolyBenchC suite. Ratios are relative
to uninstrumented JIT execution time. Points above the bars
denote number of probe fires. The results of this analysis for the branch and hotness
monitors are in Figure 5. Execution time without JIT intrin-
sification is shown as the entire bar for each program. The
probe-dispatch-logic probe-dispatch-logic cross-hatched portions of each bar represent the execution
Execution Time (% of total runtime)

m-code 100 m-code


100 program program
time saved by intrinsification. For the non-intrinsified branch
80 80
monitor, the overhead 𝑇PD + 𝑇𝑀 is dominated by M-code. In
60 60
the intrinsified case, the overhead is dominated by probe
40 40
dispatch, and the M-code overhead is reduced substantially:
20 20 calling the top-of-stack operand probe’s M-code still requires
0 0 significant spilling on the stack and a call, contributing to
ge truirbidn
geummolv
mv v
b er
atiacg
mx
dotrmvmt
covari syrn
rre an k
gatioe
gra seymmn
ms sy r2k
ch mm
nufdtdm- mt
ss 2d
jac 3imnov
he bi-2m
a d
ch t-a3d
lu les di
flo s dcmky
yd eid p
-w el lu
ars -2d
ll

ge trirbidn
geummlv
mv v
b er
atiacg
mx
dotrmvmt
covari syrn
rre an k
geatione
gra symm
ms sy r2k
ch mm
2 idt
nufdtdm- m
ss 2d
jac 3imnov
heobi-2m
at d
ch -a3d
ludleskdi
flo s cmpy
-w el- lu
ars 2d
ll
2 id

ha

ha
l c

l c
dbi-1

co itge

dubi-1

co itge
s so
m

m
s s

runtime overhead. The M-code overhead no longer includes


yd eid
o

o
jac

jac

time for construction of the FrameAccessor as it is not nec-


Figure 5. Execution time decomposition of hotness (left) essary.
and branch monitors (right) into M-code and probe dispatch As for the non-intrinsified hotness monitor, the overhead
overhead with and without probe intrinsification on the Poly- is dominated by the probe dispatch overhead as probes are
BenchC suite. The cross-hatched regions represent overhead simpler but fired more frequently. In the intrinsified case,
saved by intrinsification. there is almost no M-code overhead as counter probes do
not have custom fire functions; the counter increment is
entirely inlined. The remaining probe dispatch overhead
comes from the monitor setup and reporting.
5.3 JIT optimization of count and operand probes
Section 4 describes how Wizard’s JIT intrinsifies some types
of probes to reduce overhead. We evaluate JIT intrinsifica-
tion in Figure 4 and report the relative execution time of 5.4 Interpreter vs. JIT
instrumented over uninstrumented execution. We find that the relative overhead of monitors running in
For the hotness monitor, which counts the execution fre- Wizard’s interpreter is much lower than the JIT, for two rea-
quency of every instruction, we observed relative execution sons: the interpreter runs much slower, and less additional
times between 7–134×. This is due to the high cost of switch- work is done in checkpointing state. In contrast, calls to lo-
ing between JIT code and the engine at every instruction. cal probes in the JIT require checkpointing to support the
With intrinsification, the same monitor has relative execu- FrameAccessor API. Data in Figures 6 and 7 show that, for
tion times between 2.2–7.7×. the branch monitor, the relative execution time in the inter-
We performed a similar experiment to evaluate the effec- preter is 1.0–2.2× as compared to 1.0–16.6× in Wizard’s JIT.
tiveness of JIT intrinsification of top-of-stack operand probes In the higher-workload hotness monitor, this difference is
by measuring execution times with the branch monitor. We exacerbated: the relative execution time in the interpreter
see that intrinsification improves relative execution times is 7.0–13.5× as compared to 7.0–134× in the JIT. Although
from 1.0–16.6× with the base JIT to 1.0–2.8×. The improve- relative execution times differ substantially, absolute over-
ment is smaller for branch probes than for CountProbes, head between the two modes is comparable: for the branch
because a call into the probe’s M-code remains, whereas monitor, the mean overhead in the interpreter is 2.6s and
CountProbes are fully inlined (see Figure 2). 2.3s in the JIT.
11
Relative Execution Time (log scale)

Native (x86-64, DynamoRIO) V8 (Wasabi) Wizard (Interpreter) Wizard (JIT intrins.) Wizard (JIT) Bytecode rewriting (Wizard, JIT)
103
102
101
100

103
102
101

100
0
du 1d
tris in

rc

...
on stre olv
m 3

x
su h
v

t
ge icg
er
re h3
ret 2
au x
trm 2

sip ortha 6

bo m

do th3
en
co sy h
va rk
sy e
co eyg k
rre en

sym2d

ch d

sca idel-2 .
lar d

lav a20
bo amd

mb w
s
nq lud
m

ha sh
sec hx24

au ...

fdt ion

gra jacob m

2m t
m
th
nu kdf
lar ov

3 d
ge box_ mm

sca armu 2

ch lu
lud sky
p
lar adi
he ult2

sca icha d

lar lt6
lt5

lt7
x
ad ar 2
ac a...

x t
ac y

l
fft

ck- h ns
pa m
ea
se a2..
mv

bo mul

yte
bo

bo
k r2
ata
mm

ch _eas
nc
eti am
ge eaut

ms i-2

cm
sec tbox

th

sh auth

r e
l sh

ae yd-w box

do n
rb

ga
a
ret gem

pro m
mv

mi

au
ha

e
sca ssin
itg

ne se
i-

sec as

d-

at-

mu

mu
b

x_s
lat
x_e

_ch sh
ole

ue
ria

h
ob

h
h

s
jac

sca

ran
flo

ba
Figure 6. Relative execution times of the hotness monitor (bottom) and branch monitor (top) in Wizard, Wasabi, and
DynamoRIO across all programs on all suites, sorted by absolute execution time. Ratios are relative to uninstrumented
execution time.

From Figure 7, we observe that the intrinsified JIT execu-


Mean Relative Execution Time

Native (x86-64, DynamoRIO) Wizard (JIT intrins.)


V8 (Wasabi) Wizard (JIT)
103 103 Wizard (Interpreter) Bytecode rewriting (Wizard, JIT)
tion time is lower than that of bytecode rewriting for both
monitors.
(log scale)

102 102

101 101 5.6 Comparison with Wasabi


100
polybench libsodium ostrich 100 Wasabi is a dynamic instrumentation tool that runs analyses
polybench libsodium ostrich
on Wasm bytecode using a JavaScript engine. Since Wasabi
instrumentation must be written in JavaScript, it requires a
Relative Execution Time

103 103 Wasm engine that also runs JavaScript, such as V8 [5]. For
this comparison, we use V8 in its default mode (two com-
(log scale)

102 102
piler tiers)18 . Figure 6 includes data for Wasabi on v8. Wasabi
101 101 instrumentation is vastly slower than Wizard instrumenta-
100 polybench libsodium ostrich 100 polybench libsodium ostrich
tion due to the overhead of calling JavaScript functions. On
average, a hotness monitor in Wasabi increases execution
time 36.8–6350.2×, compared to 7–134× for Wizard’s JIT
Figure 7. Mean relative execution times of the hotness mon- (or 2.2–7.7× with intrinsification). The branch monitor also
itor (left) and branch monitor (right) in Wizard, Wasabi, and has a drastic performance impact of 29.9–4721.5× in Wasabi,
DynamoRIO across the three suites. Ratios are relative to compared to 1.0–16.6× for Wizard’s JIT (or 1.0–2.8× with
uninstrumented execution time. intrinsification).

5.7 Comparison with DynamoRIO


We also compare with machine code instrumentation. We
5.5 Comparison with bytecode rewriting cannot make a direct comparison, so instead, we compile the
Bytecode rewriting is an example of static instrumentation same benchmark programs to x86-64 assembly and instru-
described in Section 1.1.1. Using Walrus [8], a Wasm trans- ment them with DynamoRIO, with analogous machine-code
formation library written in Rust, we implemented the hot- hotness and branch monitors.
ness and branch monitors by rewriting bytecode [9]. For The results are shown in Figures 6 and 7. Executing the
the hotness monitor, we inject counting instructions before native programs instrumented with a DynamoRIO hotness
each instruction, and for the branch monitor, before each
branching instruction. Counters are stored in memory, ne- 18We also conducted experiments limiting V8 to its baseline compiler, for a
cessitating loads and stores. We evaluated the performance closer comparison to Wizard’s JIT. Results indicate that the Wasm compiler
of the transformed Wasm bytecode when run in Wizard’s makes little difference; the overhead is dominated by JavaScript execution
JIT, and compared it to their respective monitors in Wizard. and transitions between JavaScript. For more JIT comparisons see [56].
12
monitor is about 3.9–192× slower than without instrumenta- into Wasm bytecode that call instrumentation code provided
tion. The hotness monitor has a substantial relative overhead as JavaScript.
because, among other things, DynamoRIO inserts instruc- Dynamic. FERRARI [16] statically instruments core JDK
tions to spill and restore EFLAGS for each counter increment. classes while dynamically instrumenting all others using
On the other hand, the DynamoRIO branch monitor slows java.lang.instrument. SaBRe [10] injects instrumenta-
execution time from 4.4–153×, again compared to 1.0–16.6× tion at load-time, thus paying the rewriting cost once at
for Wizard’s JIT (or 1.0–2.8× with intrinsification). This is startup rather than continuously during execution. DTrace [21,
likely because our DynamoRIO monitor is implemented with 27], inspired by Paradyn [39] and other tools, enables trac-
a function call at every basic block, which DynamoRIO can ing at both the user and kernel layer of the OS by operating
sometimes inline, since it works at the machine code level. Its inside the kernel itself and uses dynamically-injected tram-
default inlining heuristics seem to give rise to unpredictable polines. Dyninst [15] interfaces with a program’s CFG and
overheads. maps modifications to concrete binary rewrites. A user can
tie M-code to instructions or CFG abstractions (e.g. function
5.8 Evaluation summary entry/exit). It can do this statically or at any point during ex-
We evaluated Wizard’s instrumentation overheads by mea- ecution and changes are immediate. Recent research in this
suring the relative execution times of a branch and a hotness direction [20] [61] [46] focuses on reducing instrumentation
monitor across multiple standardized benchmarks and in- overhead with a variety of low-level optimizations.
strumentation approaches. For monitors with sparse probes,
like the branch monitor, local probes improve on the perfor- 6.2 Recompilation.
mance of global probes (relative execution time of 1.0–2.2× Compiled programs can be recompiled to inject code using
versus 7.7–16.4×). Running monitors in Wizard’s JIT further several techniques.
improves this performance to a relative execution time of Static. Early examples of static lifting for instrumentation
1.0–16.6× without intrinsification and 1.0–2.8× with intrin- include ATOM [52], followed by EEL [32] with finer-grained
sification. This greatly outperforms DynamoRIO and Wasabi, instrumentation. Etch [48], through observing an initial pro-
with relative execution times of 4.4–153× and 29.9–4721.5× gram execution, discovered dynamic program properties to
respectively. Surprisingly, JIT intrinsification can produce in- inform static instrumentation. Other examples include Vul-
strumentation overhead even lower than intrusive bytecode can [51], which injects code into lifted Win32 binaries.
rewriting, shown in Figure 7. Our results show Wizard’s Dynamic. Both DynamoRIO [18] and Pin [36] use dy-
instrumentation architecture is both flexible and efficient. namic recompilation of native binaries to implement instru-
6 Related work mentation. They differ somewhat on subtle implementation
details, how M-code is injected, and performance charac-
Techniques for studying program behavior have been the teristics, but fundamentally work by recompiling machine
subject of a vast amount of research. They differ in when to code for a given ISA to the same ISA. Their JIT compilers
instrument (statically or dynamically), at which level (source, are purpose-built for instrumentation and are basic-block
IR, bytecode, or machine code), and what mechanism is used and trace-cache based. They run code in the original process
to do so. A key issue is that the analysis and the mechanism
and reorganize binaries, and can be intrustive, particularly
work together to analyze a program’s behavior intrusively
if M-code is supplied as low-level native code. We are not
or non-intrusively, yet some mechanisms make it easier to aware of strong consistency guarantees (Section 2.4) in the
implement non-intrusive analyses. face of dynamically adding and removing instrumentation.
6.1 Code injection RoadRunner [24] is a dynamic analysis system for Java
A common approach is to inject M-code directly into pro- based on event streams, primarily focused on race detection.
grams [63], either statically or dynamically. Injecting code It uses a custom classloader to inject calls to instrumentation.
into a program can be done inline (directly inserted into Analyses are formulated in terms of pipes and filters over
code, often requiring binary reorganization), with trampo- event streams, allowing composability. It offers some specific
lines (jumps to out-of-line instrumentation code), or both. inlining optimizations that avoid the overhead of events in
Static. Early tools for Java static bytecode instrumenta- some circumstances. Since the analysis code runs in the same
tion include Soot [59] and Bloat [44]. Later, with the rise of state space and on the same threads, it can both perturb
Aspect-Oriented Programming (AOP), tools emerged to tar- performance and alter concurrency characteristics of highly-
get joinpoints, such as DiSL [38], AspectJ [31] and BISM [50]. multithreaded programs.
Oron [40] reduced the performance overhead of JavaScript ShadowVM [37] builds on JVMTI to provide non-intrusive
source-level instrumentation by targetting AssemblyScript instrumentation with low perturbation by running the moni-
and compiling the instrumented program to Wasm for execu- tor on a separate JVM and asynchronously processing events
tion. For Wasm, tools are now emerging such as the aspect- as they occur. It is primarily suited for program observa-
oriented [47], and Wasabi [33], which injects trampolines tion, as it does not directly support state modifications. On
13
load, an instrumentation process dynamically inserts hooks In this work, we showed monitors written against Wiz-
through bytecode rewriting that trap to native code to asyn- ard’s engine APIs in a high-level language. Generic probes
chronously communicate event notifications to the moni- use runtime calls to compiled M-code. Massive speedups are
tor. According to published material, ShadowVM does not possible from intrinsifying certain probes by inlining all or
support dynamically inserting and removing hooks during part of their M-code. What if M-code was instead supplied in
program execution. an IR the JIT could inline? We plan to explore Wasm bytecode
as just that IR.
6.3 Emulation
QEMU [14] is a widely-used CPU emulator that virtual- Acknowledgments
izes a user-space process while supporting non-intrusive
This work is supported in part by NSF Grant Award #2148301,
instrumentation. Valgrind [42], primarily used as a mem-
as well as funding and support from the WebAssembly Re-
ory debugger, is similar. As emulators, both can run a guest
search Center. Thanks to Anthony Rowe and Arjun Ramesh
ISA on a different host ISA, and both use JIT compilers to
for important discussions and comments on drafts of this
make emulation fast. Thus, their JIT compilers are not nec-
work. Thanks to Saúl Cabrera, Erin Ren, and Jeff Charles
essarily “purpose-built” for instrumentation, but for cross-
at Shopify, Ulan Degenbaev and Yan Chen at DFinity, and
compilation. Avrora [57], a microcontroller emulator and
Chris Woods at Siemens.
sensor network simulator, provides an API to attach M-code
to clock events, instructions, and memory locations.
References
6.4 Direct engine support [1] The edge of the multi-cloud. https://www.fastly.
Runtime systems can be designed with specific support for com/cassets/6pk8mg3yh2ee/79dsHLTEfYIMgUwVVllaa4/
5e5330572b8f317f72e16696256d8138/WhitePaper-Multi-Cloud.pdf,
instrumentation. In .NET [12], users build profiler DLLs that
2020. (Accessed 2021-07-06).
are loaded by the CLR into the same process as a target [2] Wasm3: The fastest WebAssembly interpreter, and the most universal
application. The CLR then notifies the profiler of events oc- runtime. https://github.com/wasm3/wasm3, 2020. (Accessed 2021-08-
curring in the application through a callback interface. The 11).
JVM Tool Interface [3] allows Java bytecode instrumentation [3] Java Virtual Machine Tools Interface. https://docs.oracle.com/javase/
8/docs/technotes/guides/jvmti/, 2021. (Accessed 2021-07-29).
and also agents to be written against a lower-level internal
[4] JavaScriptCore, the built-in JavaScript engine for WebKit. https://trac.
engine API that supports attaching callbacks to events. Ex- webkit.org/wiki/JavaScriptCore, 2021. (Accessed 2021-07-29).
amples of events are method entry and exit, but nothing as [5] V8 development site. https://v8.dev, 2021. (Accessed 2021-07-29).
fine-grained as reaching individual bytecodes. To assess the [6] WebAssembly specifications. https://webassembly.github.io/spec/,
performance overhead of handling MethodEntry events, we 2021. (Accessed 2021-07-29).
[7] Intel 64® and IA-32 Architectures Software Developer’s Manual Volume
wrote a Calls monitor using JVMTI in C. When run on the
3 (3A, 3B, 3C & 3D): System Programming Guide, chapter 33: Intel
famously indirect-call-heavy Richards benchmark it imposes Processor Trace. 2023.
50–100× overhead. In contrast, for the same program com- [8] Walrus: A WebAssembly transformation library. https://github.com/
piled to Wasm and running with Wizard’s Calls monitor, the rustwasm/walrus, 2023.
overhead was measured to be 2.5–3×. [9] Wasm bytecode instrumenter. https://github.com/yashanand1910/
wasm-bytecode-instrumenter, 2023.
[10] Paul-Antoine Arras, Anastasios Andronidis, Luís Pina, Karolis Mituzas,
7 Conclusion and Future Work Qianyi Shu, Daniel Grumberg, and Cristian Cadar. SaBRe: load-time
In this paper, we showed the first non-intrusive dynamic selective binary rewriting. International Journal on Software Tools for
instrumentation framework for Wasm in a multi-tier Wasm Technology Transfer, 24(2):205–223, Apr 2022.
research engine that imposes zero overhead when not in [11] Linux Wiki Authors. Linux perf main page. https://perf.wiki.kernel.
org/index.php/Main_Page, 2012. (Accessed 2023-8-4).
use. Modifications to the interpreter and compiler tiers of [12] .NET Wiki Authors. The .NET Profiling API. https:
Wizard are minimal; just a few hundred lines of code. Novel //learn.microsoft.com/en-us/dotnet/framework/unmanaged-
optimizations reduce instrumentation overhead and perform api/profiling/profiling-overview, 2021. (Accessed 2023-8-4).
well for sparse analysis and acceptably well for heavy analy- [13] David F. Bacon, Perry Cheng, and David Grove. TuningFork: A plat-
sis. Our robust consistency guarantees make our system the form for visualization and analysis of complex real-time systems. In
Companion to the 22nd ACM SIGPLAN Conference on Object-Oriented
first to support composing multiple analyses seamlessly. Programming Systems and Applications Companion, OOPSLA ’07, page
While probes offer a complete instrumentation mecha- 854–855, New York, NY, USA, 2007. Association for Computing Ma-
nism for code, many analyses instrument other events, such chinery.
as memory accesses, traps, etc. As we saw with function en- [14] Fabrice Bellard. QEMU: A generic and open source machine emulator
try/exit and after-instruction hooks, libraries can implement and virtualizer. http://qemu.org, 2020. (Accessed 2023-8-07).
[15] Andrew R. Bernat and Barton P. Miller. Anywhere, any-time binary
higher-level hooks using probes; but if directly supported by instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT
the engine, these hooks can be implemented more efficiently, Workshop on Program Analysis for Software Tools, PASTE ’11, page 9–16,
e.g. hardware watchpoints for memory accesses. New York, NY, USA, 2011. Association for Computing Machinery.
14
[16] Walter Binder, Jarle Hulaas, and Philippe Moret. Advanced Java [32] James R. Larus and Eric Schnarr. EEL: Machine-independent exe-
bytecode instrumentation. In Proceedings of the 5th International cutable editing. In Proceedings of the ACM SIGPLAN 1995 Conference
Symposium on Principles and Practice of Programming in Java, PPPJ ’07, on Programming Language Design and Implementation, PLDI ’95, page
page 135–144, New York, NY, USA, 2007. Association for Computing 291–300, New York, NY, USA, 1995. Association for Computing Ma-
Machinery. chinery.
[17] Jay Bosamiya, Wen Shih Lim, and Bryan Parno. Provably-safe multi- [33] Daniel Lehmann and Michael Pradel. Wasabi: A framework for dynam-
lingual software sandboxing using WebAssembly. In Proceedings of ically analyzing WebAssembly. In Proceedings of the Twenty-Fourth
the USENIX Security Symposium, August 2022. International Conference on Architectural Support for Programming
[18] D Bruening, T Garnett, and S Amarasinghe. An infrastructure for Languages and Operating Systems, ASPLOS ’19, page 1045–1058, New
adaptive dynamic optimization. In International Symposium on Code York, NY, USA, 2019. Association for Computing Machinery.
Generation and Optimization, 2003. CGO 2003. IEEE Comput. Soc, 2003. [34] Borui Li, Hongchang Fan, Yi Gao, and Wei Dong. ThingSpire OS: A
[19] Rodrigo Bruno, Duarte Patricio, José Simão, Luis Veiga, and Paulo WebAssembly-based IoT operating system for cloud-edge integration.
Ferreira. Runtime object lifetime profiler for latency sensitive big data In Proceedings of the 19th Annual International Conference on Mobile
applications. In Proceedings of the Fourteenth EuroSys Conference 2019, Systems, Applications, and Services, MobiSys ’21, page 487–488, New
EuroSys ’19, New York, NY, USA, 2019. Association for Computing York, NY, USA, 2021. Association for Computing Machinery.
Machinery. [35] Renju Liu, Luis Garcia, and Mani Srivastava. Aerogel: Lightweight
[20] Buddhika Chamith, Bo Joel Svensson, Luke Dalessandro, and Ryan R. access control framework for WebAssembly-based bare-metal IoT
Newton. Living on the edge: Rapid-toggling probes with cross- devices. In 2021 IEEE/ACM Symposium on Edge Computing (SEC),
modification on x86. In Proceedings of the 37th ACM SIGPLAN Con- pages 94–105, 2021.
ference on Programming Language Design and Implementation, PLDI [36] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,
’16, page 16–26, New York, NY, USA, 2016. Association for Computing Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel-
Machinery. wood. Pin: Building customized program analysis tools with dynamic
[21] Greg Cooper. DTrace: Dynamic tracing in Oracle Solaris, Mac OS X, instrumentation. SIGPLAN Not., 40(6):190–200, June 2005.
and Free BSD by Brendan Gregg and Jim Mauro. SIGSOFT Softw. Eng. [37] Lukáš Marek, Stephen Kell, Yudi Zheng, Lubomír Bulej, Walter Binder,
Notes, 37(1):34, jan 2012. Petr Tůma, Danilo Ansaloni, Aibek Sarimbekov, and Andreas Sewe.
[22] Frank Denis. Libsodium, 2021. ShadowVM: Robust and comprehensive dynamic program analysis
[23] Bruno Dufour, Karel Driesen, Laurie Hendren, and Clark Verbrugge. for the Java platform. In Proceedings of the 12th International Confer-
Dynamic metrics for Java. SIGPLAN Not., 38(11):149–168, oct 2003. ence on Generative Programming: Concepts and Experiences, GPCE ’13,
[24] Cormac Flanagan and Stephen N. Freund. The RoadRunner dynamic page 105–114, New York, NY, USA, 2013. Association for Computing
analysis framework for concurrent programs. In Proceedings of the 9th Machinery.
ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software [38] Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter
Tools and Engineering, PASTE ’10, page 1–8, New York, NY, USA, 2010. Binder, and Zhengwei Qi. DiSL: A domain-specific language for byte-
Association for Computing Machinery. code instrumentation. In Proceedings of the 11th Annual International
[25] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Conference on Aspect-Oriented Software Development, AOSD ’12, page
Elements of Reusable Object-Oriented Software. Addison-Wesley Pro- 239–250, New York, NY, USA, 2012. Association for Computing Ma-
fessional Computing Series. Pearson Education, 1994. chinery.
[26] Manuel Geffken and Peter Thiemann. Side effect monitoring for Java [39] Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K.
using bytecode rewriting. In Proceedings of the 2014 International Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, Krishna Kunchitha-
Conference on Principles and Practices of Programming on the Java padam, and Tia Newhall. The Paradyn parallel performance measure-
Platform: Virtual Machines, Languages, and Tools, PPPJ ’14, page 87–98, ment tool. Computer, 28(11):37–46, nov 1995.
New York, NY, USA, 2014. Association for Computing Machinery. [40] Aäron Munsters, Angel Luis Scull Pupo, Jim Bauwens, and Elisa Gonza-
[27] Brendan Gregg and Jim Mauro. DTrace: Dynamic Tracing in Oracle lez Boix. Oron: Towards a dynamic analysis instrumentation platform
Solaris, Mac OS X and FreeBSD. Prentice Hall Press, USA, 1st edition, for AssemblyScript. In Companion Proceedings of the 5th International
2011. Conference on the Art, Science, and Engineering of Programming, Pro-
[28] Andreas Haas, Andreas Rossberg, Derek L. Schuff, Ben L. Titzer, gramming ’21, page 6–13, New York, NY, USA, 2021. Association for
Michael Holman, Dan Gohman, Luke Wagner, Alon Zakai, and Computing Machinery.
JF Bastien. Bringing the web up to speed with WebAssembly. In [41] Otoya Nakakaze, István Koren, Florian Brillowski, and Ralf Klamma.
Proceedings of the 38th ACM SIGPLAN Conference on Programming Retrofitting industrial machines with WebAssembly on the edge. In
Language Design and Implementation, PLDI 2017, page 185–200, New Richard Chbeir, Helen Huang, Fabrizio Silvestri, Yannis Manolopoulos,
York, NY, USA, 2017. Association for Computing Machinery. and Yanchun Zhang, editors, Web Information Systems Engineering –
[29] David Herrera, Hanfeng Chen, Erick Lavoie, and Laurie Hendren. WISE 2022, pages 241–256, Cham, 2022. Springer International Pub-
Numerical computing on the web: Benchmarking for the future. In lishing.
Proceedings of the 14th ACM SIGPLAN International Symposium on [42] Nicholas Nethercote and Julian Seward. Valgrind: A framework
Dynamic Languages, DLS 2018, page 88–100, New York, NY, USA, 2018. for heavyweight dynamic binary instrumentation. SIGPLAN Not.,
Association for Computing Machinery. 42(6):89–100, jun 2007.
[30] Saba Jamilan, Tanvir Ahmed Khan, Grant Ayers, Baris Kasikci, and [43] Manuel Nieke, Lennart Almstedt, and Rüdiger Kapitza. Edgedancer:
Heiner Litz. APT-GET: Profile-guided timely software prefetching. Secure mobile WebAssembly services on the edge. In Proceedings of the
In Proceedings of the Seventeenth European Conference on Computer 4th International Workshop on Edge Systems, Analytics and Networking,
Systems, EuroSys ’22, page 747–764, New York, NY, USA, 2022. Asso- EdgeSys ’21, page 13–18, New York, NY, USA, 2021. Association for
ciation for Computing Machinery. Computing Machinery.
[31] Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, [44] Nathaniel John Nystrom. Bytecode-level analysis and optimization of
and William G. Griswold. An overview of AspectJ. In Proceedings of Java classes. Master’s thesis, Purdue University, August 1998.
the 15th European Conference on Object-Oriented Programming, ECOOP [45] Louis-Noël Pouchet. PolyBench, May 2016.
’01, page 327–353, Berlin, Heidelberg, 2001. Springer-Verlag.

15
[46] David Georg Reichelt, Stefan Kühne, and Wilhelm Hasselbring. To- NY, USA, 2018. Association for Computing Machinery.
wards solving the challenge of minimal overhead monitoring. In [63] Matthias Wenzl, Georg Merzdovnik, Johanna Ullrich, and Edgar
Companion of the 2023 ACM/SPEC International Conference on Perfor- Weippl. From hack to elaborate technique – a survey on binary
mance Engineering, ICPE ’23 Companion, page 381–388, New York, rewriting. ACM Comput. Surv., 52(3), jun 2019.
NY, USA, 2023. Association for Computing Machinery. [64] Zhiqiang Zuo, Kai Ji, Yifei Wang, Wei Tao, Linzhang Wang, Xuandong
[47] João Rodrigues and Jorge Barreiros. Aspect-oriented WebAssembly Li, and Guoqing Harry Xu. JPortal: Precise and efficient control-flow
transformation. In 2022 17th Iberian Conference on Information Systems tracing for JVM programs with Intel Processor Trace. In Proceedings
and Technologies (CISTI), pages 1–6, 2022. of the 42nd ACM SIGPLAN International Conference on Programming
[48] Ted Romer, Geoff Voelker, Dennis Lee, Alec Wolman, Wayne Wong, Language Design and Implementation, PLDI 2021, page 1080–1094,
Hank Levy, Brian Bershad, and Brad Chen. Instrumentation and New York, NY, USA, 2021. Association for Computing Machinery.
optimization of Win32/Intel executables using Etch. In Proceedings
of the USENIX Windows NT Workshop on The USENIX Windows NT
Workshop 1997, NT’97, page 1, USA, 1997. USENIX Association. A Artifact Appendix
[49] Koushik Sen, Swaroop Kalasapur, Tasneem Brutch, and Simon Gibbs.
Jalangi: A selective record-replay and dynamic analysis framework for
A.1 Abstract
JavaScript. In Proceedings of the 2013 9th Joint Meeting on Foundations This artifact description contains information for how to re-
of Software Engineering, ESEC/FSE 2013, page 488–498, New York, NY, produce all results in this paper. We describe system require-
USA, 2013. Association for Computing Machinery.
ments, how to set up an environment and run our scripts that
[50] Chukri Soueidi, Ali Kassem, and Yliès Falcone. BISM: bytecode-level
instrumentation for software monitoring. In Runtime Verification: 20th produce data and exact figures in the paper, as well as how to
International Conference, RV 2020, Los Angeles, CA, USA, October 6–9, modify the artifact to run your own custom experiments. Our
2020, Proceedings 20, pages 323–335. Springer, 2020. package contains all scripts, benchmarks, monitors, and en-
[51] A. Srivastava, A. Edwards, and H. Vo. Vulcan: Binary transformation gines used. We also provided all of our results in the package
in a distributed environment. Technical report, Microsoft Research,
so others can do direct data comparison.
2001.
[52] Amitabh Srivastava and Alan Eustace. ATOM: A system for build-
ing customized program analysis tools. New York, NY, USA, 1994. A.2 Artifact Meta-Information
Association for Computing Machinery. • Benchmarks: The following benchmarking suites are used
[53] Ben L. Titzer. Harmonizing classes, functions, tuples, and type param- in our experiments:
eters in Virgil III. In Proceedings of the 34th ACM SIGPLAN Conference
– PolyBench/C [45] with the medium dataset, version 4.2.
on Programming Language Design and Implementation, PLDI ’13, page
85–94, New York, NY, USA, 2013. Association for Computing Machin- – Ostrich [29], version 1.0.0.
ery. – Libsodium [22], there are three different variations of the
[54] Ben L. Titzer. Wizard, An advanced Webassembly Engine for Research. Libsodium benchmark as follows:
https://github.com/titzer/wizard-engine, 2021. (Accessed 2021-07-29). ∗ libsodium, the base libsodium suite, version 0.7.13.
[55] Ben L. Titzer. A fast in-place interpreter for WebAssembly. Proc. ACM ∗ libsodium-2021, a variation pulled from the 2021-
Program. Lang., 6(OOPSLA2), October 2022. Q1 directory at https://github.com/jedisct1/webassembly-
[56] Ben L. Titzer. Whose baseline compiler is it anyway? CGO ’24, New benchmarks.
York, NY, USA, 2024. Association for Computing Machinery. ∗ libsodium-no-bulk-mem, a variation of base lib-
[57] Ben L. Titzer, Daniel K. Lee, and Jens Palsberg. Avrora: Scalable sensor sodium above without bulk memory operations.
network simulation with precise timing. In Proceedings of the 4th
All of these benchmark suites have been included in the suites
International Symposium on Information Processing in Sensor Networks,
IPSN ’05, page 67–es. IEEE Press, 2005.
directory of the artifact.
[58] Ben L. Titzer and Jens Palsberg. Nonintrusive precision instrumen- • Compilation: Since we provide the benchmarks compiled
tation of microcontroller software. In Proceedings of the 2005 ACM to Wasm, a Wasm compiler is unnecessary. However, a Rust
SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for compiler for the wasm-bytecode-instrumenter and Wasabi
Embedded Systems, LCTES ’05, page 59–68, New York, NY, USA, 2005. tools is necessary. See the next section for the required version
Association for Computing Machinery. of rustc.
[59] Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick • Transformations: For the bytecode rewriting experiment,
Lam, and Vijay Sundaresan. Soot: A Java bytecode optimization we use Walrus (version 0.20.1), a Rust library for Wasm trans-
framework. In CASCON First Decade High Impact Papers, CASCON formations, in our wasm-bytecode-rewriter to inject our
’10, page 214–224, USA, 2010. IBM Corp.
Wasm instrumentation. This crate’s repo is publicly avail-
[60] Kenton Varda. WebAssembly on Cloudflare Workers. https://
blog.cloudflare.com/webassembly-on-cloudflare-workers/. (Accessed able at: https://github.com/rustwasm/walrus. Wasabi also does
2021-07-06). Wasm transformations to inject calls, follow instructions in
[61] Mingzhe Wang, Jie Liang, Chijin Zhou, Zhiyong Wu, Xinyi Xu, and the README.md file at the root directory of the artifact for how
Yu Jiang. Odin: On-demand instrumentation with on-the-fly recompi- to set up this tool.
lation. In Proceedings of the 43rd ACM SIGPLAN International Confer- • Binaries: We used Wizard [54] version 23a.1617 for our ex-
ence on Programming Language Design and Implementation, PLDI 2022, perimentation. Since this version did not have support for
page 1010–1024, New York, NY, USA, 2022. Association for Computing enabling/disabling certain features via command line flags, we
Machinery. directly manipulated flags in the source code and compiled
[62] Conrad Watt. Mechanising and verifying the WebAssembly specifica- each required Wizard configuration for our experiments and
tion. In Proceedings of the 7th ACM SIGPLAN International Conference
made them available in the bin folder. The following describes
on Certified Programs and Proofs, CPP 2018, page 53–65, New York,
each binary’s configuration:
16
– base/wizeng.x86-64-linux. This is the base compila- A.3.2 Software dependencies. This artifact has the fol-
tion of Wizard with no flags modified. lowing software dependencies:
– local-global/wizeng.x86-64-linux. In order to use a
• V8 [5], commit hash f200321. V8 can be downloaded
monitor in Wizard, its code must be present in the binary.
We extended the base Wizard to contain both the local
from https://github.com/v8/v8. To run experiments
and global implementations of the hotness and branch scripts, this binary must be available on the PATH to
monitors in this binary. be called with the d8 command.
– fast-count/wizeng.x86-64-linux. This binary was com- • Wasabi [33], commit hash fe12347. Wasabi can be
piled with the intrinsifyCountProbe and downloaded from https://github.com/danleh/wasabi.
intrinsifyOperandProbe flags enabled in the To run experiments scripts, this binary must be avail-
src/engine/Tuning.v3 file. able on the PATH to be called with the wasabi com-
– empty-probes/wizeng.x86-64-linux. To reiterate, in or- mand.
der to use a monitor in Wizard, its code must be present • DynamoRIO [18], commit hash fc4c25f. DynamoRIO can
in the binary. We extended the base Wizard to contain be downloaded from https://github.com/DynamoRIO/
variations of the hotness and branch monitors with no
dynamorio. To run experiments scripts, this binary
M-code in their inserted probes.
– empty-probes-fast-count/wizeng.x86-64-linux. This
must be available on the PATH to be called with the
binary is a combination of the above fast-count and drrun command.
empty-probes configurations. • Python, version 3.8.10 To run experiment scripts, both
We have also included the binary btime in the bin folder the python3 and python (symlinked to python3) com-
that calculates various types of timing characteristics of a mands must be available on the PATH.
program’s execution. • Python’s venv package, python3.8-venv for
• Run-time environment: This project must be run on an Debian/Ubuntu systems
x86_64 Linux machine. It is not necessary to have sudo ac- • wasm-bytecode-instrumenter, commit hash 3ea2003.
cess as software can be installed/symlinked in a user’s home The bytecode-instrumenter can be downloaded from
directory. However, having sudo access would substantially the repo. To run experiments scripts, this binary must
simplify the installation process.
be available on the PATH to be called with the follow-
• Metrics: We report relative execution time and absolute over-
head for Wizard’s interpreter, Wizard’s JIT (with and without
ing command: wasm-bytecode-instrumenter
intrinsification), DynamoRIO, Wasabi, and bytecode rewriting. • rustc, version 1.71.0.
Given instrumented execution time 𝑇𝑖 and uninstrumented
execution time 𝑇𝑢 , we define absolute overhead as the quantity A.4 Installation and Testing
𝑇𝑖 − 𝑇𝑢 and relative execution time as the ratio 𝑇𝑖 /𝑇𝑢 . A.4.1 Installation. To install all required dependencies,
• Output: There are two different outputs for our scripts. The follow the detailed instructions in the README.md file in the
experiment*.bash scripts output CSV files. The plot*.py base directory of the artifact.
scripts output graphs as PDF and SVG files. We have provided
our own results inside the csv and figures folders for com- A.4.2 Basic Test. To verify that an environment is cor-
parison. rectly configured to run all scripts provided in this artifact,
• Experiments: Follow the instructions in the README.md file edit the SUITES variable in the common.bash file to only con-
at the base directory of the artifact for how to run experiments. tain the polybench suite, then run the following command:
• How much disk space required (approximately)?: 10 GB RUNS=2 ./experiment-all-suites.sh. We expect this ini-
• How much time is needed to prepare workflow (approx- tial test to run in about 1 day (as opposed to the 7 days for
imately)?: We expect experiment preparation to take around
all experiments as mentioned above). When running experi-
30 minutes (if all builds/installs work well).
• How much time is needed to complete experiments (ap-
ments for polybench, a successful run should result in the
proximately)?: To run 5 iterations per experiment, you should CSV directory containing subfolders with CSV output files
expect the runtime to take about 7 days when running on an for each suite script. Refer to the README for instructions on
Ubuntu 20.04.5 machine with 19 GiB of RAM and an Intel® how to run individual experiments.
Core™ i7-4790 processor running at 3.60 GHz. This is primarily
due to Wasabi being significantly slower with instrumentation. A.5 Experiment workflow
• Publicly available?: Yes, this artifact is available at the fol- The workflow of our experiments has two phases: collecting
lowing URL: https://zenodo.org/doi/10.5281/zenodo.10795556 runtime data (saved to CSV files) and generating the corre-
• Code license: Licensed under the Apache License, Version sponding figures (saved to PDF and SVG files). It is important
2.0 (https://www.apache.org/licenses/LICENSE-2.0).
to remember to save off any data/figures by copying the
csv/figures folder to alternate locations prior to running
A.3 Description scripts. If this is not done, the contents will be overwrit-
A.3.1 How to access. This artifact can be accessed at: ten. To collect the runtime data, a user can run any of the
https://zenodo.org/doi/10.5281/zenodo.10795556 experiment*.bash scripts (experiment-all-suites.bash
17
to run all experiments). Logging information will be output • Metrics: Our scripts report relative execution time and absolute
to stdout. Before plotting data, all experiments should be suc- overhead for JVMTI and Wizard. To ignore the base startup
cessfully run (as some figures require data across multiple time required by the engines, we measure instrumented and
experiments). To generate figures, run the plot-figure*.py uninstrumented runs of the Richards benchmark with 0 loops
scripts. (𝑇𝑏𝑖 and 𝑇𝑏𝑢 below). Given:
– instrumented execution time 𝑇𝑖
A.6 Evaluation and expected results – instrumented base execution time 𝑇𝑏𝑖
– uninstrumented execution time 𝑇𝑢
If you run these experiments, you will find the generated – uninstrumented base execution time 𝑇𝑏𝑢
CSV and figure files in their respective folders. To verify our We define absolute overhead as the quantity (𝑇𝑖 − 𝑇𝑏𝑖 ) − (𝑇𝑢 −
own results, a side-by-side comparison can be done with the 𝑇𝑏𝑢 ) and relative execution time as the ratio (𝑇𝑖 − 𝑇𝑏𝑖 )/(𝑇𝑢 −
figures in our paper. 𝑇𝑏𝑢 ).
• Output: We log all of our output to stdout which should be
A.7 Experiment customization redirected to a file for inspection. To view the summary of
To add your own suite: each Richard benchmark iteration, grep the file for the term
SUMMARY. The specific iteration of the Richard benchmark
1. Compile your suite to Wasm. The binary can only
is shown in the prefix of each line, e.g. [wasm-9-SUMMARY]
contain Wasm features supported by the Wasabi and means that this line is part of the summary of the Wasm execu-
Walrus tools, which tends to be aligned with the core tion of the Richard benchmark with 9 iterations. Each iteration
specification. is summarized by outputting all execution times for instru-
2. Make your new suite available in the suites folder mented and uninstrumented variants of the Java and Wasm
following the conventions shown by the other avail- executions, then reporting the absolute overhead and relative
able suites. execution time. To view the absolute overhead, grep the file
3. Update the SUITES variable in common.bash to con- for On average, runtime with monitor took. To view the relative
tain your new suite. execution time for each Richard benchmark iteration, grep
4. Update the suites variable in plot.py to contain the file for the term Factor. We have included our own results
your new suite. inside the runs_richards directory for reference.
• Experiments: The artifact scripts run instrumented and unin-
Helpful variables in common.bash: strumented variations of the Richards benchmark at 9, 99, 999,
1. RUNS: configure the number of times run to collect 9999, and 99999 loops. Each of these variations is run 10x to
average execution times. collect execution time averages across all runs.
2. SUITES: configure the suites that will run during ex- • How much disk space required (approximately)?: Re-
perimentation. quires about 2 MB for the jvmti directory and the Wizard
engine binary.
A.8 JVMTI Experiment Artifact • How much time is needed to prepare workflow (approx-
imately)?: Shouldn’t take longer than 30 minutes since there
A.8.1 Abstract. We also did a brief experiment discussed
are few dependencies.
in Related Work, Section 6, to assess the performance over- • How much time is needed to complete experiments (ap-
head imposed by JVMTI’s [3] handling of MethodEntry events. proximately)?: The experiment takes about 3 hours when
To keep our core evaluations separate from this experiment, running on an Ubuntu 20.04.6 machine with 394 GB of RAM
we have placed this artifact discussion below. This can be and an Intel®Xeon®Platinum 8168 processor running at 2.70
found in the directory jvmti at the base of the artifact lo- GHz.
cated at https://zenodo.org/doi/10.5281/zenodo.10795556. • Publicly available?: Yes, this artifact is available at the fol-
lowing URL: https://zenodo.org/doi/10.5281/zenodo.10795556
A.8.2 Meta-Information. • Code license: Licensed under the Apache License, Version
• Benchmarks: We leveraged the Richards benchmark for ex- 2.0 (https://www.apache.org/licenses/LICENSE-2.0).
perimenting with JVMTI. An equivalent Richards benchmark
for Java and Wasm has been provided as part of this artifact.
• Compilation: To compile the CallsMonitor, we require gcc
A.8.3 Software dependencies. Running the JVMTI ex-
version 9.4.0 to be installed. To compile and run the Richards periment requires following software dependencies:
Java benchmark, we require Java version 1.8 to be installed.
• Binaries: We used the base Wizard binary in our experiment, • Java, version 1.8
version 23a.1617. This binary has been provided as part of • gcc, version 9.4.0
the artifact at the location bin/base/wizeng.x86-64-linux.
• Run-time environment: We ran this experiment on an x86_64
A.9 Installation and Testing
machine running Ubuntu 20.04.1. It is not necessary to have
sudo access as software can be installed/symlinked in a user’s A.9.1 Installation. To install all required dependencies
home directory. However, having sudo access would substan- to run the scripts, follow the detailed instructions in the
tially simplify the installation process. jvmti/README.md file.
18
A.9.2 Basic Test. The jvmti/README.md file also describes It is possible that the absolute overhead has variations due to
how to run a basic test to verify your environment setup. differences in the underlying system; however, the relative
execution time should be similar.
A.10 Experiment workflow
The execution workflow is pretty straightforward and out- A.12 Experiment customization
puts all logging information to stdout. The run_richards.sh As described above, it is possible to customize the execution
script iterates over the different loops to execute (configured by manipulating the following variables:
with the BENCHMARKS variable). For each of these loops to • NUM_RUNS: Fluctuate the number of times each experi-
execute, it calculates the average absolute overhead and aver- ment is run to get an average across execution time.
age relative execution time over 10 runs (configured with the • BENCHMARKS: Configure the loops to run on the Richards
NUM_RUNS variable) when running the Richards benchmark benchmark.
for both Java, on the JVM, and Wasm, on Wizard. If there are
issues during execution, errors or warnings will be output. A.13 Methodology
Submission, reviewing and badging methodology:
A.11 Evaluation and expected results
• https://www.acm.org/publications/policies/artifact-review-
Evaluating the results can be done by grep-ing for various and-badging-current
types of information as described under “Output” in the Meta- • http://cTuning.org/ae/submission-20201122.html
Information section above. The results should be similar to • http://cTuning.org/ae/reviewing-20201122.html
what our own execution found, located in jvmti/runs_richards.

19

You might also like