0% found this document useful (0 votes)
17 views16 pages

Tango

Uploaded by

philodean
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Tango

Uploaded by

philodean
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Tango: Extracting Higher-Order Feedback through State

Inference
Ahmad Hazimeh∗ Duo Xu Qiang Liu†
[email protected] [email protected] [email protected]
EPFL; BugScale EPFL EPFL

Yan Wang Mathias Payer


[email protected] [email protected]
Huawei EPFL
ABSTRACT 8 16 24 32 40 48 56
Fuzzing is the de facto standard for automated testing. However, 7 15 23 31 39 47 55 63
while coverage-guided fuzzing excels at code discovery, its effec- 6 14 22 30 38 46 54 62
tiveness falters when applied to complex systems. One such class
5 13 21 29 37 45 53 61
entails persistent targets whose behavior depends on the state of the
4 12 20 28 36 44 52 60
system, where code coverage alone is insufficient for comprehen-
sive testing. It is difficult for a fuzzer to optimize for state discovery 3 11 19 27 35 43 51 59
when the feedback does not correlate with the objective. 2 10 18 26 34 42 50 58
We introduce Tango, an extensible framework for state-aware 9 17 25 33 41 49 57
fuzzing. Our design incorporates “state” as a first-class citizen in all
operations, enabling Tango to fuzz complex targets that otherwise (a) W/ labeled cells, a fuzzer can (b) W/o labels, a fuzzer can inden-
remain out-of-scope. We present state inference, a cross-validation systematically explore the maze. tify cells by their surroundings.
technique that distills portable coverage metrics to reveal hidden
path dependencies in the target. This in turn allows us to aggregate Figure 1: Exploring a maze is an example of a stateful process.
feedback from different paths while maintaining state-specific oper-
(RAID 2024), September 30–October 02, 2024, Padua, Italy. ACM, New York,
ation. We leverage Tango to fuzz stateful targets covering network
NY, USA, 16 pages. https://doi.org/10.1145/3678890.3678908
servers, language parsers, and video games, demonstrating the flex-
ibility of our framework in exploring complex systems. Using state
1 INTRODUCTION
inference, we shrink the scheduling queue of a fuzzer by around
seven times by identifying functionally equivalent paths. We extend Fuzzing stateful systems requires special consideration. Coverage-
current state-of-the-art fuzzers, i.e., AFL++ and Nyx-Net, with state guided fuzzing excels at finding bugs as long as feedback on covered
feedback from Tango. During our evaluation, fuzzers using our code is strongly tied to explored functionality in the target. While
technique uncovered two new bugs in yajl and dcmtk. this intuition holds for simple programs, the effects of consuming an
input usually persist beyond the lifetime of that input in a stateful
CCS CONCEPTS program. An observed behavior of such a stateful system is only
reproducible in the context of a specific state.
• Security and privacy → Software and application security. We motivate stateful fuzzing with the example in Fig. 1a. If a
fuzzer had access to the last reached cell as feedback, it can leverage
KEYWORDS that knowledge to prioritize paths that uncover new cells, as is the
Network Protocol Fuzzing, State-aware Fuzzing, State Inference case of coverage-guided fuzzers. A stateless fuzzer would mutate
ACM Reference Format:
a path—from the set of interesting paths it had already found—
Ahmad Hazimeh, Duo Xu, Qiang Liu, Yan Wang, and Mathias Payer. 2024. and execute it in one shot. The mutated path may or may not
Tango: Extracting Higher-Order Feedback through State Inference. In The introduce moves along the way that would render the rest of the
27th International Symposium on Research in Attacks, Intrusions and Defenses path uninteresting (e.g., walking into a wall). In contrast, a stateful
fuzzer would select an interesting path as a prefix, follow it, then
∗ Work done prior to the author joining BugScale. generate a move in some direction. The key difference between
† Corresponding author. the two fuzzers is that the latter explores the maze incrementally,
whereas the former performs a random walk, implying that the
Permission to make digital or hard copies of all or part of this work for personal or stateful fuzzer is more likely to solve the maze earlier.
classroom use is granted without fee provided that copies are not made or distributed Previous work reveals the benefit of state labels on fuzzer per-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. formance. IJON [3] is an annotation framework that allows fuzzers
For all other uses, contact the owner/author(s). to incorporate complex state into their feedback loop. It was used
RAID 2024, September 30–October 02, 2024, Padua, Italy to fuzz Mario Bros. in a process not too different from exploring
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0959-3/24/09 a maze: knowledge of the last reached location guides the fuzzer
https://doi.org/10.1145/3678890.3678908 towards unexplored regions. Manually annotating a target requires
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.

effort and is often skipped in favor of readily available feedback like state, the fuzzer can better model the relations between snapshots
code coverage. While alternative techniques attempt to extract state and distribute energies more equally among the inferred states. As
variables from certain types of targets, varying success [5, 20, 25], a result, the fuzzer can schedule from the state queue first, to ensure
code coverage remains the preferred mode of instrumentation. an even exploration, and cycle faster through discovered behavior.
However, in the absence of labels, a fuzzer may be misguided. We implemented state inference on top of Tango, our versatile
Consider the example of an unlabeled maze in Fig. 1b. In a running framework for stateful fuzzing. Additionally, we extended current
session, we assume the fuzzer can request one move at a time, state-of-the-art fuzzers, i.e., AFL++ and Nyx-Net, with state feed-
and its feedback is restricted to “whether or not the player moved”. back from Tango. Our evaluation indicates that state inference
Without knowing the label of the current cell, the fuzzer attributes significantly improves seed scheduling by effectively mapping snap-
the feedback only to the last generated move. For instance, if the shots to states, achieving an average reduction of 86.02% to 87.76%
fuzzer arrives at cell 28 through an upward move from 27, it would in the size of a fuzzer’s scheduling queue, when fuzzing network
consider «upward» as an interesting direction and may select it servers and parsers. Much like solving a maze, fuzzing DOOM
more often. Yet, whether or not a player can move depends not only (see Appendix E) also benefits from the knowledge of stateful in-
on the attempted move but also on its surroundings. In another formation, e.g., the player and enemy’s positions, the remaining
iteration, if the fuzzer starts from cell 5, moving upward would amount of ammo, and health points. By incorporating stateful feed-
yield no interesting results. Unaware of its surroundings, the fuzzer back, Tango solves the E1M1 level in DOOM within 30 minutes
quickly exhausts the set of interesting behaviors it can observe, and subsequently replays it in 3 minutes. We summarize our key
resulting in a random walk in exploration. contributions as follows:
Nonetheless, we observe that such boolean feedback remains • Introduction of the state inference, which group snapshots that
useful to model the player’s local surroundings. Having found a belong to the same state based on their responses to inputs.
few initial paths, the fuzzer can extract characteristics of the cells • Design of Tango, a modular-based framework for fuzzing stateful
it has arrived at by trying out, at each path, all the different in- systems in a state-aware manner.
teresting moves it has discovered so far. A complete exploration • Implementation of state-aware scheduler extensions, for AFL++
would yield the classification represented in Fig. 1b: cells are anno- and Nyx-Net, that leverage inferred states to reduce wait times
tated by the possible set of paths that can be followed through one and disperse feedback, demonstrating Tango’s effectiveness in
move from each cell. All paths known to the fuzzer then fall into analyzing complex systems.
one of 14 categories based on their surroundings. In essence, the • Open-source access to our framework and results to foster adop-
fuzzer measures the response pattern of each cell to a set of known tion and provide value to the community. Tango is available at
inputs. This allows it to group its known paths by their common https://github.com/HexHive/tango.
characteristics, e.g., paths that lead to a cell in a vertical corridor.
Increasing the number of steps yields a more accurate classification
2 CHALLENGES TO STATEFUL SCHEDULING
such that, in the limit, each cell maps uniquely to its label. Through
this process, the fuzzer extrapolates multi-dimensional feedback The path explosion in fuzzing quickly degrades the performance
from a uni-dimensional metric to guide its exploration. of a fuzzer. The performance becomes even worse when we are
On the other hand, stateful fuzzers [15, 24, 26] introduce a key fuzzing a stateful target due to the dependence on the system’s
feature for exploring complex systems: resumability. It entails the state. The key to tackling path explosion in stateful fuzzing lies in
ability to restore the target to a certain state and use that state as a more efficient scheduling, which faces three core challenges.
starting point for further fuzzing. They achieve that through restore Feedback Attribution: The fuzzer gradually develops a model
points, referred to as snapshots, which span different granularities, of the target through collected feedback, incorporating it into its
from whole-system VM snapshots, through process restore points, input generation for further exploration. Consider the example
to record-and-replay techniques. Essentially, at each snapshot, the of an FTP server in Fig. 2. If the fuzzer initially sends a correct
target occupies an implicit state as a result of the path traversed sequence of USER-PASS commands, it ends up in the AUTHED
by the input. Perfect resumability ensures the reproducibility of be- state, where control and data transfer commands are accepted.
haviors in their respective states, and allows the fuzzer to maintain Now in the AUTHED state, the fuzzer sends a control command,
its progress while exploring different paths.
In this paper, we propose state inference to address the challenges PASS (incorrect)
arising from fuzzing stateful systems. State inference is a technique
to produce groupings of snapshots that occupy the same implicit
system state, based on similarities between input-response pairs. start accept() USER
The key idea is to cross-test snapshots against inputs and observe CLOSED WAIT_USER WAIT_PASS

their behavior to determine a mapping between snapshots and


states. It does not require any prior knowledge of specification or QUIT
PASS (correct)
global variables. This novel technique offers a more hands-off and
effective approach to stateful fuzzing, paving the way for improved <CTRL_CMD> AUTHED
<DATA_CMD>
DATA_XFER
security testing of complex systems. In practice, seed queues are
done
often biased to a subset of the functional groups discovered by the
fuzzer. With the knowledge that different snapshots share the same
Figure 2: A simplified FTP server state diagram.
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy

e.g. PWD, and receives positive feedback reinforcing the use of PWD labels. Nodes then represent unique values of the state variable or
in future iterations. However, the PWD command is only valid in response code, and within each node, the system maintains a set of
an authenticated session context. Starting the session with any inputs that allow the fuzzer to restore the target to a snapshot of
command other than USER yields a completely different behavior. that state. The approximate state graph is then obtained by merging
Thus, without accounting for state, the fuzzer’s model of the target tree nodes sharing the same state labels.
receives conflicting or misleading feedback.
Exploration: The discovery of an interesting input can expose 2.2 State Modeling
many new paths to the fuzzer, since that input may set up the con-
State is a semantic identifier of the system’s dynamic nature. Any
text required to traverse those paths. Following the FTP example,
event or interaction with the system may update its state and modify
if the fuzzer saves the USER-USER-PASS-PWD sequence as an inter-
its behavior. To measure state, three approaches exist. First, some
esting input, it may apply further mutations to it, leading to an
implementations observe global variables as explicit state identifiers.
input like jUnK-PASS-PWD. Unaware of the state set up by the USER
However, this does not constitute a generic abstraction over states,
command, the fuzzer mutates and destroys that part of the input,
as there are often other contributing factors. Second, since the
trapping itself in an error path. To avoid this loss of progress, the
state of a system boils down to variable memory contents which
fuzzer should treat previous inputs as part of the state setup.
influence its responses, such as the contents of the call stack, the
Seen differently, the history of interactions should be considered
heap, or function-local variables, there are attempts at isolating and
as one way to restore the current state in the target, e.g. through
capturing those variables as state feedback [5, 20, 25]. However,
record-and-replay. Fuzzing is then performed incrementally along
their approaches either under-approximate or over-approximate
one path of exploration. Note that, while generating invalid inputs
the state. Third, it is often easier for a developer who is familiar
is equally important in fuzzing, this methodology does not rule
with the target to specify their own definition of state that fits the
out the possibility of doing so. To test for an invalid command, it
testing goals they are trying to achieve.
suffices to start from an empty prefix path, or alternatively, only
Recovering the behavioral model of a system is not straightfor-
mutate past a selected prefix. Both ensure that the state set up by
ward. To recover states and the relations between them, AFLNet
the prefix is not destroyed.
requires patching the target to augment its responses such that
Soundness: If an input triggers a crash, it may not be sufficient
state identifiers are explicitly indicated through the response codes.
to generate a reproducer from that input alone, since the target may
It also requires that the fuzzer is aware of how those identifiers can
have crashed due to previously accumulated state. To guarantee
be extracted from the received responses. Alternatively, SGFuzz
the soundness (reproducibility) of crashes, the fuzzer must produce
and NSFuzz instrument global variables as state indicators, but they
an input to build up state, which requires the fuzzer to track the
often require manual effort to filter out noise by adding irrelevant
millions of consumed inputs within the lifetime of the target.
variables to an elaborate ignore list. They also fundamentally as-
These core challenges can be addressed by treating “state” as a
sume that state can be consolidated to global variables, when in
first-class citizen and anchoring the fuzzer’s operations around it.
fact, it can span any mechanism for managing persistent memory
Introducing this new dimension to fuzzing requires careful consid-
contents, such as the call stack, function-local variables, or heap
eration and handling in the form of snapshot management, state
objects. On the other end of the spectrum, Nyx-Net foregoes state
modeling, and state-aware behavior.
identification and relies solely on its high reset rates to explore
more of the input space, following the blackbox fuzzing school of
2.1 Snapshot Management thought that favors execution speed over introspection.
Precise state recovery is orthogonal to fuzzing since the imple-
For stateful fuzzing, it suffices that the target persists between
mentation implicit states or diverge from the prescribed protocol.
successive fuzzing iterations. This allows the target to build up
Although such divergence is interesting for fault detection, it is only
state through processing the consumed inputs. Nevertheless, to
discernible where the specification is available, or when a baseline
tackle the aforementioned challenges with feedback attribution,
is used for comparison (as is the case of differential testing [18]).
exploration, and soundness, stateful fuzzers typically snapshot their
progress incrementally. Through these snapshots, the fuzzer can
then save and restore a previously discovered path.
When fuzzing stateful targets, it is common to maintain a tree PASS
of snapshots, where each node has successors representing other
snapshots. The tree is constructed by appending new snapshots as
children of the current node being fuzzed whenever new feedback,
𝐿0 : 𝐿1 : 𝐿2 : 𝐿3 :
e.g., control-flow edges, are discovered. Stateful fuzzers built on
USER USER PASS
WAIT_USER WAIT_PASS WAIT_PASS AUTHED
this idea, such as Nyx-Net [27], AFLNet [24], NSFuzz [25], and SG-
Fuzz [5], manage their snapshot tree differently. Nyx-Net maintains PASS

an implicit tree by backtracking input sequences and injecting snap-


shot commands along the path. Its snapshot tree is then encoded 𝐿3′ :

in the input corpus itself. In contrast, fuzzers like SGFuzz, AFLNet, AUTHED

and NSFuzz attempt to approximate the protocol state graph by


explicitly constructing a tree based on observed changes in state
Figure 3: A snapshot tree constructed for an FTP server.
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.

The Over-shadowed Seed: Even though there are several ap- grouping to model the target’s states. At each snapshot, the tar-
proaches proposed to model states, existing fuzzers are still hard to get occupies a unique hidden state that drives its behavior and
find the over-shadowed seed. Consider again the example of the influences its response to inputs. For a stateful system, we posit
FTP server in Fig. 3. When the fuzzer first sends the USER command, that two snapshots of the system occupy the same state—as far as
it takes a snapshot 𝐿1 (in the WAIT_PASS state), since it observes the fuzzer is concerned—if both snapshots share the same response
new feedback relating to the discovery of a new command. The pattern across all tested inputs, given a sufficient number of samples.
fuzzer then follows with USER again, taking a new snapshot 𝐿2 due We use this insight to develop a systematic approach for evaluating
to executing the error handler in WAIT_PASS, and it remains in snapshots, grouping them by their observable response patterns.
that state. Sending the PASS command then results in 𝐿3 (in the This grouping thus yields a hierarchy of states and snapshots that
AUTHED state). The fuzzer has now discovered a path to the AU- enables fairer and more targeted seed scheduling.
THED state as USER-USER-PASS. Notably, the fuzzer discovered the
WAIT_PASS state “twice”, but the AUTHED state only once. De- Definition 3.1. Feature and Feature Map: A feature is a measur-
pending on the type of feedback collected by the fuzzer, however, able quantity that is influenced by state, e.g., edge hit count. A
it may also be incapable of discovering the USER-PASS sequence feature map is a mapping from a feature identifier, e.g., edge label,
from 𝐿1 , since the feedback for discovering the PASS command had to its measured value.
already been attributed to 𝐿3 , and it is no longer considered an in-
teresting signal for taking a snapshot. In effect, the discovery of the Definition 3.2. Response and Response Pattern: A response is a
USER-PASS sequence was over-shadowed by USER-USER-PASS due value instance of the feature map obtained by executing the target
to overlapping feedback. This phenomenon stacks up the snapshots with a given input. For example, with the edge coverage map as the
and dilutes the scheduling pool. feature map, if three new edges are covered due to the given input,
Observation: Yet, despite 𝐿1 and 𝐿2 traversing different paths in three is the response. Importantly, in our implementation, we also
the target, we know they represent the same state: the target expects consider the edge’s hit count as part of the response. A response
a PASS in either case to transition to the next state. If we assume pattern is a set of features triggered while executing an input at a
𝐿3 was never created, and we send PASS at 𝐿1 , we would discover snapshot. For example, the three edges newly triggered with the
AUTHED again and would create a snapshot 𝐿3′ . Conveniently, given input form a response pattern.
through this approach, the fuzzer discovers a shortcut to AUTHED
State inference has three steps. First, this grouping requires ad-
requiring the minimal number of commands to reach it.
ditional executions to probe the target along all the interesting
This observation is not limited to authentication routines but
paths discovered by the fuzzer. These executions are named cross-
generalizes to any stateful system whose behavior is primarily
pollination (Section 3.1). Fortunately, however, the benefits can
influenced by its inputs. By probing the system at a state with
outweigh the overhead costs due to the nonlinearity of exercising
different inputs and measuring its responses, we can develop a
new coverage: while fuzzing iterations grow in linear time, new fea-
model of the state it is in. If two snapshots share the same responses
tures are only discovered in exponential time [6]. This means that,
across all tested inputs, we can then consider them as belonging to
as the fuzzing campaign progresses, the cost of state inference is
the same state, as far as the fuzzer is concerned.
amortized over the time spent by the fuzzer between successive cov-
erage findings. Meanwhile, the fuzzer can leverage the learned state
2.3 State-aware Operation model and the relations between snapshots for better input and
mutator scheduling. Second, the grouping after cross-pollination is
The behavior of a stateful target changes depending on the state of
based on the capability matrix that stores the information obtained
the system, implying that inputs consumed by the target must be
by cross-pollination. We come up with three operators, i.e., sub-
generated by accounting for the current state. Beyond using snap-
sumption, elimination, and colorful collapse, to group snapshots
shots as a prefix for incremental exploration, none of the mentioned
into states (Section 3.2). Third, having obtained the static capabil-
state-of-the-art fuzzers incorporates state into its operations, such
ity matrix, we have to update it during fuzzing so that the fuzzer
as mutator schedules or state-specific dictionaries. Incorporating
continuously benefits from the refined feedback (Section 3.3).
feedback in a state-specific manner helps develop a more accurate
model of the target in each state, better guiding the fuzzer for ex-
ploring further within each state. Granted, feedback can become 3.1 Cross-Pollination
overwhelming for the fuzzer [9, 11, 30]. State-dependent feedback Cross-pollination seeks hidden capabilities by re-applying the same
can be even more challenging to handle, as the target’s behavior input at different snapshots and observing overlaps in feedback.
evolves dynamically, necessitating that feedback be incorporated Fuzzers usually discover and record inputs that trigger new fea-
dynamically as well. Nonetheless, to make the most of the collected tures within the target. Working under this assumption, every
feedback, a state-aware fuzzer should treat state as a distinct di- recorded input is guaranteed to elicit a response in at least one
mension for its operations across all phases of the fuzzing process. snapshot that was active at the time the input was first discovered.
Using the battle-tested mechanism of a cumulative global feature
map, where observed features are only considered interesting the
3 STATE INFERENCE first time they are encountered, the fuzzer is incapable of rediscov-
State inference models the behavior of the target through its re- ering the same feature across different contexts. In other words, if
sponses to a set of inputs. Specifically, we leverage the snapshot the same response can be observed from different snapshots, the
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy

1 1 reproduce the original response. In constructing the snapshot tree,


every response pattern is associated with an edge between a parent
1 0 1 and a child snapshot. As such, if a snapshot is capable of a response,
0 𝑆0 𝑆1 𝑆2 𝑆3 it is equivalent to having an edge between that snapshot and the
0 corresponding child node in the snapshot tree. Fig. 4c shows the
0 part of the snapshot tree after cross-pollination applying 0 or 1 to
all existing snapshots.
(a) A state diagram of a system that detects the string 𝑏’1011.
Definition 3.3. Capability: A snapshot 𝐿𝑖 has a capability for 𝐿 𝑗
when the response to an input measured in 𝐿𝑖 matches the response
0/2 1/3 1/3 of another snapshot in the in-neighborhood of 𝐿 𝑗 , denoted as 𝐿𝑖 ∈
𝐿0 : 𝑆 0 𝐿1 : 𝑆 0 𝐿2 : 𝑆 1 𝐿3 : 𝑆 1
𝑁 − (𝐿 𝑗 ). For example, with a snapshot tree in Fig. 4c, 𝐿0 ∈ 𝑁 − (𝐿2 )
holds when re-applying the same input 1 to 𝐿0 generates the same
0/4 response 3 as of the input 1 to 𝐿2 ’s in-neighborhood 𝐿1 . 1
𝐿7 : 𝑆 2 0/4

𝐿6 : 𝑆 3
1/5
𝐿4 : 𝑆 2
0/2
𝐿5 : 𝑆 0 3.2 Snapshot Grouping
1/3
To continue grouping snapshots, we record all the capabilities in a
𝐿8 : 𝑆 1 capability matrix.
Definition 3.4. Capability Matrix A capability matrix is a matrix
(b) One example of a snapshot tree constructed by the fuzzer. Each
node is annotated with the implicit state 𝐿 that the system occupies of snapshots’ capabilities based on the responses to inputs and non-
at that snapshot and its ground-truth state 𝑆. Each edge is annotated empty cell ⟨𝐿𝑖 , 𝐿 𝑗 ⟩ stores the input that triggers a known response
with the given input and the response pattern. if 𝐿𝑖 ∈ 𝑁 − (𝐿 𝑗 ) holds. The left part of Fig. 5 shows a capability
metric with multiple capabilities that have been inferred.
3.2.1 Subsumption. The populated capability matrix provides in-
𝐿2 : 𝑆 1 𝐿1 : 𝑆 0 𝐿4 : 𝑆 2 𝐿4 : 𝑆 2 𝐿3 : 𝑆 1
sight into which snapshots overlap in behavior, and consequently,
how distinct snapshots are. This knowledge is essential for better
1/3
1/3 0/2 0/4
0/4
guiding the scheduler toward exploring unique functionality with-
out duplicating efforts and stalling progress. To find that overlap, we
0/2 1/3 1/3
𝐿0 : 𝑆 0 𝐿1 : 𝑆 0 𝐿2 : 𝑆 1 𝐿3 : 𝑆 1 develop the subsumption operator over graph vertices (≺). In short,
a node 𝑢 is subsumed by 𝑣 iff 𝑣 can replace 𝑢 without affecting reach-
(c) An example of the snapshot tree after the cross-pollination. New ability, i.e., 𝑣 has at least all the same edges as 𝑢. Taking Fig. 5 as
capabilities are found, such as 𝐿0 ∈ 𝑁 − (𝐿2 ), 𝐿1 ∈ 𝑁 − (𝐿1 ), 𝐿2 ∈ an example, we observe that {𝐿0, 𝐿1, 𝐿5 }, {𝐿2, 𝐿3, 𝐿8 }, {𝐿4, 𝐿7 }, and
𝑁 − (𝐿4 ), 𝐿3 ∈ 𝑁 − (𝐿3 ), and 𝐿3 ∈ 𝑁 − (𝐿4 ). {𝐿6 } form sets of snapshots with mutually overlapping responses.
Each set of snapshots is then called an equivalence state. Notably,
Figure 4: A running example of cross-pollination. equivalent snapshots may traverse different paths through the sys-
tem and thus cannot be pruned solely through power schedules [7],
since state information is not captured by code coverage alone. But,
fuzzer will attribute only one of those snapshots with the discovery
any snapshot belonging to the same equivalent state overlaps, and
of that response.
thus can be safely eliminated, without loss of capabilities. Alter-
We illustrate this procedure with an example: consider the sys-
natively, to avoid inadvertently losing quality inputs, we choose
tem with the state graph prescribed in Fig. 4a. After some simulated
to include strictly subsumed snapshots in the equivalence states
rounds of fuzzing, we arrive at the snapshot tree depicted in Fig. 4b.
of the subsuming nodes. This ensures that the fuzzer experiences
Whereas the fuzzer observed unique features—namely, state tran-
the same reduction in state counts while maintaining access to
sitions in the system’s finite state machine (FSM) — that resulted
all interesting snapshots. We present the formal definitions and
in this snapshot tree, we note that multiple snapshots occupy the
discussions of subsumption operators in Appendix A.
same state. Unaware of this overlap, the fuzzer would schedule each
of these snapshots individually and would integrate feedback into 3.2.2 Elimination. In practice, a fuzzer may also encounter snap-
state-agnostic models. This results in duplicate efforts, as the fuzzer shots whose response sets are proper subsets of others’ response
wastes many cycles on testing the same states along different paths. sets. From the fuzzer’s perspective, such snapshots are less capable
However, given this initial snapshot tree, the fuzzer can leverage and thus not worth dedicating a scheduling slot for, despite having
cross-pollination to discover hidden relations between snapshots, non-conforming behavior. Any response elicited in that snapshot
allowing it to better model their overlap and optimize its explo- can be reproduced in another having at least one additional capa-
ration. Specifically, given enough initial input samples, the behavior bility. By identifying such states and eliminating them, the fuzzer
of a snapshot can be modeled by measuring its different response further reduces the size of the scheduling queue and improves the
patterns. To discover the capabilities of all snapshots, we iterate dissemination of feedback.
over all inputs in the snapshot tree and apply them at every snap- 1 Fora directed graph G=(V, E), a vertex u is an in-neighbor of a vertex v if (u, v) ∈ E
shot, measuring its response pattern, and recording its ability to and an out-neighbor if (v, u) ∈ E.
RAID 2024, September 30-October
30–October02,
02,2024,
2024,Padua,
Padua,Italy
Italy Ahamd et al.

Collapse. After subsumption and elimination, we


3.2.3 Colorful Collapse. 𝐿0 𝐿1 𝐿2 𝐿3 𝐿4 𝐿5 𝐿6 𝐿7 𝐿8
obtain a reduced set 𝑆˜ of equivalence states 𝑆˜0 = {𝐿0, 𝐿1, 𝐿5 }, 𝑆˜1 = 𝐿0
{𝐿2, 𝐿3, 𝐿8 }, 𝑆˜2 = {𝐿4, 𝐿7 }, and 𝑆˜3 = {𝐿6 }. Within each state lies 𝐿1
a collection of snapshots that share the same behavior across all 𝑆˜0 𝑆˜1 𝑆˜2 𝑆˜3
𝐿2
tested inputs. We model such behavior by the capability set of the 𝑆˜0
𝐿3
state ⟨input : 𝐼, response : 𝑅⟩. For example, if ⟨𝐼𝑖,𝑗 , 𝑅𝑖,𝑗 ⟩ leads 𝐿𝑖 to 𝑆˜1
𝐿 𝑗 , where 𝐿 𝑗 is in 𝑆˜1 , all the snapshots 𝑆˜1 will share the capability set 𝐿4
𝑆˜2
⟨𝐼𝑖,𝑗 , 𝑅𝑖,𝑗 ⟩. Furthermore, in applying the input 𝐼𝑖,𝑗 at any snapshot 𝐿5
𝑆˜3
in 𝑆˜𝑖 , if we can reproduce the response pattern 𝑅𝑖,𝑗 , we can then 𝐿6
say that all snapshots in 𝑆˜𝑖 can reach 𝐿 𝑗 through 𝐼𝑖,𝑗 , because they 𝐿7
all elicit the same characteristic response of 𝐿 𝑗 . For example, all
snapshots 𝑆˜0 can reach 𝐿2 in 𝑆˜1 implied by its capability set ⟨1, 3⟩.
𝐿8

This is directly observed when the capability matrix 𝐴𝐶 is cast to


an adjacency matrix of snapshots, as depicted in Fig. 5. To simplify Figure 5: The capability matrix 𝐴𝐶 (left), where a at ⟨𝐿𝑖 , 𝐿 𝑗 ⟩ in-
the aggregation of snapshots, such that scheduling and feedback dicates that 𝐿𝑖 ∈ 𝑁 − (𝐿 𝑗 ). It is obtained after cross-pollination
are performed at the state level, we can collapse the adjacency by uncovering new capabilities—marked in blue. With sub-
matrix through vertex contraction [23] over 𝑆˜𝑖 , according to if any sumption, we group up nodes that mutually overlap in capa-
snapshot in 𝑆˜𝑖 can reach any snapshot in 𝑆˜ 𝑗 . bilities (in rectangles) in 𝐴𝐶 . Finally, we collapse the matrix
Since the capability matrix specifies the response of every snap- onto a graph of equivalence states that model the relations
shot to a single input, we consider those as first-order responses; between snapshots (right).
recall that in the maze analogy, we also limited the inference to
one step away from each cell. The behavior of the snapshot af-
ter applying the first input can only be modeled by further cross- To mitigate the imprecision and inaccuracy of feedback, and
testing, along an additional dimension, to obtain capability tensors to avoid misguiding the fuzzer with false assumptions about state
of higher-order responses. However, going beyond the first order equivalence, we maintain the original snapshot tree and color it
significantly increases the overhead of state inference and may only with state labels. This ensures that implicit state build-up in equiv-
partition the snapshot groupings even further; its added value is alent snapshots is not lost during collapse, and that consequent
higher accuracy. As a matter of fact, we find that fuzzing seems to be application of the state inference process does not compound the
tolerant to the imprecision introduced by the first-order responses errors, but rather reduces them as more response patterns are mea-
due to the dampening effect of random sampling. sured and evaluated. This also allows us to iteratively collapse the
The collapsed matrix models the first-order relations between same matrix to obtain a minimal recovered model of state relations.
states. In our contrived example, this matrix recovers the original
state graph presented in Fig. 4a. While certainly desirable, this was 3.3 Successive Rounds of Inference
mainly driven by two factors:
After applying state inference, we obtain a capability matrix that
Lossless features: Transitions in the snapshot tree coincide with
carries the fruits of cross-pollination, along with a collapsed matrix
those in the state graph; there are exactly 8 of each. This implies that
that models the relations between labeled snapshots. This labeling,
the feature feedback to the fuzzer was capable of capturing these
however, is static and applies only to the cross-tested snapshots. As
state transitions, without a loss in accuracy or precision. While it is
the fuzzer progresses, it will discover more snapshots that remain
not a coincidence that our tailored example displayed this behavior,
unclassified and do not benefit from the results of state inference. It
it is unlikely that features are lossless in practice. The nature of
is then necessary that the process is continuously applied through-
the feedback (e.g., code or data coverage) dictates the accuracy and
out the fuzzing campaign.
precision of the response patterns. In the FSM example, the fuzzer
One straightforward approach is through batching: for every
also managed to explore all transitions through different generated
batch of 𝑚 new snapshots, we re-apply the state inference, extend-
inputs. However, completeness is not guaranteed under fuzzing.
ing and updating the capability matrix from the previous round. We
Smooth transitions: In constructing the snapshot tree, the
illustrate the state of the extended capability matrix of the second
fuzzer encountered only new features at any active snapshot, corre-
application round in Fig. 6. Note that entries in quadrant 𝑄 𝐴 need
sponding to triggering unseen transitions in the state graph. Had
not be revisited, which is to say, we do not recompute existing
the fuzzer triggered multiple transitions in the target without ob-
snapshot groups, since capabilities are assumed to be reproducible.
serving new features, then the edge to the next recorded snapshot
After overlaying discovered edges from the latest snapshot tree
is not guaranteed to overlap a transition in the state graph. In effect,
onto the new capability matrix, we can continue to apply the state
the fuzzer would have discovered a path through the state graph,
inference as prescribed. Alternatively, batching can be scheduled in
rather than a single transition; yet, it would record it as one edge in
time slots, e.g. every 𝑁 minutes, accommodating for the non-linear
the snapshot tree. It is similarly difficult to achieve “smoothness” in
increase in coverage by performing inference on the new snapshots
practice. In the maze analogy, we enforced smoothness by limiting
generated since the last run. An adaptive hybrid approach can com-
the fuzzer to one move at a time.
bine the two strategies to reduce the startup cost of inference and
maximize information gained throughout a fuzzing campaign.
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024,
RAID 2024, September
September 30–October
30-October 02, 2024, Padua, Italy

𝐿0 𝐿1 𝐿2 𝐿3 𝐿4 𝐿5 𝐿6 𝐿7 formpropose
We of skipped
threetests, thereby reducing
optimizations (one forthe number
each of resets
quadrant) to inand
the

A B
𝐿0 executions required
form of skipped forthereby
tests, each round of inference.
reducing the number of resets and
executions required for each round of inference.
𝐿1
𝐿2 4.1 State Broadcast (𝑂 𝐵 )
4.1 State Broadcast (𝑂 𝐵 )
𝐿3 Suppose the inferred states in 𝑄 𝐴 are correct once constructed, then
Suppose the inferred states in 𝑄 𝐴 are correct once constructed, then

C D
cross-testing the capabilities of individual snapshots within the
𝐿4 cross-testing the capabilities of individual snapshots within the
same state becomes unnecessary. Working under that assumption,
𝐿5 same state becomes unnecessary. Working under that assumption,
equivalent snapshots always have the same capabilities, which we
equivalent snapshots always have the same capabilities, which we
𝐿6 can evaluate as a property of the state they belong to. For a state
can evaluate as a property of the state they belong to. For a state
with 𝑁 snapshots, it thus suffices to perform at most 𝑚 cross-tests,
𝐿7 with 𝑁 snapshots, it thus suffices to perform at most 𝑚 cross-tests,
instead of 𝑚 × 𝑁 . Discovered capabilities in 𝑄 𝐵 are then broadcast
instead of 𝑚 × 𝑁 . Discovered capabilities in 𝑄 𝐵 are then broadcast
to all snapshots within a state, reducing the number of cross-tests
Figure 6: The capability matrix of the second round, with to all snapshots within a state, reducing the number of cross-tests
(See Appendix C for a formal analysis of its reduction). In applying
(See Appendix C for a formal analysis of its reduction). In applying
𝑚 = 4 new snapshots. The shaded quadrant 𝑄 𝐴 contains the state broadcast as an optimization, the overhead of inference is
state broadcast as an optimization, the overhead of inference is
populated matrix from the previous round. The 𝑄 𝐵 and 𝑄 𝐷 distributed among states, rather than snapshots. This results in
distributed among states, rather than snapshots. This results in
quadrants are overlaid with edges—marked in green—from higher prediction accuracy for “uncommon states”, i.e., those with
higher prediction accuracy for “uncommon states”, i.e., those with
the latest adjacency matrix. Cross-pollinated capabilities are fewer snapshots, which are arguably of more interest to the fuzzer.
fewer snapshots, which are arguably of more interest to the fuzzer.
marked in blue. 𝑄𝐶 is always initially empty since old snap-
shots cannot have new predecessors.
4.2 Response Fingerprinting (𝑂𝐶 )
While we present our approach as an independent post-discovery After one round of state inference, we obtain a capability matrix
step, state inference can also benefit from the fuzzer itself since the in 𝑄 𝐴 . Each state and its associated capability set serve as labeled
fuzzer spends much time on generating inputs that do not trigger data for training a decision tree (DT) classifier. Such a classifier can
new features. Nonetheless, the feedback is often non-empty: these optimally divide the input space, enabling us to primarily cross-test
uninteresting inputs likely elicit known response patterns that may inputs in 𝑄𝐶 which maximize information gain.
overlap those of other snapshots. This observation allows the fuzzer We fit a DT over this training data to infer response fingerprints:
to dynamically extend the capability matrix at a minimal cost: the minimal subsets of capability sets which are characteristic identi-
computational overhead of matching. fiers of states. The DT classifier yields a binary tree, where each
internal node tests for a capability, such that nodes closer to the
OVERHEAD,
4 Overhead, OPTIMIZATIONS,
Optimizations, AND
and Trade-offs root yield higher information gain. Leaf nodes consequently carry
TRADE-OFFS
To uncover hidden capabilities, state inference relies on cross- a value indicating the most likely candidate state.
validation,
To uncovera technique that is notorious
hidden capabilities, state for its quadratic
inference reliescomplexity.
on cross- Following this procedure, for each new snapshot in the inference
Applying
validation,state inference
a technique thatinisbatches
notorious of for
𝑚 new snapshots
its quadratic requires
complexity. batch (𝑄𝐶 ), we traverse the DT from the root and query it for the
that the fuzzer
Applying state discovers
inference 𝑚 in new response
batches of 𝑚 new patterns. But coverage
snapshots requires next capability to test, until we hit a leaf node. At this point, the
growth
that theisfuzzer
linear discovers
in exponential𝑚 new time [6]: to discover
response patterns.𝑚 new response
But coverage candidate state is identified. The tested capabilities along the path
patterns,
growth isthe linearfuzzer spends on the
in exponential timeorder
[6]: toofdiscover
exp(𝑚) 𝑚 more
newtime than
response form a subset of the capability set of the candidate. Nonetheless,
for the lastthe
patterns, batch.
fuzzer Asspends
the fuzzer
on thecontinues
order oftoexp(𝑚)
progress,morethe time
overhead
than to reduce false positives and satisfy the rules of subsumption, we
cost of last
for the statebatch.
inference
As theis fuzzer
amortized over the
continues executions
to progress, theperformed
overhead extend the testing to all non-empty responses in the candidate’s
between
cost of state rounds (See Appendix
inference is amortized B for
overa more formal analysis).
the executions performed In capability set.
the meantime,
between rounds it (See
reapsAppendix
the benefits of amodeling
B for more formal stateanalysis).
relations In in Consider the second round in Fig. 6. We fit a DT over the capa-
scheduling
the meantime, anditinreapsgeneration. We assess
the benefits these costs
of modeling stateand benefits
relations in bility sets of 𝑆˜0 : ⟨𝐿1, 𝐿2 ⟩ and 𝑆˜1 : ⟨𝐿3 ⟩. In this simplistic example, a
through
scheduling ourand evaluation in Section
in generation. We 6.2 and these
assess Section 6.3.and benefits
costs test on any of {𝐿1, 𝐿2, 𝐿3 } is enough to yield a classification. With
The ramp-up
through our evaluationcost of state inference
in Section is, however,
6.2 and Section 6.3. high. A fuzzer the given labels, it is thus sufficient to perform one test, instead of
finds
The theramp-up
most coveragecost of in theinference
state first few epochs of thehigh.
is, however, campaign, ne-
A fuzzer four, to classify a new snapshot. A snapshot that passes the test has
cessitating
finds the most frequent
coveragerounds of cross-testing.
in the first few epochs Whereas the overhead
of the campaign, ne- a capability set that matches that of 𝑆˜1 , making it at least equivalent.
tapers off atfrequent
cessitating the tail, the initial
rounds of costs can overwhelm
cross-testing. Whereas thethe
fuzzer, mak-
overhead However, failing the test implies an empty capability set, yet the
ing it spend
tapers mosttail,
off at the of the
its time incosts
initial the beginning
can overwhelm just onthe
inference, and
fuzzer, mak- classifier would predict 𝑆˜0 . To properly test for equivalence, we
bringing
ing it spend its progress
most of its to time
a slowin halt. The cost of
the beginning ramping
just up greatly
on inference, and must test against 𝑆˜0 ’s entire capability set in the least (for a total
varies
bringing withits initial
progress seed
to coverage,
a slow halt. theThe
complexity of the target,
cost of ramping and
up greatly of three tests). This allows us to properly classify 𝐿5 in Fig. 6. Note
the execution
varies with initialspeed, seedamong otherthe
coverage, variables.
complexity of the target, and that, if the capability set of a new snapshot 𝐿𝑢 matches that of
theTo address the
execution speed, ramp-up
amongcost,otherwevariables.
propose several optimizations the candidate state 𝑆˜𝑣 using DT classifiers, we can only infer that
thatToreduce
address thetheoverhead
ramp-up of state
cost, inference,
we propose at several
the costoptimizations
of some accu- 𝑆˜𝑣 ⪯+ 𝐿𝑢 , since we do not have information about what we did not
racy in grouping
that reduce snapshots.
the overhead The prescribed
of state inference, at cross-testing procedure
the cost of some accu- test. Whether or not 𝑆˜𝑣 ≺ 𝐿𝑢 , the result is then the same: a state
requires that all cells
racy in grouping of Fig. 6The
snapshots. outside of 𝑄 𝐴 cross-testing
prescribed be tested for procedure
adjacency. 𝑆˜𝑤 = 𝑆˜𝑣 ∪ {𝐿𝑢 }. Combined with coloring,, this approach ensures
We propose
requires thatthree
all cellsoptimizations
of Fig. 6 outside(oneoffor𝑄 𝐴each quadrant)
be tested to in the
for adjacency. that state inference remains consistent under sparsity.
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.

8
○6 peeking into the new state of the system enables the Explorer
Session 1
to record any observed changes. The latter then ○ 7 forwards its
step

Strategy follow 3 8 broadcasts any


findings over a callback to the Session, which ○
callback

2
updates to concerned components, thus closing the feedback loop.
generate

Generator
update

7 execute 5 5.2 Implementation


Explorer peek_state 6
load_state 4 Tango is implemented in Python 3.11 for Python’s flexibility and
Loader ease-of-use. We expand on the implementation details in Appen-
dix D. Importantly, during our implementation, we have identified
Tracker three kinds of non-determinism: random numbers, time, and shared
resources, and concretely addressed these issues by providing the
Driver random generator with a constant seed, providing the target with
normalized time [1], and recovering shared resources when the
Figure 7: The general workflow of the Tango framework. target is reset. Removing nondeterminism that unnecessarily dif-
ferentiates two snapshots that should be merged will improve the
4.3 Test Extrapolation (𝑂 𝐷 ) overall accuracy of the snapshot grouping.
With response fingerprinting, we can reduce the number of tests
to be done in the 𝑄𝐶 quadrant of Fig. 6. However, 𝑄 𝐷 remains to 6 EVALUATION
be fully tested. Building on top of the DT’s predictions, we can
State inference is a mechanism to identify functionally-distinct
leverage 𝑄 𝐵 to extrapolate which tests in 𝑄 𝐷 need to be applied.
states through cross-testing its snapshots against different inputs.
Instead of cross-testing all 𝑚 new snapshots against each other, we
To assess the effectiveness of this technique and pinpoint its poten-
can reduce the cost of inference by extrapolating tests from the
tial use cases, we address the following research questions through
new state capabilities.
distinct evaluation campaigns:
After finding a candidate state to which each snapshot belongs,
we populate 𝑄 𝐵 to explore the new capabilities of pre-labeled snap- RQ1 What is the cost of cross-testing?
shots (i.e., those in 𝑄 𝐴 ). We follow that by testing new snapshots RQ2 How well do fuzzers distribute cycles to functionally-distinct
only against the new capabilities of their candidate states. As with snapshots?
response fingerprinting, this process ensures correct subsumption RQ3 What is the potential reduction in queue sizes?
conditions, minimizing the risk of mislabeling snapshots. RQ4 How does code coverage correlate to state coverage in eval-
uating state-aware fuzzers?
5 TANGO: THE FRAMEWORK
State inference extends the fuzzer’s knowledge base with informa- 6.1 Experimental Setup
tion about the behavior of the target, which could be leveraged To quantify the cost and benefits of a new technique, it must be
to improve its exploration through Nonetheless, the technique in- compared against a baseline where only key features are different.
troduces new definitions and requires components which are not These features must then be tested individually. Since we imple-
explicit in existing fuzzers or frameworks, such as snapshots, states, mented state inference on top of Tango, we perform the parametric
and transitions. To assess and evaluate the feasibility and benefits analysis with Tango itself as the baseline, by toggling new features
of state inference, a whole re-write and restructuring of the typical and sweeping over parameters. This enables us to measure the
fuzzing workflow is needed. The lack of an existing framework induced effect of each new aspect by changing one variable at a
for state-aware fuzzing and the inflexibility of existing monolithic time, discounting the possible variances from runtime effects, im-
fuzzers motivated the design and development of Tango: a state- plementation artifacts, or a richer set of mutators and schedulers,
aware fuzzing framework. such as those in AFL++ [10].
To that end, we tackle RQ1 through a set of experiments on
5.1 Workflow Tango-Infer, configured to use state inference as a strategy. More-
Anchored around the notion of state-sensitivity, Tango general- over, to assess the potential inefficiencies of scheduling functionally-
izes over the traditional fuzzer architecture and offers a flexible equivalent snapshots (RQ2), we dispatch a set of campaigns to
environment for developing tailor-made fuzzers of stateful systems. state-of-the-art state-aware fuzzers, collect their seed queues, and
The workflow is presented in Fig. 7. In the context of a fuzzing replay them through Tango-Infer to find functional groupings
Session, we ○ 1 iteratively step through a Strategy that governs among the fuzzers’ snapshots and analyze the skew of snapshots in
the exploration and exploitation efforts of the fuzzer. Provided with explored states. Additionally, for RQ3, we use the results of state
knowledge of the current state of the system, ○ 2 the Strategy inference on fuzzer queues to approximate the expected reduction
chooses a target state to fuzz and invokes the Generator to con- ratio 𝛼 encountered across different targets. Finally, to assess the
struct a candidate input, suitable for application under that state. practical advantage of state inference, we set up augmented fuzzing
The former then ○ 3 forwards the input to the Explorer, which campaigns on top of state-of-the-art baselines, where state infer-
○4 ensures that the system occupies the target state, then ○ 5 exe- ence is run periodically to condense the seed queue. To answer
cutes the input through its Driver. With the help of the Tracker, RQ4, we report the achieved coverage with and without inference,
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy

100%
Batch size m=10 Batch size m=20 Batch size: 10 Batch size: 20
100%
Time of cross-testing

75%

Accuracy
75%
50%
50% 25%
25% 0
Batch size: 50 Batch size: 100
0 100%
Batch size m=50 Batch size m=100 75%

Accuracy
100% 50%
Time of cross-testing

75% 25%
0
50% 0 50% 100% 0 50% 100%
Savings Savings
25% bftpd lightftp proftpd
0 dnsmasq live555 pureftpd
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d exim llhttp tinydtls
Time Time expat openssh yajl
Optimizations = All Optimizations = None kamailio openssl
(a) Optimizations: All
Figure 8: The median overhead of state inference as the pro- Optimization: B Optimization: C
portion of time spent on cross-testing, when optimizations 100%
are disabled (solid lines) and enabled (dotted lines), over 24 75%
Accuracy
hours, under different batch sizes. Suppose that the speed to 50%
discover new snapshots is the same, the lower 𝑚 is, the more 25%
frequently the state inference is performed. 0
Optimization: CD Optimization: BCD
100%
and we measure the benefit of state inference as the proportion of 75%
Accuracy

equivalence states discovered uniquely by each fuzzer. 50%


We perform evaluations against all the thirteen version-anchored 25%
targets from ProFuzzBench [21] and three stateful parsers: libexpat, 0
yajl, and llhttp. While parsers are traditionally considered state- 0 50% 100% 0 50% 100%
less, our evaluation highlights their stateful nature, as well as Savings Savings
Tango’s flexibility in fuzzing diverse data channels. We run 24- bftpd kamailio proftpd
hour campaigns, each with 3 trials to account for randomness. We
dnsmasq lightftp pureftpd
exim llhttp tinydtls
conduct all experiments on four servers, each with 32 Intel Xeon expat openssh yajl
Gold 5218 CPU (2.30GHz) cores, 64GB RAM, and Ubuntu 22.04.
(b) Batch size m=50
6.2 RQ1: Empirical Overhead
During inference, the target is reset, and inputs are executed for Figure 9: The hit accuracy (the percentage of correctly la-
cross-testing every cell in the capability matrix. To assess the over- beled snapshots) of optimizations as a function of introduced
head of this operation, we measure the time spent by the fuzzer savings (the skipped tests) based the ground truth collected
and the number of tests performed, across varying settings of batch when doing state inference without optimization.
size 𝑚 and enabled optimizations 𝑂 𝐵 , 𝑂𝐶 , and 𝑂 𝐷 .
Fig. 8 shows the overhead of state inference as a function of time. tests) and the sacrificed accuracy (the percentage of correctly la-
The beginning of a fuzzing campaign records many snapshots and beled snapshots). First, we lack data for some targets since the
frequently invokes state inference, due to the initial seeds quickly number of the snapshots generated during fuzzing these targets is
expanding coverage. As exploration speed tapers off, the fuzzer con- insufficient to perform cross-testing and the validation of savings
tinues to generate and execute inputs, progressively spending less and accuracy. Second, as shown in Fig. 9a, the optimizations can
of its time on cross-testing. However, it leverages the knowledge decrease more than half of the tests while maintaining moderate
gained from previous applications of state inference to schedule its accuracy. Third, for a batch size of 50, Fig. 9b shows that various
exploration more evenly across the functionally distinct snapshot optimizations impact savings and accuracy differently.
groups. As discussed in Section 3.3, it is essential to continuously Optimizations rely heavily on the results of previous rounds
apply inference on new snapshots. Otherwise, the fuzzer regresses of state inference. If a grouping is incorrectly established, it could
in the direction of high-density regions [7]. Fig. 8 highlights opti- remain divergent for the lifetime of the fuzzing campaign, especially
mizations can reduce the ramp-up cost and thus allow the fuzzer since, in our implementation, matching is not error-tolerant. If the
to resume regular operation faster. fuzzer falsely generates distinct groupings, then future rounds of
To better understand the influence of the optimizations, Fig. 9 inference, and optimizations within them, could propagate the
breaks down the trade-off between the introduced savings (skipped errors (unless groups are merged again under subsumption).
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.

bftpd
dcmtk
Batch size: 10 Batch size: 20
dnsmasq bftpd 16.0 (84.39%) 11.0 (87.15%)
dcmtk 5.0 (84.04%) 6.0 (80.36%)
expat dnsmasq 2.0 (99.60%) 4.0 (99.22%)
kamailio exim 34.0 (77.58%) 49.0 (65.73%)
lightftp expat 232.0 (70.31%) 219.0 (70.15%)
live555 kamailio 10.0 (96.98%) 4.0 (98.56%)
llhttp lightftp 3.0 (96.37%) 7.0 (93.39%)
openssh live555 4.0 (72.00%) 3.0 (89.66%)
openssl llhttp 1.0 (98.81%) 1.0 (98.80%)
proftpd openssh 11.0 (85.28%) 16.0 (81.95%)
pureftpd openssl 4.0 (90.23%) 3.0 (94.39%)
tinydtls proftpd 43.0 (86.80%) 73.0 (77.78%)
yajl pureftpd 5.0 (94.77%) 4.0 (95.82%)
tinydtls 1.0 (97.20%) 1.0 (98.37%)
0.0 0.2 0.4 0.6 0.8 1.0 yajl 102.0 (55.92%) 109.0 (59.23%)
Normalized Divergence Batch size: 50 Batch size: 100
bftpd 16.0 (86.17%)
dcmtk
Figure 10: Normalized values of the Kullback-Leibler diver- dnsmasq 4.0 (99.15%) 2.0 (99.58%)
exim 32.0 (76.34%) 47.0 (80.95%)
expat 197.0 (71.86%) 219.0 (73.74%)
gence from uniformity of observed snapshot-to-state dis- kamailio 6.0 (97.59%) 3.0 (98.52%)
lightftp 10.0 (93.13%) 8.0 (95.80%)
tributions, illustrated through notched box-plots at 95% CI. live555
llhttp
26.0 (56.50%)
1.0 (98.82%)
Divergence is calculated individually, for each seed queue openssh 13.0 (89.04%) 23.0 (82.52%)
openssl
and data points are then overlaid as a ⃝ scatter plot. The size proftpd 63.0 (81.54%) 24.0 (93.07%)
pureftpd 2.0 (97.64%) 3.0 (97.73%)
tinydtls 1.0 (98.68%) 1.0 (99.11%)
of each marker is proportional to the number of snapshots 𝑁 , yajl 100.0 (59.62%) 136.0 (56.56%)
normalized by the maximum number of snapshots observed 0 500 1000 1500 0 500 1000 1500
for its target across all campaigns.
Figure 11: The number of the discovered snapshots (in blue)
6.3 RQ2: Snapshots in Biased Queues and the number of the inferred states (in orange). Text after
each bar shows the number of the inferred states and the
Whenever an evolutionary fuzzer encounters interesting cover- reduction ratio 𝛼 = 1 − states/snapshots at the end of 24-hour
age, it saves the input that caused it in its seed queue. For stateful fuzzing campaign without optimizations.
fuzzers, these seeds serve as snapshots to restore the target to
the reached state. To quantify the benefits that seed scheduling
through state inference could provide, we measure the distribution scheduling strategy may starve those snapshots of the time needed
of fuzzer-generated snapshots across functionally-distinct equiva- to explore their potential capabilities, often in favor of the first one
lence states. If we assume that a state-unaware fuzzer selects seeds which uncovered interesting features.
from the queue with a uniform distribution (i.e., it assumes that
seeds are evenly spread across target functionality), then we can 6.4 RQ3: Reduction Ratio 𝛼
assess the “surprise” [28] of sampling uniformly from a skewed
The main advantage of state inference is condensing the seed queue
population. We calculate 𝐷ˆ 𝐾𝐿 (S||U), the Kullback-Leibler diver-
into functionally distinct islands. This allows seed scheduling to
gence [14], of the observed Snapshot-in-state distribution against
be more balanced and reduces redundant exploration of the same
a Uniform reference, normalized by 𝑙𝑜𝑔(𝑁 ), where 𝑁 is the total
code regions. To assess the effect of state inference on queue size
number of snapshots observed in each campaign.
reduction, we measure the number of snapshots generated by every
In this experiment, we run AFL++, Nyx-Net, and Tango-Infer
campaign, and the corresponding number of discovered groupings.
against compatible targets, collect their seed queues, and apply
In Fig. 11, we report the reduction ratio 𝛼 as 𝛼 = 1 − states/snapshots.
state inference to extract snapshot groupings. We present the re-
The average reduction ratio is 86.02% (𝑚 = 10), 86.04% (𝑚 = 20),
sults in Fig. 10. A value 𝐷ˆ 𝐾𝐿 = 0 indicates that sampling the seed
85.08% (𝑚 = 50), and 87.76% (𝑚 = 100), i.e., the queue is around
queue uniformly yields results consistent with sampling a uniformly
seven times smaller. Combined with the results from Section 6.2,
distributed population, i.e., where there are equally as many snap-
this suggests that, during later fuzzing stages, the fuzzer can cycle
shots for every equivalence state discovered by the fuzzer. On the
through its queue around seven times as fast, at a diminishing cost.
other end of the spectrum, 𝐷ˆ 𝐾𝐿 = 1 implies that uniform sampling
The skewed distribution of seeds in fuzzing queues, suggests that
yields the highest surprise: whereas the fuzzer would expect to be
state-of-the-art fuzzers could benefit from applying state inference
exploring different functionalities by cycling through its queue, it is
to prune their queues and avoid tunnel vision towards high-density
likely tunnel-visioned by a majority equivalence state. A fuzzer that
regions. Continuous incremental application also ensures that new
samples its seed queue uniformly would be hindered by duplicate
snapshots are incorporated and that the fuzzer avoids regression
efforts and a self-reinforcing equivalence state.
towards non-uniformity.
On the other hand, while fuzzers generally employ more complex
seed scheduling mechanisms [7, 16, 29], those cannot replace state
awareness. A schedule that prioritizes snapshots unequally based 6.5 RQ4: Case Study on Cross-Inference
on observed feedback inherently disregards the possible overlap of We implement state inference without optimizations as a hotplug-
those snapshots with others in their equivalence state. Equivalent gable component to introduce state-aware scheduling to two exist-
snapshots exercise overlapping behavior, insofar as cross-testing ing fuzzers: AFL++ (for the streaming parsers) and Nyx-Net (for
has not identified discrepancies that necessitate subdividing the the network servers). The fuzzer and Tango share one physical
group into distinct functionalities. However, since fuzzing is incom- core throughout the campaign. Tango continuously checks for
plete, it may be that now-equivalent snapshots may diverge in the new inputs in the fuzzer’s seed queue and applies state inference,
future, given that they are sufficiently scheduled. A non-uniform exporting its results for use by the fuzzer’s scheduler. Tango is
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy

4400 3500
140 700
3000 1000
# of Edges

# of Edges

# of Edges

# of Edges
4200

# of Edges
135 650 2500
4000
2000 900
130 3800 600
1s 1m 1h 4h 1d 1s 1m 1h4h 1d 1s 1m 1h 4h 1d 1s 1m 1h4h 1d 1s 1m 1h4h 1d
Time Time Time Time Time
(a) bftpd (b) dcmtk (c) dnsmasq (d) expat (e) llhttp
7500 400
400
# of Edges

# of Edges

# of Edges

# of Edges
2200 7000 375
380
2000 6500 350
360
6000 325
1s 1m 1h4h 1d 1s 1m 1h4h 1d 1s 1m 1h 4h 1d 1s 1m 1h 4h 1d
Time Time Time Time
(f) openssh (g) openssl (h) tinydtls (i) yajl

Figure 12: Edge coverage collected from Nyx-Net (for network servers) and AFL++ (for parsers) when running without (solid
lines) and with (dotted lines) the state inference extension.

w/o inference overlapping w/ inference translate to state coverage: “novel” paths may belong to the same
equivalence state. Through our evaluation, state-guided fuzzers
bftpd uncovered two new bugs: a heap buffer overflow in dcmtk and a
dcmtk heap out-of-bounds read in yajl.
dnsmasq
expat
llhttp
openssl 7 DISCUSSION
tinydtls State inference introduces a new metric for a fuzzer to optimize
yajl its progress and performance on stateful systems. Consequently, it
raises the question of whether such a metric ultimately improves
0 20 40 60 80 100 the fuzzer’s ability to find bugs. Notably, we argue that optimizing
Percentage of infered states
state coverage requires state-aware metrics, although not exclusively.
Code vs State coverage: Our case study on cross-inference
Figure 13: Cross-inference results showing the distribution highlights a key observation: code coverage is not sufficient for
of overlapping and unique behaviors discovered by stateful stateful exploration. Despite achieving higher code coverage, state-
fuzzers with and without state inference. unaware fuzzers discovered fewer behaviorally-unique states, as
seen in Fig. 13. This is further justified by our takeaway from Fig. 10,
that seeds in the queue of coverage-guided fuzzers are non-uniformly
also configured to export any interesting inputs it discovers during distributed. Stateless programs can be seen as occupying only one
cross-testing, further reducing its effective overhead. state, and code coverage enables in-depth exploration of that state.
We measure the code coverage achieved by both the unmodified Stateful programs introduce a new dimension for fuzzers. So, de-
and the augmented variants of the fuzzers, and we present those spite state-unaware fuzzers achieving slightly higher code coverage,
in Fig. 12. Since code coverage alone cannot capture state informa- cross-inference results highlight that state-aware exploration bet-
tion, we leverage state inference as a performance metric, through a ter maximizes state coverage. A scheduler may incorporate other
process we call cross-inference. Seeds obtained from two competing metrics like code coverage to prioritize individual snapshots or
fuzzers are used to construct a snapshot tree, upon which state diversify exploration within a state.
inference is applied. We interpret the overlap and disjunction of Bug-finding advantage: In dcmtk, Tango uniquely detected
those snapshots in equivalence states as a measure of state coverage: a clean-up bug, which was subsequently reported and fixed. To
states where all snapshots of fuzzer B are subsumed by snapshots trigger the bug, sending any message and waiting for clean-up is in-
of fuzzer A are considered unique to A, without loss of general- sufficient; instead, it requires setting up a valid state, followed by a
ity. Otherwise, we consider them overlapping. The results of this disconnection. Tango’s “post-mortem tracking” was also pivotal for
experiment are illustrated in Fig. 13. achieving this: it monitors decommissioned targets asynchronously
The experiments yield an interesting result: despite both fuzzers until they crash or exit, without impacting performance. In yajl, 4
attaining similar code coverage, state inference revealed distinctions of 5 Tango-AFL++ campaigns reliably triggered the bug, whereas
in uncovered functionality, favoring the state-guided fuzzers. The it was never triggered by the unmodified AFL++. In a later evalua-
additional code covered by the unmodified fuzzers may not directly tion with 20 campaigns, the bug was triggered in 14 Tango-AFL++
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.

campaigns, but only in 4 campaigns w/o inference. This signifi- account for persistent effects of executing seeds; they perform their
cant difference underscores the benefit of state-aware scheduling. analysis retrospectively. In contrast, state inference runs a prospec-
Without ground-truth benchmarks, we cannot assess fuzzer perfor- tive analysis, finding overlaps in the traces of inputs executed.
mance through new bugs alone, as the total bug count is unknown. Grammar inference: State inference overlaps with the disci-
Instead, we report that fuzzers with state inference found a superset plines of regular language grammars and automata theory. DFA
of the bugs found by fuzzers without it. Tango provides the tools minimization [13] proposes a technique for collapsing a FMS into a
necessary to perform state-aware fuzzing, and we believe it is an minimum number of states distinguishable by their outgoing tran-
important foundation for future research. sitions. The RPNI[22] and L* [2] algorithms present techniques for
Exploration vs Exploitation: State inference offers a mecha- passive and active learning of deterministic finite automata for reg-
nism for assigning a label to each snapshot. This enables balanced ular languages. Recent work by Luo et al. [17] also tackles grammar
exploration among states to avoid the starvation of less frequent inference for network protocols, which is limited to protocols that
ones. A power schedule over states, like AFLFast [7], could then reveal stateful information in their responses to client messages.
incorporate the frequency of encountering a state to prioritize the Though similar to DFA inference, our approach differs in key
exploration of low-frequency behaviors. If state coverage is the assumptions, filling gaps in existing techniques. Namely, we do
only metric being optimized, then equivalent snapshots would be not assume an FSM model for the target, nor that the complete
indistinguishable, and a uniform schedule within the same state set of inputs and responses is known or enumerable. It is not the
would be reasonable. However, a fuzzer can still schedule equivalent goal of state inference to extract the state model of the system,
snapshots non-uniformly based on code coverage and occasionally since that would require knowledge of which computational model
prioritize weaker ones. is implemented, e.g., FSM or PDA. State inference instead aims
Misclassification: A false grouping may occur either due to (i) to discover hidden capabilities of each snapshot through cross-
insufficient cross-pollination; or (ii) state-insensitive feedback. In pollination then finds groups of snapshots that share the same
case (i), snapshots may remain misclassified until the next round capabilities through subsumption and colorization. Compared to
of inference. If the new tests are sufficient to show a distinction MACE [8] which first introduced blackbox state inference into
between the snapshots, then the misclassification will be rectified. dynamic symbolic execution, Tango observes more fine-grained
In case (ii), the fuzzer does not have sufficient information to make feedback via instrumentation and has addressed unique challenges
a distinction, since the different behaviors do not yield different when combining state inference with seed scheduling.
observable effects. The choice of a state-sensitive feedback metric
is thus important. Through sufficient testing and state-sensitive 9 CONCLUSION
feedback, state inference always yields accurate groupings.
Research on stateful fuzzing continued where its stateless coun-
Comparision to LibAFL: Similar to Tango, LibAFL offers build-
terpart left off. While much of the progress on the latter was of
ing blocks for custom fuzzers. However, LibAFL is designed for
great benefit to this field, it still managed to imprint methods and
coverage-guided greybox fuzzing but not for stateful fuzzing. Tango
assumptions that are otherwise not suited for stateful fuzzing. In
is developed independently of and in parallel with LibAFL and was
this paper, we re-assess the definition of states and how they fit
built from the ground up with state as an anchor. Statefulness could
into the fuzzing stack. We present a method to identify semantic
be refitted on LibAFL but with heavyweight modification to support
behavior through the use of portable metrics, in a technique we dub
for target state as the context of its operations.
“State Inference”. In the process, we design and implement Tango,
a state-aware fuzzing framework for bootstrapping research in this
8 RELATED WORK domain. Through evaluation, we identify a key observation: fuzzers
could potentially spend upwards of 86% of their time being tunnel-
State-aware fuzzing: AFLNet [24] was among the first to tackle
visioned or duplicating their efforts. By applying our technique,
the problem of fuzzing network targets while allowing state to
fuzzers can leverage state awareness for more optimal scheduling,
accumulate. However, its requirement to manually annotate server
at a diminishing amortized cost. State inference is also applicable
responses hindered adoption. Alternative techniques presented in
in other stages of the fuzzing cycle, from seed minimization and
SGFuzz [5], NSFuzz [25], and StateAFL [20] addressed this issue
distillation, through unsupervised state extraction and determined
through more automated, albeit less precise techniques for state
reproduction, to better-grounded performance evaluation.
extraction. Nonetheless, those works focused on extracting state,
not as a way to generate protocol-compliant inputs, but as a labeling
mechanism for discovered inputs. State inference in Tango makes ACKNOWLEDGMENTS
this mechanism more explicit by exploring functional overlaps. We thank the anonymous reviewers for their feedback on the
LLM-guided fuzzing [19] has also shown its efficacy at targeting paper. This work was supported, in part, by the European Re-
network protocols, by providing machine-readable grammars and search Council (ERC) under the European Union’s Horizon 2020
high-quality seeds for covering states and transitions. Nonetheless, research and innovation program (grant agreement No. 850868),
it remains limited to the information available in its training data SNSF PCEGP2_186974, and a gift from Huawei.
and struggles to generalize to arbitrary or proprietary protocols.
Seed scheduling: While seed scheduling is a well-researched REFERENCES
problem in fuzzing [12, 29, 30], state inference is the first to address [1] 2022. libfaketime modifies the system time for a single application. https:
it in the context of stateful systems. Existing techniques do not //github.com/wolfcw/libfaketime Accessed: 2024-03-15.
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy

[2] Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys
Information and Computation 75, 2 (1987). '22).
[3] Cornelius Aschermann, Sergej Schumilo, Ali Abbasi, and Thorsten Holz. 2020. [28] C. E. Shannon. 1948. A Mathematical Theory of Communication. Bell System
IJON: Exploring Deep State Spaces via Fuzzing. In 2020 IEEE Symposium on Technical Journal 27 (1948), 379–423, 623–656.
Security and Privacy (SP). [29] Dongdong She, Abhishek Shah, and Suman Jana. 2022. Effective Seed Scheduling
[4] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The for Fuzzing with Graph Centrality Analysis. In 2022 IEEE Symposium on Security
Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (2002). and Privacy (SP).
[5] Jinsheng Ba, Marcel Böhme, Zahra Mirzamomen, and Abhik Roychoudhury. 2022. [30] Jinghan Wang, Chengyu Song, and Heng Yin. 2021. Reinforcement Learning-
Stateful Greybox Fuzzing. In Proceedings of the 31st USENIX Security Symposium based Hierarchical Seed Scheduling for Greybox Fuzzing. In 28th Annual Network
(USENIX Security '22). USENIX Association. and Distributed System Security Symposium (NDSS 2021). The Internet Society.
[6] Marcel Böhme and Brandon Falk. 2020. Fuzzing: On the Exponential Cost of
Vulnerability Discovery. In Proceedings of the 28th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software A SUBSUMPTION OPERATORS
Engineering (ESEC/FSE 2020). Association for Computing Machinery.
[7] Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2016. Coverage- To formalize the approach, we propose the definitions:
Based Greybox Fuzzing as Markov Chain. In Proceedings of the 2016 ACM SIGSAC 𝑁 □ (𝑢)
Conference on Computer and Communications Security (CCS '16). Association for
Computing Machinery. The □-neighborhood of 𝑢, where □ ∈ {+, −} represents out-
[8] Chia Yuan Cho, Domagoj Babić, Pongsin Poosankam, Kevin Zhijie Chen, Ed- ward and inward directions, respectively.
ward XueJun Wu, and Dawn Song. 2011. { MACE } : { Model-inference-Assisted } 𝑢 ⪯□ 𝑣 ⇐⇒ 𝑁 □ (𝑢) ⊆ 𝑁 □ (𝑣)
concolic exploration for protocol and vulnerability discovery. In Proceedings of
the 20th USENIX Security Symposium (USENIX Security '11). USENIX Association. 𝑢 is □-subsumed by 𝑣.
[9] Andrea Fioraldi, Daniele Cono D’Elia, and Davide Balzarotti. 2021. The Use of 𝑢 ≺□ 𝑣 ⇐⇒ (𝑢 ⪯ □ 𝑣) ∧ (𝑣 ⪯̸ □ 𝑢)
Likely Invariants as Feedback for Fuzzers. In Proceedings of the 30th USENIX
Security Symposium (USENIX Security '21). USENIX Association.
𝑢 is strictly □-subsumed by 𝑣.
[10] Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++: 𝑢 ⪯ 𝑣 ⇐⇒ (𝑢 ⪯ + 𝑣) ∧ (𝑢 ⪯ − 𝑣)
Combining Incremental Steps of Fuzzing Research. In Proceedings of the 14th 𝑢 is subsumed by 𝑣.
USENIX Workshop on Offensive Technologies (WOOT '20). USENIX Association.
[11] Andrea Fioraldi, Alessandro Mantovani, Dominik Maier, and Davide Balzarotti. 𝑢 ≺ 𝑣 ⇐⇒ (𝑢 ⪯+ 𝑣) ∧ (𝑢 ≺□ 𝑣)
2023. Dissecting American Fuzzy Lop: A FuzzBench Evaluation. ACM Transac- 𝑢 is strictly subsumed by 𝑣, for any □ ∈ {+, −}.
tions on Software Engineering and Methodology (2023). 𝑢 ∼□ 𝑣 ⇐⇒ (𝑢 ⪯□ 𝑣) ∧ (𝑣 ⪯□ 𝑢)
[12] Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer,
and Antony L. Hosking. 2021. Seed Selection for Successful Fuzzing. In 30th ACM 𝑢 and 𝑣 are □-equivalent.
SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2021. 𝑢 ∼ 𝑣 ⇐⇒ (𝑢 ∼+ 𝑣) ∧ (𝑢 ∼ − 𝑣)
ACM.
[13] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. 1979. Introduction
𝑢 and 𝑣 are equivalent.
to Automata Theory, Languages, and Computation. Addison-Wesley, Chapter For each proposed operator 𝑜𝑝, we can construct a set of vertices
Section 4.4.3, Minimization of DFA’s, 159–164.
[14] S. Kullback and R. A. Leibler. 1951. On Information and Sufficiency. The Annals
𝑢 ∈ 𝑈 which satisfy 𝑢 𝑜𝑝 𝑣 as: 𝑜𝑝 𝑣 = {𝑢 ∈ 𝑈 |𝑢 𝑜𝑝 𝑣 }. Of particular
of Mathematical Statistics 22 (1951). interest are the equivalence sets ∼𝑣 and strict subsumption sets ≺𝑣 .
[15] Junqiang Li, Senyi Li, Gang Sun, Ting Chen, and Hongfang Yu. 2022. SNPSFuzzer: To extract the sets of equivalence states from a capability matrix,
A Fast Greybox Fuzzer for Stateful Network Protocols Using Snapshots. IEEE
Transactions on Information Forensics and Security (2022). it suffices to construct the set of out-equivalence sets:
[16] Chung-Yi Lin, Chun-Ying Huang, and Chia-Wei Tien. 2019. Boosting Fuzzing
Performance with Differential Seed Scheduling. In 2019 14th Asia Joint Conference 𝑆˜ = {∼+𝑣 |𝑣 ∈ 𝐿}
Each element in 𝑆˜ represents an equivalence state, a set of snapshots
on Information Security (AsiaJCIS).
[17] Zhengxiong Luo, Junze Yu, Feilong Zuo, Jianzhong Liu, Yu Jiang, Ting Chen,
Abhik Roychoudhury, and Jiaguang Sun. 2023. BLEEM: Packet Sequence Oriented that overlap in behavior. To further reduce the number of states,
Fuzzing for Protocol Implementations. In 32nd USENIX Security Symposium the fuzzer calculates the set of non-empty strict subsumption sets:
(USENIX Security 23).
[18] William M. McKeeman. 1998. Differential Testing for Software. Digital Technical 𝑆ˇ = {≺𝑣 ≠ 𝜙 |𝑣 ∈ 𝐿}
Journal 10, 1 (1998).
[19] Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Following that, any snapshot belonging to any set in 𝑆ˇ can be safely
Large Language Model-guided Protocol Fuzzing. In Network and Distributed ˜ without loss of capabilities. Alter-
System Security Symposium (NDSS 2024). The Internet Society. eliminated from all sets in 𝑆,
[20] Roberto Natella. 2022. StateAFL: Greybox Fuzzing for Stateful Network Servers. natively, to avoid inadvertently losing quality inputs, we choose
Empirical Software Engineering 27, 7 (2022). to include strictly subsumed snapshots in the equivalence states
[21] Roberto Natella and Van-Thuan Pham. 2021. ProFuzzBench: A Benchmark for
Stateful Protocol Fuzzing. In Proceedings of the 30th ACM SIGSOFT International of the subsuming nodes. This ensures that the fuzzer experiences
Symposium on Software Testing and Analysis. the same reduction in state counts while maintaining access to all
[22] J. Oncina and P. García. 1992. Pattern Recognition and Image Analysis. Series in
Machine Perception and Artificial Intelligence, Vol. 1. World Scientific, Chapter
interesting snapshots.
Inferring Regular Languages in Polynomial Updated Time, 49–61.
[23] Sriram V. Pemmaraju and Steven S. Skiena. 2003. Computational Discrete Mathe- B TIME SPENT ON CROSS-POLLINATION
matics: Combinatorics and Graph Theory with Mathematica. Cambridge University
Press, Cambridge, England. 231–234 pages. Contracting Vertices, Section 6.1.1. During cross-pollination, for every batch of 𝑚 new snapshots, given
[24] Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury. 2020. AFLNet: A 𝑛 existing batches, the fuzzer must perform additional executions
Greybox Fuzzer for Network Protocols. In Proceedings of the 13rd IEEE Interna-
tional Conference on Software Testing, Verification and Validation : Testing Tools on the order of O (𝑚 × 𝑛).
Track. In particular, we define the following quantities:
[25] Shisong Qin, Fan Hu, Bodong Zhao, Tingting Yin, and Chao Zhang. 2022. NS-
Fuzz: Towards Efficient and State-Aware Network Service Fuzzing. International Time-step 𝑡𝑘 : the point in time at which the 𝑘th application round
Fuzzing Workshop (FUZZING) 2022 (2022). of state inference is performed.
[26] Sergej Schumilo, Cornelius Aschermann, Ali Abbasi, Simon Wörner, and Thorsten
Holz. 2021. Nyx: Greybox Hypervisor Fuzzing using Fast Snapshots and Affine
Snapshots 𝑛𝑘 = 𝑛𝑘 −1 + 𝑚 = 𝑚𝑘
Types. In 30th USENIX Security Symposium (USENIX Security 21). States 𝑠𝑘 = 𝑠𝑘 −1 + 𝛼 𝑚 = 𝛼𝑚𝑘
[27] Sergej Schumilo, Cornelius Aschermann, Andrea Jemmett, Ali Abbasi, and average reduction ratio
Thorsten Holz. 2022. Nyx-Net: Network Fuzzing with Incremental Snapshots. In
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.
additional executions per round

Cross-tests 𝑐𝑘 = 𝑐𝑘 −1 + 𝑚(2𝑛𝑘 −1 + 𝑚 − 1) = 𝑛𝑘2 − 𝑛𝑘 the relevant I/O syscalls where necessary, e.g., read, dup, close,
Since coverage growth is linear in exponential time [6], during and poll, among others. Synchronization allows the target to re-
the 𝑘th step, at time 𝑡𝑘 , the total number of cross-tests performed is turn control to the fuzzer as soon as it becomes ready, instead of
O (log2 (𝑡𝑘 )). Meanwhile, the fuzzer generates and tests new inputs busy-waiting and degrading throughput. Moreover, it guarantees
in linear time, diminishing the ratio of time spent on state inference reproducibility of results: by encoding the relevant sequences of
as: syscalls into its saved inputs, the fuzzer can reliably reproduce states
log2 (𝑡) and coverage measurements, leaving little for the OS to influence
lim =0 when data is delivered to the target.
𝑡 →∞ 𝑡
In addition, to increase fuzzer throughput and exploit the redun-
C COST REDUCED BY STATE BROADCAST dancy of resets, Tango leverages ptrace to dynamically inject a
Since discovered capabilities are then broadcast to all snapshots forkserver at runtime, just after setting up the communication chan-
within a state, the number of cross-tests becomes: nel in the target. This relieves the loader of the heavy initialization
𝑐˜𝑘 = 𝑐˜𝑘 −1 + 𝑚(𝑛𝑘 −1 + 𝑠𝑘 −1 + 𝑚 − 1) phase of many network services.

𝛼 +1 D.5 Container isolation


 
= 𝑚𝑘 𝑚(𝑘 + 1) − 𝛼𝑚 − 1
2
Fuzzing is an embarrassingly parallel process, and it is commonly
The approximate reduction in total cross-tests is then: employed by launching multiple concurrent campaigns on capable
𝑘 −1 1 machines. When fuzzing network services, this can introduce the
(1 − 𝛼) ; when ≪1
2𝑘 𝑚 problem of overlapping socket bind addresses, since a server’s
configuration parameters are often identical across campaigns. We
D BUILT-IN EXTENSIONS therefore isolate each fuzzer instance in a Linux network namespace,
Components in Tango are left open for customization, to accom- allowing them to communicate with their target without aliasing
modate for arbitrary stateful systems beyond network services. The other instances.
framework also encourages re-usability by providing components We also leverage mount namespaces to achieve filesystem iso-
with fixed interfaces, as well as a dynamic component discovery lation. Some targets may store persistent state through local files,
subsystem to ensure that specialized co-dependent modules are influencing other instances of the target across resets. To avoid
instantiated together. Through its supplementary profiling module, that, we mount an overlay filesystem on top of a tmpfs mount
Tango also enables the instrumentation of the fuzzer’s own func- point for storing instance-local data. Then, upon reset, we clear
tions and variables, to improve the debuggability of the fuzzer’s the upper filesystem of the overlay, effectively destroying any
operations and to better attribute feature changes to improvements persistent state left by the target.
in performance.
D.6 Record-and-replay
D.1 AsyncIO Tango ships with a default loader which implements a record-and-
Async I/O is a form of cooperative scheduling, where the application replay mechanism for loading snapshots. Under the reasonable
specifies when control is returned to the scheduler. We built Tango assumption that the target is deterministic, such a loader can re-
as an asynchronous application to enable graceful suspension of liably reproduce paths by relaunching or forking the target and
the fuzzer and extend its compatibility to event-driven systems, re-applying a saved input. More sophisticated snapshot-ing meth-
such as DOOM. ods exist [27] which can be ported for use under Tango; however,
this remains out-of-scope of the current extensions.
D.2 Hotpluggable Inference
We implemented state inference both as a strategy for use in Tango D.7 SanitizerCoverage
and as a plug-in to third-party fuzzers such as AFL++ and Nyx- In our study on state inference, we primarily model features as code
Net. We slightly augment the scheduling routines of those fuzzers coverage profiles, classified into AFL-style bins. To achieve that,
to incorporate the inference results generated by Tango during a Tango implements a CoverageTracker which sets up a shared
fuzzing campaign and provide them with state-specific feedback. memory region for communicating coverage updates and extracting
feature sets. The tracker is equipped with C-based bindings for
D.3 Built-in Extensions performing the binning and hashing with minimal impact on the
Tango ships with a set of complementary modules that enable it to fuzzer’s critical path.
fuzz x86_64 processes on Linux-based systems, reload state through
replayed inputs, measure and classify SanitizerCoverage feedback, D.8 socket+stdio
communicate over standard file descriptors and network sockets, Tango includes a demonstrative set of channels, to communicate
train state-specific mutators and perform state inference. with the target over network sockets, such as TCP and UDP, as well
as standard input. These channels extend the ptrace functionality
D.4 ptrace-d processes to synchronize the state of the socket or file in the target with its
To ensure synchronicity between the fuzzer and the running pro- counterpart in the fuzzer. By capturing syscalls such as bind and
cess, we use ptrace with seccomp filters to place catchpoints over accept, we inject a forkserver at the latest stage in initializing the
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy

target. This achieves around a 50x increase in fuzzing throughput of this approach is that nearby states (locations) may be reachable
over the non-optimized implementation since socket setup is often through a bee-line movement, whereas the input generated and
expensive, and otherwise, the fuzzer may only successfully connect discovered by the fuzzer to transition between these two states
to the target by reattempting the operation until it no longer fails. may involve redundancies that impact fuzzer throughput. Another
downside is that if the current location and the target location are
D.9 Adaptive mutators close to each other, yet are far enough from the spawn location,
We implement an adaptive model for applying havoc mutators that restarting the level from that location would be inefficient. The
balances exploration and exploitation using the Exp3 [4] algorithm player may simply need to move a few steps in the direction of
for distributing rewards and assigning probabilities. For each snap- the target to reach it. Moreover, continuously restarting the target
shot, we maintain a set of weights describing the probability that a breaks the immersion of the fuzzer “playing” the game, and it would
mutator is chosen in that state. Provided a comprehensive set of instead spend much of its time replaying actions from the start. To
mutators, this approach accommodates an evolving target by ad- avoid that, we implemented path-finding algorithms, based on the
justing and applying the probabilities in selecting the next mutator, state graph explored by the fuzzer, to move between two locations
based on how well each one performs in the state context. using a sequence of Reach instructions. In essence, to load a state, a
path to it from the current state is calculated, and Reach instructions
are performed piece-wise along every transition in the path.
E CASE STUDY: DOOM
Strategy: Finally, we implemented a ZoomStrategy component,
Tango is a framework for building state-aware fuzzers, and the to tie it all together, that schedules states based on a convex hull of
main witness of its merit would be using it to develop a state- the explored locations, and prioritizes locations on the perimeter,
aware fuzzer for a complex stateful system. In the spirit of hacker that are furthest from the start. In addition, the strategy implements
culture, we opted to answer the question “Can it run DOOM?” with an event observer task that is responsible for reacting to urgent
a resounding “Yes!”, using Tango. events such as seeing an enemy or stepping in slime pits. By pre-
empting the fuzzer’s main loop, the strategy minimizes the reaction
E.1 Setup time to increase the survivability of the player.
To support DOOM, we extended Tango in the following aspect:
Driver: We implemented an X11Channel component which E.2 Results
sends keystroke events to a process’s window, through a public With these extensions, Tango consistently manages to finish the
Python library, python3-xlib. E1M1 level of DOOM, on difficulty 3, in 10 to 40 wall-clock minutes.
Generator: We added Activate, Kill, Move, Reach, Rotate, The main factor contributing to this variability is perimeter explo-
and Shoot instructions, extending the input base class. These gov- ration. As can be seen in Fig. 14b, in one recorded fuzzing session,
ern how the player’s character interacts with its environment, and the fuzzer encountered a big undiscovered area on the left side of
are later used by the input generator and the exploration strategy the map, and dedicated a significant amount of time to exploring
to maneuver around the map and overcome obstacles. it. It also managed to discover the path up the stairs to the higher
We limited the functionality of the input generator to selecting platform, where it found a level 1 armor pickup. In other runs, due
a possible outgoing or incoming transition of the current state and to the stochastic nature of the fuzzing process, it may miss that area
passing it on to the mutator, which mainly mutates the direction completely and continue exploring in the immediate direction of
and duration of movement, to explore the level. We also provided the exit door, achieving a lower overall finish time in the process.
it with two helper functions that can yield the correct sequence of Having found the armor pickup, Tango maintains a record of it
commands to follow a path or to aim and shoot at a moving target in its later exploration stages. As can be seen in Fig. 14c, its path to
by incorporating live feedback. the finish line includes going up the stairs, picking up the armor
Tracker: We implemented state feedback as a shared-memory boost, and returning back on another path to the exit room.
struct populated by DOOM and accessed by Tango. The struct Regardless of the overall finish time, once Tango finds the path
contains all basic user properties such as location, weapons, ammo, to the exit, it consistently manages to follow it in 3 to 4 in-game
pickups, enemies in sight, and doors or switches within reach. Two minutes. While far from typical speed-runs for this level (which
states are considered equivalent if they have the same player posi- are as low as 9 seconds), it remains a formidable achievement to
tion. We attached extended state variables to each state that describe be able to explore the state space of a DOOM level and manage to
the pickups collected along the current path, and state attributes finish it in a sensible amount of time.
describing the current location (e.g. if it is a slime pit or a secret
level). To avoid having a unique state for every single position on
the map, the level is divided into a grid of cells, representing the
granularity of feedback, as shown in Fig. 14a.
Loader: We extended the loader’s two main functions: restarting
the target and loading a state. Restarts are simple, as they’re only a
matter of terminating and relaunching the process. Loading a state
is slightly more complex: without a means of snapshotting, actions
must be replayed to reach a certain known location, given that the
state graph contains at least one path to it. However, a downside
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.

(b) Heatmap of the visited cells during the first (c) Heatmap of the visited cells after 30 minutes
(a) Grid view of the DOOM map. 10 minutes of fuzzing DOOM’s the E1M1 level. of fuzzing.

Figure 14: The progress of Tango in playing DOOM. Red shading implies higher hit counts. Figure 14c shows that the fuzzer
had figured out a path to the finish and continues to repeat it to achieve the lowest completion time.
100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20
Time of cross-testing

Time of cross-testing

Time of cross-testing

Time of cross-testing
104.67 83.0 31.33 22.67
75% 104.33 91.33 75% 29.67 43.33 75% 75%
151.67
50% 50% 50% 50% 103.0
25% 25% 25% 495.33 472.0 25% 143.0
458.0 527.33 194.0
0 0 0 0
100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100
Time of cross-testing

Time of cross-testing

Time of cross-testing

Time of cross-testing
115.67 89.0 23.5 16.67
75% 90.33 88.0 75% 22.33 16.33 75% 75%
50% 50% 50% 50%
25% 25% 25% 471.67 396.33 25% 136.67 245.0
529.0 510.0 168.67 133.67
0 0 0 0
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d
Time Time Time Time Time Time Time Time
Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None

(a) bftpd (b) dcmtk (c) dnsmasq (d) exim


100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20
Time of cross-testing

Time of cross-testing

Time of cross-testing

Time of cross-testing
301.0 11.0 16.67
75% 75% 400.0 75% 75% 86.0 16.33
50% 50% 50% 50%
25% 781.33 733.67 25% 331.33 25% 82.67 111.0 25%
693.0 658.67 476.67 111.33 110.67
0 0 0 0
100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100
Time of cross-testing

Time of cross-testing

Time of cross-testing

Time of cross-testing
59.0 59.33
75% 75% 75% 75% 47.67 62.0
50% 50% 50% 50%
25% 701.33 832.67 25% 235.0 202.33 25% 150.33 190.5 25%
702.67 612.67 384.0 303.0 120.33 143.0
0 0 0 0
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d
Time Time Time Time Time Time Time Time
Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None

(e) expat (f) kamailio (g) lightftp (h) live555


100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20
Time of cross-testing

Time of cross-testing

Time of cross-testing

Time of cross-testing

84.0 83.67 77.0 88.67 44.33 42.0


75% 87.0 87.67 75% 119.0 52.33 75% 33.67 33.67 75%
50% 50% 50% 50%
25% 25% 25% 25% 328.33 327.0
354.67 345.0
0 0 0 0
100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100
Time of cross-testing

Time of cross-testing

Time of cross-testing

Time of cross-testing

85.0 84.0 121.67 129.67 32.0 30.67


75% 86.67 87.0 75% 102.0 88.0 75% 25.33 35.0 75%
50% 50% 50% 50%
25% 25% 25% 25% 343.0 346.33
368.33 342.0
0 0 0 0
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d
Time Time Time Time Time Time Time Time
Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None

(i) llhttp (j) openssh (k) openssl (l) proftpd


100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20
Time of cross-testing

Time of cross-testing

Time of cross-testing

35.67 61.33
75% 75% 54.0 35.33 75%
230.67 267.33
50% 50% 50% 252.67 264.0
25% 95.67 103.67 25% 25%
124.0 120.67
0 0 0
100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100
Time of cross-testing

Time of cross-testing

Time of cross-testing

107.33 62.33 61.67 247.67 312.33


75% 110.67 75% 74.33 82.67 75% 272.0 305.67
50% 50% 50%
25% 98.67 25% 25%
115.67
0 0 0
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d
Time Time Time Time Time Time
Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None

(m) pureftpd (n) tinydtls (o) yajl

Figure 15: The time of cross-testing per target.

You might also like