Tango
Tango
Inference
Ahmad Hazimeh∗ Duo Xu Qiang Liu†
[email protected] [email protected] [email protected]
EPFL; BugScale EPFL EPFL
effort and is often skipped in favor of readily available feedback like state, the fuzzer can better model the relations between snapshots
code coverage. While alternative techniques attempt to extract state and distribute energies more equally among the inferred states. As
variables from certain types of targets, varying success [5, 20, 25], a result, the fuzzer can schedule from the state queue first, to ensure
code coverage remains the preferred mode of instrumentation. an even exploration, and cycle faster through discovered behavior.
However, in the absence of labels, a fuzzer may be misguided. We implemented state inference on top of Tango, our versatile
Consider the example of an unlabeled maze in Fig. 1b. In a running framework for stateful fuzzing. Additionally, we extended current
session, we assume the fuzzer can request one move at a time, state-of-the-art fuzzers, i.e., AFL++ and Nyx-Net, with state feed-
and its feedback is restricted to “whether or not the player moved”. back from Tango. Our evaluation indicates that state inference
Without knowing the label of the current cell, the fuzzer attributes significantly improves seed scheduling by effectively mapping snap-
the feedback only to the last generated move. For instance, if the shots to states, achieving an average reduction of 86.02% to 87.76%
fuzzer arrives at cell 28 through an upward move from 27, it would in the size of a fuzzer’s scheduling queue, when fuzzing network
consider «upward» as an interesting direction and may select it servers and parsers. Much like solving a maze, fuzzing DOOM
more often. Yet, whether or not a player can move depends not only (see Appendix E) also benefits from the knowledge of stateful in-
on the attempted move but also on its surroundings. In another formation, e.g., the player and enemy’s positions, the remaining
iteration, if the fuzzer starts from cell 5, moving upward would amount of ammo, and health points. By incorporating stateful feed-
yield no interesting results. Unaware of its surroundings, the fuzzer back, Tango solves the E1M1 level in DOOM within 30 minutes
quickly exhausts the set of interesting behaviors it can observe, and subsequently replays it in 3 minutes. We summarize our key
resulting in a random walk in exploration. contributions as follows:
Nonetheless, we observe that such boolean feedback remains • Introduction of the state inference, which group snapshots that
useful to model the player’s local surroundings. Having found a belong to the same state based on their responses to inputs.
few initial paths, the fuzzer can extract characteristics of the cells • Design of Tango, a modular-based framework for fuzzing stateful
it has arrived at by trying out, at each path, all the different in- systems in a state-aware manner.
teresting moves it has discovered so far. A complete exploration • Implementation of state-aware scheduler extensions, for AFL++
would yield the classification represented in Fig. 1b: cells are anno- and Nyx-Net, that leverage inferred states to reduce wait times
tated by the possible set of paths that can be followed through one and disperse feedback, demonstrating Tango’s effectiveness in
move from each cell. All paths known to the fuzzer then fall into analyzing complex systems.
one of 14 categories based on their surroundings. In essence, the • Open-source access to our framework and results to foster adop-
fuzzer measures the response pattern of each cell to a set of known tion and provide value to the community. Tango is available at
inputs. This allows it to group its known paths by their common https://github.com/HexHive/tango.
characteristics, e.g., paths that lead to a cell in a vertical corridor.
Increasing the number of steps yields a more accurate classification
2 CHALLENGES TO STATEFUL SCHEDULING
such that, in the limit, each cell maps uniquely to its label. Through
this process, the fuzzer extrapolates multi-dimensional feedback The path explosion in fuzzing quickly degrades the performance
from a uni-dimensional metric to guide its exploration. of a fuzzer. The performance becomes even worse when we are
On the other hand, stateful fuzzers [15, 24, 26] introduce a key fuzzing a stateful target due to the dependence on the system’s
feature for exploring complex systems: resumability. It entails the state. The key to tackling path explosion in stateful fuzzing lies in
ability to restore the target to a certain state and use that state as a more efficient scheduling, which faces three core challenges.
starting point for further fuzzing. They achieve that through restore Feedback Attribution: The fuzzer gradually develops a model
points, referred to as snapshots, which span different granularities, of the target through collected feedback, incorporating it into its
from whole-system VM snapshots, through process restore points, input generation for further exploration. Consider the example
to record-and-replay techniques. Essentially, at each snapshot, the of an FTP server in Fig. 2. If the fuzzer initially sends a correct
target occupies an implicit state as a result of the path traversed sequence of USER-PASS commands, it ends up in the AUTHED
by the input. Perfect resumability ensures the reproducibility of be- state, where control and data transfer commands are accepted.
haviors in their respective states, and allows the fuzzer to maintain Now in the AUTHED state, the fuzzer sends a control command,
its progress while exploring different paths.
In this paper, we propose state inference to address the challenges PASS (incorrect)
arising from fuzzing stateful systems. State inference is a technique
to produce groupings of snapshots that occupy the same implicit
system state, based on similarities between input-response pairs. start accept() USER
The key idea is to cross-test snapshots against inputs and observe CLOSED WAIT_USER WAIT_PASS
e.g. PWD, and receives positive feedback reinforcing the use of PWD labels. Nodes then represent unique values of the state variable or
in future iterations. However, the PWD command is only valid in response code, and within each node, the system maintains a set of
an authenticated session context. Starting the session with any inputs that allow the fuzzer to restore the target to a snapshot of
command other than USER yields a completely different behavior. that state. The approximate state graph is then obtained by merging
Thus, without accounting for state, the fuzzer’s model of the target tree nodes sharing the same state labels.
receives conflicting or misleading feedback.
Exploration: The discovery of an interesting input can expose 2.2 State Modeling
many new paths to the fuzzer, since that input may set up the con-
State is a semantic identifier of the system’s dynamic nature. Any
text required to traverse those paths. Following the FTP example,
event or interaction with the system may update its state and modify
if the fuzzer saves the USER-USER-PASS-PWD sequence as an inter-
its behavior. To measure state, three approaches exist. First, some
esting input, it may apply further mutations to it, leading to an
implementations observe global variables as explicit state identifiers.
input like jUnK-PASS-PWD. Unaware of the state set up by the USER
However, this does not constitute a generic abstraction over states,
command, the fuzzer mutates and destroys that part of the input,
as there are often other contributing factors. Second, since the
trapping itself in an error path. To avoid this loss of progress, the
state of a system boils down to variable memory contents which
fuzzer should treat previous inputs as part of the state setup.
influence its responses, such as the contents of the call stack, the
Seen differently, the history of interactions should be considered
heap, or function-local variables, there are attempts at isolating and
as one way to restore the current state in the target, e.g. through
capturing those variables as state feedback [5, 20, 25]. However,
record-and-replay. Fuzzing is then performed incrementally along
their approaches either under-approximate or over-approximate
one path of exploration. Note that, while generating invalid inputs
the state. Third, it is often easier for a developer who is familiar
is equally important in fuzzing, this methodology does not rule
with the target to specify their own definition of state that fits the
out the possibility of doing so. To test for an invalid command, it
testing goals they are trying to achieve.
suffices to start from an empty prefix path, or alternatively, only
Recovering the behavioral model of a system is not straightfor-
mutate past a selected prefix. Both ensure that the state set up by
ward. To recover states and the relations between them, AFLNet
the prefix is not destroyed.
requires patching the target to augment its responses such that
Soundness: If an input triggers a crash, it may not be sufficient
state identifiers are explicitly indicated through the response codes.
to generate a reproducer from that input alone, since the target may
It also requires that the fuzzer is aware of how those identifiers can
have crashed due to previously accumulated state. To guarantee
be extracted from the received responses. Alternatively, SGFuzz
the soundness (reproducibility) of crashes, the fuzzer must produce
and NSFuzz instrument global variables as state indicators, but they
an input to build up state, which requires the fuzzer to track the
often require manual effort to filter out noise by adding irrelevant
millions of consumed inputs within the lifetime of the target.
variables to an elaborate ignore list. They also fundamentally as-
These core challenges can be addressed by treating “state” as a
sume that state can be consolidated to global variables, when in
first-class citizen and anchoring the fuzzer’s operations around it.
fact, it can span any mechanism for managing persistent memory
Introducing this new dimension to fuzzing requires careful consid-
contents, such as the call stack, function-local variables, or heap
eration and handling in the form of snapshot management, state
objects. On the other end of the spectrum, Nyx-Net foregoes state
modeling, and state-aware behavior.
identification and relies solely on its high reset rates to explore
more of the input space, following the blackbox fuzzing school of
2.1 Snapshot Management thought that favors execution speed over introspection.
Precise state recovery is orthogonal to fuzzing since the imple-
For stateful fuzzing, it suffices that the target persists between
mentation implicit states or diverge from the prescribed protocol.
successive fuzzing iterations. This allows the target to build up
Although such divergence is interesting for fault detection, it is only
state through processing the consumed inputs. Nevertheless, to
discernible where the specification is available, or when a baseline
tackle the aforementioned challenges with feedback attribution,
is used for comparison (as is the case of differential testing [18]).
exploration, and soundness, stateful fuzzers typically snapshot their
progress incrementally. Through these snapshots, the fuzzer can
then save and restore a previously discovered path.
When fuzzing stateful targets, it is common to maintain a tree PASS
of snapshots, where each node has successors representing other
snapshots. The tree is constructed by appending new snapshots as
children of the current node being fuzzed whenever new feedback,
𝐿0 : 𝐿1 : 𝐿2 : 𝐿3 :
e.g., control-flow edges, are discovered. Stateful fuzzers built on
USER USER PASS
WAIT_USER WAIT_PASS WAIT_PASS AUTHED
this idea, such as Nyx-Net [27], AFLNet [24], NSFuzz [25], and SG-
Fuzz [5], manage their snapshot tree differently. Nyx-Net maintains PASS
in the input corpus itself. In contrast, fuzzers like SGFuzz, AFLNet, AUTHED
The Over-shadowed Seed: Even though there are several ap- grouping to model the target’s states. At each snapshot, the tar-
proaches proposed to model states, existing fuzzers are still hard to get occupies a unique hidden state that drives its behavior and
find the over-shadowed seed. Consider again the example of the influences its response to inputs. For a stateful system, we posit
FTP server in Fig. 3. When the fuzzer first sends the USER command, that two snapshots of the system occupy the same state—as far as
it takes a snapshot 𝐿1 (in the WAIT_PASS state), since it observes the fuzzer is concerned—if both snapshots share the same response
new feedback relating to the discovery of a new command. The pattern across all tested inputs, given a sufficient number of samples.
fuzzer then follows with USER again, taking a new snapshot 𝐿2 due We use this insight to develop a systematic approach for evaluating
to executing the error handler in WAIT_PASS, and it remains in snapshots, grouping them by their observable response patterns.
that state. Sending the PASS command then results in 𝐿3 (in the This grouping thus yields a hierarchy of states and snapshots that
AUTHED state). The fuzzer has now discovered a path to the AU- enables fairer and more targeted seed scheduling.
THED state as USER-USER-PASS. Notably, the fuzzer discovered the
WAIT_PASS state “twice”, but the AUTHED state only once. De- Definition 3.1. Feature and Feature Map: A feature is a measur-
pending on the type of feedback collected by the fuzzer, however, able quantity that is influenced by state, e.g., edge hit count. A
it may also be incapable of discovering the USER-PASS sequence feature map is a mapping from a feature identifier, e.g., edge label,
from 𝐿1 , since the feedback for discovering the PASS command had to its measured value.
already been attributed to 𝐿3 , and it is no longer considered an in-
teresting signal for taking a snapshot. In effect, the discovery of the Definition 3.2. Response and Response Pattern: A response is a
USER-PASS sequence was over-shadowed by USER-USER-PASS due value instance of the feature map obtained by executing the target
to overlapping feedback. This phenomenon stacks up the snapshots with a given input. For example, with the edge coverage map as the
and dilutes the scheduling pool. feature map, if three new edges are covered due to the given input,
Observation: Yet, despite 𝐿1 and 𝐿2 traversing different paths in three is the response. Importantly, in our implementation, we also
the target, we know they represent the same state: the target expects consider the edge’s hit count as part of the response. A response
a PASS in either case to transition to the next state. If we assume pattern is a set of features triggered while executing an input at a
𝐿3 was never created, and we send PASS at 𝐿1 , we would discover snapshot. For example, the three edges newly triggered with the
AUTHED again and would create a snapshot 𝐿3′ . Conveniently, given input form a response pattern.
through this approach, the fuzzer discovers a shortcut to AUTHED
State inference has three steps. First, this grouping requires ad-
requiring the minimal number of commands to reach it.
ditional executions to probe the target along all the interesting
This observation is not limited to authentication routines but
paths discovered by the fuzzer. These executions are named cross-
generalizes to any stateful system whose behavior is primarily
pollination (Section 3.1). Fortunately, however, the benefits can
influenced by its inputs. By probing the system at a state with
outweigh the overhead costs due to the nonlinearity of exercising
different inputs and measuring its responses, we can develop a
new coverage: while fuzzing iterations grow in linear time, new fea-
model of the state it is in. If two snapshots share the same responses
tures are only discovered in exponential time [6]. This means that,
across all tested inputs, we can then consider them as belonging to
as the fuzzing campaign progresses, the cost of state inference is
the same state, as far as the fuzzer is concerned.
amortized over the time spent by the fuzzer between successive cov-
erage findings. Meanwhile, the fuzzer can leverage the learned state
2.3 State-aware Operation model and the relations between snapshots for better input and
mutator scheduling. Second, the grouping after cross-pollination is
The behavior of a stateful target changes depending on the state of
based on the capability matrix that stores the information obtained
the system, implying that inputs consumed by the target must be
by cross-pollination. We come up with three operators, i.e., sub-
generated by accounting for the current state. Beyond using snap-
sumption, elimination, and colorful collapse, to group snapshots
shots as a prefix for incremental exploration, none of the mentioned
into states (Section 3.2). Third, having obtained the static capabil-
state-of-the-art fuzzers incorporates state into its operations, such
ity matrix, we have to update it during fuzzing so that the fuzzer
as mutator schedules or state-specific dictionaries. Incorporating
continuously benefits from the refined feedback (Section 3.3).
feedback in a state-specific manner helps develop a more accurate
model of the target in each state, better guiding the fuzzer for ex-
ploring further within each state. Granted, feedback can become 3.1 Cross-Pollination
overwhelming for the fuzzer [9, 11, 30]. State-dependent feedback Cross-pollination seeks hidden capabilities by re-applying the same
can be even more challenging to handle, as the target’s behavior input at different snapshots and observing overlaps in feedback.
evolves dynamically, necessitating that feedback be incorporated Fuzzers usually discover and record inputs that trigger new fea-
dynamically as well. Nonetheless, to make the most of the collected tures within the target. Working under this assumption, every
feedback, a state-aware fuzzer should treat state as a distinct di- recorded input is guaranteed to elicit a response in at least one
mension for its operations across all phases of the fuzzing process. snapshot that was active at the time the input was first discovered.
Using the battle-tested mechanism of a cumulative global feature
map, where observed features are only considered interesting the
3 STATE INFERENCE first time they are encountered, the fuzzer is incapable of rediscov-
State inference models the behavior of the target through its re- ering the same feature across different contexts. In other words, if
sponses to a set of inputs. Specifically, we leverage the snapshot the same response can be observed from different snapshots, the
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy
𝐿6 : 𝑆 3
1/5
𝐿4 : 𝑆 2
0/2
𝐿5 : 𝑆 0 3.2 Snapshot Grouping
1/3
To continue grouping snapshots, we record all the capabilities in a
𝐿8 : 𝑆 1 capability matrix.
Definition 3.4. Capability Matrix A capability matrix is a matrix
(b) One example of a snapshot tree constructed by the fuzzer. Each
node is annotated with the implicit state 𝐿 that the system occupies of snapshots’ capabilities based on the responses to inputs and non-
at that snapshot and its ground-truth state 𝑆. Each edge is annotated empty cell ⟨𝐿𝑖 , 𝐿 𝑗 ⟩ stores the input that triggers a known response
with the given input and the response pattern. if 𝐿𝑖 ∈ 𝑁 − (𝐿 𝑗 ) holds. The left part of Fig. 5 shows a capability
metric with multiple capabilities that have been inferred.
3.2.1 Subsumption. The populated capability matrix provides in-
𝐿2 : 𝑆 1 𝐿1 : 𝑆 0 𝐿4 : 𝑆 2 𝐿4 : 𝑆 2 𝐿3 : 𝑆 1
sight into which snapshots overlap in behavior, and consequently,
how distinct snapshots are. This knowledge is essential for better
1/3
1/3 0/2 0/4
0/4
guiding the scheduler toward exploring unique functionality with-
out duplicating efforts and stalling progress. To find that overlap, we
0/2 1/3 1/3
𝐿0 : 𝑆 0 𝐿1 : 𝑆 0 𝐿2 : 𝑆 1 𝐿3 : 𝑆 1 develop the subsumption operator over graph vertices (≺). In short,
a node 𝑢 is subsumed by 𝑣 iff 𝑣 can replace 𝑢 without affecting reach-
(c) An example of the snapshot tree after the cross-pollination. New ability, i.e., 𝑣 has at least all the same edges as 𝑢. Taking Fig. 5 as
capabilities are found, such as 𝐿0 ∈ 𝑁 − (𝐿2 ), 𝐿1 ∈ 𝑁 − (𝐿1 ), 𝐿2 ∈ an example, we observe that {𝐿0, 𝐿1, 𝐿5 }, {𝐿2, 𝐿3, 𝐿8 }, {𝐿4, 𝐿7 }, and
𝑁 − (𝐿4 ), 𝐿3 ∈ 𝑁 − (𝐿3 ), and 𝐿3 ∈ 𝑁 − (𝐿4 ). {𝐿6 } form sets of snapshots with mutually overlapping responses.
Each set of snapshots is then called an equivalence state. Notably,
Figure 4: A running example of cross-pollination. equivalent snapshots may traverse different paths through the sys-
tem and thus cannot be pruned solely through power schedules [7],
since state information is not captured by code coverage alone. But,
fuzzer will attribute only one of those snapshots with the discovery
any snapshot belonging to the same equivalent state overlaps, and
of that response.
thus can be safely eliminated, without loss of capabilities. Alter-
We illustrate this procedure with an example: consider the sys-
natively, to avoid inadvertently losing quality inputs, we choose
tem with the state graph prescribed in Fig. 4a. After some simulated
to include strictly subsumed snapshots in the equivalence states
rounds of fuzzing, we arrive at the snapshot tree depicted in Fig. 4b.
of the subsuming nodes. This ensures that the fuzzer experiences
Whereas the fuzzer observed unique features—namely, state tran-
the same reduction in state counts while maintaining access to
sitions in the system’s finite state machine (FSM) — that resulted
all interesting snapshots. We present the formal definitions and
in this snapshot tree, we note that multiple snapshots occupy the
discussions of subsumption operators in Appendix A.
same state. Unaware of this overlap, the fuzzer would schedule each
of these snapshots individually and would integrate feedback into 3.2.2 Elimination. In practice, a fuzzer may also encounter snap-
state-agnostic models. This results in duplicate efforts, as the fuzzer shots whose response sets are proper subsets of others’ response
wastes many cycles on testing the same states along different paths. sets. From the fuzzer’s perspective, such snapshots are less capable
However, given this initial snapshot tree, the fuzzer can leverage and thus not worth dedicating a scheduling slot for, despite having
cross-pollination to discover hidden relations between snapshots, non-conforming behavior. Any response elicited in that snapshot
allowing it to better model their overlap and optimize its explo- can be reproduced in another having at least one additional capa-
ration. Specifically, given enough initial input samples, the behavior bility. By identifying such states and eliminating them, the fuzzer
of a snapshot can be modeled by measuring its different response further reduces the size of the scheduling queue and improves the
patterns. To discover the capabilities of all snapshots, we iterate dissemination of feedback.
over all inputs in the snapshot tree and apply them at every snap- 1 Fora directed graph G=(V, E), a vertex u is an in-neighbor of a vertex v if (u, v) ∈ E
shot, measuring its response pattern, and recording its ability to and an out-neighbor if (v, u) ∈ E.
RAID 2024, September 30-October
30–October02,
02,2024,
2024,Padua,
Padua,Italy
Italy Ahamd et al.
𝐿0 𝐿1 𝐿2 𝐿3 𝐿4 𝐿5 𝐿6 𝐿7 formpropose
We of skipped
threetests, thereby reducing
optimizations (one forthe number
each of resets
quadrant) to inand
the
A B
𝐿0 executions required
form of skipped forthereby
tests, each round of inference.
reducing the number of resets and
executions required for each round of inference.
𝐿1
𝐿2 4.1 State Broadcast (𝑂 𝐵 )
4.1 State Broadcast (𝑂 𝐵 )
𝐿3 Suppose the inferred states in 𝑄 𝐴 are correct once constructed, then
Suppose the inferred states in 𝑄 𝐴 are correct once constructed, then
C D
cross-testing the capabilities of individual snapshots within the
𝐿4 cross-testing the capabilities of individual snapshots within the
same state becomes unnecessary. Working under that assumption,
𝐿5 same state becomes unnecessary. Working under that assumption,
equivalent snapshots always have the same capabilities, which we
equivalent snapshots always have the same capabilities, which we
𝐿6 can evaluate as a property of the state they belong to. For a state
can evaluate as a property of the state they belong to. For a state
with 𝑁 snapshots, it thus suffices to perform at most 𝑚 cross-tests,
𝐿7 with 𝑁 snapshots, it thus suffices to perform at most 𝑚 cross-tests,
instead of 𝑚 × 𝑁 . Discovered capabilities in 𝑄 𝐵 are then broadcast
instead of 𝑚 × 𝑁 . Discovered capabilities in 𝑄 𝐵 are then broadcast
to all snapshots within a state, reducing the number of cross-tests
Figure 6: The capability matrix of the second round, with to all snapshots within a state, reducing the number of cross-tests
(See Appendix C for a formal analysis of its reduction). In applying
(See Appendix C for a formal analysis of its reduction). In applying
𝑚 = 4 new snapshots. The shaded quadrant 𝑄 𝐴 contains the state broadcast as an optimization, the overhead of inference is
state broadcast as an optimization, the overhead of inference is
populated matrix from the previous round. The 𝑄 𝐵 and 𝑄 𝐷 distributed among states, rather than snapshots. This results in
distributed among states, rather than snapshots. This results in
quadrants are overlaid with edges—marked in green—from higher prediction accuracy for “uncommon states”, i.e., those with
higher prediction accuracy for “uncommon states”, i.e., those with
the latest adjacency matrix. Cross-pollinated capabilities are fewer snapshots, which are arguably of more interest to the fuzzer.
fewer snapshots, which are arguably of more interest to the fuzzer.
marked in blue. 𝑄𝐶 is always initially empty since old snap-
shots cannot have new predecessors.
4.2 Response Fingerprinting (𝑂𝐶 )
While we present our approach as an independent post-discovery After one round of state inference, we obtain a capability matrix
step, state inference can also benefit from the fuzzer itself since the in 𝑄 𝐴 . Each state and its associated capability set serve as labeled
fuzzer spends much time on generating inputs that do not trigger data for training a decision tree (DT) classifier. Such a classifier can
new features. Nonetheless, the feedback is often non-empty: these optimally divide the input space, enabling us to primarily cross-test
uninteresting inputs likely elicit known response patterns that may inputs in 𝑄𝐶 which maximize information gain.
overlap those of other snapshots. This observation allows the fuzzer We fit a DT over this training data to infer response fingerprints:
to dynamically extend the capability matrix at a minimal cost: the minimal subsets of capability sets which are characteristic identi-
computational overhead of matching. fiers of states. The DT classifier yields a binary tree, where each
internal node tests for a capability, such that nodes closer to the
OVERHEAD,
4 Overhead, OPTIMIZATIONS,
Optimizations, AND
and Trade-offs root yield higher information gain. Leaf nodes consequently carry
TRADE-OFFS
To uncover hidden capabilities, state inference relies on cross- a value indicating the most likely candidate state.
validation,
To uncovera technique that is notorious
hidden capabilities, state for its quadratic
inference reliescomplexity.
on cross- Following this procedure, for each new snapshot in the inference
Applying
validation,state inference
a technique thatinisbatches
notorious of for
𝑚 new snapshots
its quadratic requires
complexity. batch (𝑄𝐶 ), we traverse the DT from the root and query it for the
that the fuzzer
Applying state discovers
inference 𝑚 in new response
batches of 𝑚 new patterns. But coverage
snapshots requires next capability to test, until we hit a leaf node. At this point, the
growth
that theisfuzzer
linear discovers
in exponential𝑚 new time [6]: to discover
response patterns.𝑚 new response
But coverage candidate state is identified. The tested capabilities along the path
patterns,
growth isthe linearfuzzer spends on the
in exponential timeorder
[6]: toofdiscover
exp(𝑚) 𝑚 more
newtime than
response form a subset of the capability set of the candidate. Nonetheless,
for the lastthe
patterns, batch.
fuzzer Asspends
the fuzzer
on thecontinues
order oftoexp(𝑚)
progress,morethe time
overhead
than to reduce false positives and satisfy the rules of subsumption, we
cost of last
for the statebatch.
inference
As theis fuzzer
amortized over the
continues executions
to progress, theperformed
overhead extend the testing to all non-empty responses in the candidate’s
between
cost of state rounds (See Appendix
inference is amortized B for
overa more formal analysis).
the executions performed In capability set.
the meantime,
between rounds it (See
reapsAppendix
the benefits of amodeling
B for more formal stateanalysis).
relations In in Consider the second round in Fig. 6. We fit a DT over the capa-
scheduling
the meantime, anditinreapsgeneration. We assess
the benefits these costs
of modeling stateand benefits
relations in bility sets of 𝑆˜0 : ⟨𝐿1, 𝐿2 ⟩ and 𝑆˜1 : ⟨𝐿3 ⟩. In this simplistic example, a
through
scheduling ourand evaluation in Section
in generation. We 6.2 and these
assess Section 6.3.and benefits
costs test on any of {𝐿1, 𝐿2, 𝐿3 } is enough to yield a classification. With
The ramp-up
through our evaluationcost of state inference
in Section is, however,
6.2 and Section 6.3. high. A fuzzer the given labels, it is thus sufficient to perform one test, instead of
finds
The theramp-up
most coveragecost of in theinference
state first few epochs of thehigh.
is, however, campaign, ne-
A fuzzer four, to classify a new snapshot. A snapshot that passes the test has
cessitating
finds the most frequent
coveragerounds of cross-testing.
in the first few epochs Whereas the overhead
of the campaign, ne- a capability set that matches that of 𝑆˜1 , making it at least equivalent.
tapers off atfrequent
cessitating the tail, the initial
rounds of costs can overwhelm
cross-testing. Whereas thethe
fuzzer, mak-
overhead However, failing the test implies an empty capability set, yet the
ing it spend
tapers mosttail,
off at the of the
its time incosts
initial the beginning
can overwhelm just onthe
inference, and
fuzzer, mak- classifier would predict 𝑆˜0 . To properly test for equivalence, we
bringing
ing it spend its progress
most of its to time
a slowin halt. The cost of
the beginning ramping
just up greatly
on inference, and must test against 𝑆˜0 ’s entire capability set in the least (for a total
varies
bringing withits initial
progress seed
to coverage,
a slow halt. theThe
complexity of the target,
cost of ramping and
up greatly of three tests). This allows us to properly classify 𝐿5 in Fig. 6. Note
the execution
varies with initialspeed, seedamong otherthe
coverage, variables.
complexity of the target, and that, if the capability set of a new snapshot 𝐿𝑢 matches that of
theTo address the
execution speed, ramp-up
amongcost,otherwevariables.
propose several optimizations the candidate state 𝑆˜𝑣 using DT classifiers, we can only infer that
thatToreduce
address thetheoverhead
ramp-up of state
cost, inference,
we propose at several
the costoptimizations
of some accu- 𝑆˜𝑣 ⪯+ 𝐿𝑢 , since we do not have information about what we did not
racy in grouping
that reduce snapshots.
the overhead The prescribed
of state inference, at cross-testing procedure
the cost of some accu- test. Whether or not 𝑆˜𝑣 ≺ 𝐿𝑢 , the result is then the same: a state
requires that all cells
racy in grouping of Fig. 6The
snapshots. outside of 𝑄 𝐴 cross-testing
prescribed be tested for procedure
adjacency. 𝑆˜𝑤 = 𝑆˜𝑣 ∪ {𝐿𝑢 }. Combined with coloring,, this approach ensures
We propose
requires thatthree
all cellsoptimizations
of Fig. 6 outside(oneoffor𝑄 𝐴each quadrant)
be tested to in the
for adjacency. that state inference remains consistent under sparsity.
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.
8
○6 peeking into the new state of the system enables the Explorer
Session 1
to record any observed changes. The latter then ○ 7 forwards its
step
2
updates to concerned components, thus closing the feedback loop.
generate
Generator
update
100%
Batch size m=10 Batch size m=20 Batch size: 10 Batch size: 20
100%
Time of cross-testing
75%
Accuracy
75%
50%
50% 25%
25% 0
Batch size: 50 Batch size: 100
0 100%
Batch size m=50 Batch size m=100 75%
Accuracy
100% 50%
Time of cross-testing
75% 25%
0
50% 0 50% 100% 0 50% 100%
Savings Savings
25% bftpd lightftp proftpd
0 dnsmasq live555 pureftpd
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d exim llhttp tinydtls
Time Time expat openssh yajl
Optimizations = All Optimizations = None kamailio openssl
(a) Optimizations: All
Figure 8: The median overhead of state inference as the pro- Optimization: B Optimization: C
portion of time spent on cross-testing, when optimizations 100%
are disabled (solid lines) and enabled (dotted lines), over 24 75%
Accuracy
hours, under different batch sizes. Suppose that the speed to 50%
discover new snapshots is the same, the lower 𝑚 is, the more 25%
frequently the state inference is performed. 0
Optimization: CD Optimization: BCD
100%
and we measure the benefit of state inference as the proportion of 75%
Accuracy
bftpd
dcmtk
Batch size: 10 Batch size: 20
dnsmasq bftpd 16.0 (84.39%) 11.0 (87.15%)
dcmtk 5.0 (84.04%) 6.0 (80.36%)
expat dnsmasq 2.0 (99.60%) 4.0 (99.22%)
kamailio exim 34.0 (77.58%) 49.0 (65.73%)
lightftp expat 232.0 (70.31%) 219.0 (70.15%)
live555 kamailio 10.0 (96.98%) 4.0 (98.56%)
llhttp lightftp 3.0 (96.37%) 7.0 (93.39%)
openssh live555 4.0 (72.00%) 3.0 (89.66%)
openssl llhttp 1.0 (98.81%) 1.0 (98.80%)
proftpd openssh 11.0 (85.28%) 16.0 (81.95%)
pureftpd openssl 4.0 (90.23%) 3.0 (94.39%)
tinydtls proftpd 43.0 (86.80%) 73.0 (77.78%)
yajl pureftpd 5.0 (94.77%) 4.0 (95.82%)
tinydtls 1.0 (97.20%) 1.0 (98.37%)
0.0 0.2 0.4 0.6 0.8 1.0 yajl 102.0 (55.92%) 109.0 (59.23%)
Normalized Divergence Batch size: 50 Batch size: 100
bftpd 16.0 (86.17%)
dcmtk
Figure 10: Normalized values of the Kullback-Leibler diver- dnsmasq 4.0 (99.15%) 2.0 (99.58%)
exim 32.0 (76.34%) 47.0 (80.95%)
expat 197.0 (71.86%) 219.0 (73.74%)
gence from uniformity of observed snapshot-to-state dis- kamailio 6.0 (97.59%) 3.0 (98.52%)
lightftp 10.0 (93.13%) 8.0 (95.80%)
tributions, illustrated through notched box-plots at 95% CI. live555
llhttp
26.0 (56.50%)
1.0 (98.82%)
Divergence is calculated individually, for each seed queue openssh 13.0 (89.04%) 23.0 (82.52%)
openssl
and data points are then overlaid as a ⃝ scatter plot. The size proftpd 63.0 (81.54%) 24.0 (93.07%)
pureftpd 2.0 (97.64%) 3.0 (97.73%)
tinydtls 1.0 (98.68%) 1.0 (99.11%)
of each marker is proportional to the number of snapshots 𝑁 , yajl 100.0 (59.62%) 136.0 (56.56%)
normalized by the maximum number of snapshots observed 0 500 1000 1500 0 500 1000 1500
for its target across all campaigns.
Figure 11: The number of the discovered snapshots (in blue)
6.3 RQ2: Snapshots in Biased Queues and the number of the inferred states (in orange). Text after
each bar shows the number of the inferred states and the
Whenever an evolutionary fuzzer encounters interesting cover- reduction ratio 𝛼 = 1 − states/snapshots at the end of 24-hour
age, it saves the input that caused it in its seed queue. For stateful fuzzing campaign without optimizations.
fuzzers, these seeds serve as snapshots to restore the target to
the reached state. To quantify the benefits that seed scheduling
through state inference could provide, we measure the distribution scheduling strategy may starve those snapshots of the time needed
of fuzzer-generated snapshots across functionally-distinct equiva- to explore their potential capabilities, often in favor of the first one
lence states. If we assume that a state-unaware fuzzer selects seeds which uncovered interesting features.
from the queue with a uniform distribution (i.e., it assumes that
seeds are evenly spread across target functionality), then we can 6.4 RQ3: Reduction Ratio 𝛼
assess the “surprise” [28] of sampling uniformly from a skewed
The main advantage of state inference is condensing the seed queue
population. We calculate 𝐷ˆ 𝐾𝐿 (S||U), the Kullback-Leibler diver-
into functionally distinct islands. This allows seed scheduling to
gence [14], of the observed Snapshot-in-state distribution against
be more balanced and reduces redundant exploration of the same
a Uniform reference, normalized by 𝑙𝑜𝑔(𝑁 ), where 𝑁 is the total
code regions. To assess the effect of state inference on queue size
number of snapshots observed in each campaign.
reduction, we measure the number of snapshots generated by every
In this experiment, we run AFL++, Nyx-Net, and Tango-Infer
campaign, and the corresponding number of discovered groupings.
against compatible targets, collect their seed queues, and apply
In Fig. 11, we report the reduction ratio 𝛼 as 𝛼 = 1 − states/snapshots.
state inference to extract snapshot groupings. We present the re-
The average reduction ratio is 86.02% (𝑚 = 10), 86.04% (𝑚 = 20),
sults in Fig. 10. A value 𝐷ˆ 𝐾𝐿 = 0 indicates that sampling the seed
85.08% (𝑚 = 50), and 87.76% (𝑚 = 100), i.e., the queue is around
queue uniformly yields results consistent with sampling a uniformly
seven times smaller. Combined with the results from Section 6.2,
distributed population, i.e., where there are equally as many snap-
this suggests that, during later fuzzing stages, the fuzzer can cycle
shots for every equivalence state discovered by the fuzzer. On the
through its queue around seven times as fast, at a diminishing cost.
other end of the spectrum, 𝐷ˆ 𝐾𝐿 = 1 implies that uniform sampling
The skewed distribution of seeds in fuzzing queues, suggests that
yields the highest surprise: whereas the fuzzer would expect to be
state-of-the-art fuzzers could benefit from applying state inference
exploring different functionalities by cycling through its queue, it is
to prune their queues and avoid tunnel vision towards high-density
likely tunnel-visioned by a majority equivalence state. A fuzzer that
regions. Continuous incremental application also ensures that new
samples its seed queue uniformly would be hindered by duplicate
snapshots are incorporated and that the fuzzer avoids regression
efforts and a self-reinforcing equivalence state.
towards non-uniformity.
On the other hand, while fuzzers generally employ more complex
seed scheduling mechanisms [7, 16, 29], those cannot replace state
awareness. A schedule that prioritizes snapshots unequally based 6.5 RQ4: Case Study on Cross-Inference
on observed feedback inherently disregards the possible overlap of We implement state inference without optimizations as a hotplug-
those snapshots with others in their equivalence state. Equivalent gable component to introduce state-aware scheduling to two exist-
snapshots exercise overlapping behavior, insofar as cross-testing ing fuzzers: AFL++ (for the streaming parsers) and Nyx-Net (for
has not identified discrepancies that necessitate subdividing the the network servers). The fuzzer and Tango share one physical
group into distinct functionalities. However, since fuzzing is incom- core throughout the campaign. Tango continuously checks for
plete, it may be that now-equivalent snapshots may diverge in the new inputs in the fuzzer’s seed queue and applies state inference,
future, given that they are sufficiently scheduled. A non-uniform exporting its results for use by the fuzzer’s scheduler. Tango is
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy
4400 3500
140 700
3000 1000
# of Edges
# of Edges
# of Edges
# of Edges
4200
# of Edges
135 650 2500
4000
2000 900
130 3800 600
1s 1m 1h 4h 1d 1s 1m 1h4h 1d 1s 1m 1h 4h 1d 1s 1m 1h4h 1d 1s 1m 1h4h 1d
Time Time Time Time Time
(a) bftpd (b) dcmtk (c) dnsmasq (d) expat (e) llhttp
7500 400
400
# of Edges
# of Edges
# of Edges
# of Edges
2200 7000 375
380
2000 6500 350
360
6000 325
1s 1m 1h4h 1d 1s 1m 1h4h 1d 1s 1m 1h 4h 1d 1s 1m 1h 4h 1d
Time Time Time Time
(f) openssh (g) openssl (h) tinydtls (i) yajl
Figure 12: Edge coverage collected from Nyx-Net (for network servers) and AFL++ (for parsers) when running without (solid
lines) and with (dotted lines) the state inference extension.
w/o inference overlapping w/ inference translate to state coverage: “novel” paths may belong to the same
equivalence state. Through our evaluation, state-guided fuzzers
bftpd uncovered two new bugs: a heap buffer overflow in dcmtk and a
dcmtk heap out-of-bounds read in yajl.
dnsmasq
expat
llhttp
openssl 7 DISCUSSION
tinydtls State inference introduces a new metric for a fuzzer to optimize
yajl its progress and performance on stateful systems. Consequently, it
raises the question of whether such a metric ultimately improves
0 20 40 60 80 100 the fuzzer’s ability to find bugs. Notably, we argue that optimizing
Percentage of infered states
state coverage requires state-aware metrics, although not exclusively.
Code vs State coverage: Our case study on cross-inference
Figure 13: Cross-inference results showing the distribution highlights a key observation: code coverage is not sufficient for
of overlapping and unique behaviors discovered by stateful stateful exploration. Despite achieving higher code coverage, state-
fuzzers with and without state inference. unaware fuzzers discovered fewer behaviorally-unique states, as
seen in Fig. 13. This is further justified by our takeaway from Fig. 10,
that seeds in the queue of coverage-guided fuzzers are non-uniformly
also configured to export any interesting inputs it discovers during distributed. Stateless programs can be seen as occupying only one
cross-testing, further reducing its effective overhead. state, and code coverage enables in-depth exploration of that state.
We measure the code coverage achieved by both the unmodified Stateful programs introduce a new dimension for fuzzers. So, de-
and the augmented variants of the fuzzers, and we present those spite state-unaware fuzzers achieving slightly higher code coverage,
in Fig. 12. Since code coverage alone cannot capture state informa- cross-inference results highlight that state-aware exploration bet-
tion, we leverage state inference as a performance metric, through a ter maximizes state coverage. A scheduler may incorporate other
process we call cross-inference. Seeds obtained from two competing metrics like code coverage to prioritize individual snapshots or
fuzzers are used to construct a snapshot tree, upon which state diversify exploration within a state.
inference is applied. We interpret the overlap and disjunction of Bug-finding advantage: In dcmtk, Tango uniquely detected
those snapshots in equivalence states as a measure of state coverage: a clean-up bug, which was subsequently reported and fixed. To
states where all snapshots of fuzzer B are subsumed by snapshots trigger the bug, sending any message and waiting for clean-up is in-
of fuzzer A are considered unique to A, without loss of general- sufficient; instead, it requires setting up a valid state, followed by a
ity. Otherwise, we consider them overlapping. The results of this disconnection. Tango’s “post-mortem tracking” was also pivotal for
experiment are illustrated in Fig. 13. achieving this: it monitors decommissioned targets asynchronously
The experiments yield an interesting result: despite both fuzzers until they crash or exit, without impacting performance. In yajl, 4
attaining similar code coverage, state inference revealed distinctions of 5 Tango-AFL++ campaigns reliably triggered the bug, whereas
in uncovered functionality, favoring the state-guided fuzzers. The it was never triggered by the unmodified AFL++. In a later evalua-
additional code covered by the unmodified fuzzers may not directly tion with 20 campaigns, the bug was triggered in 14 Tango-AFL++
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.
campaigns, but only in 4 campaigns w/o inference. This signifi- account for persistent effects of executing seeds; they perform their
cant difference underscores the benefit of state-aware scheduling. analysis retrospectively. In contrast, state inference runs a prospec-
Without ground-truth benchmarks, we cannot assess fuzzer perfor- tive analysis, finding overlaps in the traces of inputs executed.
mance through new bugs alone, as the total bug count is unknown. Grammar inference: State inference overlaps with the disci-
Instead, we report that fuzzers with state inference found a superset plines of regular language grammars and automata theory. DFA
of the bugs found by fuzzers without it. Tango provides the tools minimization [13] proposes a technique for collapsing a FMS into a
necessary to perform state-aware fuzzing, and we believe it is an minimum number of states distinguishable by their outgoing tran-
important foundation for future research. sitions. The RPNI[22] and L* [2] algorithms present techniques for
Exploration vs Exploitation: State inference offers a mecha- passive and active learning of deterministic finite automata for reg-
nism for assigning a label to each snapshot. This enables balanced ular languages. Recent work by Luo et al. [17] also tackles grammar
exploration among states to avoid the starvation of less frequent inference for network protocols, which is limited to protocols that
ones. A power schedule over states, like AFLFast [7], could then reveal stateful information in their responses to client messages.
incorporate the frequency of encountering a state to prioritize the Though similar to DFA inference, our approach differs in key
exploration of low-frequency behaviors. If state coverage is the assumptions, filling gaps in existing techniques. Namely, we do
only metric being optimized, then equivalent snapshots would be not assume an FSM model for the target, nor that the complete
indistinguishable, and a uniform schedule within the same state set of inputs and responses is known or enumerable. It is not the
would be reasonable. However, a fuzzer can still schedule equivalent goal of state inference to extract the state model of the system,
snapshots non-uniformly based on code coverage and occasionally since that would require knowledge of which computational model
prioritize weaker ones. is implemented, e.g., FSM or PDA. State inference instead aims
Misclassification: A false grouping may occur either due to (i) to discover hidden capabilities of each snapshot through cross-
insufficient cross-pollination; or (ii) state-insensitive feedback. In pollination then finds groups of snapshots that share the same
case (i), snapshots may remain misclassified until the next round capabilities through subsumption and colorization. Compared to
of inference. If the new tests are sufficient to show a distinction MACE [8] which first introduced blackbox state inference into
between the snapshots, then the misclassification will be rectified. dynamic symbolic execution, Tango observes more fine-grained
In case (ii), the fuzzer does not have sufficient information to make feedback via instrumentation and has addressed unique challenges
a distinction, since the different behaviors do not yield different when combining state inference with seed scheduling.
observable effects. The choice of a state-sensitive feedback metric
is thus important. Through sufficient testing and state-sensitive 9 CONCLUSION
feedback, state inference always yields accurate groupings.
Research on stateful fuzzing continued where its stateless coun-
Comparision to LibAFL: Similar to Tango, LibAFL offers build-
terpart left off. While much of the progress on the latter was of
ing blocks for custom fuzzers. However, LibAFL is designed for
great benefit to this field, it still managed to imprint methods and
coverage-guided greybox fuzzing but not for stateful fuzzing. Tango
assumptions that are otherwise not suited for stateful fuzzing. In
is developed independently of and in parallel with LibAFL and was
this paper, we re-assess the definition of states and how they fit
built from the ground up with state as an anchor. Statefulness could
into the fuzzing stack. We present a method to identify semantic
be refitted on LibAFL but with heavyweight modification to support
behavior through the use of portable metrics, in a technique we dub
for target state as the context of its operations.
“State Inference”. In the process, we design and implement Tango,
a state-aware fuzzing framework for bootstrapping research in this
8 RELATED WORK domain. Through evaluation, we identify a key observation: fuzzers
could potentially spend upwards of 86% of their time being tunnel-
State-aware fuzzing: AFLNet [24] was among the first to tackle
visioned or duplicating their efforts. By applying our technique,
the problem of fuzzing network targets while allowing state to
fuzzers can leverage state awareness for more optimal scheduling,
accumulate. However, its requirement to manually annotate server
at a diminishing amortized cost. State inference is also applicable
responses hindered adoption. Alternative techniques presented in
in other stages of the fuzzing cycle, from seed minimization and
SGFuzz [5], NSFuzz [25], and StateAFL [20] addressed this issue
distillation, through unsupervised state extraction and determined
through more automated, albeit less precise techniques for state
reproduction, to better-grounded performance evaluation.
extraction. Nonetheless, those works focused on extracting state,
not as a way to generate protocol-compliant inputs, but as a labeling
mechanism for discovered inputs. State inference in Tango makes ACKNOWLEDGMENTS
this mechanism more explicit by exploring functional overlaps. We thank the anonymous reviewers for their feedback on the
LLM-guided fuzzing [19] has also shown its efficacy at targeting paper. This work was supported, in part, by the European Re-
network protocols, by providing machine-readable grammars and search Council (ERC) under the European Union’s Horizon 2020
high-quality seeds for covering states and transitions. Nonetheless, research and innovation program (grant agreement No. 850868),
it remains limited to the information available in its training data SNSF PCEGP2_186974, and a gift from Huawei.
and struggles to generalize to arbitrary or proprietary protocols.
Seed scheduling: While seed scheduling is a well-researched REFERENCES
problem in fuzzing [12, 29, 30], state inference is the first to address [1] 2022. libfaketime modifies the system time for a single application. https:
it in the context of stateful systems. Existing techniques do not //github.com/wolfcw/libfaketime Accessed: 2024-03-15.
Tango: Extracting Higher-Order Feedback through State Inference RAID 2024, September 30–October 02, 2024, Padua, Italy
[2] Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys
Information and Computation 75, 2 (1987). '22).
[3] Cornelius Aschermann, Sergej Schumilo, Ali Abbasi, and Thorsten Holz. 2020. [28] C. E. Shannon. 1948. A Mathematical Theory of Communication. Bell System
IJON: Exploring Deep State Spaces via Fuzzing. In 2020 IEEE Symposium on Technical Journal 27 (1948), 379–423, 623–656.
Security and Privacy (SP). [29] Dongdong She, Abhishek Shah, and Suman Jana. 2022. Effective Seed Scheduling
[4] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The for Fuzzing with Graph Centrality Analysis. In 2022 IEEE Symposium on Security
Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput. 32, 1 (2002). and Privacy (SP).
[5] Jinsheng Ba, Marcel Böhme, Zahra Mirzamomen, and Abhik Roychoudhury. 2022. [30] Jinghan Wang, Chengyu Song, and Heng Yin. 2021. Reinforcement Learning-
Stateful Greybox Fuzzing. In Proceedings of the 31st USENIX Security Symposium based Hierarchical Seed Scheduling for Greybox Fuzzing. In 28th Annual Network
(USENIX Security '22). USENIX Association. and Distributed System Security Symposium (NDSS 2021). The Internet Society.
[6] Marcel Böhme and Brandon Falk. 2020. Fuzzing: On the Exponential Cost of
Vulnerability Discovery. In Proceedings of the 28th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software A SUBSUMPTION OPERATORS
Engineering (ESEC/FSE 2020). Association for Computing Machinery.
[7] Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2016. Coverage- To formalize the approach, we propose the definitions:
Based Greybox Fuzzing as Markov Chain. In Proceedings of the 2016 ACM SIGSAC 𝑁 □ (𝑢)
Conference on Computer and Communications Security (CCS '16). Association for
Computing Machinery. The □-neighborhood of 𝑢, where □ ∈ {+, −} represents out-
[8] Chia Yuan Cho, Domagoj Babić, Pongsin Poosankam, Kevin Zhijie Chen, Ed- ward and inward directions, respectively.
ward XueJun Wu, and Dawn Song. 2011. { MACE } : { Model-inference-Assisted } 𝑢 ⪯□ 𝑣 ⇐⇒ 𝑁 □ (𝑢) ⊆ 𝑁 □ (𝑣)
concolic exploration for protocol and vulnerability discovery. In Proceedings of
the 20th USENIX Security Symposium (USENIX Security '11). USENIX Association. 𝑢 is □-subsumed by 𝑣.
[9] Andrea Fioraldi, Daniele Cono D’Elia, and Davide Balzarotti. 2021. The Use of 𝑢 ≺□ 𝑣 ⇐⇒ (𝑢 ⪯ □ 𝑣) ∧ (𝑣 ⪯̸ □ 𝑢)
Likely Invariants as Feedback for Fuzzers. In Proceedings of the 30th USENIX
Security Symposium (USENIX Security '21). USENIX Association.
𝑢 is strictly □-subsumed by 𝑣.
[10] Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++: 𝑢 ⪯ 𝑣 ⇐⇒ (𝑢 ⪯ + 𝑣) ∧ (𝑢 ⪯ − 𝑣)
Combining Incremental Steps of Fuzzing Research. In Proceedings of the 14th 𝑢 is subsumed by 𝑣.
USENIX Workshop on Offensive Technologies (WOOT '20). USENIX Association.
[11] Andrea Fioraldi, Alessandro Mantovani, Dominik Maier, and Davide Balzarotti. 𝑢 ≺ 𝑣 ⇐⇒ (𝑢 ⪯+ 𝑣) ∧ (𝑢 ≺□ 𝑣)
2023. Dissecting American Fuzzy Lop: A FuzzBench Evaluation. ACM Transac- 𝑢 is strictly subsumed by 𝑣, for any □ ∈ {+, −}.
tions on Software Engineering and Methodology (2023). 𝑢 ∼□ 𝑣 ⇐⇒ (𝑢 ⪯□ 𝑣) ∧ (𝑣 ⪯□ 𝑢)
[12] Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer,
and Antony L. Hosking. 2021. Seed Selection for Successful Fuzzing. In 30th ACM 𝑢 and 𝑣 are □-equivalent.
SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2021. 𝑢 ∼ 𝑣 ⇐⇒ (𝑢 ∼+ 𝑣) ∧ (𝑢 ∼ − 𝑣)
ACM.
[13] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. 1979. Introduction
𝑢 and 𝑣 are equivalent.
to Automata Theory, Languages, and Computation. Addison-Wesley, Chapter For each proposed operator 𝑜𝑝, we can construct a set of vertices
Section 4.4.3, Minimization of DFA’s, 159–164.
[14] S. Kullback and R. A. Leibler. 1951. On Information and Sufficiency. The Annals
𝑢 ∈ 𝑈 which satisfy 𝑢 𝑜𝑝 𝑣 as: 𝑜𝑝 𝑣 = {𝑢 ∈ 𝑈 |𝑢 𝑜𝑝 𝑣 }. Of particular
of Mathematical Statistics 22 (1951). interest are the equivalence sets ∼𝑣 and strict subsumption sets ≺𝑣 .
[15] Junqiang Li, Senyi Li, Gang Sun, Ting Chen, and Hongfang Yu. 2022. SNPSFuzzer: To extract the sets of equivalence states from a capability matrix,
A Fast Greybox Fuzzer for Stateful Network Protocols Using Snapshots. IEEE
Transactions on Information Forensics and Security (2022). it suffices to construct the set of out-equivalence sets:
[16] Chung-Yi Lin, Chun-Ying Huang, and Chia-Wei Tien. 2019. Boosting Fuzzing
Performance with Differential Seed Scheduling. In 2019 14th Asia Joint Conference 𝑆˜ = {∼+𝑣 |𝑣 ∈ 𝐿}
Each element in 𝑆˜ represents an equivalence state, a set of snapshots
on Information Security (AsiaJCIS).
[17] Zhengxiong Luo, Junze Yu, Feilong Zuo, Jianzhong Liu, Yu Jiang, Ting Chen,
Abhik Roychoudhury, and Jiaguang Sun. 2023. BLEEM: Packet Sequence Oriented that overlap in behavior. To further reduce the number of states,
Fuzzing for Protocol Implementations. In 32nd USENIX Security Symposium the fuzzer calculates the set of non-empty strict subsumption sets:
(USENIX Security 23).
[18] William M. McKeeman. 1998. Differential Testing for Software. Digital Technical 𝑆ˇ = {≺𝑣 ≠ 𝜙 |𝑣 ∈ 𝐿}
Journal 10, 1 (1998).
[19] Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Following that, any snapshot belonging to any set in 𝑆ˇ can be safely
Large Language Model-guided Protocol Fuzzing. In Network and Distributed ˜ without loss of capabilities. Alter-
System Security Symposium (NDSS 2024). The Internet Society. eliminated from all sets in 𝑆,
[20] Roberto Natella. 2022. StateAFL: Greybox Fuzzing for Stateful Network Servers. natively, to avoid inadvertently losing quality inputs, we choose
Empirical Software Engineering 27, 7 (2022). to include strictly subsumed snapshots in the equivalence states
[21] Roberto Natella and Van-Thuan Pham. 2021. ProFuzzBench: A Benchmark for
Stateful Protocol Fuzzing. In Proceedings of the 30th ACM SIGSOFT International of the subsuming nodes. This ensures that the fuzzer experiences
Symposium on Software Testing and Analysis. the same reduction in state counts while maintaining access to all
[22] J. Oncina and P. García. 1992. Pattern Recognition and Image Analysis. Series in
Machine Perception and Artificial Intelligence, Vol. 1. World Scientific, Chapter
interesting snapshots.
Inferring Regular Languages in Polynomial Updated Time, 49–61.
[23] Sriram V. Pemmaraju and Steven S. Skiena. 2003. Computational Discrete Mathe- B TIME SPENT ON CROSS-POLLINATION
matics: Combinatorics and Graph Theory with Mathematica. Cambridge University
Press, Cambridge, England. 231–234 pages. Contracting Vertices, Section 6.1.1. During cross-pollination, for every batch of 𝑚 new snapshots, given
[24] Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury. 2020. AFLNet: A 𝑛 existing batches, the fuzzer must perform additional executions
Greybox Fuzzer for Network Protocols. In Proceedings of the 13rd IEEE Interna-
tional Conference on Software Testing, Verification and Validation : Testing Tools on the order of O (𝑚 × 𝑛).
Track. In particular, we define the following quantities:
[25] Shisong Qin, Fan Hu, Bodong Zhao, Tingting Yin, and Chao Zhang. 2022. NS-
Fuzz: Towards Efficient and State-Aware Network Service Fuzzing. International Time-step 𝑡𝑘 : the point in time at which the 𝑘th application round
Fuzzing Workshop (FUZZING) 2022 (2022). of state inference is performed.
[26] Sergej Schumilo, Cornelius Aschermann, Ali Abbasi, Simon Wörner, and Thorsten
Holz. 2021. Nyx: Greybox Hypervisor Fuzzing using Fast Snapshots and Affine
Snapshots 𝑛𝑘 = 𝑛𝑘 −1 + 𝑚 = 𝑚𝑘
Types. In 30th USENIX Security Symposium (USENIX Security 21). States 𝑠𝑘 = 𝑠𝑘 −1 + 𝛼 𝑚 = 𝛼𝑚𝑘
[27] Sergej Schumilo, Cornelius Aschermann, Andrea Jemmett, Ali Abbasi, and average reduction ratio
Thorsten Holz. 2022. Nyx-Net: Network Fuzzing with Incremental Snapshots. In
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.
additional executions per round
Cross-tests 𝑐𝑘 = 𝑐𝑘 −1 + 𝑚(2𝑛𝑘 −1 + 𝑚 − 1) = 𝑛𝑘2 − 𝑛𝑘 the relevant I/O syscalls where necessary, e.g., read, dup, close,
Since coverage growth is linear in exponential time [6], during and poll, among others. Synchronization allows the target to re-
the 𝑘th step, at time 𝑡𝑘 , the total number of cross-tests performed is turn control to the fuzzer as soon as it becomes ready, instead of
O (log2 (𝑡𝑘 )). Meanwhile, the fuzzer generates and tests new inputs busy-waiting and degrading throughput. Moreover, it guarantees
in linear time, diminishing the ratio of time spent on state inference reproducibility of results: by encoding the relevant sequences of
as: syscalls into its saved inputs, the fuzzer can reliably reproduce states
log2 (𝑡) and coverage measurements, leaving little for the OS to influence
lim =0 when data is delivered to the target.
𝑡 →∞ 𝑡
In addition, to increase fuzzer throughput and exploit the redun-
C COST REDUCED BY STATE BROADCAST dancy of resets, Tango leverages ptrace to dynamically inject a
Since discovered capabilities are then broadcast to all snapshots forkserver at runtime, just after setting up the communication chan-
within a state, the number of cross-tests becomes: nel in the target. This relieves the loader of the heavy initialization
𝑐˜𝑘 = 𝑐˜𝑘 −1 + 𝑚(𝑛𝑘 −1 + 𝑠𝑘 −1 + 𝑚 − 1) phase of many network services.
target. This achieves around a 50x increase in fuzzing throughput of this approach is that nearby states (locations) may be reachable
over the non-optimized implementation since socket setup is often through a bee-line movement, whereas the input generated and
expensive, and otherwise, the fuzzer may only successfully connect discovered by the fuzzer to transition between these two states
to the target by reattempting the operation until it no longer fails. may involve redundancies that impact fuzzer throughput. Another
downside is that if the current location and the target location are
D.9 Adaptive mutators close to each other, yet are far enough from the spawn location,
We implement an adaptive model for applying havoc mutators that restarting the level from that location would be inefficient. The
balances exploration and exploitation using the Exp3 [4] algorithm player may simply need to move a few steps in the direction of
for distributing rewards and assigning probabilities. For each snap- the target to reach it. Moreover, continuously restarting the target
shot, we maintain a set of weights describing the probability that a breaks the immersion of the fuzzer “playing” the game, and it would
mutator is chosen in that state. Provided a comprehensive set of instead spend much of its time replaying actions from the start. To
mutators, this approach accommodates an evolving target by ad- avoid that, we implemented path-finding algorithms, based on the
justing and applying the probabilities in selecting the next mutator, state graph explored by the fuzzer, to move between two locations
based on how well each one performs in the state context. using a sequence of Reach instructions. In essence, to load a state, a
path to it from the current state is calculated, and Reach instructions
are performed piece-wise along every transition in the path.
E CASE STUDY: DOOM
Strategy: Finally, we implemented a ZoomStrategy component,
Tango is a framework for building state-aware fuzzers, and the to tie it all together, that schedules states based on a convex hull of
main witness of its merit would be using it to develop a state- the explored locations, and prioritizes locations on the perimeter,
aware fuzzer for a complex stateful system. In the spirit of hacker that are furthest from the start. In addition, the strategy implements
culture, we opted to answer the question “Can it run DOOM?” with an event observer task that is responsible for reacting to urgent
a resounding “Yes!”, using Tango. events such as seeing an enemy or stepping in slime pits. By pre-
empting the fuzzer’s main loop, the strategy minimizes the reaction
E.1 Setup time to increase the survivability of the player.
To support DOOM, we extended Tango in the following aspect:
Driver: We implemented an X11Channel component which E.2 Results
sends keystroke events to a process’s window, through a public With these extensions, Tango consistently manages to finish the
Python library, python3-xlib. E1M1 level of DOOM, on difficulty 3, in 10 to 40 wall-clock minutes.
Generator: We added Activate, Kill, Move, Reach, Rotate, The main factor contributing to this variability is perimeter explo-
and Shoot instructions, extending the input base class. These gov- ration. As can be seen in Fig. 14b, in one recorded fuzzing session,
ern how the player’s character interacts with its environment, and the fuzzer encountered a big undiscovered area on the left side of
are later used by the input generator and the exploration strategy the map, and dedicated a significant amount of time to exploring
to maneuver around the map and overcome obstacles. it. It also managed to discover the path up the stairs to the higher
We limited the functionality of the input generator to selecting platform, where it found a level 1 armor pickup. In other runs, due
a possible outgoing or incoming transition of the current state and to the stochastic nature of the fuzzing process, it may miss that area
passing it on to the mutator, which mainly mutates the direction completely and continue exploring in the immediate direction of
and duration of movement, to explore the level. We also provided the exit door, achieving a lower overall finish time in the process.
it with two helper functions that can yield the correct sequence of Having found the armor pickup, Tango maintains a record of it
commands to follow a path or to aim and shoot at a moving target in its later exploration stages. As can be seen in Fig. 14c, its path to
by incorporating live feedback. the finish line includes going up the stairs, picking up the armor
Tracker: We implemented state feedback as a shared-memory boost, and returning back on another path to the exit room.
struct populated by DOOM and accessed by Tango. The struct Regardless of the overall finish time, once Tango finds the path
contains all basic user properties such as location, weapons, ammo, to the exit, it consistently manages to follow it in 3 to 4 in-game
pickups, enemies in sight, and doors or switches within reach. Two minutes. While far from typical speed-runs for this level (which
states are considered equivalent if they have the same player posi- are as low as 9 seconds), it remains a formidable achievement to
tion. We attached extended state variables to each state that describe be able to explore the state space of a DOOM level and manage to
the pickups collected along the current path, and state attributes finish it in a sensible amount of time.
describing the current location (e.g. if it is a slime pit or a secret
level). To avoid having a unique state for every single position on
the map, the level is divided into a grid of cells, representing the
granularity of feedback, as shown in Fig. 14a.
Loader: We extended the loader’s two main functions: restarting
the target and loading a state. Restarts are simple, as they’re only a
matter of terminating and relaunching the process. Loading a state
is slightly more complex: without a means of snapshotting, actions
must be replayed to reach a certain known location, given that the
state graph contains at least one path to it. However, a downside
RAID 2024, September 30–October 02, 2024, Padua, Italy Ahamd et al.
(b) Heatmap of the visited cells during the first (c) Heatmap of the visited cells after 30 minutes
(a) Grid view of the DOOM map. 10 minutes of fuzzing DOOM’s the E1M1 level. of fuzzing.
Figure 14: The progress of Tango in playing DOOM. Red shading implies higher hit counts. Figure 14c shows that the fuzzer
had figured out a path to the finish and continues to repeat it to achieve the lowest completion time.
100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20 100%
Batch size m=10 Batch size m=20
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
104.67 83.0 31.33 22.67
75% 104.33 91.33 75% 29.67 43.33 75% 75%
151.67
50% 50% 50% 50% 103.0
25% 25% 25% 495.33 472.0 25% 143.0
458.0 527.33 194.0
0 0 0 0
100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
115.67 89.0 23.5 16.67
75% 90.33 88.0 75% 22.33 16.33 75% 75%
50% 50% 50% 50%
25% 25% 25% 471.67 396.33 25% 136.67 245.0
529.0 510.0 168.67 133.67
0 0 0 0
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d
Time Time Time Time Time Time Time Time
Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None
Time of cross-testing
Time of cross-testing
Time of cross-testing
301.0 11.0 16.67
75% 75% 400.0 75% 75% 86.0 16.33
50% 50% 50% 50%
25% 781.33 733.67 25% 331.33 25% 82.67 111.0 25%
693.0 658.67 476.67 111.33 110.67
0 0 0 0
100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
59.0 59.33
75% 75% 75% 75% 47.67 62.0
50% 50% 50% 50%
25% 701.33 832.67 25% 235.0 202.33 25% 150.33 190.5 25%
702.67 612.67 384.0 303.0 120.33 143.0
0 0 0 0
1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d 1m 10m 1h 4h 1d
Time Time Time Time Time Time Time Time
Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None Optimizations = All Optimizations = None
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
Time of cross-testing
35.67 61.33
75% 75% 54.0 35.33 75%
230.67 267.33
50% 50% 50% 252.67 264.0
25% 95.67 103.67 25% 25%
124.0 120.67
0 0 0
100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100 100%
Batch size m=50 Batch size m=100
Time of cross-testing
Time of cross-testing
Time of cross-testing