0% found this document useful (0 votes)
23 views

RESTler Stateful REST API Fuzzing

RESTler is a tool that automatically tests REST APIs by generating stateful test sequences of requests. It analyzes the API specification to infer dependencies between request types, such as needing the output of one request as input to another. It also learns from responses to prior tests to avoid combinations that were refused. The tool was able to find 28 bugs in GitLab and several bugs each in multiple Microsoft cloud services by generating tests in this way.

Uploaded by

谭嘉俊
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

RESTler Stateful REST API Fuzzing

RESTler is a tool that automatically tests REST APIs by generating stateful test sequences of requests. It analyzes the API specification to infer dependencies between request types, such as needing the output of one request as input to another. It also learns from responses to prior tests to avoid combinations that were refused. The tool was able to find 28 bugs in GitLab and several bugs each in multiple Microsoft cloud services by generating tests in this way.

Uploaded by

谭嘉俊
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

RESTler: Stateful REST API Fuzzing


Vaggelis Atlidakis∗ Patrice Godefroid Marina Polishchuk
Columbia University Microsoft Research Microsoft Research

Abstract—This paper introduces RESTler, the first stateful test generation and execution with the goal of finding security
REST API fuzzer. RESTler analyzes the API specification of vulnerabilities. Unlike other REST API testing tools, RESTler
a cloud service and generates sequences of requests that au- performs a lightweight static analysis of an entire Swagger
tomatically test the service through its API. RESTler generates
test sequences by (1) inferring producer-consumer dependencies specification, and then generates and executes tests that exer-
among request types declared in the specification (e.g., inferring cise the corresponding cloud service in a stateful manner. By
that “a request B should be executed after request A” because stateful, we mean that RESTler attempts to explore service
B takes as an input a resource-id x produced by A) and by states that are reachable only using sequences of multiple
(2) analyzing dynamic feedback from responses observed during requests. With RESTler, each test is defined as a sequence
prior test executions in order to generate new tests (e.g., learning
that “a request C after a request sequence A;B is refused by the of requests and responses. RESTler generates tests by:
service” and therefore avoiding this combination in the future). 1) inferring dependencies among request types declared in
We present experimental results showing that these two the Swagger specification (e.g., inferring that a resource
techniques are necessary to thoroughly exercise a service under included in the response of a request A is necessary as
test while pruning the large search space of possible request
input argument of another request B, and therefore that
sequences. We used RESTler to test GitLab, an open-source Git
service, as well as several Microsoft Azure and Office365 cloud A should be executed before B), and by
services. RESTler found 28 bugs in GitLab and several bugs in 2) analyzing dynamic feedback from responses observed
each of the Azure and Office365 cloud services tested so far. These during prior test executions in order to generate new tests
bugs have been confirmed and fixed by the service owners. (e.g., learning that “a request C after a request sequence
A;B is refused by the service” and therefore avoiding
I. I NTRODUCTION this combination in the future).
Over the last decade, we have seen an explosion in cloud We present empirical evidence showing that these two
services for hosting software applications (Software-as-a- techniques are necessary to thoroughly test a service, while
Service), for building distributed services and data processing pruning the large search space defined by all possible request
(Platform-as-a-Service), and for providing general computing sequences. RESTler also implements several search strategies
infrastructure (Infrastructure-as-a-Service). Today, most cloud (akin to those used in model-based testing [43]) and we
services, such as those provided by Amazon Web Services compare their effectiveness while fuzzing GitLab [13], an
(AWS) [2] and Microsoft Azure [29], are programmatically ac- open-source self-hosted Git service with a complex REST API.
cessed through REST APIs [11] by third-party applications [1] During the course of our experiments, we found 28 new
and other services [31]. Meanwhile, Swagger (recently re- bugs in GitLab (see Section VI). We also ran experiments
named OpenAPI) [40] has arguably become the most popular on four public cloud services in Microsoft Azure [29] and Of-
interface-description language for REST APIs. A Swagger fice365 [30] and found several bugs in each service tested (see
specification describes how to access a cloud service through Section VII). This paper makes the following contributions:
its REST API, including what requests the service can handle, • We introduce RESTler, the first automatic, stateful
what responses may be received, and the response format. fuzzing tool for REST APIs, which analyzes a Swagger
Tools for automatically testing cloud services via their specification, automatically infers dependencies among
REST APIs and checking whether those services are reliable request types, and dynamically generates tests guided by
and secure are still in their infancy. The most sophisticated feedback from service responses.
testing tools currently available for REST APIs capture live
• We present detailed experimental evidence showing that
API traffic, and then parse, fuzz and replay the traffic with the
the techniques used in RESTler are necessary for effective
hope of finding bugs [4], [34], [7], [41], [3]. Many of these
automated stateful REST API fuzzing.
tools were born as extensions of more established website
• We present experimental results obtained with three dif-
testing and scanning tools (see Section VIII). Since these
ferent strategies for searching the large search space
REST API testing tools are all recent and not yet widely used,
defined by all possible request sequences, and discuss
it is still largely unknown how effective they are in finding
their strengths and weaknesses.
bugs and how security-critical those bugs are.
In this paper, we introduce RESTler, the first automatic • We present a detailed case study with GitLab, a large
stateful REST API fuzzing tool. Fuzzing [39] means automatic popular open-source self-hosted Git service and discuss
several new bugs found so far.
∗ The work of this author was mostly done at Microsoft Research. • We discuss preliminary experiences using RESTler to test

748
DOI 10.1109/ICSE.2019.00083

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
basePath: ’/api’ from restler import requests
swagger: ’2.0’ from restler import dependencies
definitions:
”Blog Post”: def parse posts(data):
properties: post id = data[”id”]
body: dependencies.set var(post id)
type: string
id: request = requests.Request(
type: integer restler static(”POST”),
required: restler static(”/api/blog/posts/”),
−body restler static(”HTTP/1.1”),
type: object restler static(”{”),
restler static(”body:”),
Fig. 1: Swagger Specification of Blog Posts Service paths: restler fuzzable(”string”),
”/blog/posts/” restler static(”}”),
several Microsoft public cloud services. post: ’post send’: {
parameters: ’parser’: parse posts,
The remainder of the paper is organized as follows. Sec- −in: body ’dependencies’: [
tion II describes how Swagger specifications are processed by name: payload post id.writer(),
RESTler. Sections III and IV present the main test-generation required: true ]
schema: }
algorithm used in RESTler and implementation details. Sec- ref: ”/definitions/Blog Post” )
tion V presents an evaluation of the test-generation techniques
and search strategies implemented in RESTler. Section VI Fig. 2: Swagger Specification and Automatically Derived RESTler
Grammar. Shows a snippet of Swagger specification in YAML (left)
discusses new bugs found in GitLab. Section VII presents our and the corresponding grammar generated by RESTler (right).
experiences fuzzing several public cloud services. Section VIII
discusses related work, and Section IX concludes the paper. it. In contrast, the function restler_fuzzable takes as
argument a value type (like string in this example) and
II. P ROCESSING API S PECIFICATIONS replaces it by one value of that type taken from a (small)
In this paper, we consider services accessible through REST dictionary of values for that type. How dictionaries are defined
APIs described with a Swagger specification. A client program and how values are selected is discussed in the next section.
can send messages, called requests, to a service and receive The response is expected to return a new dynamic object (a
messages back, called responses. Such messages are sent over dynamically created resource id) named id of type integer.
the HTTP protocol. A Swagger specification describes how to Using the schema shown on the left, RESTler automatically
access a service through its REST API (e.g., what requests generates the function parse_posts shown on the right.
the service can handle and what responses may be expected). By similarly analyzing the other request types described in
Given a Swagger specification, open-source Swagger tools can this Swagger specification, RESTler will infer automatically
automatically generate a web UI that allows users to view the that ids returned by such POST requests are necessary
documentation and interact with the API via a web browser. to generate well-formed requests of the last three request
A sample Swagger specification, in web-UI form, is shown types shown in Figure 1, which each requires an id. These
in Figure 1. This specification describes the API of a simple producer-consumer dependencies are extracted by RESTler
blog posts hosting service. The API consists of five request when processing the Swagger specification and are later used
types, specifying the endpoint, method, and required parame- for test generation, as described next.
ters. This service allows users to create, access, update, and
delete blog posts. In a web browser, clicking on any of these III. T EST G ENERATION A LGORITHM
five request types expands the description of the request type. The main algorithm for test generation used by RESTler is
For instance, selecting the second (POST) request, reveals shown in Figure 3 in python-like notation. It starts (line 3) by
text similar to the left of Figure 2. This text is in YAML processing a Swagger specification as discussed in the previous
format and describes the exact syntax expected for that specific section. The result of this processing is a set of request types,
request and its response. In this case, the definition part denoted reqSet in Figure 3, and of their dependencies (more
of the specification indicates that an object named body of on this later).
type string is required and that an object named id of type The algorithm computes a set of request sequences, as in-
integer is optional (since it is not required). The path part ferred from Swagger, denoted seqSet and initially containing
of the specification describes the HTTP-syntax for this POST an empty sequence  (line 5). A request sequence is valid if
request as well as the format of the expected response. every response in the sequence has a valid return code, defined
From such a specification, RESTler automatically constructs here as any code in the 200 range. At each iteration of its main
the test-generation grammar shown on the right of Figure 2. loop (line 8), starting with n = 1, the algorithm computes all
This grammar is encoded in executable python code. It valid request sequences seqSet of length n before moving to
consists of code to generate an HTTP request, of type POST n+1 and so on until a user-specified maxLength is reached.
in this case, and code to process the expected response Computing seqSet is done in two steps.
of this request. Each function restler_static simply First, the set of valid request sequences of length n − 1 is
appends the string it takes as argument without modifying extended (line 9) to create a set of new sequences of length n

749

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
1 Inputs: swagger spec, maxLength types in the request is computed (line 27) (those are identified
2 # Set of requests parsed from the Swagger API spec by restler_fuzzable in the code shown on the right of
3 reqSet = PROCESS(swagger spec)
4 # Set of request sequences (initially an empty sequence ) Figure 2). Then, each fuzzable primitive type in the request is
5 seqSet = {} concretized, which substitutes one concrete value of that type
6 # Main loop: iterate up to a given maximum sequence length taken out of a finite, user-configurable dictionary of values.
7 n=1
8 while (n =< maxLength): For instance, for fuzzable type integer, RESTler might use
9 seqSet = EXTEND(seqSet, reqSet) a small dictionary with the values 0, 1, and -10, while for
10 seqSet = RENDER(seqSet) fuzzable type string, a dictionary could be defined with
11 n=n+1
12 # Extend all sequences in seqSet by appending the values “sampleString”, the empty string, and a very long
13 # new requests whose dependencies are satisfied fixed string. The function RENDER generates all possible such
14 def EXTEND(seqSet, reqSet): combinations (line 28). Each combination thus corresponds
15 newSeqSet = {}
16 for seq in seqSet: to a fully-defined request newReq (line 29) which is HTTP-
17 for req in reqSet: syntactically correct. The function RENDER then executes this
18 if DEPENDENCIES(seq, req): new request sequence (line 31), and checks its response: if the
19 newSeqSet = newSeqSet + concat(seq, req)
20 return newSeqSet response has a valid status code, the new request sequence is
21 # Concretize all newly appended requests using dictionary values, valid and retained (line 33); otherwise, it is discarded and the
22 # execute each new request sequence and keep the valid ones received error code is logged for analysis and debugging.
23 def RENDER(seqSet):
24 newSeqSet = {} More precisely, the function EXECUTE executes each re-
25 for seq in seqSet: quest in a sequence one by one, each time checking that the
26 req = last request in(seq) response is valid, extracting and memoizing dynamic objects
27 ~ = tuple of fuzzable types in(req)
V
~:
(if any), and providing those in subsequent requests in the
28 for ~v in V
29 newReq = concretize(req, ~v ) sequence if needed, as determined by the dependency analysis;
30 newSeq = concat(seq, newReq) the response returned by function EXECUTE in line 31 refers
31 response = EXECUTE(newSeq) to the response received for the last, newly-appended request
32 if response has a valid code:
33 newSeqSet = newSeqSet + newSeq in the sequence. Note that if a request sequence produces
34 else: more than one dynamic object of a given type, the function
35 log error EXECUTE will memoize all of those objects, but will provide
36 return newSeqSet
37 # Check that all objects referenced in a request are produced them later when needed by subsequent requests in the exact
38 # by some response in a prior request sequence order in which they are produced; in other words, the function
39 def DEPENDENCIES(seq, req): EXECUTE will not try different ordering of such objects. If a
40 if CONSUMES(req) ⊆ PRODUCES(seq):
41 return True dynamic object is passed as argument to a subsequent request
42 else: and is “destroyed” after that request, i.e., it becomes unusable
43 return False later on, RESTler will detect this by receiving an invalid status
44 # Objects required in a request
45 def CONSUMES(req): code (outside the 200 range) when attempting to reuse that
46 return object types required in(req) unusable object, and will then discard that request sequence.
47 # Objects produced in the responses of a sequence of requests By default, the function RENDER of Figure 3 generates all
48 def PRODUCES(seq):
49 dynamicObjects = {} possible combinations of dictionary values for every request
50 for req in seq: with several fuzzable types (see line 28). For large dictionaries,
51 newObjs = objects produced in response of(req) this may result in astronomical numbers of combinations. In
52 dynamicObjects = dynamicObjects + newObjs
53 return dynamicObjects that case, a more scalable option is to randomly sample each
dictionary for one (or a few) values, or to use combinatorial-
Fig. 3: Main Algorithm used in RESTler. testing algorithms [10] for covering, say, every dictionary
by appending each request with satisfied dependencies at the value, or every pair of values, but not every k-tuple. In the
end of each sequence, as described in the EXTEND function experiments reported later, we used small dictionaries and the
(line 14). The function DEPENDENCIES (line 39) checks if default RENDER function shown in Figure 3.
all dependencies of the specified request are satisfied. This is The function EXTEND of Figure 3 generates all request se-
true when every dynamic object that is a required parameter quences of length n+1 whose dependencies are satisfied. Since
of the request, denoted by CONSUMES(req), is produced by n is incremented at each iteration of the main loop of line 8,
some response to the request sequence preceding it, denoted the overall algorithm performs a breadth-first search (BFS)
by PRODUCES(seq). If all the dependencies are satisfied, in the search space defined by all possible request sequences.
the new sequence of length n is retained (line 19); otherwise In Section V, we report experiments performed also with two
it is discarded. additional search strategies: BFS-Fast and RandomWalk.
Second, each newly-extended request sequence whose de- BFS-Fast. In function EXTEND, instead of appending every
pendencies are satisfied is rendered (line 10) one by one as request to every sequence, every request is appended to at most
described in the RENDER function (line 23). For every newly- one sequence. This results in in a smaller set newSeqSet
appended request (line 26), the list of all fuzzable primitive which covers (i.e., includes at least once) every request but

750

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
does not generate all valid request sequences. Like BFS, nently”, 303 “See Other”, and 307 “Temporary Redirect”).
BFS-Fast still exercises every executable request type at each Furthermore, RESTler currently can only find bugs defined
iteration of the main loop in line 8: it still provides full as unexpected HTTP status codes. Such a simple test oracle
grammar coverage but with fewer request sequences, which cannot detect vulnerabilities that are not visible though HTTP
allows it to go deeper faster than BFS. status codes (e.g., “Information Exposure” and others). Despite
RandomWalk. In function EXTEND, the two loops of line 17 these limitations, RESTler has already found confirmed bugs
and line 18 are eliminated; instead, the function now returns a in a production-scale open-source application and in several
single new request sequence whose dependencies are satisfied, Microsoft Azure and Office365 services, as will be discussed
and generated by randomly selecting one request sequence in Sections VI and VII.
seq in seqSet and one request in reqSet. (The function
randomly chooses such a pair until all the dependencies of that V. E VALUATION
pair are satisfied.) This search strategy will therefore explore We present experimental results obtained with RESTler that
the search space of possible request sequences deeper more answer the following questions:
quickly than BFS or BFS-Fast. When RandomWalk can no Q1: Are both inferring dependencies among request types
longer extend the current request sequence, it restarts from and analyzing dynamic feedback necessary for effective
scratch from an empty request sequence. (Since it does not automated REST API fuzzing? (Section V-B)
memoize past request sequences between restarts, it might Q2: Are tests generated by RESTler exercising deeper
regenerate the same request sequence again in the future.) service-side logic as sequence length increases? (Sec-
IV. I MPLEMENTATION tion V-C)
We have implemented RESTler in 3,151 lines of modular Q3: How do the three search strategies implemented in
python code split into: the parser and compiler module, the RESTler compare across various APIs? (Section V-D)
core fuzzing runtime module, and the garbage collector (GC) We answer the first question (Q1) using a simple blog posts
module. The parser and compiler module is used to parse a service with a REST API. We answer (Q2), and (Q3) using
Swagger specification and to generate the RESTler grammar GitLab, an open-source, production-scale 1 web service for
describing how to fuzz a target service. (In the absence of self-hosted Git. We conclude the evaluation by discussing
a Swagger specification, the user could directly provide the in Section V-E how to bucketize (i.e., group together) the
RESTler grammar.) The core fuzzing runtime module imple- numerous bugs that can be reported by RESTler in order to
ments the algorithm of Figure 3 and its variants. It renders API facilitate their analysis.
requests, processes service-side responses to retrieve values
A. Experimental Setup
of the dynamic objects created, and analyzes service-side
feedback to decide which requests should be reused in future Blog Posts Service. We answer (Q1) using a simple blog posts
generations while composing new request sequences. Finally, service, written in 189 lines of python code using the Flask
the GC runs as a separate thread that tracks the creation of web framework [12]. Its functionality is exposed over a REST
the dynamic objects over time and periodically deletes aging API with a Swagger specification shown in Figure 1. The API
objects that exceed some user-defined limit (see Section VII). contains five request types: (i) GET on /posts: returns all
blog posts currently registered; (ii) POST on /posts: creates
A. Using RESTler a new blog post (body: the text of the blog post); (iii) DELETE
RESTler is a command-line tool that takes as input a /posts/id: deletes a blog post; (iv) GET posts/id:
Swagger specification, service access parameters (e.g. IP, returns the body and the checksum of an individual blog post;
port, authentication), the mutations dictionary, and the search and (v) PUT /posts/id: updates the contents of a blog post
strategy to use during fuzzing. After compiling the Swagger (body: the new text of the blog post and the checksum of the
specification, RESTler displays the number of endpoints dis- older version of the blog post’s text).
covered and the list of resolved and unresolved dependencies, To model an imaginary subtle bug, at every update of a blog
if any. In case of unresolved dependencies, the user may pro- post (PUT request with body text and checksum) the service
vide additional annotations or resource-specific mutations (see checks if the checksum provided in the request matches the
Section VII) and re-run this step to resolve them. Alternatively, recorded checksum for the current blog post, and if it does, an
the user may choose to start fuzzing right away and RESTler uncaught exception is raised. Thus, this bug will be triggered
will treat unresolved dependencies in consumer parameters and detected only if dependencies on dynamic objects shared
as restler_fuzzable string primitives. During fuzzing, across requests are taken into account during test generation.
RESTler reports each bug, currently defined as a 500 HTTP GitLab. We answer (Q2) and (Q3) using GitLab, an open-
status code (500 “Internal Server Error”) received after exe- source web service for self-hosted Git. GitLab’s back-end
cuting a request sequence, as soon as it is found. is written in over 376K lines of ruby code using ruby-on-
B. Current Limitations rails [35] and its functionality is exposed through a REST
Currently, RESTler does not support requests for API end- 1 GitLab [13] is used by more than 100,000 organizations, has millions of
points with server-side redirects (e.g., 301 “Moved Perma- users, and has currently a 2/3 market share of the self-hosted Git market [20].

751

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
API [14]. For our deployment, we apply the following config- Second, we see that without considering dynamic feedback
uration settings: we use Nginx to proxypass the Unicorn web to prune invalid request sequences in the search space (Fig-
server and configure 15 Unicorn workers limited to up to 2GB ure 4, top center) the number of tests generated grows quickly,
of physical memory; we use postgreSQL for persistent storage even for a simple API. Specifically, without considering dy-
configured with a pool of 10 workers; we use GitLab’s default namic feedback (Figure 4, top center), RESTler produces more
configuration for sidekiq queues and redis workers. According than 4, 600 tests that take 1, 750 seconds and cover about 150
to GitLab’s deployment recommendations, such configuration lines of code. In contrast, by considering dynamic feedback
should scale up to 4,000 concurrent users [15]. (Figure 4, top right), the state space is significantly reduced
Fuzzing Dictionaries. For the experiments in this section, we and RESTler achieves the same code coverage with less than
use the following dictionaries for fuzzable primitives types: 800 test cases and only 179 seconds.
string has possible values “sampleString” and “” (empty HTTP Status Codes. We make two observations. First,
string); integer has possible values “0” and “1”; boolean has focusing on 40X status codes, we notice a high number of
possible values “true” and “f alse”. 40X responses when ignoring dynamic feedback (Figure 4,
All experiments were run on Ubuntu 16.04 Microsoft Azure bottom center). This indicates that without considering service-
VMs configured with eight Intel(R) Xeon(R) E5-2673 v3 @ side dynamic feedback, the number of possible invalid request
2.40GHz CPU cores and 56GB of physical memory. sequences grows quickly. In contrast, considering dynamic
feedback dramatically decreases the percentage of 40X status
B. Techniques for Effective REST API Fuzzing
codes from 60% to 26% without using dependencies among
In this section, we report results with our blog posts request types (Figure 4 bottom left) and to 20% with using
service to determine whether both (1) inferring dependencies dependencies among request types (Figure 4, bottom right).
among request types and (2) analyzing dynamic feedback are Moreover, when using dependencies among request types
necessary for effective automated REST API fuzzing (Q1). (Figure 4, bottom right), we observe the highest percentage
We choose a simple service in order to clearly measure and of 20X status codes (approximately 80%), indicating that
interpret the testing capabilities of the two core techniques RESTler then exercises a larger part of the service logic –
being evaluated. Those capabilities are evaluated by measuring also confirmed by coverage data (Figure 4, top right).
service code coverage and client-visible HTTP status codes. Second, when ignoring dependencies among request types,
Specifically, we compare results obtained when exhaustively we see that no 500 status codes are detected (Figure 4, bottom
generating all possible request sequences of length up to three, left), while RESTler finds a handful of 500 status codes
with three different test-generation algorithms: when using dependencies among request types (see (Figure 4,
1) RESTler ignores dependencies among request types bottom left and bottom right). These 500 responses are
and treats dynamic objects – such as post id and triggered by the unhandled exception we planted in our blog
checksum – as fuzzable primitive type string ob- posts service after a PUT blog update request with a checksum
jects, while still analyzing dynamic feedback. matching the previous blog post’s body (see Section V-A).
2) RESTler ignores service-side dynamic feedback and When ignoring dependencies among request types, RESTler
does not eliminate invalid sequences during the search, misses this bug (Figure 4, bottom left). In contrast, when
but still infers dependencies among request types and analyzing dependencies across request types and using the
generates request sequences satisfying those. checksum returned by a previous GET /posts/id request
3) RESTler uses the algorithm of Figure 3 using both de- in a subsequent PUT /posts/id update request with the
pendencies among request types and dynamic feedback. same id, RESTler does trigger the bug. Furthermore, when
Figure 4 shows the number of tests, i.e., request sequences, up additionally using dynamic feedback, the search space is
to maximum length 3, generated by each of these three algo- pruned while preserving this bug, which is then found with
rithms, from left to right. The top plots show the cumulative the least number of tests (Figure 4, bottom right).
code coverage measured in lines of code over time, as well as Overall, these experiments illustrate the complementarity
when the sequence length increases. The bottom plots show between utilizing dependencies among request types and using
the cumulative number of HTTP status codes received. dynamic feedback, and show that both are needed for effective
Code Coverage. First, we observe that without considering REST API fuzzing.
dependencies among request types (Figure 4, top left), code
coverage is limited to up to 130 lines and there is no increase C. Deeper Service Exploration
over time, despite increasing the length of request sequences. In this section, we use GitLab to determine whether tests
This illustrates the limitations of using a naive approach to generated by RESTler exercise deeper service-side logic as
test a service where values of dynamic objects like id and sequence length increases (Q2). We perform individual experi-
checksum cannot be randomly guessed or picked among ments on six groups of GitLab APIs related to usual operations
values in a small predefined dictionary. In contrast, by infer- with commits, branches, issues and notes, repositories and
ing dependencies among requests and by processing service repository files, groups and group membership, and projects.
responses RESTler achieves an increase in code coverage up Table I shows the total number of requests in each of the six
to 150 lines of code (Figure 4, top center and right). target API groups and presents experimental results obtained

752

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
Test Cases Test Cases Test Cases
40 142 285 403 525 662 796 900 0 843 1428 2178 2942 3637 4285 4700 30 117 255 391 515 647 776
140 160 160
135 150 150
Code coverage (LOC)

Code coverage (LOC)

Code coverage (LOC)


130
140 140
125
120 130 130
115 120 120
110 Coverage Coverage Coverage
Seq. length increase 110 Seq. length increase 110
105 Seq. length increase
25 50 75 100 125 150 175 200 0 250 500 750 1000 1250 1500 1750 25 50 75 100 125 150 175
Time (seconds) Time (seconds) Time (seconds)
Test Cases Test Cases Test Cases
40 142 285 403 525 662 796 900 0 843 1428 2178 2942 3637 4285 4700 30 117 255 391 515 647 776
1000 HTTP Status: 201 HTTP Status: 404 HTTP Status: 201
7000 HTTP Status: 200
HTTP Status: 200 HTTP Status: 200 800

Occurences Over Time


Occurences Over Time

Occurences Over Time


800 HTTP Status: 404 6000 HTTP Status: 201 HTTP Status: 404
HTTP Status: 400 HTTP Status: 400 HTTP Status: 400
5000 HTTP Status: 500 600 HTTP Status: 500
600
4000
400 3000 400
2000
200 200
1000
0 0 0
25 50 75 100 125 150 175 200 0 250 500 750 1000 1250 1500 1750 25 50 75 100 125 150 175
Time (seconds) Time (seconds) Time (seconds)
Fig. 4: Blog Posts Service Code Coverage and HTTP Status Codes Over Time. Shows the increase in code coverage over time (top)
and the cumulative number of HTTP status codes received over time (bottom), for the simple blog posts service. Left: RESTler ignores
dependencies among request types. Center: RESTler ignores dynamic feedback. Right: RESTler utilizes both dependencies among request
types and dynamic feedback. When leveraging both techniques, RESTler achieves the best code coverage and finds the planted 500 “Internal
Server Error” bug with the least number of tests.

API Total Seq. Coverage Tests seqSet Dynamic


Requests Len. Increase Size Objects maximum 1, 000 combinations per request. Between experi-
Commits 11 1 598 1 1 1
ments, we reboot the entire GitLab service to restart from the
2 1108 7 5 10 same initial state. For each API group, as time goes by, Table I
3 1196 250 46 521 shows the increase (going down) in the sequence length, code
4 1760 2220 1341 6577
coverage, tests executed, seqSet size, and the number of
5 1760 3667 20679 12518
Branches 7 1 598 1 1 1
dynamic objects created, until the 5-hours timeout is reached.
2 1089 8 6 11 Code Coverage. We collect code coverage data by configur-
3 1172 58 44 107 ing Ruby’s Class: TracePoint hook to trace GitLab’s
4 1182 576 387 1279 service/lib folder. Table I shows the cumulative code
5 1185 3644 5528 9336
Issues 22 1 816 37 37 37
coverage achieved after executing all the request sequences
2 1163 2444 1839 4245 generated by RESTler for each sequence length, or until the
3 1163 4156 15658 8870 5-hours timeout expires. The results are incremental on top of
Repos 10 1 598 1 1 1 16,836 lines of code executed during service boot.
2 1117 97 65 206
3 1181 5153 2194 15472
From Table I, we can see that longer sequence lengths
Groups 50 1 887 39 39 38 consistently lead to increased service-side code coverage. This
2 1177 3508 3360 5204 is the desired behaviour, especially for small sequence lengths,
3 1177 4817 79518 8946 as some of the service functionality can only be exercised
Projects 48 1 934 42 41 38
after at least a few requests are executed. As an example,
2 1192 1870 1781 3343
3 1203 3226 18173 7374 consider the GitLab functionality of “selecting a commit”. Ac-
cording to GitLab’s specification, selecting a commit requires
TABLE I: Testing Common GitLab APIs with RESTler. Shows two dynamic objects, a project-id and a commit-id, and the
the increase in sequence length, code coverage, tests executed,
following dependency of requests is implicit: (1) a user needs
seqSet size, and the number of dynamic objects being created using
BFS, until a 5-hours timeout is reached. Longer request sequences to create a project, (2) use the respective project-id to post a
gradually increase service-side code coverage. new commit, and then (3) select the commit using its commit-
id and the respective project-id. Clearly, this operation can only
with the test-generation algorithm of Figure 3 using BFS. For be performed by sequences of three requests or more. For the
each experiment we run RESTler with a 5-hours timeout and Commit APIs, note the gradual increase in coverage from 598
limit the number of fuzzable primitive-type combinations to to 1, 108 to 1, 196 lines of code for sequence lengths of one,

753

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
Total Time BFS BFS-Fast RandomWalk
API
Requests (hrs)
Len. Coverage seqSet Len. Coverage seqSet Len. (restarts) Coverage seqSet
Commits 11 (*11) 1 4 1202 7 1697 13 (16) 1285
3 5 1760 9 1731 13 (35) 1295
5 5 1760 20679 12 1731 33 13 (56) 1303 1
Branches 7 (*2) 1 5 1182 21 1154 15 (24) 1182
3 5 1185 37 1178 19 (92) 1187
5 5 1185 5528 47 1178 11 22 (158) 1208 1
Issues 22 (*82) 1 2 1150 2 1086 10 (1) 770
3 3 1163 4 1551 10 (1) 770
5 3 1163 15658 5 1570 26 16 (2) 847 1
Repos 10 (*24) 1 3 1127 5 1141 10 (29) 1195
3 3 1127 7 1141 13 (88) 1231
5 3 1181 2194 8 1161 64 13 (142) 1231 1
Groups 50 (*2) 1 2 961 6 1275 19 (41) 1167
3 3 1177 11 1275 19 (120) 1250
5 3 1177 79518 14 1275 130 22 (186) 1283 1
Projects 48 (*4) 1 2 1006 5 1318 4 (3) 889
3 2 1053 11 1319 22 (31) 1024
5 3 1203 18173 15 1319 171 22 (45) 1273 1

TABLE II: Comparison of BFS, BFS-Fast and RandomWalk over Time. Shows the maximum sequence length, the increase in lines of
code covered (excluding service-boot coverage), and the seqSet size with each search strategy after 1, 3, and 5 hours. The second column
shows the total number of requests in each API along with the average feasible request renderings (*). Although BFS covers slightly more
lines of code, BFS-Fast and RandomWalk reach deeper request sequences and maintain a much smaller seqSet size.

two, and three, respectively. Most notably, for the Branches reached. For the RandomWalk search strategy the total number
API, service-side code coverage keeps gradually increasing of restarts is also shown in parenthesis.
for sequences of length up to five, and reaches 1, 185 lines First, we compare BFS with BFS-fast. We observe that after
when the 5-hours limit expires. five hours, BFS achieves better coverage than BFS-Fast in
Tests, Sequence Sets, and Dynamic Objects. In addition to Commits, Branches, and Repos. These groups of APIs have
code coverage, Table I also shows the increase in the number relatively fewer requests and BFS delivers better coverage by
of tests executed, the size of seqSet after the RENDER exercising all feasible request sequences. However, BFS does
function returns (line 10 of Figure 3), and the number of not scale well in APIs with relatively more requests, such as
dynamic objects created by RESTler. All those numbers are Issues, Groups, and Projects. As shown in Table II after 5
quickly growing since the search space also grows quickly due hours for Issues, Groups, and Projects, BFS is still exploring
to the exhaustive nature of the BFS search strategy. sequences of length 3 while BFS-Fast is exploring sequences
Nevertheless, we emphasize that without the two key tech- of length 5, 14, and 15, respectively. BFS-Fast scales better
niques evaluated in Section V-B this growth would be much in APIs with many request because, unlike BFS, it does not
worse. For instance, for the Commit API, the SeqSet size explore all feasible request sequences but instead appends each
is 20, 679 and there are 12, 518 dynamic objects created by request to at most one sequence in each generation. BFS-Fast
RESTler for sequences of length up to five. By comparison, maintains a smaller seqSet, and explores deeper sequences
since the Commits API has 11 request types with an average of and grows coverage faster in Issues, Groups, and Projects.
4 rendering combinations, the number of all possible rendered We now compare BFS with RandomWalk. By construction,
request sequences of up to length four is already more than 164 RandomWalk does not guarantee full grammar coverage since
millions, and a naive brute-force enumeration of those would it appends each request to one random sequence in each
already be untractable. Still, even with the two core techniques generation. As shown in Table II, RandomWalk maintains a
used in RESTler, the search space explodes quickly, and we small seqSet at any time, by construction. Furthermore, after
evaluate other search strategies next. 5 hours RandomWalk explores considerably deeper request
sequences compared to BFS and, in most cases, compared to
D. Search Strategies BFS-Fast. RandomWalk also delivers the best coverage after
We now present results of experiments comparing the 5 hours in Branches, Repos, and Groups.
BFS, BFS-Fast, and RandomWalk search strategies defined On the other hand, in Issues, we observe that after 5
in Section III (Q3). For each search strategy, Table II shows hours RandomWalk explores sequences of length 16 and the
the maximum sequence length, the increase in lines of code coverage increase is 847 lines. In the same time-frame, BFS
covered (excluding service-boot coverage) after 1, 3, and 5 explores sequences of length 3 but the coverage increase
hours, and the size of the seqSet when the 5-hours timeout is is 1, 163 and BFS-Fast explores sequences length 5 with a

754

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
API BFS BFS- Random- Intersection Union method, the last two columns show the intersection and the
Fast Walk
union of the bug buckets. In the context of these experiments,
Commits 5 1 5 1 5 RESTler found 22 new unique bugs, after running each search
Branches 7 7 7 5 8 strategy for 5 hours on each API group.
Issues 0 1 1 0 1 RandomWalk stands out in Table III by finding the most
Repos 2 3 3 2 3 bugs: 21 compared to 16 and 13 for BFS and BFS-Fast
Groups 0 0 2 0 2 respectively. It is particularly intriguing that RandomWalk
Projects 2 1 3 1 3 finds as many bugs as BFS and BFS-Fast combined in Com-
Total 16 13 21 9 22 mits and in Issues APIs because in these APIs RandomWalk
delivers relatively little coverage. After 5 hours, in commits,
TABLE III: Bug Buckets found by BFS, BFS-Fast, and Ran-
domWalk after Five Hours. Shows the sets of bugs found by each RandomWalk finds as many bugs as BFS and more than
search strategy in each API. In total: RESTler found 22 new bugs. BFS-Fast. At the same time, RandomWalk delivers less code
coverage than each of BFS and BFS-Fast in Commits (see
coverage increase of 1, 570. Compared to all other APIs shown Table II). Similarly, RandomWalk finds 1 bug in Issues, while
in Table II, Issues have 82 feasible request renderings on BFS finds none and BFS-Fast also finds one. Yet, again,
average. This is relatively large and in such a case, with many RandomWalk achieves less code coverage than each of BFS
feasible request renderings, the breadth of the search achieved and BFS-Fast in Issues. The differences between BFS and
by RandomWalk is small (e.g., after 5 hours there are only 2 BFS-Fast are less striking. BFS finds more bugs in Commits,
restarts). Consequently, the search remains focused on a very while BFS-Fast finds more bugs in Issues and Repos.
restricted subspace which reflects poorly on coverage. Overall, within the 5-hours time-frame of our experiments,
In practice, both controlling the size of seqSet, when RandomWalk finds more bugs than BFS or BFS-Fast despite
facing broader search spaces due to large APIs with many the fact that it does not always deliver the best coverage. It
requests or when reaching greater depths, and maintaining is unclear how this generalizes to longer fuzzing sessions or
some breadth when extending request sequences seem key to other APIs. Yet, it becomes apparent that coverage increase
to delivering better code coverage. Nevertheless, the ultimate should not always dictate the selection of a search strategy
goal is to find bugs, and maximizing code coverage is just a because different search strategies may be complementary
heuristic to try to reach that goal. within a large search space. Next, we discuss details of the
bugs founds with RESTler in GitLab and the total number of
E. Bug Bucketization bugs found when running longer fuzzing experiments.
Before discussing real bugs found with RESTler, we intro-
duce a bucketization scheme to cluster similar 500 “Internal VI. N EW B UGS F OUND IN G IT L AB
Server Errors”. When fuzzing, different instances of a same During all our fuzzing experiments with RESTler on our
bug are often found repeatedly. Since all the bugs found local GitLab deployment, we found a total of 28 new unique
have to be inspected by the user, it is therefore important bugs. All bugs were easily reproducible, disclosed to GitLab
in practice to facilitate this analysis by identifying likely- developers, confirmed and fixed. Due to space limitation, we
redundant instances of a same unique bug. describe only 2 of these bugs, to give the reader a flavor of
In our context, we define a bug as a 500 HTTP status code what those bugs look like and how they were found. (See [16],
being received after executing a request sequence. Thus, every [17], [18], [19] for other examples of bugs found.)
bug found is associated with the request sequence that was Example 1: Bug in Commits API. One of the bugs found by
executed to find it. Given this property, we use the following RESTler in the Commits API is triggered when a user tries to
bucketization procedure for the bugs found by RESTler: cherry-pick a commit to a branch with an empty name. Due
Whenever a new bug is found, we compute all non- to incomplete input validation, an invalid branch name can be
empty suffixes of its non-rendered request sequence2 passed between two different layers of abstraction as follows:
(starting with the smallest one) and check whether The ruby code that checks if a target branch exists, invokes a
some suffix is a previously-recorded sequence lead- native C function whose return value is expected to be either
ing to a bug found earlier. If there is a match, the NULL or an existing entry. However, if an unmatched entry
new bug is added to the bucket of that previous bug. type (e.g., an empty string) is passed to the native function, an
Otherwise, a new bucket is created with the new bug exception is raised. This exception is unhandled by the higher-
and its request sequence. level ruby code, and therefore it causes a 500 “Internal Server
When using BFS or BFS-Fast, this bucketization scheme will Error”. The bug can be reproduced by (1) creating a project,
identify bugs by the shortest sequence necessary to find it. (2) creating a new branch (in addition to master branch which
Table III shows the sets of bug buckets found by each is created by default), (3) posting a valid commit with action
search strategy, after five hours, in each GitLab API group. “create” in the branch created in (2), and (4) cherry-picking
To demonstrate the overlap between the bugs reported by each the commit to a branch whose name is set to the empty string.
Example 2: Bug in Branches API. Another bug, found
2A request sequence of length n has n suffixes of length 1, 2, . . . , n. by RESTler in the Branches API, is triggered when a user

755

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
tries to edit a branch of a recently deleted project. The bug periodically deletes dynamic objects that are no longer used in
is due to invalid serialization of operations which results in order to avoid exceeding resource quotas. This allows RESTler
an database entry update using an invalid foreign key of a to continuously test new sequences for hours or days without
deleted project. Since the project-id (foreign key) is not present hitting resource-quota-related errors.
in the respective “projects” table, a PG::ForeignKeyViolation Short-lived Access Tokens. Unlike in self-contained deploy-
exception causes a 500 “Internal Server Error”. The bug can ments where an admin can pre-populate static or long-lived
be reproduced by (1) creating a project, (2) creating a branch, authentication tokens, public cloud services use short-lived, re-
(3) deleting the project created in (1), and (4) quickly editing freshable authentication tokens. Usually, a public endpoint, ac-
the branch of the deleted project. cessible with some type of static credentials (e.g., a username-
From the above bug descriptions, we see a two-fold pattern. password pair or a master token) and service-specific logic,
First, RESTler produces a sequence that exercises the target generates fresh, short-lived access tokens. The latter are added
service deep enough so that it reaches a particular valid “state”. in the header of HTTPS requests. Since different services
Second, while the service is in such a state, RESTler produces may require custom logic to access their public authentication
an additional request with an unexpected fuzzed value (e.g., an endpoints, RESTler provides an authentication hook which
empty string) or an unexpected action (e.g., edit a branch of a periodically executes a user-provided piece of code (e.g., a
recently deleted project). Most bugs found by RESTler require script) and propagates fresh values in the pool of refreshable
a combination of these two features in order to be found. authentication tokens.
Application-specific Naming Schemes. As discussed in Sec-
VII. E XPERIENCES WITH P UBLIC C LOUD S ERVICES tion II, RESTler performs a light-weight static analysis of a
In this section, we describe our preliminary experiences run- Swagger specification to infer dependencies among requests
ning RESTler on three Azure [29] services and one Microsoft of the target REST API. However, part of a target API
Office365 [30] service. The services we fuzzed primarily may not be fully REST compliant, or the specification may
perform resource management and real-time data aggregation. be incomplete, and consequently the inferred dependencies
Swagger specifications for these services are publicly available will also be incomplete. To address this challenge, RESTler
and published by Microsoft on GitHub. supports annotations, which can be added directly to the
While still in an early stage of development, RESTler found specification (as Swagger extensions), in order to explicitly
new bugs in all of these services. These bugs range from declare dependencies, as well as resource-specific mutations,
mis-handled invalid inputs (e.g., using a wrong ID or enum which can be used for the creation of resources that require
value), executing operations in invalid states (e.g., updating some custom format (e.g., an IP address). These two features
a resource that no longer exists), and inconsistent parameter have proven useful in practice because Azure services use
validations (e.g., using a valid request body with incorrect PUT requests to create resources whose user-provided names
metadata). Although we cannot disclose detailed descriptions are passed as URL parameters and, after successful creation,
of these bugs, we emphasize that all bugs found by RESTler are also returned in the response. For this scenario, one can
so far have been confirmed and fixed by Microsoft service use resource-name-specific mutations to indicate that a PUT
developers. Indeed, “500 Internal Server Errors” are server request should create a resource named in a custom format,
state corruptions that may severely damage service health and and then use that name to identify the corresponding dynamic
security: it is safer to fix these rather than risk a live incident object in subsequent requests.
with unknown consequences.
During this effort, we faced a number of challenges unique VIII. R ELATED W ORK
to public cloud services, including resource quota limitations, HTTP Fuzzers. Since REST API requests and responses
short-lived access tokens, and complex API dependencies are transmitted over the HTTP protocol, HTTP-fuzzers can
beyond the canonical REST API structure with application- be used to fuzz REST APIs. Fuzzers like Burp [8], Sul-
specific resource values and naming schemes. We describe the ley [38], BooFuzz [7], or the commercial AppSpider [4] and
extensions made to RESTler to address these challenges. Qualys’s WAS [34], can capture/replay HTTP traffic, parse
Resource Quotas. Production services that run in public HTTP requests/responses and their contents (like embedded
clouds are deployed with default resource quotas. Once quotas JSON data), and then fuzz those using either pre-defined
are reached, RESTler’s core algorithm will continue to try heuristics [4], [34] or user-defined rules [38], [7]. Tools to
request sequences containing requests that can no longer capture, parse, fuzz, and replay HTTP traffic have recently
succeed due to exceeded quotas (since these requests were been extended to leverage Swagger specifications in order to
valid in prior tests and generated lots of new resources), which parse HTTP requests and guide their fuzzing [4], [34], [41],
impedes progress. This challenge is unique to public cloud [3]. Compared to those tools, the main originality of RESTler
deployments, contrary to self-contained deployments where is the lightweight static analysis of Swagger specifications in
one can easily control and reconfigure resource quotas. To order to infer dependencies among request types, which in
address this problem, we implemented a garbage collector turn allows RESTler to automatically generate sequences of
(GC) in RESTler. The GC runs as a separate thread that requests that exercise the business logic exposed by the API
monitors the creation of dynamic resources over time and in a stateful manner and without pre-recorded HTTP traffic.

756

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
Feedback-directed Test Generation. The dynamic feedback bined [27], [21] with whitebox fuzzing [23], which uses
RESTler uses to prune invalid requests from the search space dynamic symbolic execution [22], [9], constraint generation
(line 32 in Figure 3) is similar to the feedback used in and solving in order to generate new tests exercising new
Randoop [32]. However, the Randoop search strategy (in code paths. In contrast, RESTler is currently purely blackbox:
particular, search pruning and ordering) is different from the inner workings of the service under test are invisible to
the three search strategies considered in our work, and the RESTler which only sees REST API requests and responses.
Randoop optimizations related to object equality and filtering Since cloud services are usually complex distributed systems
are not relevant in our context. RESTler’s dependency analysis whose components are written in different languages, general
is also related to the analysis of type dependencies performed symbolic-execution-based approaches seem problematic, but
by the Randoop algorithm [32] for typed object-oriented it would be worth exploring this option further. For instance,
programs. However, unlike in the Randoop work, dynamic in the short term, RESTler could be extended to take into
objects in Swagger specifications are implicitly declared and account alerts (e.g., assertion violations) reported in back-end
untyped (e.g., authentication tokens or service-specific re- logs in order to increase chances of finding interesting bugs
sources). When a Swagger specification is not complete or and correlating them to specific request sequences.
RESTler cannot infer object types correctly, RESTler supports Penetration Testing. In practice, the main technique used
annotations (see Section VII) that the user can use to fix and today to ensure the security of cloud services is the so-called
control RESTler’s behavior. In the future, it would be interest- “penetration testing”, or pen testing, which means security
ing to allow richer user annotations in order to easily specify experts review the architecture, design and code of cloud
complex service-specific types as well as their properties, in services from a security perspective. Since pen testing is
the spirit of code contracts [28], [5]. labor intensive, it is expensive and limited in scope and
Model-based Testing. Our BFS-Fast search strategy is in- depth. Fuzzing tools like RESTler can partly automate the
spired by test generation algorithms used in model-based discovery of specific classes of security vulnerabilities, and
testing [42], whose goal is to generate a minimum number of are complementary to pen testing.
tests covering, say, every state and transition of a finite-state
machine model (e.g., see [43]) in order to generate a test suite IX. C ONCLUSION
to check conformance of a (blackbox) implementation with
respect to that model. BFS-Fast is also related to algorithms RESTler is the first automatic tool for stateful fuzzing
for generating tests from an input grammar while covering all of cloud services through their REST APIs. While still in
its production rules [26]. Indeed, in our context, BFS-Fast pro- early stages of development, RESTler was able to find 28
vides, by construction, a full grammar coverage up to the given bugs in GitLab and several bugs in each of the four Azure
current sequence length. The number of request sequences and Office365 cloud services tested so far. Although still
it generates is not necessarily minimal, but that number was preliminary, our results are encouraging. How general are these
always small, hence manageable, in our experiments so far. results? To find out, we need to fuzz more services through
Grammar-based Fuzzing. General-purpose grammar-based their REST APIs and check more properties to detect different
fuzzers like Peach [33] and SPIKE [37], among others [39], are kinds of bugs and security vulnerabilities. Indeed, unlike buffer
not Swagger-specific but can also be used to fuzz REST APIs. overflows in binary-format parsers, use-after-free bugs in web
With these tools, however, the user has to manually construct browsers, or cross-site-scripting attacks in web-pages, it is still
an API-specific input grammar, often encoded directly by code unclear what security vulnerabilities might hide behind REST
specifying what and how to fuzz, similar to the code shown APIs. While past human-intensive pen testing efforts targeting
on the right of Figure 2. By contrast, RESTler automatically cloud services provide evidence that such vulnerabilities do
generates an input grammar from a Swagger specification, and exist, this evidence is still too anecdotal. New automated
its fuzzing rules are determined separately and automatically tools, like RESTler, are needed for more systematic answers.
by the algorithm of Figure 3. How many bugs can be found by fuzzing REST APIs? How
Automatically learning input grammars from input sam- security-critical will they be? This paper provides a clear path
ples is another complementary research area [25], [6], [24], forward to answer these questions.
[36]. RESTler currently relies on a Swagger specification to
represent a service’s input space and it learns automatically X. ACKNOWLEDGEMENTS
how to prune invalid request sequences by analyzing service
responses. Still, a Swagger specification could be further We thank William Blum, Dave Tamasi and David Molnar
refined given representative unit tests or live traffic in order for their helpful comments, and the whole Microsoft Secu-
to focus the search towards specific areas of the input space. rity Risk Detection team for their support. We also thank
For REST services without a Swagger specification, it would Albert Greenberg, Mark Russinovich and John Walton, from
be worth investigating how to automatically infer it by using Microsoft Azure, for encouraging us to pursue this line of
machine learning on runtime traffic logs or static analysis on work. Finally, we thank the GitLab and Microsoft developers
the code implementing the API. we interacted with, for graciously acknowledging, discussing
Whitebox Fuzzing. Grammar-based fuzzing can also be com- and fixing the bugs found during this work.

757

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [23] P. Godefroid, M. Levin, and D. Molnar. Automated Whitebox Fuzz
Testing. In Proceedings of NDSS’2008, pages 151–166, 2008.
[1] S. Allamaraju. RESTful Web Services Cookbook. O’Reilly, 2010.
[24] P. Godefroid, H. Peleg, and R. Singh. Learn&Fuzz: Machine Learning
[2] Amazon. AWS. https://aws.amazon.com/.
for Input Fuzzing. In Proceedings of ASE’2017, pages 50–59, 2017.
[3] APIFuzzer. https://github.com/KissPeter/APIFuzzer.
[4] AppSpider. https://www.rapid7.com/products/appspider. [25] M. Höschele and A. Zeller. Mining Input Grammars from Dynamic
[5] M. Barnett, M. Fahndrich, and F. Logozzo. Embedded Contract Taints. In Proceedings of ASE’2016, pages 720–725, 2016.
Languages. In Proceedings of SAC-OOPS’2010. ACM, March 2010. [26] R. Lämmel and W. Schulte. Controllable Combinatorial Coverage in
[6] O. Bastani, R. Sharma, A. Aiken, and P. Liang. Synthesizing Program Grammar-Based Testing. In Proceedings of TestCom’2006, 2006.
Input Grammars. In Proceedings of PLDI’2017, pages 95–110. ACM, [27] R. Majumdar and R. Xu. Directed Test Generation using Symbolic
2017. Grammars. In Proceedings of ASE’2007, 2007.
[7] BooFuzz. https://github.com/jtpereyda/boofuzz. [28] B. Meyer. Eiffel. Prentice-Hall, 1992.
[8] Burp Suite. https://portswigger.net/burp.
[29] Microsoft. Azure. https://azure.microsoft.com/en-us/.
[9] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler.
EXE: Automatically Generating Inputs of Death. In Proceedings of [30] Microsoft. Office. https://www.office.com/.
CCS’2006, 2006. [31] S. Newman. Building Microservices. O’Reilly, 2015.
[10] D. M. Cohen, S. R. Dalal, J. Parelius, and G. C. Patton. The [32] C. Pacheco, S. Lahiri, M. D. Ernst, and T. Ball. Feedback-Directed
Combinatorial Design Approach to Automatic Test Generation. IEEE Random Test Generation. In Proceedings of ICSE’2007. ACM, 2007.
Software, 13(5), 1996. [33] Peach Fuzzer. http://www.peachfuzzer.com/.
[11] R. T. Fielding. Architectural Styles and the Design of Network-based [34] Qualys Web Application Scanning (WAS). https://www.qualys.com/
Software Architectures. PhD Thesis, UC Irvine, 2000. apps/web-app-scanning/.
[12] Flask. Web development, one drop at a time. http://flask.pocoo.org/.
[35] Ruby on Rails. Rails. http://rubyonrails.org.
[13] GitLab. GitLab. https://about.gitlab.com.
[14] GitLab. GitLab API. https://docs.gitlab.com/ee/api/. [36] D. She, K. Pei, D. Epstein, J. Yang, B. Ray, and S. Jana. Neuzz: Efficient
[15] GitLab. Hardware requirements. https://docs.gitlab.com/ce/install/ fuzzing with neural program learning. CoRR, abs/1807.05620, 2018.
requirements.html. [37] SPIKE Fuzzer. http://resources.infosecinstitute.com/fuzzer-automation-
[16] GitLab. Sample Bug1. https://gitlab.com/gitlab-org/gitlab-ce/issues/ with-spike/.
50955. [38] Sulley. https://github.com/OpenRCE/sulley.
[17] GitLab. Sample Bug2. https://gitlab.com/gitlab-org/gitlab-ce/issues/ [39] M. Sutton, A. Greene, and P. Amini. Fuzzing: Brute Force Vulnerability
50265. Discovery. Addison-Wesley, 2007.
[18] GitLab. Sample Bug3. https://gitlab.com/gitlab-org/gitlab-ce/issues/
[40] Swagger. https://swagger.io/.
50270.
[19] GitLab. Sample Bug4. https://gitlab.com/gitlab-org/gitlab-ce/issues/ [41] TnT-Fuzzer. https://github.com/Teebytes/TnT-Fuzzer.
50949. [42] M. Utting, A. Pretschner, and B. Legeard. A Taxonomy of Model-Based
[20] GitLab. Statistics. https://about.gitlab.com/is-it-any-good/. Testing Approaches. Intl. Journal on Software Testing, Verification and
[21] P. Godefroid, A. Kiezun, and M. Y. Levin. Grammar-based Whitebox Reliability, 22(5), 2012.
Fuzzing. In Proceedings of PLDI’2008, pages 206–215, 2008. [43] M. Yannakakis and D. Lee. Testing Finite-State Machines. In Proceed-
[22] P. Godefroid, N. Klarlund, and K. Sen. DART: Directed Automated ings of the 23rd Annual ACM Symposium on the Theory of Computing,
Random Testing. In Proceedings of PLDI’2005, pages 213–223, 2005. pages 476–485, 1991.

758

Authorized licensed use limited to: Nanjing University. Downloaded on March 23,2023 at 06:17:21 UTC from IEEE Xplore. Restrictions apply.

You might also like