Staged Information Flow For Javascript: Ravi Chugh Jeffrey A. Meister Ranjit Jhala Sorin Lerner

The document discusses staged information flow analysis for JavaScript to enforce security policies. It presents an approach to infer effects of JavaScript code on a website to ensure key properties like confidentiality and integrity are not violated. When code is dynamically loaded, information flow is propagated through known code to compute residual checks on remaining code.

Uploaded by

Nnn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

183 views13 pages

Staged Information Flow For Javascript: Ravi Chugh Jeffrey A. Meister Ranjit Jhala Sorin Lerner

Uploaded by

Nnn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Staged Information Flow for JavaScript ∗

Ravi Chugh Jeffrey A. Meister Ranjit Jhala Sorin Lerner

University of California, San Diego
{rchugh,jmeister,jhala,lerner}@cs.ucsd.edu

Abstract <script src="http://adnetwork.com/insert-ad.js">

Modern websites are powered by JavaScript, a flexible dynamic <textbox id="SearchBox">

scripting language that executes in client browsers. A common <button id="Search" onclick="doSearch()">
paradigm in such websites is to include third-party JavaScript code
in the form of libraries or advertisements. If this code were ma- <script type="javascript">
licious, it could read sensitive information from the page or write var doSearch = function() {
to the location bar, thus redirecting the user to a malicious page, var searchBox = document.nodes.SearchBox.value;
var searchStr = searchUrl + searchBox;
from which the entire machine could be compromised. We present
document.location.set(searchStr);
an information-flow based approach for inferring the effects that a }
piece of JavaScript has on the website in order to ensure that key </script>
security properties are not violated. To handle dynamically loaded
and generated JavaScript, we propose a framework for staging in-
formation flow properties. Our framework propagates information Figure 1. A snippet of JavaScript based on www.wsj.com. When
flow through the currently known code in order to compute a min- the user clicks the Search button, the doSearch function appends
imal set of syntactic residual checks that are performed on the re- the contents of the SearchBox to a base URL string searchUrl,
maining code when it is dynamically loaded. We have implemented and redirects the page to the resulting URL.
a prototype framework for staging information flow. We describe
our techniques for handling some difficult features of JavaScript
and evaluate our system’s performance on a variety of large real- objects and code can be sent over the network as raw strings that are
world websites. Our experiments show that static information flow dynamically parsed and executed by the receiver, and all modern
is feasible and efficient for JavaScript, and that our technique al- web browsers provide JavaScript APIs that allow scripts executing
lows the enforcement of information-flow policies with almost no on the page to dynamically access and modify the state associated
run-time overhead. with the page. Unfortunately, the flexibility comes at a great price:
JavaScript has few protection or information hiding mechanisms,
Categories and Subject Descriptors D.2.4 [Software Engineer- and consequently, the use of JavaScript has opened up new classes
ing]: Software/Program Verification – Validation; F.3.2 [Seman- of security vulnerabilities such as cross-site scripting and code-
tics of Programming Languages]: Semantics of Programming Lan- injection attacks.
guages – Program analysis We illustrate the main issues here with a simple and glar-
General Terms Languages, Reliability, Verification ing attack found in a study of real-world vulnerabilities carried
out at Google [27]. Figure 1 shows a code snippet adapted from
Keywords Set Constraints, Flow Analysis, Web Applications, www.wsj.com. On the first line, the web site inserts an ad by in-
Confidentiality, Integrity cluding some JavaScript code from an ad agency. This JavaScript
code runs and replaces itself on the web page with the actual ad
1. Introduction (this is a very common way of placing ads on web pages, includ-
ing Google’s AdSense). Below the ad, the sample page contains a
JavaScript is a popular scripting language that is the foundation of search form with a text box and a search button with an on-click
Web 2.0 applications like Gmail and Facebook. The popularity of event handler. The event handler redirects to the search site stored
JavaScript stems from its extremely dynamic nature: libraries can in a global searchUrl variable (which can be reassigned), with the
be downloaded at run time from diverse sources across the web, contents of the search box appended as a URL parameter.
∗ This
In practice, first tier ad agencies often delegate to second tier
work was supported by NSF CAREER grants CCF-0644306, CCF- agencies, which often delegate to third tier agencies, and so on.
0644361, NSF PDOS grant CNS-0720802, NSF Collaborative grant CCF-
In their study, Google found a case where ads from a reputable
0702603.
and non-malicious American ad agency, after several levels of in-
direction, eventually included JavaScript code from a malicious ad
provider in Russia. In the wsj example, such a malicious JavaScript
snippet could simply write to searchUrl, redirecting the user to
Permission to make digital or hard copies of all or part of this work for personal or the attacker’s site the next time the search button is clicked. This
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
malicious site could then exploit a vulnerability in the browser
on the first page. To copy otherwise, to republish, to post on servers or to redistribute to compromise the client’s machine. Thus, the attacker can divert
to lists, requires prior specific permission and/or a fee. the user to the malicious site, without directly changing the docu-
PLDI’09, June 15–20, 2009, Dublin, Ireland. ment’s location. The Google study reports that almost all web at-
Copyright c 2009 ACM 978-1-60558-392-1/09/06. . . $5.00 tacks which take over a user’s computer follow this pattern: use
JavaScript to change the location bar to redirect to a malicious site <script type="javascript">
that then exploits a vulnerability in the browser. var document.settings = {
Although a browser’s vulnerability is the last nail in the coffin, setBaseUrl = function(s) { this.baseUrl = s },
the root cause of the problem is that the code that is included for setVersion = function(i) { this.version = i }
inserting the ad should not be able to change the location bar, and }
the designers of the wsj page certainly never intended to give the var initSettings = function(s, i) {
ad code this privilege. Thus, in order to make Web 2.0 applications document.settings.setBaseUrl(s);
secure, the fundamental challenge is to devise a mechanism that document.settings.setVersion(i);
can specify and enforce the designer’s intentions about the effects }
that a piece of JavaScript code can have.
In this paper, we propose to formalize these effects using infor- initSettings("mysite.com/login.php", 1.0);
mation flow. Information flow can capture the fact that a particular
value in the program affects another value in the program. In the var login = function() {
above example with the location bar, the integrity policy that we var pwd = document.nodes.PasswordTextBox.value;
if (readCookie("doCheck") && pwd.length < 8) {
would want to enforce is that no value from within any ad should document.alert("Password is too short!");
flow to the location bar, or into any child frame’s location bar. Sim- } else {
ilarly, using information flow, we could specify useful confiden- var user = document.nodes.UsernameTextBox.value;
tiality properties like the sensitive cookie value of the current page var params = "u=" + user + "&p=" + pwd;
should not flow to any ad, or information filled in textboxes on the post(document.settings.baseUrl, params);
current page should not flow to any third-party widgets (such as }
counters inserted on the page). }
Although information flow has been explored in many settings, </script>
the dynamic nature of JavaScript poses a new challenge: because
<text id="UsernameTextBox"> <text id="PasswordTextBox">
JavaScript commonly evaluates complex code strings that are built <button id="ButtonLogin" onclick="login()">
at run time or read from the network, the entire code is not avail-
able until the JavaScript program is already running. As a result, <div id="AdNode">
techniques for statically checking information flow are not directly <script src="adserver.com/display.js">
applicable. One solution to this challenge would be to check infor- </div>
mation flow dynamically, but unfortunately this approach has some
significant drawbacks. In addition to adding a possibly large run Figure 2. mysite.com with username and password textboxes to
time overhead, it would also prevent developers from catching pol- allow logging in. The global function initSettings is intended to
icy errors early on in the development process. be called once to initialize settings used by the page. The page also
To address the dynamic nature of JavaScript, we propose in this loads a third-party script, which will be evaled when the string is
paper a framework for staging integrity and confidentiality infor- received from the network.
mation flow properties. Staging consists of statically computing as
much of the information flow as possible based on the known code,
and leaving the remainder of the computation until more code be- var z1 = "evil.com"; var z2 = 1.0; initSettings(z1,z2);
comes available. Since the residual checking must be performed
within the browser, and must be performed every time new code is Figure 3. A bad network string display.js, returned by
dynamically loaded, we must ensure that the residual checks can be the malicious or compromised adserver.com, which calls
performed efficiently. initSettings to overwrite the page’s settings. When the user
In our staging framework, the heavyweight flow analysis is car- clicks the login button, her username and password are sent to
ried out just once on the server and its results are distilled into suc- evil.com instead of mysite.com.
cinct residual checks, that enjoy two properties. First, they soundly
describe the properties that are left to be checked on the remain-
ing code once it becomes known; if each piece of loaded code analysis of JavaScript code is a contribution by itself, indepen-
passes the residual checks, the top level flow policy is guaranteed dent of the staging framework.
to hold. Second, they obviate the need for flow analysis within the
browser; they are syntactically enforceable and can be efficiently • We evaluate our analysis techniques and our staging framework
discharged when the dynamically loaded code is parsed inside the on a variety of real-world web sites. In particular, in Section 5
browser. Thus, by performing the bulk of the analysis statically, we demonstrate the feasibility of staged information flow by
staging allows the enforcement of information flow policies for dy- showing that: (1) our approach scales to the Alexa top 100 [1]
namic web applications with almost no run-time overhead. To sum web sites, (2) the residual checks are orders of magnitude faster
up, we make the following contributions in this paper. than checking the entire program and fast enough to run inside
the client browser, and (3) our analysis is precise enough to
correctly identify flows with a small false positive rate.
• We present a framework for staging integrity and confidentiality
information flow properties in JavaScript programs. Section 2 2. Overview
gives an overview of our framework using an illustrative exam- We start with an example that motivates our approach of staged
ple, and Section 3 describes the framework in more detail. information flow for JavaScript. Consider the web page shown
• We present an instantiation of the framework using set inclu- in Figure 2. In our approach, we separate a web page into two
sion constraints. Section 4 describes how we use set constraints parts. The first part, which we call the context, is known and in
to capture direct and indirect flows through difficult-to-analyze our example consists of the entire web page except for the last
features of JavaScript like dynamically created objects, fields, three lines. The second part, which we call the hole, is loaded
first-class functions, and prototypes. Our constraint-based flow dynamically and is unknown. In our example, the hole is the last
var displayAd; variable declared within the hole from affecting the first parameter
var appName = document.navigator.appName; of the function post. Formally, we prevent untrusted eval sites
var appVersion = document.navigator.appVersion; from navigating the page via an integrity policy specified as a pair
if (appName == "IE" && appVersion < 7) {
displayAd = function() { ... }; (•, document.location)
} else which states that variables declared within the code loaded at
displayAd = function() { ... };
}
any eval site must not flow into (i.e., affect) the value of
document.nodes.AdNode.innerHTML = displayAd(); document.location. Dually, we can prevent secure information
from being read by untrusted holes via a confidentiality policy spec-
ified as a pair
Figure 4. A good network string display.js, which creates (document.cookie, •)
advertisement text depending on the browser detected from the which states that the value of document.cookie must not flow into
document.navigator fields, and then displays it in the AdNode any variable within the code loaded at any eval site.
div that the page created for the ad. Given an information flow policy and a complete program, there
are standard techniques (e.g., using type systems [23] or dataflow
analysis [21]), for checking that the program satisfies the policy.
three lines of the web page and consists of dynamically loaded
Unfortunately, these techniques cannot be applied in our setting
third-party code from adserver.com.
because of the extremely dynamic nature of JavaScript. At any
The Context. The context contains some JavaScript code within time, additional code can be downloaded or dynamically generated
<script>...</script> tags. This script defines a global ob- using eval, which executes an arbitrary string as code. One option
ject called settings that contains fields baseUrl and version, is to resort to fully dynamic enforcement of the flow policies. A
and associated setter methods setBaseUrl and setVersion. The second option is to re-analyze the entire application every time a
function initSettings calls the setter methods to initialize the new piece of code is loaded or evaled. However, both of these
fields of the settings object. The function login reads the pass- options incur a significant runtime overhead, which may make
word from the appropriate field of the document and the docu- certain applications unusable.
ment’s cookie, creates a string comprised of the user’s name and
password, and calls post to send this string to the server whose 2.2 Staged Information Flow
URL is stored in settings.baseUrl. The login function is called Instead, our approach is to stage the analysis by pre-analyzing the
asynchronously, on the reception of the event corresponding to the code from the context using the given policy, in order to compute a
user clicking the ButtonLogin button. At the point when a user residual policy, which captures the requirements that the hole must
clicks on the ButtonLogin button, the settings object is already satisfy in order for the entire program to satisfy the flow policy. If
initialized with mysite.com, the intended destination of the user- the hole is filled with dynamically loaded code that contains more
name and password. holes, then the staging framework recursively checks the residual
The Hole. Suppose that adserver.com is either malicious or policy on the inner holes, and so on.
compromised, and sends the JavaScript code shown in Figure 3. Stage 1: Computing Residual Policy. Let us see how, for the
This code calls initSettings with values that cause the baseUrl example from Figure 2, the context’s code can be used to re-
field to point to the attacker’s address. Thus, if the user presses duce the flow policy that no variable from the hole should flow
the button after this dynamically added code is evaled, the into the first parameter of post, into a residual policy that must
user’s name and password will get posted to evil.com instead of be satisfied by the code loaded from adserver.com. To do so
mysite.com. we use a static constraint-based analysis to compute the set of
values that can flow into all known variables that are in the
2.1 Safety via Information Flow
same scope as the hole and that can affect the first parameter of
Unfortunately, existing mechanisms are insufficient to prevent this post. These include document.settings.baseUrl, the s param-
kind of attack. First, dynamic techniques like stack-based access eter of document.settings.setBaseUrl, and the s parameter of
control are insufficient in the face of asynchrony and global state. initSettings. Thus, the residual policy states that no variable
After the malicious code has set the global settings.baseUrl, declared in the hole should flow into any of the above variables (in
it is no longer on the stack when the click event occurs, and addition to the first formal of post).
hence there is nothing on the callstack to suggest that an at- Stage 2: Checking Residual Policy. Once the code is loaded from
tack has occurred. Second, JavaScript lacks most language-based, adserver.com we can rapidly check it against the residual policy.
coarse-grained, information-hiding mechanisms such as private The code from Figure 3 violates the residual policy as z1, declared
fields and abstract datatypes. This is a large part of JavaScript’s in the hole, is passed in as the first parameter for initSettings.
appeal for Web 2.0 programming – by eschewing such mecha- Dually, if the code satisfies the residual policy (i.e., if variables
nisms, JavaScript makes it easier to rapidly construct applications declared in the hole do not flow into variables known to affect the
by gluing together trusted and untrusted libraries. Thus, to reconcile first parameter of post), then we are guaranteed that the original
safety with flexibility, we need a fine-grained enforcement mecha- information policy holds over the complete system. For example,
nism that allows untrusted code access to certain parts of the web- suppose that the code that gets loaded for the hole is that shown in
page, but ensures that critical elements cannot be affected. Figure 4. In this case, the hole does not violate the residual policy
Flow Policies. In our approach, the author of the context provides as it only reads fields of the document.navigator object, and
a flow policy which is a set of pairs of policy elements. A policy hence, we are guaranteed that the page is safe with respect to the
element is a program variable or a hole. Each pair in the policy given policy. Thus, the residual policy allows us to quickly check
represents flow that is disallowed. For our example, the attack the hole when it is dynamically filled with code, without the check
above is possible because the code from the untrusted hole is able being slowed by tracking flows that occur within the context.
to interfere with a trusted part of the system, namely the URL to Our staged information flow framework can be used to split the
which the messages are sent. To shield the application from such analysis burden across client and server. Along with the develop-
attacks, the page’s author could provide a policy that prohibits any ment of a JavaScript application, the developer will specify the set
of information flow policies that should be enforced on the page.
e ::= Expressions:
During the testing cycle, various configurations and controlled li-
| c constant
braries can be tested against the policy. The remaining residual
| x variable
checks can then be sent to the client along with the page, and a
| x.f field-read
modified browser can perform the remaining stages of the analy-
| e1 op e2 bin-op
sis, halting execution if the policy is ever violated. We assume that
| {. . . , f : e, . . .}i object literal
the residual policies are not tampered with either when transmitted
| this this
over the network or within the client browser.
| funi (thisi , pi ){ s } fun-def
In this setting, residual checks are performed by the client
| f(e) fun-call
browser, which means that their efficiency directly affects the
| y.f(e) method-call
browsing experience. Consequently, to make residual checks as fast
| newi f(e) constructor-call
and simple as possible, we have designed our staging framework so
that all residual checks are entirely syntactic, eliminating the need s ::= Statements:
to do full-scale information flow on the client side. | skip skip
| var x var-def
2.3 Challenges of Analyzing JavaScript | x := e assign
JavaScript poses several challenges in addition to dynamic code | x.f := e field-assign
loading and generation. | s1 ;s2 sequence
| ifi x then s1 else s2 branch
Functions. First, functions are objects, and hence first-class values | whilei x do s while
that can be bound to other variables and passed around as param- | return e return
eters. For example, all the functions and methods in Figure 2 are | evali (x) eval
created as anonymous function objects that are bound to the vari-
ables whose names are subsequently used to invoke the functions. P ::= 2(X×•)∪(•×X) Policies
Further, JavaScript programs make heavy use of first-class func- RP ::= 2X × 2X Res. Policies
tions for several reasons, including to attach listeners to events, and
to allow different versions of the same function to be defined based
on run time properties, as shown in the case of the displayAd Figure 5. Syntax
function in Figure 4. Thus, any analysis for JavaScript must be
able to handle function values – which rules out the use of the
proto fields until the field being looked up is resolved, or until
standard summary-based interprocedural analyses that are used for
the root of the chain is reached without finding the field. Because
first-order languages.
the proto field can be read and written at any time, just like any
Fields. Second, most variables refer to objects that contain fields. other field, a JavaScript analysis must carefully track the prototype
Unlike statically typed languages like Java, it is difficult to deter- objects corresponding to each constructor function, and must track
mine which fields an object has because there are no classes, and for each object, the attributes that the object implicitly inherits via
fields can be dynamically added to objects. In essence, objects cor- the prototype chain.
respond to dictionaries and fields are simply used as names to look
up in the dictionary. For instance, in our example, in order to be suf-
ficiently precise, the analysis must be able to distinguish between
3. Framework
the different fields of the document object. Thus, any analysis for We now describe our framework by formalizing the language, the
JavaScript must be field-sensitive, and not lump together values notions of flow and residual policies, and finally describing how we
flowing into different fields of an object. However, the sensitivity stage the information flow analysis.
must be achieved efficiently as each object can have many fields,
thereby making a naı̈ve analysis impossible to scale. 3.1 Core JavaScript
Prototypes. Third, JavaScript eschews classes in favor of a form of Figure 5 summarizes the syntax of Core JavaScript, which captures
inheritance called prototyping. In essence, each function Foo can be the essence of JavaScript.
used as a constructor to create objects: the expression new Foo(. . .) Expressions of Core JavaScript include: basic constants c which
creates an object and calls the function Foo to initialize the object. include integers, 0, 1, . . ., strings, etc.; variable reads x; field reads
This way of constructing objects also creates an implicit inheritance x.f, where f is a field name; binary operations e1 op e2 , where
chain through the use of a special field called proto. In particular, op includes primitive operations like addition, string concatena-
each function Foo has a corresponding prototype object that is tion etc.; function declarations, where each function is labeled by
implicitly constructed and stored in the field Foo.proto – recall a unique identifier i, and has two formal parameters thisi and pi ;
that functions are objects and can therefore have fields.1 When a function, method, and constructor calls, where exactly one param-
new object o is constructed with the expression new Foo(. . .), the eter is passed to the callee; objects, which are a sequence of field-
new object implicitly inherits all the attributes of the proto field of expression bindings; and the this expression.
Foo (that is, the new object essentially inherits from Foo.proto). Statements of Core JavaScript include: variable declarations, vari-
This means that on each field read o.f, if f is not a field of o, able and field assignments, statement sequencing, branching and
then the prototype of the function that created o (in this case, Foo’s while-loops, and return statements. To model dynamic code load-
prototype) is used to lookup field f. Since Foo’s prototype is itself ing, we include an eval statement. A statement is open if it con-
an object, it may also have been created using a function that had tains an eval site, and closed otherwise.
a prototype field. Thus, JavaScript transitively follows the chain of
Conventions. We assume, without loss of generality, that the pro-
1 We use proto throughout as shorthand for prototype, which is the gram satisfies certain syntactically enforceable conventions. First,
actual field name used in JavaScript. Also, each prototype object has a we assume that all functions are declared anonymously, and that
constructor field that holds the function value itself. Although we model each declaration has a unique label i. Further, each function has
this in our implementation, we omit the details for clarity of exposition. exactly two parameters. The first parameter thisi corresponds to
the object that will be referred to as this inside the body of the
function. We use thisi to model JavaScript’s semantics for this. SIF(P, s) =
The second parameter pi corresponds to the argument passed to the RP ← Stage(P, s)
function. When a function is called directly as f(e), the this vari- c ← Initialize(s)
able is set to the global object, and the value that e evaluates to is do
passed in as the second parameter. When a function is called indi- (s, c) ← Execute(c)
rectly as a method call x.f(e), the object x is passed in as the first while (c 6= E XIT and Check(RP , s) 6= E RR)
parameter, and the value that e evaluates to is passed as the second if c = E XIT then return S AFE else return E RR
parameter. Second, we assume that each eval statement is uniquely
labeled. Further, we assume that each branch and loop is uniquely Figure 6. Staged Information Flow Framework
labeled – we will use these labels to compute indirect flows, which
are the flows that occur from the branch value to locations being write set respectively. Intuitively, the code loaded at any hole must
assigned in the branch. Third, we assume that each variable is de- not read (resp. write) any variable or field in MNR (resp. MNW ).
clared, and that all variables are renamed so that local variables in Even though policies are restricted to variables, notice that we
different scopes have unique names. Fourth, rather than explicitly can stipulate a policy like (x.f, •) (resp. (•, y.g)) by (a) creating a
modeling the webpage (i.e., encoding the DOM), we assume that new variable x0 (resp. y0 ), (b) adding a new assignment x0 := x.f
there is an object document in the global namespace that can be (resp. y.g := y0 ) and, (c) specifying a policy (x0 , •) (resp. (•, y0 )).
accessed and manipulated via the appropriate fields and methods. Similarly, to prevent flows from constants we assume that dynami-
Dynamic Semantics. The dynamic semantics of Core JavaScript cally loaded code is rewritten so that all constants that appear in the
are standard – we refer the reader to [31, 36, 6, 17] for a detailed hole are bound to new variables declared within the hole.
formalization via small-step operational semantics. Flows. Informally, we say that a variable x flows into y if the value
of y can be affected by the value of x, either directly, via a sequence
3.2 Information Flow Policies of assignments, or indirectly, due to conditional dependences. To
Our information flow policies are expressed as sets of pairs, where formalize when a flow policy P is violated by a program s we
each pair represents a must-not-flow requirement. In its most gen- rewrite the program to Rewrite(P, s), which is an instrumented
eral form, such policies would include pairs where each element is version of s that: (1) has auxiliary taint fields that track flows,
either a variable or the label of an eval site. Although our unstaged and (2) calls a special Core JavaScript function flowDetected()
information flow analysis will handle such general flow policies, as soon as a flow is detected during execution from x to a hole,
these policies make residual checks difficult to perform syntacti- for some (x, •) ∈ P or from a hole to x, for some (•, x) ∈
cally, and require sending a large amount of state across stages. P . Figure 7 formalizes the most important cases of the rewriting
Consider for example a policy stating that a variable x should not function Rewrite. Notice that the rewriting function is recursively
flow to another variable y, and a context containing a single as- invoked every time an eval is executed.
signment b = a. If the hole executes a = x and b = y, then the Flow Policy Satisfaction. We say that a program s violates a
flow policy is violated. Since the hole can add flow between any policy P if there is an execution of Rewrite(P, s) along which
variables a and b that it chooses, in the most general case the first the function flowDetected() is called. Otherwise, we say that s
stage must send to the second stage all possible flows of variables satisfies the policy P .
in the context. This would require sending a large amount of data
to the second stage, and would also require performing a full flow 3.3 Staged Policy Verification
analysis in the client browser. Procedure SIF. Figure 6 formalizes our Staged Information Flow
Thus, to make our policies more amenable to syntactic residual framework as a procedure SIF that takes as input a policy P and an
checks, we restrict ourselves to confidentiality and integrity poli- open Core JavaScript program s and returns either S AFE indicating
cies: confidentiality policies state that sensitive information should that the policy P was satisfied or E RR indicating that the policy
not be leaked, whereas integrity policies state that the attacker can- was violated. At a high level, our framework stages information
not compromise sensitive information. As a result, these policies flow checking as follows. First, the framework calls Stage with
include pairs where one element of the pair is a variable and the the top-level policy P and the context s, i.e., the known part of
other is a hole, which would not allow the problematic policy (x, y) the program, to compute the flows that occur in the context. Stage
to be expressed. uses the computed flows and the (top-level) policy to pre-compute
Even with this restriction, policies that grant one hole access to and return a residual policy RP that is a projection of the (top-
a sensitive variable but not another pose problems for staging. In level) policy to the eval sites. Second, the framework initializes the
particular, consider the case where the client receives a hole that variable c with a snapshot of the entire initial state of the executing
is allowed to access the sensitive variable. This hole may induce program. Third, the framework enters a loop where it invokes
flow that must be taken into account in the residual checks for Execute on the snapshot c to run the program until it reaches the
the remaining holes that cannot access the variable. To update the next eval site, or terminates. In the former case, Execute returns
residual policy would again require a complex computation on the a pair (s, c) where s is the code to be loaded at the next site, and c
client. As a result, instead of allowing must-not-flow pairs to be is the current snapshot of the program. In the latter case, Execute
specific to particular holes, we require that if the policy restricts returns a triple where c is simply E XIT, indicating the program has
access to one hole, it restricts access to all holes. terminated. The loop is repeated until the program terminates or
Policies. Formally, we define a must-not-flow policy as a pair of the Check determines that the loaded code s violates the residual policy
form (x, •), which states the confidentiality policy that the value of RP . After breaking out of the loop, we check if we exited because
the variable x must not flow to any variable within a hole, or (•, x), the program terminated or, because of some call to Check, returned
which states the integrity policy that values from variables within E RR. In the former case SIF returns S AFE, and in the latter E RR.
a hole must not flow into x. A policy P is a set of must-not-flow Procedures Stage and Check. The framework is parameterized by
policies. A residual policy RP is a pair of two sets of variables or two procedures Stage and Check. Stage takes as input a policy P
fields MNR and MNW called the a must not read set and must not and a statement s corresponding to a context, and returns a residual
policy RP corresponding to the projection of P to the eval sites SRC(P, z) = DST(P, z) =
in s. Check takes as input a residual policy RP and a statement if (z, •) ∈ P or if ∃a s.t. a ∈ z.taint and
s corresponding to code loaded in at a hole, and returns E RR or z is a hole var [((a, •) ∈ P and z is a hole var ) or
then [z] else [] ((•, z) ∈ P and a is a hole var )]
S AFE. then flowDetected()
Soundness. For two programs s, s0 and hole label i, let s[i 7→ s0 ]
be the closed program obtained by replacing the eval site i in s Rewrite(P, c) = {data : c, taint : I}
with s0 and all other eval sites with skip. To ensure soundness, the Rewrite(P, var x) = var x
procedures Stage and Check must meet the following requirements Rewrite(P, return x) = return x
which state that the procedures must overapproximate the flows Rewrite(P, x) =
that occur in concrete executions. {data : x.data,
taint : I + x.taint + SRC(P, x)}
∀P, s, i, s0 . if s[i 7→ s0 ] violates P Rewrite(P, x.f) =
then Check(Stage(P, s), s0 ) = E RR {data : x.data.f.data,
taint : I + x.data.f.taint}

∀P, s, i, s0 . if Check(Stage(P, s), s0 ) = S AFE Rewrite(P, x op y) =

{data : x.data op y.data,
0
then Stage(P, s) = Stage(P, s[i 7→ s ]) taint : I + x.taint + y.taint}

We can show that if Check and Stage meet the above cri- Rewrite(P, fun(this, p){ s }) =
{data : fun(this, x, I)Rewrite(P, s),
teria, then for all policies P and programs s, if P is vio- taint : I}
lated by s then SIF(P, s) returns E RR. A sketch of the proof
Rewrite(P, {f1 : x1 , . . .}) =
is as follows: assume that P is violated at some point dur- {data : {f1 : {data : x1 .data, taint : I + x1 .taint}, . . .},
ing the execution of s, and suppose that at the point of fail- taint : I}
ure, sites i1 through ik in s had been loaded with s1 through Rewrite(P, x := e) =
sk . Thus, we know that s[i1 7→ s1 , · · · , ik 7→ sk ] violates the var tmp := Rewrite(P, e);
policy, and then by the first property above, we know that x.data := tmp.data;
Check(Stage(P, s[i1 7→ s1 , · · · , ik−1 7→ sk−1 ]), sk ) = E RR. x.taint := tmp.taint;
DST(P, x)
Then, using k − 2 applications of the second property above,
we can show that Stage(P, s[i1 7→ s1 , · · · , ik−1 7→ sk−1 ]) = Rewrite(P, x.f := e) =
var tmp := Rewrite(P, e);
Stage(P, s), and therefore Check(Stage(P, s), sk ) = E RR, which x.data.f.data := tmp.data;
means that SIF(P, s) would return E RR when it performs the resid- x.data.f.taint := tmp.taint
ual check on sk .
Rewrite(P, x := f(z)) =
The second condition above enables our framework to compute var tmp := f.data;
the flows and residual policies once, without having to recompute x := tmp(this, z, I + f.taint);
them each time that a hole is filled. In essence, the conditions on DST(P, x)
Stage and Check ensure that the dynamically loaded code does not Rewrite(P, x := y.f(z)) =
induce any new flows for the variables described in the top-level x := y.data.f.data(y, z, I + y.data.f.taint);
policy P . If any new flows would be induced by the hole, then DST(P, x)
Check would return E RR and execution would be halted. Rewrite(P, s1 ;s2 ) =
Rewrite(P, s1 );Rewrite(P, s2 )
Rewrite(P, ifi x then s1 else s2 ) =
4. Static Instantiation var tmp := I;
We now describe how we have instantiated our framework by I := I + x.taint;
ifi x.data then Rewrite(P, s1 ) else Rewrite(P, s2 );
presenting our implementations of Stage and Check. Stage takes a I := tmp
policy P and program s and returns the residual policy for the eval
Rewrite(P, whilex s do ) =
sites in s. Check takes a residual policy RP comprising a must-not- var tmp := I;
read and must-not-write set, and a statement corresponding to code I := I + x.taint;
to be loaded at a hole, and verifies that the statement satisfies the whilex.data Rewrite(P, s) do ;
residual policy, by verifying that the statement does not read (resp. I := tmp
write) the variables or fields listed in the must not read (resp. must Rewrite(P, x := evali (y))
var tmp := evali (Rewrite(P, y.data));
not write) sets. x.data := tmp.data;
Next, we describe our flow-insensitive, field-sensitive, set- x.taint := I + tmp.taint + y.taint
constraint based instantiation of the procedure Stage. First, we
present the different elements constituting the constraints, con-
stants, constructors, and terms. Second, we describe our syntax- Figure 7. Dynamic Information Flow Rewriting. We assume com-
directed constraint generation procedure. Third, we discuss some plex expressions are bound to fresh temporary variables. The global
optimizations required to analyze JavaScript with sufficient preci- variable I, initially the empty set, stores the set of indirect taints.
sion. Fourth, we show how Stage combines policies and constraints
to compute residual policies.
Set Constraints. A term is either a constraint variable X, a con-
stant, or a constructed term C(t1 , . . . , tn ), where C is a constructor
of arity n and t1 , . . . , tn are terms. A set constraint is a constraint
of the form t1 ⊆ t2 , where t1 and t2 are terms. A satisfying solu- two unary constructors C, D, we write the constraint t1 ⊆C,D t2 as
tion for a finite set of constraints maps each constraint variable to a an abbreviation for the pair of constraints t1 ⊆ C(X), D(X) ⊆ t2 ,
set of constants and constructed terms, such that all of the inclusion where X is a fresh constraint variable that is distinct from all other
constraints are satisfied. For details, we refer the reader to [20]. For variables.
Statements 4.1 Constraint Elements
Gen(k, XI , skip) = ∅ We set up a system of constraints over variables Xe for each sub-
expression e of the program. The constraints use several kinds of
Gen(k, XI , var x) = constructors to model various aspects of JavaScript code. The first
{cx ⊆ Xx } ∪ {Ind(cx ) ⊆ Xx } two constructors are standard ways of encoding functions and fields
Gen(k, XI , x := e) = using set constraints [12]. The last three are novel mechanisms re-
Gen(k, XI , e) ∪ {Xe ⊆ Xx } ∪ {XI ⊆ Xx } quired to capture information flow and the semantics of JavaScript:
they are used to distinguish between fields that are directly con-
Gen(k, XI , x.f := e) = tained in an object, fields that are reachable by transitively follow-
Gen(k, XI , e) ∪ ing a prototype chain, values that directly reach a particular point,
{Xx ⊆ Real(Fldf (Xe , Ω))} ∪ {Xx ⊆ Real(Fldf (XI , Ω))} and values that indirectly reach a particular point.
1. Function Constructor. JavaScript programs have first class
Gen(k, XI , s1 ;s2 ) =
functions in that functions can be created and passed around like
Gen(k, XI , s1 ) ∪ Gen(k, XI , s2 )
any other value. We model the flow of function values via a con-
Gen(k, XI , ifi x then s1 else s2 ) = structor Fun() of arity 5. The first argument corresponds to the
Gen(k, Xi , s1 ) ∪ Gen(k, Xi , s2 ) ∪ objects constructed by the function object, a special feature of
{XI ⊆ Xi } ∪ {Xx ⊆Ind,Ind Xi } JavaScript that we will describe in the sequel. This argument is
treated as contravariant. The second argument corresponds to the
Gen(k, XI , whilei x do s ) = function’s implicit parameter this. As the argument corresponds
Gen(k, Xi , s) ∪ {XI ⊆ Xi } ∪ {Xx ⊆Ind,Ind Xi } to an input of the function, it is treated as contravariant. The third
argument corresponds to the explicit formal parameter of the func-
Gen(k, XI , return e) = tion. As this argument also corresponds to an input of the function,
Gen(k, XI , e) ∪ {Xe ⊆ Xretk } ∪ {XI ⊆ Xretk } it is treated as contravariant. The fourth argument corresponds to
an implicit parameter that holds the values corresponding to indi-
Gen(k, XI , evali (e)) = ∅
rect flows into the points where the function is invoked. This pa-
Expressions rameter is used as the initial set of indirect flows into the body of
the function, and as it corresponds to an input, the argument is also
Gen(k, XI , c as e) = treated is contravariant. The fifth argument corresponds to the re-
{c ⊆ Xe } ∪ {Ind(c) ⊆ Xe } turn value, and hence the argument is covariant.
Gen(k, XI , x) = ∅ 2. Field Constructors. JavaScript programs make heavy use of
fields and any precise analysis must track flows in a field-sensitive
Gen(k, XI , x.f as e) = manner. The classical way to model fields is to view them as a pair
{Xx ⊆ Real(Fldf (∅, Xe ))} ∪ {Xx ⊆ Pro(Fldf (∅, Xe ))} of functions: a setter that updates the contents of the field, and a
getter the returns the contents of the field. Following this intuition,
Gen(k, XI , e1 op e2 as e) =
we encode a field f via a a constructor Fldf () of arity 2. The first
Gen(k, XI , e1 ) ∪ Gen(k, XI , e2 ) ∪
parameter corresponds to the set of values written into the field,
{Xe1 ⊆ Xe } ∪ {Xe2 ⊆ Xe }
i.e., the inputs to the setter, and hence, is treated as contravariant.
Gen(k, XI , {. . . , fj : ej , . . .}i as e) = The second parameter corresponds to the set of values read from
(∪j Gen(k, XI , ej )) ∪ {Xi ⊆ Xe } ∪ the field, i.e., the outputs of the getter, and hence, is treated as
(∪j {Xej ⊆ Xi.fj }) ∪ (∪j {XI ⊆ Xi.fj }) ∪ covariant. When initializing an object’s fields, we use the same set
(∪j {Real(Fldfj (Xi.fj , Xi.fj )) ⊆ Xi }) variable in both places so that all arguments that flows into the first
argument flow out of the second argument. When writing a field,
Gen(k, XI , this as e) = {thisk ⊆ Xe } we pad the second argument with the set variable Ω, which collects
Gen(k, XI , funi (thisi , pi ){ s } as e) = everything that flows from the field by covariance. When reading
Gen(i, Xindi , s) ∪ {XI ⊆ Xindi } ∪ {Xi ⊆ Xe } ∪ a field, we pad the first argument with the set variable ∅, so that
{Fun(Xconsi , Xthisi , Xpi , Xindi , Xreti ) ⊆ Xi } ∪ nothing flows into the field by contravariance.
(∪j {Real(Fldfj (Xprotoi .fj , Xprotoi .fj )) ⊆ Xprotoi }) ∪ 3. Real and Prototype Flow Constructors. In order to determine
{Real(Fldproto (Xprotoi , Xprotoi )) ⊆ Xi } ∪ what a field read returns in the presence of prototyping, we must
{Xconsi ⊆ Real(Fldproto (Xprotoi , Ω))} track, for each object, the values for all fields that can be read
directly from the object or transitively via following its prototype
Gen(k, XI , f(e0 ) as e) = chain. To distinguish between the fields of an object and the fields
Gen(k, XI , e0 ) ∪ {Xf ⊆ Fun(∅, Xog , Xe0 , XI , Xe )} reachable via the prototype chain of an object, we use a special
constructor Real() of arity 1 to wrap the fields that an object
Gen(k, XI , x.f(e0 ) as e) =
directly contains, and a special constructor Pro() of arity 1 to wrap
Gen(k, XI , e0 ) ∪
the fields that are transitively reachable by following the object’s
{Xx ⊆ Real(Fldf (∅, Fun(∅, Xx , Xe0 , XI , Xe )))} ∪
prototype chain.
{Xx ⊆ Pro(Fldf (∅, Fun(∅, Xx , Xe0 , XI , Xe )))}
In general, if c can be reached by following the prototype chain
Gen(k, XI , newi f(e0 ) as e) = of an expression e, then our constraints will ensure that Pro(c)
Gen(k, XI , e0 ) ∪ {Xi ⊆ Xe } ∪ flows into Xe . For example, x.f can return the constant c if the
(∪j {Real(Fldfj (Xi.fj , Xi.fj )) ⊆ Xi }) ∪ object has a field f that has the value c, or if an object in its
{Xf ⊆ Fun(Xi , Xi , Xe0 , XI , Ω)} ∪ prototype chain has a field f that has the value c. In the former
{Xi.proto ⊆Real,Pro Xi } ∪ {Xi.proto ⊆Pro,Pro Xi } case, our constraints ensure that the term Real(Fldf (c, c)) flows
into Xx . In the latter case, our constraints ensure that the term
Pro(Fldf (c, c)) flows into Xx .
Figure 8. Constraint generation
4. Indirect Flow Constructor. When tracking information flow, reads from x.f, the result can flow from either the object x (if it has
we must track both direct value flows, as well as indirect flows a field f), or from some object in its prototype chain with a fields
that arise when assignments take place under particular branch f. To model these semantics, the values returned from reads are the
conditions. However, due to the presence of higher-order functions values from objects that directly flow into x (i.e., Real()-wrapped
and dynamic dispatch, we must take care to separate direct flows terms that flow into Xx ) as well as objects that flow into x after
(which affect which functions get executed at a different program following the prototype chain (i.e., Pro()-wrapped terms that flow
points), from indirect flows (which have no effect on the execution). into Xx ). For fields writes to x.f, only the actual object itself may be
To achieve this separation, we use a covariant constructor Ind() of updated (as opposed to some object along the prototype chain). To
arity 1 to wrap constants and convert them into ground terms that model these semantics, we only carry out the assignment on those
participate in indirect flows. objects that directly reach x, i.e., are wrapped under Real().
In general, if a constant c directly flows into an expression e,
Function Definitions. Each anonymous function declaration is la-
then the constraints ensure that the term c flows into Xe . If the
beled with a unique label i. For each function i, we create a term
constant c indirectly flows into an expression e, however, then our
Fun(Xconsi , Xthisi , Xpi , Xindi , Xreti ) corresponding to the func-
constraints ensure that Ind(c) flows into Xe . For example, if c
tion’s value, and create the appropriate constraints from the con-
indirectly flows into the expression x.f, then our constraints will
straint variables representing the “inputs” Xthisi , Xpi , and Xindi
ensure that the term Real(Fldf (Ind(c), Ind(c))) flows into Xx .
into the body of the function, and from the return statement of the
4.2 Constraint Generation function to Xreti . The contravariant argument consi in the first po-
sition of the function term corresponds to the objects constructed by
Figure 8 shows the constraint generation procedure Gen. The pro- the function. The last three constraints deal with prototypes. First,
cedure takes as input a label k corresponding to the identifier of the we create a fresh prototype object, namely Xprotoi , and set up its
function currently being analyzed, a constraint variable XI repre- fields in the same way we do for object literals. Second, we add the
senting the indirect flows into the program location being analyzed, constraint Real(Fldproto (Xprotoi , Xprotoi )) ⊆ Xi that stores the
and either e or s, respectively the expression or statement being prototype object in the proto field of the function object. Third,
analyzed, and it returns as output a set of inclusion constraints. we add the constraint Xconsi ⊆ Real(Fldproto (Xprotoi , Ω)), which
Gen traverses the AST of the program and generates constraints has the effect of writing the prototype object into the proto field
between variables of the form Xe for each subexpression e, that of any objects that flow to consi .
capture the set of values that flow directly or indirectly into e. We
maintain the invariants that: (1) only values wrapped with the Ind() Function Calls. For each function call, we generate a constraint
constructor flow into the indirect flow variables XI , (2) for every that uses constructor matching to pull out the set of actual functions
value that directly flows into e, there is a corresponding term that reaching the callsite, and uses variance to flow the actuals (both
flows into Xe , and, (3) for every value that is reachable from e after explicit, and implicit due to indirect flow) into the formals, and the
transitively following the prototype chain rooted at e, there is a cor- return out to the callsite respectively [11]. The values flowed in
responding term wrapped under Pro() that flows to Xe . Next, we for the cons and this parameter differ depending on the the three
discuss how constraints are generated for a representative subset of kinds of function calls.
expressions and statements. • For direct calls of the form f(e0 ), we use ∅ for the constructed
Assignments. For each assignment x := e, we generate constraints object and the flow variable corresponding to the global object
on the subexpression e, and then constraints that capture the direct og for the this parameter.
flow from e into x as well as the indirect flow from the current loca- • For method calls of the form x.f(e0 ), we use constructor match-
tion’s indirect flow variable into x. Notice that a return statement
ing in a manner similar to field-reads, to pull out the appropriate
is treated as an assignment to the return variable of the function to
functions that flow to the callsite, and we use the receiver object
which the statement belongs.
x as the this parameter.
Branches. Each branch statement and loop is labeled with a unique
• For constructor calls of the form newi f(e0 ), which create a
label i that can be generated with a syntactic pass over the source.
For each branch or loop labeled i, we create a new indirect flow fresh object and call f with this bound to the fresh object,
variable Xi . We flow the values in XI and the indirect values from we introduce a new variable Xi that represents the objects cre-
the expression used in the branch condition into Xi , and then use ated at the callsite and set up its fields as we do for object lit-
Xi as the indirect flow variable when generating the constraints for erals. Next, we use constructor matching to pull out the func-
the statements that depend on the branch. To preserve the invariant tions that flow to the callsite, and (via contravariance), flow Xi
that indirect flows are wrapped under the Ind() constructor (and into the constructed object parameter and this parameters of
hence, not used to affect computation), we filter flows to wrapped the callees. Finally, the last two constraints “flatten” the con-
terms, using Xx ⊆Ind,Ind Xi . structed object’s prototype chain. Intuitively, the constraints add
all fields of an object’s prototype chain into the object directly,
Object Literals. We use the set variable Xi.f to track the contents while at the same time keeping track of which fields actually
of field f of object i. For every field fj , we flow the initial value belong to the object and which do not. To achieve this, we take
ej into the field with the constraint Xej ⊆ Xi.fj , and the current each object that flows into the prototype field of the constructed
indirect taint into the field with the constraint XI ⊆ Xi.fj . Finally, object (Xi.proto ) either directly (i.e., wrapped under Real()) or
we add the constraint Real(Fldfj (Xi.fj , Xi.fj )) ⊆ Xi to treat fj via prototype-chains (i.e., wrapped under Pro()), and rewrap
as a field that object i directly contains, where Xi.fj is used as both those objects under Pro() and flow them into the constructed
the setter and getter for the field. object Xi .
Fields. For each field x.f read in (resp. written in) an expres-
sion e, we create the appropriate flow constraint between Xx and 4.3 Analyzing Real JavaScript
Fldf (∅, Xe ) (resp. Fldf (Xe , Ω)), which by virtue of the construc-
tor matching and variance, has the effect of flowing the values from Multi-parameter Functions. Functions in JavaScript can be in-
(resp. into) the f field of x into (resp. from) Xe . We must account voked with any number of arguments, regardless of how many pa-
for prototype chains when reading and writing to fields. For field rameters the function is defined to accept; missing arguments are
set to undefined and additional arguments are ignored.2 To model f = function(x) { ... }
this in the implementation, we define a common set constructor f(document.cookie);
Funn () that, in addition to cons, this, return, and taint parame-
That is, the context contains a function f that is called with the
ters, takes n arguments, where n is the maximum number of ar-
cookie as a parameter. Next, suppose that a hole is filled with:
guments across all function definitions and applications in the pro-
gram. When a definition or call in the program uses fewer than n f = function(x) { post(x); }
arguments, we pad the remaining arguments with fresh constraint
That is, the hole redefines the function f with another function that
variables. We omit these details from the presentation since they
broadcasts the argument x. If this code is dynamically loaded, it can
are straightforward.
overwrite the original “trusted” function f, and a policy violation
Iterative Field-Sensitivity. Due to the flexibility of JavaScript will occur if the new function is called from the context. If we could
objects, we must assume that every object can have a binding for re-analyze the entire source of the context and the hole at the client,
every possible static field. However, this naı̈ve approach does not then we would deduce that the call in the context flows the cookie
scale, as the product of object count and field count is often very into the formal x of the new function defined in the hole. However,
large. Instead, we perform an iterative field-sensitive analysis that performing client-side flow analysis on the entire code each time a
tracks fields on a per-object basis as needed. For each object, we hole is loaded would make residual policy checking prohibitively
begin by tracking only the fields that we are certain will exist based inefficient. Instead, we observe that when a function’s arguments
on the object definition. After solving the set constraints under receive confidential values, we can guarantee confidentiality by
these assumptions, we check whether any objects flow into accesses ensuring the function itself is not overwritten by the hole. Thus,
of fields that were not being tracked. We add constructed terms even for confidentiality policies, certain variables, namely those
for missing fields as appropriate, and incrementally solve for the holding function values whose arguments have been tainted, should
constraints again. not be written by the hole. Dually, for integrity policies, the residual
Current Limitations. Tracking reads and writes of dynamic fields, policy must also include both must-not-read and must-not-write
i.e., array or dictionary lookups, is significantly harder than track- sets.
ing statically known field names. Modeling these accesses with a Aliasing. Consider a context that contains the following code snip-
Fld> set constructor, where > represents unknown fields, makes pet, where tmp is not aliased to document:
the analysis unscalable, due to large numbers of accesses through
integer fields (for array objects) and complex string expressions that z = tmp.cookie;
compute precise names of HTML elements on the page. For the Hence, in the context, the value that flows into z is not sensitive.
purposes of our analysis, we make the dynamically checkable as- Next, suppose that a hole is filled with
sumption that dynamically created field names, i.e., dynamically
created array or hash table indices, do not clash with statically tmp = document;
known field names. ...
Our current implementation also does not support several other post(z);
features of JavaScript, but these can be directly captured within That is, the hole aliases tmp and document and as a result, the as-
our constraint-based SIF framework. These include the with state- signment in the context can flow the confidential cookie into the
ment, which allows its body to be evaluated with a given object’s variable z, thereby leaking the confidential information. Again, al-
fields temporarily brought into scope, and call and apply forms though re-analyzing the entire source on the client would detect this
that allow the programmer to explicitly set the this parameter of a leak, it would be prohibitively expensive. One option is to treat the
function call, which can be used to implement closure-based inher- object x as confidential if x.f is confidential. Thus, we could treat
itance. document as confidential since document.cookie is confidential
and prohibit any hole from reading document. However, this is
4.4 Residual Policy Generation far too restrictive as it is perfectly safe and common for the hole
We now describe how we use our constraint-based flow analysis to to read document and its non-tainted fields. Instead, when gener-
compute residual policies for holes. Recall that the residual policy ating a residual policy, we conservatively assume that once field
comprises a set of must-not read MNR and must-not write MNW f of some object contains confidential information, then all fields
variables and fields. Thus, at a high level, for confidentiality (resp. named f contain confidential information, no matter what object
integrity) policies, our goal is to find variables to which (resp. from they belong to. For our example, since document.cookie is con-
which) the sensitive information may flow, and then prohibit the fidential, we assume that tmp.cookie is confidential, and hence z
client from reading (resp. writing) those variables. is confidential, and so in the residual policy, we prevent the hole
However, it turns out that several subtleties arise due to the com- from reading z. Similarly, in this example, we would also prevent
bination of higher-order functions, aliasing and the requirement the hole from directly reading any field called cookie. Thus, we
that residual policy checking be efficient. We illustrate these issues can make the residual policy checking robust and efficient, even in
using examples that motivate our algorithm for generating residual the presence of aliasing, by unifying the taint information of each
policies. For the following examples, suppose we wish to enforce field f across all the objects that contain f, and preventing the hole
the confidentiality policy stating that the document’s cookie should from accessing f.
not flow into a hole.
We now describe our constraint-based algorithm that analyzes a
Functions. Intuitively, the residual policy needs to prevent the hole context in order to compute the MNR and MNW sets correspond-
from reading any variable in the context that is tainted by the ing to the residual policy. For clarity of exposition, we omit the
cookie. However, the residual policy must also prevent the hole indirect, real, and prototype wrappers, and we only describe how
from writing certain variables. To illustrate, consider the following residual policies are generated for confidentiality policies – it is
context: straightforward to extend the method to integrity policies.
2 Whenever a function is called in JavaScript, the list of actuals provided Taint Propagation. To compute the MNR and MNW sets, we
is bound to a special variable called arguments available within the body. use two new covariant unary constructors NR() and NW(), which
We model this behavior in our implementation. correspond to not-read and not-write taints. We seed the analysis
by adding constraints that flow these new taint constructors into the 5.1 Experiments
variables of the confidentiality policy. In particular: We have implemented a static, constraint-based instantiation of the
for each (x, •) in the policy, NR(cx ) ⊆ Xx SIF framework for JavaScript. Our analysis is currently a stan-
dalone tool, not yet integrated within a browser. As a result, we
where cx is a special constant associated with x. We use unary con- do not have automatic support for staging when a script is loaded
structors with these special constants as arguments so that we can dynamically. Instead, for the purposes of evaluation, we have im-
we can later define filter (i.e., ⊆A,B ) constraints. In addition to plemented a Firefox browser extension that intercepts all dynamic
the basic flow constraints from Figure 8, which propagate these code loading calls, and inlines the new code in the surrounding
taint seeds throughout the context code, we add new constraints context. The subsequent static analysis proceeds as if the dynamic
to account for the subtleties described above. In particular, to han- content had been there originally. Once our analysis engine is com-
dle higher-order functions, we contravariantly (resp. covariantly) bined with the browser, a dynamically loaded script will instead
propagate the taints from the function arguments (resp. return val- trigger the staged analysis.
ues) to function definitions: We use the JSure parser [3] as a front-end to parse JavaScript
source into OCaml abstract syntax trees, over which we generate
for each fun-def labeled i, Xpi ⊆NR,NW Xi constraints. We use the Banshee [20] constraint solver to build and
Xpi ⊆NW,NR Xi solve constraints. The Firefox extension is written in approximately
500 lines of JavaScript, the Banshee bindings in 400 lines of C and
Xreti ⊆NR,NR Xi OCaml, and the staged information flow tool in about 6,000 lines
Xreti ⊆NW,NW Xi of OCaml.
Benchmarks, policies and holes. We used our Firefox extension
where Xi , Xpi , and Xreti are the constraint variables representing to collect the closed-program source for all the web sites from the
the flows into the function labeled i, its formal parameter, and its Alexa top 100 list [1]. Alexa is a company that tracks web page
return value respectively. Finally, to handle aliasing, we unify the traffic, and generates the lists of the most popular 100 web sites by
taints across all objects containing a field f by creating a special country and by language. We ran our staging engine on all 100 sites
variable Xf and generating the following constraints: in the top 100 English pages.
We checked two policies on each web site: (1) a confidentiality
for each object labeled i, Xi.f ⊆NR,NR Xf
policy stating that the cookie value should not flow into the hole,
Xf ⊆NR,NR Xi.f and (2) an integrity policy stating that no values from the hole
Xi.f ⊆NW,NW Xf should flow into the location bar. These policies are general enough
that they apply to any web site, making it easier to systematically
Xf ⊆NW,NW Xi.f run on all the Alexa web sites.
For each web site, we systematically identified holes as any
Residual Policy Generation. To compute the residual policy, we scripts originating from a different domain than the site’s. Each
solve the entire set of constraints, that is, the basic flow constraints closed-program we collected is a snapshot of whatever JavaScript
augmented with the constraints above. Intuitively, if a NR() (resp. executed on that particular run; subsequent visits to the same page
NW()) constructor flows into the constraint variable Xx corre- would likely contain different dynamic code to populate the hole.
sponding to the program variable x, then the variable is added to For each benchmark, we first ran our information flow analysis
the MNR (resp. MNW ) set of the residual policy. Let S be the on the entire program. We then generated the residual policy for
constraint solution. We write S ` t X if the solution S maps the holes that we identified, and performed the residual checks on
the constraint variable X to a set containing the term t. The must- the code in the hole. This simulated the situation where holes are
not-read and must-not-write sets of the residual policy are defined: not available at the first stage, but are made available at the second
. stage.
MNR = { x | S ` NR(·) Xx } ∪ (Not-Read-Variables)
{ f | S ` NR(·) Xf } (Not-Read-Fields) Summary of results. Of the 100 sites in the Alexa list, 97 had
. JavaScript, 64 had holes in them, and of the ones with holes in
MNW = { x | S ` NW(·) Xx } ∪ (Not-Write-Variables) them, we were able to parse 63. Our full unstaged analysis suc-
{ f | S ` NW(·) Xf } (Not-Write-Fields) cessfully completed on all 63, and our staged analysis successfully
completed on 62 of these. The one benchmark that our staged anal-
That is, a variable or field must not be read (resp. written) if the not- ysis failed on (by running out of memory while generating the
read taint (resp. not-write taint) constructor flows to the constraint residual policy) is the largest benchmark in the Alexa top 100,
variable corresponding to the variable or field. namely wsj with 43,698 lines of JavaScript (which is twice as large
Residual Policy Checking. To verify that a hole satisfies a residual as the next largest benchmark).
policy, we perform a syntactic check that none of the variables or
fields in MNR (resp. MNW ) are read (resp. written) in the hole. 5.2 Performance of Unstaged Analysis
Figure 9 plots lines of code vs. running time of the unstaged
full analysis for the cookie confidentiality policy; the plot for the
5. Evaluation integrity policy follows similar trends. Our data shows that, for
In this section, we describe experiments (Section 5.1) that validate benchmarks up to 13,000 lines of code (which accounts for about
three hypotheses about our approach: our information flow analysis 80% of the benchmarks) the running time does not grow very fast,
using set constraints scales to real world JavaScript (Section 5.2); and stays under twelve seconds. Nevertheless, these times are too
our staged information flow approach creates residual checks that slow to run on the client side each time that a new hole is filled.
are much smaller and faster than the full analysis, making them Beyond 13,000 lines of code, even though the running time grows
practical for running on the client side (Section 5.3); and our in- much faster, our unstaged full analysis still scales to the largest of
formation flow analysis is precise enough to track useful properties JavaScript programs in the Alexa top 100 (76.0 seconds for 43,698
(Section 5.4). lines of JavaScript). Most of the benchmarks that take over a minute
80 of which 5 reported that the cookie does not flow into a hole. A
70
manual inspection of these examples reveals that this is indeed the
case. By looking at the code of the remaining 12 benchmarks, we
60
determined that 8 of them contained holes that read, and even mod-
50 ified, the cookie. Many of these sites included scripts from popular
ad services, such as GoogleSyndication and QuantServe, and data
40
tracking services, like GoogleAnalytics. These services make use
30 of cookies as a persistent storage for statistics across multiple page
visits.
20
The reported flow on the remaining 4 benchmarks were false
10 positives in our unstaged analysis, which were all caused by the
0
lack of context-sensitivity. For example, if the cookie and an un-
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 related string both flow into the same function, and this function
flows its argument to its return value, then both strings will flow to
Figure 9. Analysis time (seconds) of unstaged analysis for cookie the returning call sites, smearing the actual flows. Techniques for
policy vs. lines of code. extending set constraint-based program analysis with context sen-
sitivity would help in this situation.
We evaluate the precision of our staged analysis by comparing
to analyze make heavy use of prototypes. This observation points its results (“RF” column in Figure 10) with the unstaged version
to an area of possible performance tuning for future work. (“FF” column). In general, the answer to whether the policy is
violated should be the same in both unstaged and staged modes,
5.3 Performance and Benefits of Staging and this is indeed the case for most of our benchmarks.
Figure 10 shows some of the results from the 62 benchmarks on However, there are several benchmarks on which this is not the
which our staged analysis ran. The last line of the table, however, case. For confidentiality policies, there are 4 benchmarks for which
averages over all benchmarks. The columns in the table are as the unstaged analysis finds no flow, but the staged one reports flow.
follows: “Site and rank” gives the name of the web site and its rank For integrity, there are 8 such benchmarks. In each of these cases,
in the top 100 list; “Total LOC” gives the number of lines of code on the residual analysis reports a spurious flow because of how we
the web site, including the hole, as formatted by our own JavaScript conservatively taint fields when generating residual policies.
pretty-printer; “Hole LOC” gives the number of lines of code in the As expected, there are no cases in which the unstaged analysis
hole; “Full” gives the time it took to run our unstaged information reports flow but the staged analysis reports no flow.
flow analysis on the entire JavaScript program; “GenRes” gives the
time it took our analysis to generate the residual checks for the
holes that we identified; “ChkRes” gives the time it took to perform 6. Related Work
the residual checks on the code from the holes; “FF” states whether
or not the unstaged information flow analysis found any flow for the Static Information Flow. There is a rich literature on modeling se-
given policy (X indicates flow, × indicates no-flow); “RF” states curity properties using information flow [15]. Many of these ideas
whether or not the residual checks found any flow for the given are manifested as static language-based techniques for ensuring
policy. All times are in seconds. that the values of high security values do not flow into low secu-
In general, the time for computing residual policies is on the rity outputs. These include type systems [33, 25], Hoare-logics [5],
same order of magnitude as the time for running the unstaged full and safety (model) checking [30]. Dually, there are techniques for
analysis. Even though the residual policy generation uses more checking that low-security (i.e., tainted) values do not flow to safety
kinds of constraints, it does not always take longer to solve, because critical operations. These include the use of type qualifiers [28] and
only the context is being analyzed, which is smaller than the entire dataflow analysis [21]. Unfortunately, these techniques work on
code that the unstaged analysis ran on. closed programs (or require summaries or stubs for missing code),
Our data shows that all residual checks run under one second, and further, often rely on underlying structure like types, and hence
and most run under one tenth of a second. The residual checks are cannot be applied directly to JavaScript.
performed in a single pass along with parsing, so in fact most of Dynamic Information Flow. Several dynamic techniques for in-
the time for performing residual checks lies in the parsing, which formation flow control have been proposed at the language, operat-
must be done anyway. Our current implementation uses a parser ing system and architecture levels. The type system of [23] allows
generator that is not optimized for speed, which leaves room for the specification and dynamic enforcement of richer flow and ac-
performance improvements. cess control policies including the dynamic creation of principals
As a result, residual checks would add only minimal overhead and declassification of high-security information. These ideas can-
to run in a client browser, especially if we further tune the perfor- not yet be applied in our setting as they require a closed system,
mance of our checker. Our data also shows that residual checks are written in a statically typed language (Java), and further, annota-
about two orders of magnitude faster than the full analysis, which tions must be provided to specify and verify the policies. There are
on average runs in 9.9 seconds, and thus would be too slow to run several projects that use dynamic tainting, either via binary rewrit-
in a client browser. These observations together point to the benefit ing [24], at the architecture level [29, 32], or using virtual machines
of staging: the full analysis would be too slow to run on the client [8]. We leave the implementation of a dynamic instantiation of our
browser, but if the developer could run the first stage of the analy- framework, possibly for enforcing the residual policy checks, as an
sis, then the remaining checks are fast enough to be run in the client avenue for future work. However, we conjecture that dynamically
browser. tracking flows is likely to incur a significant run time overhead, and
hence, is not a likely candidate for client-side deployment. Sev-
5.4 Precision eral recent projects [9, 37] propose expressive OS mechanisms for
In order to assess the precision of our unstaged analysis, we ran- information flow control. Here the goal is to provide abstractions
domly sampled 17 of the benchmarks for the cookie-flow policy, that allow application developers to specify policies about where
Total Hole Flow from cookie to hole? Flow from hole to location bar?
Site and rank
LOC LOC Full GenRes ChkRes FF RF Full GenRes ChkRes FF RF
3. myspace 22,469 3,484 77.4 27.4 0.52 X X 105.3 37.2 0.52 × ×
4. youtube 7,187 779 3.7 4.4 0.20 × × 3.6 4.8 0.18 × X
10. aol 4,714 255 2.1 2.9 0.06 X X 2.1 3.4 0.06 × ×
13. go 904 60 0.5 0.9 0.03 × × 0.5 0.9 0.03 × ×
15. cnn 15,445 3,472 71.4 18.0 0.52 X X 83.1 30.4 0.54 × X
16. espn.go 7,155 28 4.0 7.0 0.03 × × 4.0 8.2 0.03 × ×
18. flickr 747 713 0.3 0.1 0.12 × × 0.3 0.2 0.12 × ×
24. imdb 556 13 0.3 0.5 0.02 × × 0.3 0.6 0.02 × ×
28. weather 20,104 232 76.8 106.5 0.12 X X 72.5 200.6 0.09 X X
35. foxnews 13,589 70 14.7 30.7 0.10 X X 15.0 50.6 0.04 × ×
42. doubleclick 3,259 1,203 1.5 1.2 0.21 X X 1.4 1.4 0.21 × ×
43. bbc.co.uk 8,639 41 3.9 7.5 0.03 X X 3.9 8.6 0.02 × ×
44. walmart 13,174 101 7.0 22.0 0.09 X X 7.2 55.5 0.07 X X
46. rr 2,545 70 1.1 1.8 0.05 X X 1.1 3.3 0.03 × ×
47. target 10,532 61 4.0 7.0 0.04 X X 4.0 8.1 0.04 × ×
48. netflix 9,879 27 4.4 8.4 0.03 × × 4.5 9.8 0.02 × ×
49. nfl 10,485 170 8.4 16.8 0.03 × × 8.4 21.2 0.03 × ×
57. hulu 14,476 545 25.9 42.1 0.11 X X 28.2 131.4 0.12 × ×
58. verizon.net 3,456 167 1.5 2.1 0.05 X X 1.5 2.4 0.04 × ×
62. disney.go 3,383 6 1.9 3.3 0.03 × × 1.9 3.8 0.02 × ×
63. bestbuy 10,975 3,916 8.2 10.6 0.76 X X 8.7 290.0 0.80 × X
64. msn.foxsports 6,838 490 4.2 7.0 0.14 X X 4.2 16.1 0.18 × X
67. cnet 10,598 242 7.2 22.6 0.17 X X 7.3 29.2 0.06 × ×
71. linkedin 7,964 1,816 3.9 3.4 0.32 × × 3.9 3.6 0.29 X X
75. gamespot 13,041 1,491 11.8 23.6 0.32 X X 11.5 30.6 0.28 × ×
77. veoh 9,742 86 6.5 13.1 0.07 × X 6.6 32.6 0.04 × ×
79. latimes 8,225 55 6.8 9.9 0.04 X X 6.8 11.7 0.03 × X
80. nbc 7,644 74 5.8 8.5 0.04 × × 5.8 10.4 0.04 × ×
87. reuters 4,049 258 1.7 2.4 0.06 X X 1.7 2.6 0.06 × ×
88. imeem 12,050 194 4.6 7.9 0.04 × × 4.6 8.7 0.04 × ×
89. gamefaqs 365 77 0.2 0.3 0.03 × × 0.2 0.3 0.03 × ×
90. tinypic 6,658 64 3.9 6.0 0.03 × × 4.0 6.6 0.03 × ×
92. abcnews.go 14,330 246 9.9 18.0 0.07 × X 9.8 21.4 0.08 × X
99. dailymotion 11,709 379 10.9 19.3 0.08 × X 10.8 30.4 0.08 × ×
100. people 6,152 261 3.4 4.8 0.07 X X 3.4 6.8 0.06 × ×
Average 7,979 597 9.9 14.0 0.13 10.7 28.4 0.12
Figure 10. Sample results from Alexa web sites with holes. Average numbers are for all benchmarks (including those not in the table), and
times reported are in seconds.

data generated by the process should be allowed to flow. These ap- ately sanitizing user generated content on the server side. Several
proaches are too coarse-grained to be applicable to our setting. server-side tools apply static analysis to determine whether user
Analyzing JavaScript. Several authors have studied the prob- generated content has been properly vetted [19, 35, 34]. To en-
lem of analyzing JavaScript. Some of the idiosyncratic features of sure safety on the client side, one simple and elegant approach is
JavaScript are described in [31], which also presents a type system to only allow previously known and authorized scripts to run on a
for statically checking JavaScript programs. Further, [6] describes web page [18]. Unfortunately, this makes it harder to use dynam-
an algorithm for inferring types for JavaScript programs. However, ically generated third-party content, and hence is not applicable
it is unclear whether JavaScript programs in the wild satisfy the in our setting. Finally, there have been several proposals for re-
typing disciplines described in these works. Neither approach deals designing the ecosystem within which web-applications are built
with dynamically generated code, and hence cannot directly be ap- and deployed [4, 2, 7]. In essence these approaches advocate that
plied to our setting. The interaction of JavaScript and web browsers web-applications be built in higher-level languages like C ] , Java
is studied in [36], which presents a formal semantics of the inter- and JIF respectively, thereby availing of the protection mechanisms
action, and uses it to describe a general framework for dynamically available in those languages. It remains to be seen whether web-
verifying arbitrary safety properties inside the browser. Gatekeeper application developers are willing to trade the flexibility and rapid-
[22] is a static analysis framework for JavaScript that focuses on prototyping strengths of JavaScript for the security benefits offered
performing analysis in a single stage (e.g., on the server). In con- by strongly typed languages.
trast, our primary focus is on developing residual checks that spec- Set Constraint-based Program Analysis. Set constraints provide
ify how dynamically loaded code should behave in order for the an expressive framework within which many kinds of program
system to satisfy high-level flow policies. analyses including points-to analyses [14, 16], type qualifier in-
Web and Browser Security. Several recent projects have consid- ference [13], race detection [26], and uncaught exceptions [10].
ered the problem of securing web applications via browser and lan- Our contribution is to show that this expressive framework is espe-
guage mechanisms. Many vulnerabilities arise from not appropri- cially suited to capturing the complexities of JavaScript including
fields and higher-order functions, and that after using optimizations for detecting web application vulnerabilities (short paper). In IEEE
like the optimistic field analysis the resulting analysis scales to the Symposium on Security and Privacy, 2006.
JavaScript that powers most popular websites. [20] J. Kodumal and A. Aiken. Banshee: A scalable constraint-based
analysis toolkit. In SAS, pages 218–234, 2005.
References [21] M. S. Lam, M. Martin, V. B. Livshits, and J. Whaley. Securing web
[1] English: Alexa top 100 sites, November 2008. http://www.alexa. applications with static and dynamic information flow tracking. In
com/. PEPM, pages 3–12, 2008.
[2] Google web toolkit, November 2008. http://code.google.com/ [22] B. Livshits and S. Guarnieri. Gatekeeper: Mostly static enforcement
webtoolkit/. of security and reliability policies for javascript code. Technical
Report MSR-TR-2009-16, Microsoft Research, Feb. 2009.
[3] Jsure, November 2008. http://www.jsure.org/.
[23] A. C. Myers. Programming with explicit security policies. In ESOP,
[4] Volta, November 2008. http://live.labs.com/volta. pages 1–4, 2005.
[5] T. Amtoft and A. Banerjee. Information flow analysis in logical form. [24] J. Newsome and D. X. Song. Dynamic taint analysis for automatic de-
In SAS, pages 100–115, 2004. tection, analysis, and signature generation of exploits on commodity
[6] C. Anderson, P. Giannini, and S. Drossopoulou. Towards type software. In NDSS, 2005.
inference for javascript. In ECOOP, pages 428–452, 2005. [25] F. Pottier and V. Simonet. Information flow inference for ml. In
[7] S. Chong, J. Liu, A. C. Myers, X. Qi, K. Vikram, L. Zheng, and POPL, pages 319–330, 2002.
X. Zheng. Secure web application via automatic partitioning. In [26] P. Pratikakis, J. S. Foster, and M. Hicks. Locksmith: context-sensitive
SOSP, pages 31–44, 2007. correlation analysis for race detection. In PLDI. ACM, 2006.
[8] J. Chow, B. Pfaff, T. Garfinkel, K. Christopher, and M. Rosenblum. [27] N. Provos, D. McNamee, P. Mavrommatis, K. Wang, and
Understanding data lifetime via whole system simulation. In USENIX N. Modadugu. The ghost in the browser analysis of web-based
Security Symposium, pages 321–336, 2004. malware. In HotBots, 2007.
[9] P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey, D. Ziegler, [28] U. Shankar, K. Talwar, J. S. Foster, and D. Wagner. Detecting format
E. Kohler, D. Mazières, F. Kaashoek, and R. Morris. Labels and event string vulnerabilities with type qualifiers. In USENIX Security, 2001.
processes in the asbestos operating system. In SOSP. ACM, 2005.
[29] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas. Secure program
[10] M. Fähndrich and A. Aiken. Program analysis using mixed term and execution via dynamic information flow tracking. In ASPLOS, 2004.
set constraints. In SAS, pages 114–126, 1997.
[30] T. Terauchi and A. Aiken. Secure information flow as a safety
[11] M. Fähndrich, J. S. Foster, A. Aiken, and J. Cu. Tracking down problem. In SAS, pages 352–367, 2005.
exceptions in standard ml programs. Technical report, EECS
Department, UC Berkeley, 1998. [31] P. Thiemann. Towards a type system for analyzing javascript
programs. In ESOP, pages 408–422, 2005.
[12] C. Flanagan and M. Felleisen. Componential set-based analysis.
ACM Trans. Program. Lang. Syst., 21(2):370–416, 1999. [32] N. Vachharajani, M. J. Bridges, J. Chang, R. Rangan, G. Ottoni,
J. A. Blome, G. Reis, M. Vachharajani, and D. I. August. Rifle: An
[13] J. S. Foster, M. Fähndrich, and A. Aiken. A theory of type qualifiers. architectural framework for user-centric information-flow security. In
In PLDI. ACM, 1999. MICRO, 2004.
[14] J. S. Foster, M. Fähndrich, and A. Aiken. Polymorphic versus [33] D. Volpano and G. Smith. Verifying secrets and relative secrecy. In
monomorphic flow-insensitive points-to analysis for c. In SAS, 2000. POPL, 2000.
[15] J. A. Goguen and J. Meseguer. Security policies and security models. [34] G. Wassermann and Z. Su. Static detection of cross-site scripting
In IEEE Symposium on Security and Privacy, pages 11–20, 1982. vulnerabilities. In ICSE, pages 171–180, 2008.
[16] B. Hardekopf and C. Lin. The ant and the grasshopper: fast and [35] Y. Xie and A. Aiken. Scalable error detection using boolean
accurate pointer analysis for millions of lines of code. In PLDI, 2007. satisfiability. In POPL, pages 351–363, 2005.
[17] D. Herman and C. Flanagan. Status report: specifying javascript with [36] D. Yu, A. Chander, N. Islam, and I. Serikov. Javascript instrumenta-
ml. In ML, pages 47–52, 2007. tion for browser security. In POPL, pages 237–249, 2007.
[18] T. Jim, N. Swamy, and M. Hicks. Defeating script injection attacks [37] N. Zeldovich, S. Boyd-Wickizer, and D. Mazières. Securing
with browser-enforced embedded policies. In WWW, 2007. distributed systems with information flow control. In NSDI, 2008.
[19] N. Jovanovic, C. Krügel, and E. Kirda. Pixy: A static analysis tool