Staged Information Flow For Javascript: Ravi Chugh Jeffrey A. Meister Ranjit Jhala Sorin Lerner
Staged Information Flow For Javascript: Ravi Chugh Jeffrey A. Meister Ranjit Jhala Sorin Lerner
We can show that if Check and Stage meet the above cri- Rewrite(P, fun(this, p){ s }) =
{data : fun(this, x, I)Rewrite(P, s),
teria, then for all policies P and programs s, if P is vio- taint : I}
lated by s then SIF(P, s) returns E RR. A sketch of the proof
Rewrite(P, {f1 : x1 , . . .}) =
is as follows: assume that P is violated at some point dur- {data : {f1 : {data : x1 .data, taint : I + x1 .taint}, . . .},
ing the execution of s, and suppose that at the point of fail- taint : I}
ure, sites i1 through ik in s had been loaded with s1 through Rewrite(P, x := e) =
sk . Thus, we know that s[i1 7→ s1 , · · · , ik 7→ sk ] violates the var tmp := Rewrite(P, e);
policy, and then by the first property above, we know that x.data := tmp.data;
Check(Stage(P, s[i1 7→ s1 , · · · , ik−1 7→ sk−1 ]), sk ) = E RR. x.taint := tmp.taint;
DST(P, x)
Then, using k − 2 applications of the second property above,
we can show that Stage(P, s[i1 7→ s1 , · · · , ik−1 7→ sk−1 ]) = Rewrite(P, x.f := e) =
var tmp := Rewrite(P, e);
Stage(P, s), and therefore Check(Stage(P, s), sk ) = E RR, which x.data.f.data := tmp.data;
means that SIF(P, s) would return E RR when it performs the resid- x.data.f.taint := tmp.taint
ual check on sk .
Rewrite(P, x := f(z)) =
The second condition above enables our framework to compute var tmp := f.data;
the flows and residual policies once, without having to recompute x := tmp(this, z, I + f.taint);
them each time that a hole is filled. In essence, the conditions on DST(P, x)
Stage and Check ensure that the dynamically loaded code does not Rewrite(P, x := y.f(z)) =
induce any new flows for the variables described in the top-level x := y.data.f.data(y, z, I + y.data.f.taint);
policy P . If any new flows would be induced by the hole, then DST(P, x)
Check would return E RR and execution would be halted. Rewrite(P, s1 ;s2 ) =
Rewrite(P, s1 );Rewrite(P, s2 )
Rewrite(P, ifi x then s1 else s2 ) =
4. Static Instantiation var tmp := I;
We now describe how we have instantiated our framework by I := I + x.taint;
ifi x.data then Rewrite(P, s1 ) else Rewrite(P, s2 );
presenting our implementations of Stage and Check. Stage takes a I := tmp
policy P and program s and returns the residual policy for the eval
Rewrite(P, whilex s do ) =
sites in s. Check takes a residual policy RP comprising a must-not- var tmp := I;
read and must-not-write set, and a statement corresponding to code I := I + x.taint;
to be loaded at a hole, and verifies that the statement satisfies the whilex.data Rewrite(P, s) do ;
residual policy, by verifying that the statement does not read (resp. I := tmp
write) the variables or fields listed in the must not read (resp. must Rewrite(P, x := evali (y))
var tmp := evali (Rewrite(P, y.data));
not write) sets. x.data := tmp.data;
Next, we describe our flow-insensitive, field-sensitive, set- x.taint := I + tmp.taint + y.taint
constraint based instantiation of the procedure Stage. First, we
present the different elements constituting the constraints, con-
stants, constructors, and terms. Second, we describe our syntax- Figure 7. Dynamic Information Flow Rewriting. We assume com-
directed constraint generation procedure. Third, we discuss some plex expressions are bound to fresh temporary variables. The global
optimizations required to analyze JavaScript with sufficient preci- variable I, initially the empty set, stores the set of indirect taints.
sion. Fourth, we show how Stage combines policies and constraints
to compute residual policies.
Set Constraints. A term is either a constraint variable X, a con-
stant, or a constructed term C(t1 , . . . , tn ), where C is a constructor
of arity n and t1 , . . . , tn are terms. A set constraint is a constraint
of the form t1 ⊆ t2 , where t1 and t2 are terms. A satisfying solu- two unary constructors C, D, we write the constraint t1 ⊆C,D t2 as
tion for a finite set of constraints maps each constraint variable to a an abbreviation for the pair of constraints t1 ⊆ C(X), D(X) ⊆ t2 ,
set of constants and constructed terms, such that all of the inclusion where X is a fresh constraint variable that is distinct from all other
constraints are satisfied. For details, we refer the reader to [20]. For variables.
Statements 4.1 Constraint Elements
Gen(k, XI , skip) = ∅ We set up a system of constraints over variables Xe for each sub-
expression e of the program. The constraints use several kinds of
Gen(k, XI , var x) = constructors to model various aspects of JavaScript code. The first
{cx ⊆ Xx } ∪ {Ind(cx ) ⊆ Xx } two constructors are standard ways of encoding functions and fields
Gen(k, XI , x := e) = using set constraints [12]. The last three are novel mechanisms re-
Gen(k, XI , e) ∪ {Xe ⊆ Xx } ∪ {XI ⊆ Xx } quired to capture information flow and the semantics of JavaScript:
they are used to distinguish between fields that are directly con-
Gen(k, XI , x.f := e) = tained in an object, fields that are reachable by transitively follow-
Gen(k, XI , e) ∪ ing a prototype chain, values that directly reach a particular point,
{Xx ⊆ Real(Fldf (Xe , Ω))} ∪ {Xx ⊆ Real(Fldf (XI , Ω))} and values that indirectly reach a particular point.
1. Function Constructor. JavaScript programs have first class
Gen(k, XI , s1 ;s2 ) =
functions in that functions can be created and passed around like
Gen(k, XI , s1 ) ∪ Gen(k, XI , s2 )
any other value. We model the flow of function values via a con-
Gen(k, XI , ifi x then s1 else s2 ) = structor Fun() of arity 5. The first argument corresponds to the
Gen(k, Xi , s1 ) ∪ Gen(k, Xi , s2 ) ∪ objects constructed by the function object, a special feature of
{XI ⊆ Xi } ∪ {Xx ⊆Ind,Ind Xi } JavaScript that we will describe in the sequel. This argument is
treated as contravariant. The second argument corresponds to the
Gen(k, XI , whilei x do s ) = function’s implicit parameter this. As the argument corresponds
Gen(k, Xi , s) ∪ {XI ⊆ Xi } ∪ {Xx ⊆Ind,Ind Xi } to an input of the function, it is treated as contravariant. The third
argument corresponds to the explicit formal parameter of the func-
Gen(k, XI , return e) = tion. As this argument also corresponds to an input of the function,
Gen(k, XI , e) ∪ {Xe ⊆ Xretk } ∪ {XI ⊆ Xretk } it is treated as contravariant. The fourth argument corresponds to
an implicit parameter that holds the values corresponding to indi-
Gen(k, XI , evali (e)) = ∅
rect flows into the points where the function is invoked. This pa-
Expressions rameter is used as the initial set of indirect flows into the body of
the function, and as it corresponds to an input, the argument is also
Gen(k, XI , c as e) = treated is contravariant. The fifth argument corresponds to the re-
{c ⊆ Xe } ∪ {Ind(c) ⊆ Xe } turn value, and hence the argument is covariant.
Gen(k, XI , x) = ∅ 2. Field Constructors. JavaScript programs make heavy use of
fields and any precise analysis must track flows in a field-sensitive
Gen(k, XI , x.f as e) = manner. The classical way to model fields is to view them as a pair
{Xx ⊆ Real(Fldf (∅, Xe ))} ∪ {Xx ⊆ Pro(Fldf (∅, Xe ))} of functions: a setter that updates the contents of the field, and a
getter the returns the contents of the field. Following this intuition,
Gen(k, XI , e1 op e2 as e) =
we encode a field f via a a constructor Fldf () of arity 2. The first
Gen(k, XI , e1 ) ∪ Gen(k, XI , e2 ) ∪
parameter corresponds to the set of values written into the field,
{Xe1 ⊆ Xe } ∪ {Xe2 ⊆ Xe }
i.e., the inputs to the setter, and hence, is treated as contravariant.
Gen(k, XI , {. . . , fj : ej , . . .}i as e) = The second parameter corresponds to the set of values read from
(∪j Gen(k, XI , ej )) ∪ {Xi ⊆ Xe } ∪ the field, i.e., the outputs of the getter, and hence, is treated as
(∪j {Xej ⊆ Xi.fj }) ∪ (∪j {XI ⊆ Xi.fj }) ∪ covariant. When initializing an object’s fields, we use the same set
(∪j {Real(Fldfj (Xi.fj , Xi.fj )) ⊆ Xi }) variable in both places so that all arguments that flows into the first
argument flow out of the second argument. When writing a field,
Gen(k, XI , this as e) = {thisk ⊆ Xe } we pad the second argument with the set variable Ω, which collects
Gen(k, XI , funi (thisi , pi ){ s } as e) = everything that flows from the field by covariance. When reading
Gen(i, Xindi , s) ∪ {XI ⊆ Xindi } ∪ {Xi ⊆ Xe } ∪ a field, we pad the first argument with the set variable ∅, so that
{Fun(Xconsi , Xthisi , Xpi , Xindi , Xreti ) ⊆ Xi } ∪ nothing flows into the field by contravariance.
(∪j {Real(Fldfj (Xprotoi .fj , Xprotoi .fj )) ⊆ Xprotoi }) ∪ 3. Real and Prototype Flow Constructors. In order to determine
{Real(Fldproto (Xprotoi , Xprotoi )) ⊆ Xi } ∪ what a field read returns in the presence of prototyping, we must
{Xconsi ⊆ Real(Fldproto (Xprotoi , Ω))} track, for each object, the values for all fields that can be read
directly from the object or transitively via following its prototype
Gen(k, XI , f(e0 ) as e) = chain. To distinguish between the fields of an object and the fields
Gen(k, XI , e0 ) ∪ {Xf ⊆ Fun(∅, Xog , Xe0 , XI , Xe )} reachable via the prototype chain of an object, we use a special
constructor Real() of arity 1 to wrap the fields that an object
Gen(k, XI , x.f(e0 ) as e) =
directly contains, and a special constructor Pro() of arity 1 to wrap
Gen(k, XI , e0 ) ∪
the fields that are transitively reachable by following the object’s
{Xx ⊆ Real(Fldf (∅, Fun(∅, Xx , Xe0 , XI , Xe )))} ∪
prototype chain.
{Xx ⊆ Pro(Fldf (∅, Fun(∅, Xx , Xe0 , XI , Xe )))}
In general, if c can be reached by following the prototype chain
Gen(k, XI , newi f(e0 ) as e) = of an expression e, then our constraints will ensure that Pro(c)
Gen(k, XI , e0 ) ∪ {Xi ⊆ Xe } ∪ flows into Xe . For example, x.f can return the constant c if the
(∪j {Real(Fldfj (Xi.fj , Xi.fj )) ⊆ Xi }) ∪ object has a field f that has the value c, or if an object in its
{Xf ⊆ Fun(Xi , Xi , Xe0 , XI , Ω)} ∪ prototype chain has a field f that has the value c. In the former
{Xi.proto ⊆Real,Pro Xi } ∪ {Xi.proto ⊆Pro,Pro Xi } case, our constraints ensure that the term Real(Fldf (c, c)) flows
into Xx . In the latter case, our constraints ensure that the term
Pro(Fldf (c, c)) flows into Xx .
Figure 8. Constraint generation
4. Indirect Flow Constructor. When tracking information flow, reads from x.f, the result can flow from either the object x (if it has
we must track both direct value flows, as well as indirect flows a field f), or from some object in its prototype chain with a fields
that arise when assignments take place under particular branch f. To model these semantics, the values returned from reads are the
conditions. However, due to the presence of higher-order functions values from objects that directly flow into x (i.e., Real()-wrapped
and dynamic dispatch, we must take care to separate direct flows terms that flow into Xx ) as well as objects that flow into x after
(which affect which functions get executed at a different program following the prototype chain (i.e., Pro()-wrapped terms that flow
points), from indirect flows (which have no effect on the execution). into Xx ). For fields writes to x.f, only the actual object itself may be
To achieve this separation, we use a covariant constructor Ind() of updated (as opposed to some object along the prototype chain). To
arity 1 to wrap constants and convert them into ground terms that model these semantics, we only carry out the assignment on those
participate in indirect flows. objects that directly reach x, i.e., are wrapped under Real().
In general, if a constant c directly flows into an expression e,
Function Definitions. Each anonymous function declaration is la-
then the constraints ensure that the term c flows into Xe . If the
beled with a unique label i. For each function i, we create a term
constant c indirectly flows into an expression e, however, then our
Fun(Xconsi , Xthisi , Xpi , Xindi , Xreti ) corresponding to the func-
constraints ensure that Ind(c) flows into Xe . For example, if c
tion’s value, and create the appropriate constraints from the con-
indirectly flows into the expression x.f, then our constraints will
straint variables representing the “inputs” Xthisi , Xpi , and Xindi
ensure that the term Real(Fldf (Ind(c), Ind(c))) flows into Xx .
into the body of the function, and from the return statement of the
4.2 Constraint Generation function to Xreti . The contravariant argument consi in the first po-
sition of the function term corresponds to the objects constructed by
Figure 8 shows the constraint generation procedure Gen. The pro- the function. The last three constraints deal with prototypes. First,
cedure takes as input a label k corresponding to the identifier of the we create a fresh prototype object, namely Xprotoi , and set up its
function currently being analyzed, a constraint variable XI repre- fields in the same way we do for object literals. Second, we add the
senting the indirect flows into the program location being analyzed, constraint Real(Fldproto (Xprotoi , Xprotoi )) ⊆ Xi that stores the
and either e or s, respectively the expression or statement being prototype object in the proto field of the function object. Third,
analyzed, and it returns as output a set of inclusion constraints. we add the constraint Xconsi ⊆ Real(Fldproto (Xprotoi , Ω)), which
Gen traverses the AST of the program and generates constraints has the effect of writing the prototype object into the proto field
between variables of the form Xe for each subexpression e, that of any objects that flow to consi .
capture the set of values that flow directly or indirectly into e. We
maintain the invariants that: (1) only values wrapped with the Ind() Function Calls. For each function call, we generate a constraint
constructor flow into the indirect flow variables XI , (2) for every that uses constructor matching to pull out the set of actual functions
value that directly flows into e, there is a corresponding term that reaching the callsite, and uses variance to flow the actuals (both
flows into Xe , and, (3) for every value that is reachable from e after explicit, and implicit due to indirect flow) into the formals, and the
transitively following the prototype chain rooted at e, there is a cor- return out to the callsite respectively [11]. The values flowed in
responding term wrapped under Pro() that flows to Xe . Next, we for the cons and this parameter differ depending on the the three
discuss how constraints are generated for a representative subset of kinds of function calls.
expressions and statements. • For direct calls of the form f(e0 ), we use ∅ for the constructed
Assignments. For each assignment x := e, we generate constraints object and the flow variable corresponding to the global object
on the subexpression e, and then constraints that capture the direct og for the this parameter.
flow from e into x as well as the indirect flow from the current loca- • For method calls of the form x.f(e0 ), we use constructor match-
tion’s indirect flow variable into x. Notice that a return statement
ing in a manner similar to field-reads, to pull out the appropriate
is treated as an assignment to the return variable of the function to
functions that flow to the callsite, and we use the receiver object
which the statement belongs.
x as the this parameter.
Branches. Each branch statement and loop is labeled with a unique
• For constructor calls of the form newi f(e0 ), which create a
label i that can be generated with a syntactic pass over the source.
For each branch or loop labeled i, we create a new indirect flow fresh object and call f with this bound to the fresh object,
variable Xi . We flow the values in XI and the indirect values from we introduce a new variable Xi that represents the objects cre-
the expression used in the branch condition into Xi , and then use ated at the callsite and set up its fields as we do for object lit-
Xi as the indirect flow variable when generating the constraints for erals. Next, we use constructor matching to pull out the func-
the statements that depend on the branch. To preserve the invariant tions that flow to the callsite, and (via contravariance), flow Xi
that indirect flows are wrapped under the Ind() constructor (and into the constructed object parameter and this parameters of
hence, not used to affect computation), we filter flows to wrapped the callees. Finally, the last two constraints “flatten” the con-
terms, using Xx ⊆Ind,Ind Xi . structed object’s prototype chain. Intuitively, the constraints add
all fields of an object’s prototype chain into the object directly,
Object Literals. We use the set variable Xi.f to track the contents while at the same time keeping track of which fields actually
of field f of object i. For every field fj , we flow the initial value belong to the object and which do not. To achieve this, we take
ej into the field with the constraint Xej ⊆ Xi.fj , and the current each object that flows into the prototype field of the constructed
indirect taint into the field with the constraint XI ⊆ Xi.fj . Finally, object (Xi.proto ) either directly (i.e., wrapped under Real()) or
we add the constraint Real(Fldfj (Xi.fj , Xi.fj )) ⊆ Xi to treat fj via prototype-chains (i.e., wrapped under Pro()), and rewrap
as a field that object i directly contains, where Xi.fj is used as both those objects under Pro() and flow them into the constructed
the setter and getter for the field. object Xi .
Fields. For each field x.f read in (resp. written in) an expres-
sion e, we create the appropriate flow constraint between Xx and 4.3 Analyzing Real JavaScript
Fldf (∅, Xe ) (resp. Fldf (Xe , Ω)), which by virtue of the construc-
tor matching and variance, has the effect of flowing the values from Multi-parameter Functions. Functions in JavaScript can be in-
(resp. into) the f field of x into (resp. from) Xe . We must account voked with any number of arguments, regardless of how many pa-
for prototype chains when reading and writing to fields. For field rameters the function is defined to accept; missing arguments are
set to undefined and additional arguments are ignored.2 To model f = function(x) { ... }
this in the implementation, we define a common set constructor f(document.cookie);
Funn () that, in addition to cons, this, return, and taint parame-
That is, the context contains a function f that is called with the
ters, takes n arguments, where n is the maximum number of ar-
cookie as a parameter. Next, suppose that a hole is filled with:
guments across all function definitions and applications in the pro-
gram. When a definition or call in the program uses fewer than n f = function(x) { post(x); }
arguments, we pad the remaining arguments with fresh constraint
That is, the hole redefines the function f with another function that
variables. We omit these details from the presentation since they
broadcasts the argument x. If this code is dynamically loaded, it can
are straightforward.
overwrite the original “trusted” function f, and a policy violation
Iterative Field-Sensitivity. Due to the flexibility of JavaScript will occur if the new function is called from the context. If we could
objects, we must assume that every object can have a binding for re-analyze the entire source of the context and the hole at the client,
every possible static field. However, this naı̈ve approach does not then we would deduce that the call in the context flows the cookie
scale, as the product of object count and field count is often very into the formal x of the new function defined in the hole. However,
large. Instead, we perform an iterative field-sensitive analysis that performing client-side flow analysis on the entire code each time a
tracks fields on a per-object basis as needed. For each object, we hole is loaded would make residual policy checking prohibitively
begin by tracking only the fields that we are certain will exist based inefficient. Instead, we observe that when a function’s arguments
on the object definition. After solving the set constraints under receive confidential values, we can guarantee confidentiality by
these assumptions, we check whether any objects flow into accesses ensuring the function itself is not overwritten by the hole. Thus,
of fields that were not being tracked. We add constructed terms even for confidentiality policies, certain variables, namely those
for missing fields as appropriate, and incrementally solve for the holding function values whose arguments have been tainted, should
constraints again. not be written by the hole. Dually, for integrity policies, the residual
Current Limitations. Tracking reads and writes of dynamic fields, policy must also include both must-not-read and must-not-write
i.e., array or dictionary lookups, is significantly harder than track- sets.
ing statically known field names. Modeling these accesses with a Aliasing. Consider a context that contains the following code snip-
Fld> set constructor, where > represents unknown fields, makes pet, where tmp is not aliased to document:
the analysis unscalable, due to large numbers of accesses through
integer fields (for array objects) and complex string expressions that z = tmp.cookie;
compute precise names of HTML elements on the page. For the Hence, in the context, the value that flows into z is not sensitive.
purposes of our analysis, we make the dynamically checkable as- Next, suppose that a hole is filled with
sumption that dynamically created field names, i.e., dynamically
created array or hash table indices, do not clash with statically tmp = document;
known field names. ...
Our current implementation also does not support several other post(z);
features of JavaScript, but these can be directly captured within That is, the hole aliases tmp and document and as a result, the as-
our constraint-based SIF framework. These include the with state- signment in the context can flow the confidential cookie into the
ment, which allows its body to be evaluated with a given object’s variable z, thereby leaking the confidential information. Again, al-
fields temporarily brought into scope, and call and apply forms though re-analyzing the entire source on the client would detect this
that allow the programmer to explicitly set the this parameter of a leak, it would be prohibitively expensive. One option is to treat the
function call, which can be used to implement closure-based inher- object x as confidential if x.f is confidential. Thus, we could treat
itance. document as confidential since document.cookie is confidential
and prohibit any hole from reading document. However, this is
4.4 Residual Policy Generation far too restrictive as it is perfectly safe and common for the hole
We now describe how we use our constraint-based flow analysis to to read document and its non-tainted fields. Instead, when gener-
compute residual policies for holes. Recall that the residual policy ating a residual policy, we conservatively assume that once field
comprises a set of must-not read MNR and must-not write MNW f of some object contains confidential information, then all fields
variables and fields. Thus, at a high level, for confidentiality (resp. named f contain confidential information, no matter what object
integrity) policies, our goal is to find variables to which (resp. from they belong to. For our example, since document.cookie is con-
which) the sensitive information may flow, and then prohibit the fidential, we assume that tmp.cookie is confidential, and hence z
client from reading (resp. writing) those variables. is confidential, and so in the residual policy, we prevent the hole
However, it turns out that several subtleties arise due to the com- from reading z. Similarly, in this example, we would also prevent
bination of higher-order functions, aliasing and the requirement the hole from directly reading any field called cookie. Thus, we
that residual policy checking be efficient. We illustrate these issues can make the residual policy checking robust and efficient, even in
using examples that motivate our algorithm for generating residual the presence of aliasing, by unifying the taint information of each
policies. For the following examples, suppose we wish to enforce field f across all the objects that contain f, and preventing the hole
the confidentiality policy stating that the document’s cookie should from accessing f.
not flow into a hole.
We now describe our constraint-based algorithm that analyzes a
Functions. Intuitively, the residual policy needs to prevent the hole context in order to compute the MNR and MNW sets correspond-
from reading any variable in the context that is tainted by the ing to the residual policy. For clarity of exposition, we omit the
cookie. However, the residual policy must also prevent the hole indirect, real, and prototype wrappers, and we only describe how
from writing certain variables. To illustrate, consider the following residual policies are generated for confidentiality policies – it is
context: straightforward to extend the method to integrity policies.
2 Whenever a function is called in JavaScript, the list of actuals provided Taint Propagation. To compute the MNR and MNW sets, we
is bound to a special variable called arguments available within the body. use two new covariant unary constructors NR() and NW(), which
We model this behavior in our implementation. correspond to not-read and not-write taints. We seed the analysis
by adding constraints that flow these new taint constructors into the 5.1 Experiments
variables of the confidentiality policy. In particular: We have implemented a static, constraint-based instantiation of the
for each (x, •) in the policy, NR(cx ) ⊆ Xx SIF framework for JavaScript. Our analysis is currently a stan-
dalone tool, not yet integrated within a browser. As a result, we
where cx is a special constant associated with x. We use unary con- do not have automatic support for staging when a script is loaded
structors with these special constants as arguments so that we can dynamically. Instead, for the purposes of evaluation, we have im-
we can later define filter (i.e., ⊆A,B ) constraints. In addition to plemented a Firefox browser extension that intercepts all dynamic
the basic flow constraints from Figure 8, which propagate these code loading calls, and inlines the new code in the surrounding
taint seeds throughout the context code, we add new constraints context. The subsequent static analysis proceeds as if the dynamic
to account for the subtleties described above. In particular, to han- content had been there originally. Once our analysis engine is com-
dle higher-order functions, we contravariantly (resp. covariantly) bined with the browser, a dynamically loaded script will instead
propagate the taints from the function arguments (resp. return val- trigger the staged analysis.
ues) to function definitions: We use the JSure parser [3] as a front-end to parse JavaScript
source into OCaml abstract syntax trees, over which we generate
for each fun-def labeled i, Xpi ⊆NR,NW Xi constraints. We use the Banshee [20] constraint solver to build and
Xpi ⊆NW,NR Xi solve constraints. The Firefox extension is written in approximately
500 lines of JavaScript, the Banshee bindings in 400 lines of C and
Xreti ⊆NR,NR Xi OCaml, and the staged information flow tool in about 6,000 lines
Xreti ⊆NW,NW Xi of OCaml.
Benchmarks, policies and holes. We used our Firefox extension
where Xi , Xpi , and Xreti are the constraint variables representing to collect the closed-program source for all the web sites from the
the flows into the function labeled i, its formal parameter, and its Alexa top 100 list [1]. Alexa is a company that tracks web page
return value respectively. Finally, to handle aliasing, we unify the traffic, and generates the lists of the most popular 100 web sites by
taints across all objects containing a field f by creating a special country and by language. We ran our staging engine on all 100 sites
variable Xf and generating the following constraints: in the top 100 English pages.
We checked two policies on each web site: (1) a confidentiality
for each object labeled i, Xi.f ⊆NR,NR Xf
policy stating that the cookie value should not flow into the hole,
Xf ⊆NR,NR Xi.f and (2) an integrity policy stating that no values from the hole
Xi.f ⊆NW,NW Xf should flow into the location bar. These policies are general enough
that they apply to any web site, making it easier to systematically
Xf ⊆NW,NW Xi.f run on all the Alexa web sites.
For each web site, we systematically identified holes as any
Residual Policy Generation. To compute the residual policy, we scripts originating from a different domain than the site’s. Each
solve the entire set of constraints, that is, the basic flow constraints closed-program we collected is a snapshot of whatever JavaScript
augmented with the constraints above. Intuitively, if a NR() (resp. executed on that particular run; subsequent visits to the same page
NW()) constructor flows into the constraint variable Xx corre- would likely contain different dynamic code to populate the hole.
sponding to the program variable x, then the variable is added to For each benchmark, we first ran our information flow analysis
the MNR (resp. MNW ) set of the residual policy. Let S be the on the entire program. We then generated the residual policy for
constraint solution. We write S ` t X if the solution S maps the holes that we identified, and performed the residual checks on
the constraint variable X to a set containing the term t. The must- the code in the hole. This simulated the situation where holes are
not-read and must-not-write sets of the residual policy are defined: not available at the first stage, but are made available at the second
. stage.
MNR = { x | S ` NR(·) Xx } ∪ (Not-Read-Variables)
{ f | S ` NR(·) Xf } (Not-Read-Fields) Summary of results. Of the 100 sites in the Alexa list, 97 had
. JavaScript, 64 had holes in them, and of the ones with holes in
MNW = { x | S ` NW(·) Xx } ∪ (Not-Write-Variables) them, we were able to parse 63. Our full unstaged analysis suc-
{ f | S ` NW(·) Xf } (Not-Write-Fields) cessfully completed on all 63, and our staged analysis successfully
completed on 62 of these. The one benchmark that our staged anal-
That is, a variable or field must not be read (resp. written) if the not- ysis failed on (by running out of memory while generating the
read taint (resp. not-write taint) constructor flows to the constraint residual policy) is the largest benchmark in the Alexa top 100,
variable corresponding to the variable or field. namely wsj with 43,698 lines of JavaScript (which is twice as large
Residual Policy Checking. To verify that a hole satisfies a residual as the next largest benchmark).
policy, we perform a syntactic check that none of the variables or
fields in MNR (resp. MNW ) are read (resp. written) in the hole. 5.2 Performance of Unstaged Analysis
Figure 9 plots lines of code vs. running time of the unstaged
full analysis for the cookie confidentiality policy; the plot for the
5. Evaluation integrity policy follows similar trends. Our data shows that, for
In this section, we describe experiments (Section 5.1) that validate benchmarks up to 13,000 lines of code (which accounts for about
three hypotheses about our approach: our information flow analysis 80% of the benchmarks) the running time does not grow very fast,
using set constraints scales to real world JavaScript (Section 5.2); and stays under twelve seconds. Nevertheless, these times are too
our staged information flow approach creates residual checks that slow to run on the client side each time that a new hole is filled.
are much smaller and faster than the full analysis, making them Beyond 13,000 lines of code, even though the running time grows
practical for running on the client side (Section 5.3); and our in- much faster, our unstaged full analysis still scales to the largest of
formation flow analysis is precise enough to track useful properties JavaScript programs in the Alexa top 100 (76.0 seconds for 43,698
(Section 5.4). lines of JavaScript). Most of the benchmarks that take over a minute
80 of which 5 reported that the cookie does not flow into a hole. A
70
manual inspection of these examples reveals that this is indeed the
case. By looking at the code of the remaining 12 benchmarks, we
60
determined that 8 of them contained holes that read, and even mod-
50 ified, the cookie. Many of these sites included scripts from popular
ad services, such as GoogleSyndication and QuantServe, and data
40
tracking services, like GoogleAnalytics. These services make use
30 of cookies as a persistent storage for statistics across multiple page
visits.
20
The reported flow on the remaining 4 benchmarks were false
10 positives in our unstaged analysis, which were all caused by the
0
lack of context-sensitivity. For example, if the cookie and an un-
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 related string both flow into the same function, and this function
flows its argument to its return value, then both strings will flow to
Figure 9. Analysis time (seconds) of unstaged analysis for cookie the returning call sites, smearing the actual flows. Techniques for
policy vs. lines of code. extending set constraint-based program analysis with context sen-
sitivity would help in this situation.
We evaluate the precision of our staged analysis by comparing
to analyze make heavy use of prototypes. This observation points its results (“RF” column in Figure 10) with the unstaged version
to an area of possible performance tuning for future work. (“FF” column). In general, the answer to whether the policy is
violated should be the same in both unstaged and staged modes,
5.3 Performance and Benefits of Staging and this is indeed the case for most of our benchmarks.
Figure 10 shows some of the results from the 62 benchmarks on However, there are several benchmarks on which this is not the
which our staged analysis ran. The last line of the table, however, case. For confidentiality policies, there are 4 benchmarks for which
averages over all benchmarks. The columns in the table are as the unstaged analysis finds no flow, but the staged one reports flow.
follows: “Site and rank” gives the name of the web site and its rank For integrity, there are 8 such benchmarks. In each of these cases,
in the top 100 list; “Total LOC” gives the number of lines of code on the residual analysis reports a spurious flow because of how we
the web site, including the hole, as formatted by our own JavaScript conservatively taint fields when generating residual policies.
pretty-printer; “Hole LOC” gives the number of lines of code in the As expected, there are no cases in which the unstaged analysis
hole; “Full” gives the time it took to run our unstaged information reports flow but the staged analysis reports no flow.
flow analysis on the entire JavaScript program; “GenRes” gives the
time it took our analysis to generate the residual checks for the
holes that we identified; “ChkRes” gives the time it took to perform 6. Related Work
the residual checks on the code from the holes; “FF” states whether
or not the unstaged information flow analysis found any flow for the Static Information Flow. There is a rich literature on modeling se-
given policy (X indicates flow, × indicates no-flow); “RF” states curity properties using information flow [15]. Many of these ideas
whether or not the residual checks found any flow for the given are manifested as static language-based techniques for ensuring
policy. All times are in seconds. that the values of high security values do not flow into low secu-
In general, the time for computing residual policies is on the rity outputs. These include type systems [33, 25], Hoare-logics [5],
same order of magnitude as the time for running the unstaged full and safety (model) checking [30]. Dually, there are techniques for
analysis. Even though the residual policy generation uses more checking that low-security (i.e., tainted) values do not flow to safety
kinds of constraints, it does not always take longer to solve, because critical operations. These include the use of type qualifiers [28] and
only the context is being analyzed, which is smaller than the entire dataflow analysis [21]. Unfortunately, these techniques work on
code that the unstaged analysis ran on. closed programs (or require summaries or stubs for missing code),
Our data shows that all residual checks run under one second, and further, often rely on underlying structure like types, and hence
and most run under one tenth of a second. The residual checks are cannot be applied directly to JavaScript.
performed in a single pass along with parsing, so in fact most of Dynamic Information Flow. Several dynamic techniques for in-
the time for performing residual checks lies in the parsing, which formation flow control have been proposed at the language, operat-
must be done anyway. Our current implementation uses a parser ing system and architecture levels. The type system of [23] allows
generator that is not optimized for speed, which leaves room for the specification and dynamic enforcement of richer flow and ac-
performance improvements. cess control policies including the dynamic creation of principals
As a result, residual checks would add only minimal overhead and declassification of high-security information. These ideas can-
to run in a client browser, especially if we further tune the perfor- not yet be applied in our setting as they require a closed system,
mance of our checker. Our data also shows that residual checks are written in a statically typed language (Java), and further, annota-
about two orders of magnitude faster than the full analysis, which tions must be provided to specify and verify the policies. There are
on average runs in 9.9 seconds, and thus would be too slow to run several projects that use dynamic tainting, either via binary rewrit-
in a client browser. These observations together point to the benefit ing [24], at the architecture level [29, 32], or using virtual machines
of staging: the full analysis would be too slow to run on the client [8]. We leave the implementation of a dynamic instantiation of our
browser, but if the developer could run the first stage of the analy- framework, possibly for enforcing the residual policy checks, as an
sis, then the remaining checks are fast enough to be run in the client avenue for future work. However, we conjecture that dynamically
browser. tracking flows is likely to incur a significant run time overhead, and
hence, is not a likely candidate for client-side deployment. Sev-
5.4 Precision eral recent projects [9, 37] propose expressive OS mechanisms for
In order to assess the precision of our unstaged analysis, we ran- information flow control. Here the goal is to provide abstractions
domly sampled 17 of the benchmarks for the cookie-flow policy, that allow application developers to specify policies about where
Total Hole Flow from cookie to hole? Flow from hole to location bar?
Site and rank
LOC LOC Full GenRes ChkRes FF RF Full GenRes ChkRes FF RF
3. myspace 22,469 3,484 77.4 27.4 0.52 X X 105.3 37.2 0.52 × ×
4. youtube 7,187 779 3.7 4.4 0.20 × × 3.6 4.8 0.18 × X
10. aol 4,714 255 2.1 2.9 0.06 X X 2.1 3.4 0.06 × ×
13. go 904 60 0.5 0.9 0.03 × × 0.5 0.9 0.03 × ×
15. cnn 15,445 3,472 71.4 18.0 0.52 X X 83.1 30.4 0.54 × X
16. espn.go 7,155 28 4.0 7.0 0.03 × × 4.0 8.2 0.03 × ×
18. flickr 747 713 0.3 0.1 0.12 × × 0.3 0.2 0.12 × ×
24. imdb 556 13 0.3 0.5 0.02 × × 0.3 0.6 0.02 × ×
28. weather 20,104 232 76.8 106.5 0.12 X X 72.5 200.6 0.09 X X
35. foxnews 13,589 70 14.7 30.7 0.10 X X 15.0 50.6 0.04 × ×
42. doubleclick 3,259 1,203 1.5 1.2 0.21 X X 1.4 1.4 0.21 × ×
43. bbc.co.uk 8,639 41 3.9 7.5 0.03 X X 3.9 8.6 0.02 × ×
44. walmart 13,174 101 7.0 22.0 0.09 X X 7.2 55.5 0.07 X X
46. rr 2,545 70 1.1 1.8 0.05 X X 1.1 3.3 0.03 × ×
47. target 10,532 61 4.0 7.0 0.04 X X 4.0 8.1 0.04 × ×
48. netflix 9,879 27 4.4 8.4 0.03 × × 4.5 9.8 0.02 × ×
49. nfl 10,485 170 8.4 16.8 0.03 × × 8.4 21.2 0.03 × ×
57. hulu 14,476 545 25.9 42.1 0.11 X X 28.2 131.4 0.12 × ×
58. verizon.net 3,456 167 1.5 2.1 0.05 X X 1.5 2.4 0.04 × ×
62. disney.go 3,383 6 1.9 3.3 0.03 × × 1.9 3.8 0.02 × ×
63. bestbuy 10,975 3,916 8.2 10.6 0.76 X X 8.7 290.0 0.80 × X
64. msn.foxsports 6,838 490 4.2 7.0 0.14 X X 4.2 16.1 0.18 × X
67. cnet 10,598 242 7.2 22.6 0.17 X X 7.3 29.2 0.06 × ×
71. linkedin 7,964 1,816 3.9 3.4 0.32 × × 3.9 3.6 0.29 X X
75. gamespot 13,041 1,491 11.8 23.6 0.32 X X 11.5 30.6 0.28 × ×
77. veoh 9,742 86 6.5 13.1 0.07 × X 6.6 32.6 0.04 × ×
79. latimes 8,225 55 6.8 9.9 0.04 X X 6.8 11.7 0.03 × X
80. nbc 7,644 74 5.8 8.5 0.04 × × 5.8 10.4 0.04 × ×
87. reuters 4,049 258 1.7 2.4 0.06 X X 1.7 2.6 0.06 × ×
88. imeem 12,050 194 4.6 7.9 0.04 × × 4.6 8.7 0.04 × ×
89. gamefaqs 365 77 0.2 0.3 0.03 × × 0.2 0.3 0.03 × ×
90. tinypic 6,658 64 3.9 6.0 0.03 × × 4.0 6.6 0.03 × ×
92. abcnews.go 14,330 246 9.9 18.0 0.07 × X 9.8 21.4 0.08 × X
99. dailymotion 11,709 379 10.9 19.3 0.08 × X 10.8 30.4 0.08 × ×
100. people 6,152 261 3.4 4.8 0.07 X X 3.4 6.8 0.06 × ×
Average 7,979 597 9.9 14.0 0.13 10.7 28.4 0.12
Figure 10. Sample results from Alexa web sites with holes. Average numbers are for all benchmarks (including those not in the table), and
times reported are in seconds.
data generated by the process should be allowed to flow. These ap- ately sanitizing user generated content on the server side. Several
proaches are too coarse-grained to be applicable to our setting. server-side tools apply static analysis to determine whether user
Analyzing JavaScript. Several authors have studied the prob- generated content has been properly vetted [19, 35, 34]. To en-
lem of analyzing JavaScript. Some of the idiosyncratic features of sure safety on the client side, one simple and elegant approach is
JavaScript are described in [31], which also presents a type system to only allow previously known and authorized scripts to run on a
for statically checking JavaScript programs. Further, [6] describes web page [18]. Unfortunately, this makes it harder to use dynam-
an algorithm for inferring types for JavaScript programs. However, ically generated third-party content, and hence is not applicable
it is unclear whether JavaScript programs in the wild satisfy the in our setting. Finally, there have been several proposals for re-
typing disciplines described in these works. Neither approach deals designing the ecosystem within which web-applications are built
with dynamically generated code, and hence cannot directly be ap- and deployed [4, 2, 7]. In essence these approaches advocate that
plied to our setting. The interaction of JavaScript and web browsers web-applications be built in higher-level languages like C ] , Java
is studied in [36], which presents a formal semantics of the inter- and JIF respectively, thereby availing of the protection mechanisms
action, and uses it to describe a general framework for dynamically available in those languages. It remains to be seen whether web-
verifying arbitrary safety properties inside the browser. Gatekeeper application developers are willing to trade the flexibility and rapid-
[22] is a static analysis framework for JavaScript that focuses on prototyping strengths of JavaScript for the security benefits offered
performing analysis in a single stage (e.g., on the server). In con- by strongly typed languages.
trast, our primary focus is on developing residual checks that spec- Set Constraint-based Program Analysis. Set constraints provide
ify how dynamically loaded code should behave in order for the an expressive framework within which many kinds of program
system to satisfy high-level flow policies. analyses including points-to analyses [14, 16], type qualifier in-
Web and Browser Security. Several recent projects have consid- ference [13], race detection [26], and uncaught exceptions [10].
ered the problem of securing web applications via browser and lan- Our contribution is to show that this expressive framework is espe-
guage mechanisms. Many vulnerabilities arise from not appropri- cially suited to capturing the complexities of JavaScript including
fields and higher-order functions, and that after using optimizations for detecting web application vulnerabilities (short paper). In IEEE
like the optimistic field analysis the resulting analysis scales to the Symposium on Security and Privacy, 2006.
JavaScript that powers most popular websites. [20] J. Kodumal and A. Aiken. Banshee: A scalable constraint-based
analysis toolkit. In SAS, pages 218–234, 2005.
References [21] M. S. Lam, M. Martin, V. B. Livshits, and J. Whaley. Securing web
[1] English: Alexa top 100 sites, November 2008. http://www.alexa. applications with static and dynamic information flow tracking. In
com/. PEPM, pages 3–12, 2008.
[2] Google web toolkit, November 2008. http://code.google.com/ [22] B. Livshits and S. Guarnieri. Gatekeeper: Mostly static enforcement
webtoolkit/. of security and reliability policies for javascript code. Technical
Report MSR-TR-2009-16, Microsoft Research, Feb. 2009.
[3] Jsure, November 2008. http://www.jsure.org/.
[23] A. C. Myers. Programming with explicit security policies. In ESOP,
[4] Volta, November 2008. http://live.labs.com/volta. pages 1–4, 2005.
[5] T. Amtoft and A. Banerjee. Information flow analysis in logical form. [24] J. Newsome and D. X. Song. Dynamic taint analysis for automatic de-
In SAS, pages 100–115, 2004. tection, analysis, and signature generation of exploits on commodity
[6] C. Anderson, P. Giannini, and S. Drossopoulou. Towards type software. In NDSS, 2005.
inference for javascript. In ECOOP, pages 428–452, 2005. [25] F. Pottier and V. Simonet. Information flow inference for ml. In
[7] S. Chong, J. Liu, A. C. Myers, X. Qi, K. Vikram, L. Zheng, and POPL, pages 319–330, 2002.
X. Zheng. Secure web application via automatic partitioning. In [26] P. Pratikakis, J. S. Foster, and M. Hicks. Locksmith: context-sensitive
SOSP, pages 31–44, 2007. correlation analysis for race detection. In PLDI. ACM, 2006.
[8] J. Chow, B. Pfaff, T. Garfinkel, K. Christopher, and M. Rosenblum. [27] N. Provos, D. McNamee, P. Mavrommatis, K. Wang, and
Understanding data lifetime via whole system simulation. In USENIX N. Modadugu. The ghost in the browser analysis of web-based
Security Symposium, pages 321–336, 2004. malware. In HotBots, 2007.
[9] P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey, D. Ziegler, [28] U. Shankar, K. Talwar, J. S. Foster, and D. Wagner. Detecting format
E. Kohler, D. Mazières, F. Kaashoek, and R. Morris. Labels and event string vulnerabilities with type qualifiers. In USENIX Security, 2001.
processes in the asbestos operating system. In SOSP. ACM, 2005.
[29] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas. Secure program
[10] M. Fähndrich and A. Aiken. Program analysis using mixed term and execution via dynamic information flow tracking. In ASPLOS, 2004.
set constraints. In SAS, pages 114–126, 1997.
[30] T. Terauchi and A. Aiken. Secure information flow as a safety
[11] M. Fähndrich, J. S. Foster, A. Aiken, and J. Cu. Tracking down problem. In SAS, pages 352–367, 2005.
exceptions in standard ml programs. Technical report, EECS
Department, UC Berkeley, 1998. [31] P. Thiemann. Towards a type system for analyzing javascript
programs. In ESOP, pages 408–422, 2005.
[12] C. Flanagan and M. Felleisen. Componential set-based analysis.
ACM Trans. Program. Lang. Syst., 21(2):370–416, 1999. [32] N. Vachharajani, M. J. Bridges, J. Chang, R. Rangan, G. Ottoni,
J. A. Blome, G. Reis, M. Vachharajani, and D. I. August. Rifle: An
[13] J. S. Foster, M. Fähndrich, and A. Aiken. A theory of type qualifiers. architectural framework for user-centric information-flow security. In
In PLDI. ACM, 1999. MICRO, 2004.
[14] J. S. Foster, M. Fähndrich, and A. Aiken. Polymorphic versus [33] D. Volpano and G. Smith. Verifying secrets and relative secrecy. In
monomorphic flow-insensitive points-to analysis for c. In SAS, 2000. POPL, 2000.
[15] J. A. Goguen and J. Meseguer. Security policies and security models. [34] G. Wassermann and Z. Su. Static detection of cross-site scripting
In IEEE Symposium on Security and Privacy, pages 11–20, 1982. vulnerabilities. In ICSE, pages 171–180, 2008.
[16] B. Hardekopf and C. Lin. The ant and the grasshopper: fast and [35] Y. Xie and A. Aiken. Scalable error detection using boolean
accurate pointer analysis for millions of lines of code. In PLDI, 2007. satisfiability. In POPL, pages 351–363, 2005.
[17] D. Herman and C. Flanagan. Status report: specifying javascript with [36] D. Yu, A. Chander, N. Islam, and I. Serikov. Javascript instrumenta-
ml. In ML, pages 47–52, 2007. tion for browser security. In POPL, pages 237–249, 2007.
[18] T. Jim, N. Swamy, and M. Hicks. Defeating script injection attacks [37] N. Zeldovich, S. Boyd-Wickizer, and D. Mazières. Securing
with browser-enforced embedded policies. In WWW, 2007. distributed systems with information flow control. In NSDI, 2008.
[19] N. Jovanovic, C. Krügel, and E. Kirda. Pixy: A static analysis tool