25k Xss Vulnerabilities Overview
25k Xss Vulnerabilities Overview
DOM-based XSS
5.2.2 JavaScript context-specific generation the extracted branch (in gray). For each of the extracted
JavaScript context-specific generation is necessary when- branches the generator creates one break out sequence by
ever a data flow ends within a sink that interprets a string traversing the branch from top to bottom and adding a fixed
as JavaScript code. This is the case for functions such as sequence of closing/break out characters for each node. So
eval & Function, Event handlers (such as onload and on- in our example the following steps are taken:
error) and DOM properties such as script.textContent, 1. FunctionDeclaration: ’;’
script.text and script.innerText. While browsers are 2. FunctionConstructor: ”
very forgiving when parsing and executing syntactically in- 3. Block: ’}’
correct HTML, they are quite strict when it comes to Java- 4. Declaration: ’;’
Script code execution. If the JavaScript parser encounters a 5. StringLiteral: ’”’
syntax error, it cancels the script execution for the complete 6. Resulting Breakout Sequence: ’”;};’
block/function. Therefore, the big challenge for the exploit
generator is to generate a syntactically correct exploit, that To trigger the exploit we can simply construct the test
will not cause the parser to cancel the execution. In order to case as follows: Based on the source (location.href), the
do so, the system again has to determine the exact location system simple adds the break out sequence, an arbitrary
of the tainted bytes. payload and the escape sequence to the URL of the page:
Listing 3 shows a very simple vulnerable piece of Java- http://example.org/#";};__reportingFunction__();//
Script code. In the first step, the code constructs a string of
benign/hard coded and tainted (location.href) parts. In a When executed within a browser, the string construction
second step, it executes the code using eval. Thereby, this process from Listing 3 is conducted and the following string
code can be exploited in slightly different ways. Either the flows into the eval call (Note: Line breaks are only inserted
attacker could break out of the variable x and inject his code for readability reasons):
into the function named test, or he could break out of the
variable x and the function test and inject his code into the
function test (){
top level JavaScript space. While the first method requires var x = " http :// example . org /# " ;
an additional invocation of the test function, the second ex- };
ploit executes as soon as eval is called with a syntactically _ _ r e p o r t i n g F u n c t i o n _ _ ();
correct code. However, for the last case, the complexity of // doSomething ( x );}
the break out sequence grows with the complexity of con-
structed code. Nevertheless, we do not want do rely on any
behavior of other non-controllable code or wait for a user in-
teraction to trigger an invocation of the manipulated code. 6. EMPIRICAL STUDY
As mentioned earlier, an important motivation for our
Therefore, we always seek to break out to the top level work was to gain insight into the prevalence and nature of
of the JavaScript execution. In order to do so, our system potentially insecure data flows in current JavaScript applica-
first parses the JavaScript string and creates a syntax tree tions leading to DOM-based XSS. For this reason, we created
of the code. Based on this tree and the taint information a Web crawling infrastructure capable of automatically ap-
we extract the branches that contain tainted values. Listing plying our vulnerability detection and validation techniques
4 shows the resulting syntax tree for our example code and to a large set of real-world Web sites.
Browser'1' Browser'm'
browser features that were needed for the crawling and an-
Tab'1' Tab'n' Tab'1' Tab'n'
alyzing processes were realized in the form of a browser ex-
Web' Web' Web' Web' tension.
page' page' page' page'
Following the general architecture of Chrome’s extension
' ' ' '
&'
'
&' '
&'
'
&' '
tentially vulnerable data flows, based on the following crite- 5083, the total number of exploits tested in Chromium was
ria: reduced to 137,826, whereas the remaining 43,412 exploits
(C1) The data flow ended in a sink that allows, if no fur- were tested using Internet Explorer.
ther sanitization steps were taken, direct JavaScript Out of these, a total number of 58,066 URLs tested in
execution. Hence, all flow into cookies, Web Storage, Chromium triggered our verification payload. Additionally,
or DOM attribute values were excluded. we could exploit 11,921 URLs visited in Internet Explorer.
(C2) The data flow originates from a source that can imme- This corresponds to a success rate of 38.61% in total, and a
diately be controlled by the adversary, without pro- success rate of 42.13% when only considering vulnerabilities
grammatic preconditions or assumptions in respect to exploitable in Chromium.
the processing code. This criteria effectively excluded As we discussed earlier, we crawled down one level from
all flows that come from second order sources, such the entry page. We assume that a high number of Web sites
as cookies or Web Storage, as well as flows from the utilize content management systems and thus include the
postMessage API. same client-side code in each of their sub pages. Hence, to
(C3) Only data flows without any built-in escaping methods zero in on the number of actual vulnerabilities we decided
and data flows with non-matching escaping methods to reduce the data set by applying a uniqueness criterion.
were considered. Data flows, for which the observed For any finding that triggered an exploit, we therefore re-
built-in escaping methods indeed provide appropriate trieved the URL, the used break out sequence, the type of
protection for the flow’s final sink were excluded. code (inline, eval or external) and the exact location. Next,
(C4) For each of the remaining data flows we generated ex- we normalized the URL to its corresponding second-level
ploits. However, many flows led to the generation of domain. To be consistent in regards to our selection of do-
exactly the same exploit payloads for exactly the same mains, we used the search feature on alexa.com to determine
URL - e.g. when a web page inserts three scripts via the corresponding second-level domain for each URL. We
document.write and always includes location.hash then determined for each of the results the tuple:
at a similar location. In order to decrease the overhead
for testing the exploits, our system only validates one {domain, break out sequence, code type, code location}
of these exploits. In regards to the code location, we chose to implement the
Starting from initial 24,474,306 flows, we successively ap- uniqueness to be the exact line and column offset in case
plied the outlined criteria to establish the set of relevant of external scripts and evals, and the column offset in in-
flows: line scripts. Applying the uniqueness filter to the com-
C1 C2
24, 474, 306 −−→ 4, 948, 264 −−→ 1, 825, 598 plete dataset including those pages only exploitable on In-
(5) ternet Explorer, we found a total of 8,163 unique exploits
C3 C4
−−→ 313.794 −−→ 181, 238 on 701 different domains, whereas a domain corresponds to
the aforementioned normalized domain. Due to the nature
Thus, in total we generated 181,238 test payloads, out of of our approach, among these were also domains not con-
which a total of 69,987 successfully caused the injected Java- tained in the top 5000 domains. Thus, we applied another
Script to execute. We discuss the specifics of these results filter, removing all exploits from these domains outside the
in the next Section. top 5000. This reduced the number of unique exploits to
6,167, stemming from 480 different domains. In respect to
6.4 Found vulnerabilities the number of domains we originally crawled, this means
In total, we generated a dataset of 181,238 test payloads that our infrastructure found working exploits on 9.6% of
utilizing several combinations of sources and sinks. As dis- the 5000 most frequented Web sites and their sub-domains.
cussed in Section 6.3 (C3), all flows which are encoded are When considering only exploits that work in Chromium,
filtered early on. For Google Chromium, which we used in we found 8,065 working exploits on 617 different domains,
our testing infrastructure, adhering to this rule we also must including those outside the top 5000. Again filtering out
filter all those exploits that use either location.search or domains not contained in the 5000 most visited sites, we
document.referrer to carry the payloads. This is due to found 6,093 working exploits on 432 of the top 5000 domains
the fact that both these values are automatically encoded or their sub-domains.
by Chromium. Hence, we chose to test these vulnerabili- Among the domains we exploited were several online bank-
ties in Internet Explorer 10 whereas the rest of the URLs ing sites, a poplar social networking site as well as govern-
were verified using our aforementioned crawling infrastruc- mental domains and a large internet-service provider run-
ture. Since the number of exploits utilizing search vulnera- ning a bug bounty program. Furthermore, we found vulner-
bilities amounts to 38,329 and the sum for referrer reached abilities on Web sites of two well-known AntiVirus products.
6.5 Selected Case Studies amples where code fragments seemed to extract automati-
During the analysis of our findings, we encountered several cally encoded values (and hence no sanitization is needed),
vulnerabilities which exposed interesting characteristics. In but due to non-standard parsing, extracted also unencoded
the following subsections, we provide additional insight into parts in malicious cases.
these cases. 1. Task: Extract host from URL
2. What it really does: Extract everything between www.
6.5.1 JSONP + HTTP Parameter Pollution and .com (e.g. whole URL)
As stated in Section 2, flows into URL sinks are not eas- 3. e.g. http://www.example.com/#notEncoded.com
ily exploitable. Only if the attacker controls the complete
string, he can make use of data and javascript URLs to ex-
var regex = new RegExp ( " / www \..*\. com / g " );
ecute JavaScript code. However, in our dataset we found var result = regex . exec ( location . href );
a particularly interesting coding pattern, that allows script
execution despite the fact that the attacker only controls
parts of the injected URLs. In order to abuse this pattern a 1. Task: Extract GET parameter foo
Web page must assign a partly tainted string to a script.src 2. What it really does: Extracts something that starts
attribute that includes a JSONP script with a callback pa- with foo=
rameter (See Listing 5). 3. e.g. http://www.example.com/#?foo=notEncoded