Analyzing The Ecosystem of Malicious URL Redirection Through Longitudinal Observation From Honeypots

computers & security 69 (2017) 155–173
Available online at www.sciencedirect.com
ScienceDirect
j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / c o s e
Analyzing the ecosystem of malicious

URL redirection through longitudinal observation
from honeypots
Mitsuaki Akiyama a,*, Takeshi Yagi a, Takeshi Yada a, Tatsuya Mori b,

Youki Kadobayashi c
a
NTT Secure Platform Laboratories, Tokyo, Japan
b
Waseda University, Tokyo, Japan
c
Nara Institute of Science and Technology, Ikoma, Nara, Japan
A R T I C L E I N F O A B S T R A C T
Article history: Today, websites are exposed to various threats that exploit their vulnerabilities. A compro-
Available online 11 January 2017 mised website will be used as a stepping-stone and will serve attackers’ evil purposes. For
instance, URL redirection mechanisms have been widely used as a means to perform web-
Keywords: based attacks covertly; i.e., an attacker injects a redirect code into a compromised website
Honeypot so that a victim who visits the site will be automatically navigated to a malware distribu-
Compromised website tion site. Although many defense operations against malicious websites have been developed,
URL redirection we still encounter many active malicious websites today. As we will show in the paper, we
Drive-by download infer that the reason is associated with the evolution of the ecosystem of malicious redirection.
Domain generation algorithm Given this background, we aim to understand the evolution of the ecosystem through
long-term measurement. To this end, we developed a honeypot-based monitoring system,
which specializes in monitoring the behavior of URL redirections. We deployed the moni-
toring system across four years and collected more than 100K malicious redirect URLs, which
were extracted from 776 distinct websites. Our chief findings can be summarized as follows:
(1) Click-fraud has become another motivation for attackers to employ URL redirection,
(2) The use of web-based domain generation algorithms (DGAs) has become popular as a
means to increase the entropy of redirect URLs to thwart URL blacklisting, and (3) Both domain-
flux and IP-flux are concurrently used for deploying the intermediate sites of redirect chains
to ensure robustness of redirection.
Based on the results, we also present practical countermeasures against malicious URL
redirections. Security/network operators can leverage useful information obtained from the
honeypot-based monitoring system. For instance, they can disrupt infrastructures of web-
based attack by taking down domain names extracted from the monitoring system. They
can also collect web advertising/tracking IDs, which can be used to identify the criminals
behind attacks.
© 2017 The Author(s). Published by Elsevier Ltd. This is an open access article under the
CC BY license (http://creativecommons.org/licenses/by/4.0/).
* Corresponding author.
E-mail address: [email protected] (M. Akiyama).
http://dx.doi.org/10.1016/j.cose.2017.01.003
0167-4048/© 2017 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://
creativecommons.org/licenses/by/4.0/).
156 computers & security 69 (2017) 155–173
• The use of domain generation algorithm (DGAs), which were

1. Introduction originally used for communication of a bot and a command
and control (C&C) server, has become popular as a means
Attack campaigns targeting websites, e.g., Beladen, Gumblar, to increase the entropy of redirect URLs to thwart URL
and Nine-ball, successfully affected tens of thousands of web- blacklisting.
sites in 2009 (Websense Security Labs, 2009). In addition to this • Both domain-flux and IP-flux should be concurrently used for
mass compromising, we faced serious server-side vulnerabili- deploying the intermediate sites of redirect chains to ensure
ties such as Heartbleed, ShellShock, and Poodle in 2014 robustness of redirection.
(Durumeric et al., 2014). The above threats exploit websites to
expose many of them to the risk of data tampering. If such data We unveiled the above change through four years of mea-
tampering is applied to a website, it can generate web-based surement. The insight obtained from analyzing the observed
attacks for accomplishing attacker’s purposes. One such purpose ecosystem helps network and security operators disrupt attack
of inflicting serious damage to victims is to compel a com- campaigns in appropriate points and layers corresponding to
promised website to serve as a stepping-stone of various web- the distinct purpose, mechanism, and strategy. For this purpose,
based attacks, e.g., drive-by download exploits. Once a victim it is essential to reveal the purposes of attacks, mechanisms
visits such a compromised website, he/she will be redirected of redirection, and strategies to conduct attacks to gain secu-
to another website, which exploits the vulnerabilities of web rity knowledge in a timely manner as well as protocol-level
browsers to automatically download and install malware. URL measurements such as passive DNS and web proxy. To the best
redirection is used as a fundamental mechanism to broadly of our knowledge, there have been no studies on the analysis
collect web accesses across websites. In addition, injecting re- of malicious websites introduced from the longitudinal view-
direct codes into various compromised websites is one strategy point, i.e., even though some studies used a huge volume of
to surreptitiously conduct attacks. datasets, these datasets were obtained by just one-time in-
As many defense operations against malicious websites have spection or repeated inspections during a short period. Our
been developed (Grier et al., 2012; Moshchuk et al., 2006; Provos study is a pioneer example of honeypot-based measurement
et al., 2008), the web-based attack, which consists of the afore- to observe the ecosystem.
mentioned purposes, mechanisms, and strategies, must also Based on the obtained knowledge, we also investigated
have substantially evolved. We infer the reason we still en- the operational difference between types of DGAs and effec-
counter a large number of active malicious websites used for tiveness of conventional methods, and discussed practical
web-based attacks today may originate from the evolution of mitigation strategies against the ecosystem of malicious
its core mechanism – malicious URL redirection. URL redirections: countering DGAs, discovering unknown com-
On the basis of the discussion above, our goal in this study promised websites, and disabling web advertising and tracking
was to characterize the ecosystem of malicious URL redirection, IDs of attackers.
which plays a vital role in the attack vector. While many new The rest of the paper is organized as follows. In Section 2,
mechanisms have been incorporated into attacks, one inter- we detail our monitoring system and give a summary
esting observation we found through the long-term study of of the data we collected. In Section 3, we analyze the URL re-
the ecosystem was that URL redirection has always been used direction mechanism in detail. In Section 4, we present an
since the emergence of attacks. Such an invariant should play analyzed complex URL redirection structure, which uses DGAs.
a key role in controlling the success of attacks. In Section 5, we discuss countermeasures to malicious URL re-
With this background in mind, we pose the following redirection and domain-flux techniques. Then we evaluate the
search questions: generality and impact of our observation in Section 6. We in-
troduce related work in Section 7 and conclude this paper in
RQ1: What are the key characteristics of URL redirection Section 8.
mechanisms?
RQ2: Have their purposes been changed over time?
To answer the research questions, we developed a honeypot- 2. Extracting URL redirection

based monitoring system, which specializes in monitoring the
behavior of URL redirections. We deployed the monitoring 2.1. Definition of URL redirection
system across four years, resulting in the collection of more
than 100K malicious redirect URLs extracted from 776 web- Redirection refers to automatically replacing access destina-
sites compromised by fraudulent accesses with stolen tions, and it is generally controlled over an HTTP protocol on
credentials. We conducted an in-depth analysis of collected URL the web. In addition to this conventional method, other methods
redirections. for automatically accessing external web content, e.g., iframe
Our key findings can be summarized as follows: tag, have been often used, particularly for web-based attacks.
In this paper, we originally define that URL redirection addi-
• Although the main purpose of URL redirection caused by tionally includes automatically occurring web access to URLs
redirect code injections has been drive-by download- corresponding to an initial accessed URL and assume that URL
based malware infection, click-fraud has become a redirection methods are tag redirections (iframe, script, meta,
new purpose in addition to malware infection in recent etc.), script redirections (JavaScript location’s methods), and
years. HTTP redirection (HTTP-3xx status code).
computers & security 69 (2017) 155–173 157
2.2. Monitoring system malware which gathers usernames and passwords from the
host system and leaks them to a machine elsewhere on the
Discovering specific compromised websites in web space is re- Internet under the control of an attacker. The sandbox gen-
quired in advance of starting observation to understand the erates honeytokens, i.e., IP address/domain name of the
actual circumstances of URL redirect injection. Since web space server, usernames and passwords of the CMSs created in
is extensive (the number of URLs is increasing daily) and deep the previous step, and sets them into a configuration file
(dynamic pages are less easily indexed in search engines), it or registry of several client applications, e.g., FTP clients. In
is not easy to discover malicious websites in the wild in a timely each analysis, the sandbox randomly generates unique
and efficient manner. Guided approaches for efficiently dis- credentials.
covering unknown malicious websites in web space have been 3. The Web CMS honeypot actually behaves as an FTP server
proposed (Invernizzi et al., 2012; Zhang et al., 2012). Addition- and waits for attackers to infect the CMSs and inject URL
ally, a decoy system using honeypots to attract attackers has redirection attack code and perhaps other things besides.
been proposed (Canali and Balzarotti, 2013). Our approach in It creates a user directory for each account corresponding
conducting our research also involved honeypots to monitor to potentially leaked credential. After the attacker has suc-
compromised websites and analyzing malicious redirect code cessfully FTP login to the server with honeytokens, the
injected into them. We previously presented a prototype of our attacker was provided with full access as an FTP user to the
measurement system (Akiyama et al., 2013). In addition, we server, since the attacker could insert HTML tags, JavaScript
extend the scope of the above studies to continually track and code, PHP code and .htaccess files.
dissect websites that are redirect destinations over the long 4. We run a honeyclient, which is a web browser-type honeypot,
term. performing as a vulnerable web browser and access to each
Our monitoring system is composed of several types of infected CMS content on a daily basis to determine where
honeypots and can efficiently collect information of mali- the URL redirections lead. In our observation, all observed
cious URL redirection. The key method is purposely leaking the redirections toward external websites must result from re-
bait credentials, called honeytokens (Spitzner, 2003), of our web direct code injection either directly or indirectly because
content management system (Web CMS), which is also a original decoy content does not include external web
honeypot, to attackers to incur masquerade attacks. Fig. 1 gives content.
an overview of our monitoring system, and we describe the
analytical steps as follows: Our monitoring system restricts web access from outside
benign web users to eliminate the risk of secondary attacks
1. We setup decoy Web Content Management Systems (CMSs) from our compromised websites, while it receives masquer-
on the server, called Web CMS honeypot, and make it ac- ade attacks and URL redirect injections. It means that our
cessible from the open Internet by FTP protocol. We deploy monitoring system permits only our honeyclient to access by
famous CMSs such as Wordpress and Joomla. The CMSs con- web our compromised websites and publicly open FTP to incur
tents, e.g., HTML, JavaScript, and PHP, with little modification masquerade attacks with stolen credentials.
are used as decoys and do not include external web
content. The Web CMS honeypot has a domain name and 2.3. Extraction method
requires a username and password to be able to access any
content. In general, benign websites basically include benign redirects
2. We setup a malware sandbox, running malware from our daily derived from original web content made by a website owner.
crawling of public websites and blacklisted websites i.e., web- Furthermore, an owner sometimes changes original web content
sites listed in http://malwaredomainlist.com, in order to legitimately. This type of redirect becomes noise in analyzing
purposely leak honeytokens. The sandbox is designed to run benign websites. In contrast, since all redirects on our
Our monitoring system

3. Accept fraudulent login with honeytoken
Web CMS Malicious

honeypot websites
(Compromised 4. Inspect web content
Honeytokens websites) and detect
Attacker 1. Setup Web CMS honeypot

Honeytokens
Malware Honeyclients
sandbox
Detectors
2. Setup malware sandbox
and analyze malware
Fig. 1 – Monitoring system overview.

compromised decoy websites are caused by malicious redi-

Table 2 – URL status when they were accessed.
rect injections, we can safely assume that all the collected
redirects are derived from malicious ones. Protocol Status # of accesses (%)
DNS/TCP DNS resource not found 731,272 (49.7)
TCP connection error 56,206 (3.8)
Other error 51,732 (3.5)
HTTP HTTP-2xx successful 453,143 (30.8)
3. Analysis of URL redirection HTTP-3xx redirection 83,047 (5.6)
HTTP-4xx client error 91,371 (6.2)
To address RQ1 and RQ2, we analyzed the methods and des- HTTP-5xx server error 4,436 (0.3)
tination websites/URLs of injected URL redirection. First, we Total 1,471,197
explain the basic statistics and give an overview of observed
data. Next, the most significant security concern is how
many web injections are related to malware infection, so we
identify which redirect destinations are exploit sites. Then, we
survey redirection methods and errors. Finally, we investi- variation, we simply express a <host> part as a name of a
gate domain and URL features such as URL popularity on website in the following analysis and statistics. Emerging re-
individual websites. directs indicate 11,235 websites of which 95.4% use domain
names and the rest directly use IP addresses.
In our experiment, there were 1.4M+ URL accesses derived
3.1. Basic statistics from URL redirect injections on compromised websites across
the four years of inspections. Table 2 shows the URL status of
Table 1 summarizes the collected data. In our observation the accesses. Nearly half the accesses failed to establish a con-
period, 776 websites were compromised in our monitoring nection with websites due to DNS error or TCP connection error,
system, and we confirmed that 96.5% of them have been while just about 30% of accesses successfully received an HTTP
actually injected with redirect codes over 57K times. Note response. In total, we discuss the analysis of a large amount
that some redirect URLs directly use IP addresses for the of accesses and clarify what they were aimed for, how they
<host> part of the URL instead of domain names, e.g., worked, and the reasons for access failure in the following
http://10.0.0.1/exploit.php. In consideration of this hostname sections.
Fig. 2 shows how actively events occurred in our monitor-
ing system. It includes the numbers of masquerader’s login
events to compromised websites in our monitoring system, web-
Table 1 – Summary of data. sites injected URL redirects, and redirect destination websites/
Type # URLs, which were calculated monthly. After that, the number
of masquerade accesses reduced; however, the number of re-
Period 39 months (Mar. 2012–May 2015)
direct destinations continued to remain. The main reasons for
Compromised websites 776 sites
Masquerader’s login 59,462 logins a number of redirect destinations remaining are caused by both
Content modification 57,009 logins web advertisements and automatically generated redirect des-
Inspections 323,581 times tinations. We give detailed explanations on such redirect chains
Redirects (unique) 11,235 websites, 109,991 URLs in Sections 3.4.4 and 4.
Fig. 2 – Injection activity.

3.2. Malware-infection related websites

Table 3 – Detection overlap.
3.2.1. Detectors and detection coverage Sets # of websites

Due to the known limitation of detection coverages of static U ∪ C ∪ E (All) 8311
and dynamic analysis, we combine several detection methods U 7296
C 3694
that independently operate to broadly identify malicious web-
E 1080
sites related to malware infection. First, our honeyclient accesses
U∩C 2830
compromised content and occasionally causes redirect desti- U∩E 788
nations. Next, detectors analyze URLs or web content from the C∩E 825
following viewpoints. U − C − E (included only in U) 4350
C − U − E (included only in C) 711
• URL-based detector statically parses URL strings by using E − U − C (included only in E) 139
U∩C∩E 672
YARA rules1 embedded in Sandy (fb1h2s, 2014) and detects
URLs based on the characteristic <path> part of a URL gen- U: URL-based detection; C: content-based detection; E: exploit-
erated by known exploit-kits. based detection.
• Content-based detector statically scans web content, which

is collected by our honeyclient, by using multiple types of
anti-malware software2 and detects URLs indicating scanned
patterns. Generally, exploit kits are widely used to build exploit
content when the content is flagged by at least more than
sites, and these sites have characteristic URL paths since default
one of the URLs.
templates for web content are used. Therefore, URL string sig-
• Exploit-based detector uses our honeyclient to dynami-
natures can be used to easily detect typical unchanged exploit
cally analyze web content and detect malicious web content
kits.
on the basis of anomalous browser/system behavior caused
URL access failure is out of the scope of detectors that use
by browser exploit (i.e., drive-by download).
static or dynamic features of web content, i.e., content- and
exploit-based detectors, while URL-based detectors are able to
Our honeyclient has installed a vulnerable browser and its
detect characteristic URLs. On the contrary, content-based and
plugins and inspects an input URL with corresponding URLs
exploit-based detectors are able to detect URLs regardless of
that are automatically accessed using redirect methods. In
characteristic features in the URL-path part.
the case of exploit-based detection, our honeyclient detects ex-
ploits on the basis of anomalous behavior in the system. To
detect known vulnerabilities, the detection modules, called 3.2.2. Detection and trend
HoneyPatches (Akiyama et al., 2010), are applied to vulnerable The detection overlaps between detectors are listed in Table 3.
functions on our honeyclient like a patch. They transparently The URL-based detector identified 12,841 URLs owned by 7296
check the data flow of vulnerable functions and detect ex- websites. The content-based detector identified web content
ploits when input data triggers vulnerabilities.3 To detect corresponding to 18,273 URLs owned by 3694 websites. The
unknown vulnerabilities, our honeyclient also performs in- exploit-based detector actually detected drive-by download ex-
tegrity checking of the file system, registry, and process, which ploits in 2841 inspections that included 6584 URLs owned by
are also implemented in Capture-HPC (Honeynet Project, 1080 websites. Websites suspected of exploits were 74.1% (8333/
2008). 11,235), and websites that definitely attempted to exploit were
In addition, it extracts all redirect destinations including in- 9.61% (1080/11,235).
termediate websites that have emerged during inspection. Although the URL-based detector identified a number of
A general limitation of a honeyclient is that false nega- websites/URLs that appeared over an entire period, many such
tives are possibly raised by environment-dependent exploits websites/URLs were access failures, and we did not obtain their
in exploit-based detection. For example, to detect an exploit actual web content. A number of websites identified by content-
code targeting CVE-2010-0188, the environment of the based and exploit-based detectors, i.e., websites related to drive-
honeyclient must first install a specific version of Adobe reader, by downloads, appeared in 2012 and 2013 according to the sharp
which should be version 9.3.1 or earlier. increase in the volume of attacker activity from the end of 2012
The URL-based detector works even if access fails because to Oct. 2013, as shown in Fig. 2. However, such drive-by down-
it only scans URL strings and detects characteristic URL-path loads were not so active in 2014 and 2015 according to the drop
in the volume of attacker activity.
1
The latest rules (Apr. 6, 2014) can be found at https://github.com/ 3.2.3. Correlative redirections in drive-by download sequence
fb1h2s/sandy/tree/master/yara-ctypes/yara/rules/urlclassifier. To We analyzed how many URLs and websites are correlatively
reduce obvious false positives, we excluded g01pack.yar, which ag- involved in drive-by download sequences (Fig. 3). We simply
gressively detects benign URLs. extracted all URLs and websites in each detected inspection.
2
Eleven distinct kinds of anti-malware software, that is, Avast,
In inspection detecting drive-by downloads, all compromised
AVG, Avira, ClamAV, ESET, Forefront, Kaspersky, McAfee, Sophos,
websites have redirects leading to an exploit URL on an ex-
Symantec, and TrendMicro.
3
Our HoneyPatch was presented in 2010 (Akiyama et al., 2010), ternal website. The figure indicates that almost all sequences
and a similar but sophisticated implementation was presented by involved two or more external URLs and websites. This means
Araujo et al. (2014) in 2014. that exploit codes are inevitably placed at external websites.
1
Website
0.9 URL
0.8
Fraction of inspections 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 10 100
Number of redirect websites or URLs
Fig. 3 – Redirection count in drive-by download exploits. Note that a landing URL that is on our compromised websites is
excluded in this graph.
Exploit sites are built using various exploit kits, and typical case, almost all script tag insertions were script tags without
exploit kits have several URLs that have different roles, such src attribution. This means that obfuscated JavaScript is inside
as browser fingerprinting, exploiting browser vulnerability, and script tags and dynamically outputs iframe tags or ex-
forcing a compromised browser to download a malware executes location redirects after de-obfuscating itself. Due to
ecutable. First, a page for browser fingerprinting identifies a this obfuscation, malicious redirect code conceals specific re-
target host’s environment and redirects it to the appropriate direct destinations (i.e., URLs).
exploit pages. If an exploit is successful, the target host is forced In comparison, HTTP redirects are implemented in
to access a malware download page. server-side content as header() insertion in .php files (9 modi-
The reason several inspections include two or more exter- fications) and .htaccess files (174 modifications). In the case
nal websites is due to an intermediate website. In typical exploit of using the above redirect methods controlled in server-side
sequences observed during our observation, an exploit site was content, it is impossible, by client-side measurement, to rec-
accessed via an intermediate website. The role of this inter- ognize the reason a redirection occurs, while a decoy website
mediate website is controlling web accesses and redirecting can collect such server-side .php and .htaccess files? Table 4
them to appropriate destinations. We explain the analysis of lists the classified redirect methods and the breakdown of their
this intermediate website further in Section 4.2. usage on actual injected redirect codes. We statically ana-
lyzed injected redirect codes, but this static analysis to find
3.3. Redirect features specific strings could not enable us to precisely identify D, E,
and F because they are obfuscated and dynamically ex-
3.3.1. Methods and utilization ecuted. Therefore, we counted script tag insertions without
We classified the observed redirect methods into three cat- src attributes as the integrated number of D, E, and F.
egories: tag redirect, script redirect, and HTTP redirect. The most
standard method is tag redirect using script, iframe, or meta 3.3.2. Errors
tags. The iframe and meta tag insertions were only 1.0% and Many redirections fail to access redirect destinations in some
less than 0.1% respectively. In contrast, the script tag inser- layers, e.g., DNS or HTTP. We counted the unique statuses of
tion became a major injection method over time, which websites and indicated the time-series accessibility of redi-
accounted for 70.8% of all observed file modifications. In this rect destinations in regard to websites, which are shown as
Table 4 – Redirect methods on compromised websites.

Category Subcategory Code example # of modifications events
Tag redirect A. iframe tag iframe src = URL 574
B. script tag script src = URL 436
C. meta tag meta http-equiv = “Refresh” content = “0”; URL = URL 7
D. iframe tag by document.write document.write(“¡iframe src = URL 39,949
E. script tag by document.write document.write(“¡script src = URL
Script redirect F. JavaScript location location.href = URL
location.replace(URL)
HTTP redirect G. PHP header() header(“Location: URL 9
H. .htaccess RewriteRule RewriteRule ^.*$ URL [R = 302,L] 136
I. .htaccess ErrorDocument ErrorDocument 404 URL 14
Fig. 4 – Time-series of access status.
percentages of DNS lookup failure, connection errors, and HTTP 3.4.2. Emerging redirect destinations from distinct
errors in Fig. 4. The percentages of DNS and TCP connection compromised website
errors gradually increased. Attackers frequently change redirect codes to switch redirect
During our observation period, 4488 domains responded with destinations to thwart blacklisting or change attack cam-
non-existent domain (NXDomain) more than once, and 32.9% of paigns. Taking this feature into account, our monitoring system
domains responded with NXDomain to all DNS queries. In ad- can efficiently obtain various redirect URLs through periodi-
dition, 20.7% of URLs responded with HTTP error (not HTTP- cally monitoring compromised websites performing as a landing
200) more than once, and 16.0% responded with HTTP error site of web-based attacks. In contrast to benign websites, decoy
to all HTTP requests. These include automatically generated websites are suitable for long-term observation and can maxi-
domains, and they often failed to resolve their domain name. mize the yield amount of redirect codes and redirect
destinations. To indicate how many redirect destinations our
3.4. Domain and URL features monitoring system extracted from each compromised website,
we counted observed URLs and websites derived from each
3.4.1. Domain depth and suffix compromised website over an entire measurement period
Table 5a shows the depth of domain, which means the number (Fig. 5). The compromised websites had a median number of
of labels in the <host> part of a URL, e.g., the domain depth 91 domains and a median number of 127 URLs in terms of re-
of www.example.co.jp is 4. Domains placed 2nd and 3rd were directions in total.
dominate among all domains and accounted for 46.8 and 46.5%,
respectively. 3.4.3. URL popularity on distinct redirect website
Table 5b lists the statistics for the (FQDN-1) label, which From the viewpoint of redirect destination, Fig. 6 shows the
means domain suffix without a hostname, e.g., the (FQDN-1) URL popularity, which means how many URLs each redirect website
label of www.example.com indicates example.com. Although has. A number of malicious websites, which are obviously exploit
mynumber.org and .pro are not popular domain suffixes, they websites identified by the exploit-based detector mentioned
are placed higher in rank. The reason of this unusual distri- in Section 3.2.1, are distributed between two and tens of URLs
bution of domain suffixes is that almost all domains under on the X-axis in the figure. A malicious website built using a
these two domain suffixes are automatically generated domains certain exploit kit generally has several URLs (pages) such as
(AGDs), which are possibly produced using a DGA, and a vast a fingerprinting page for identifying a target’s environment and
majority are sometimes not accessible because of DNS lookup an exploit page targeting specific vulnerabilities. Websites at
failure. Detailed analysis of the DGAs discovered during our the bottom right of this figure, which own a large volume of
observation is described in Section 4. URLs, are used for web advertising and tracking, and we
mention them in Section 3.4.4 in detail. The leftmost web-
sites, which own only one URL, account for 69.3%, and most
are the aforementioned AGDs. In the exploit-detected inspec-
Table 5 – Domain depth and suffix statistics.
tions mentioned in Section 3.2.2, these AGDs often emerged
(a) Domain (b) Distribution of
as intermediate websites; in other words, they fulfill the role
depth domain suffix (Top 6)
of bridging an initial website, i.e., compromised website, and
Domain % Domain suffix Domain % a final website, e.g., an exploit site. We further discuss the analy-
depth domain (FQDN-1) depth domain sis of AGDs and unveil their nature in Section 4.
2 46.8 .mynumber.org 3 17.95
3 46.5 .com 2 16.29
4 6.4 .pro 2 14.60 3.4.4. Web advertising and tracking
5 0.17 .ru 2 3.44 Even if an attacker compromises websites and injects
6 0.02 .metric.gstatic.com 4 2.87 redirect code, a redirect destination is not necessarily related
7 0.01 .net 2 2.51
to malware infection but rather web advertising and tracking
1
Website
redirected from each compromised website

0.9 URL
0.8
CDF of websites or URLs 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 10 100 1000 10000 100000
# websites or URLs
Fig. 5 – Number of emerging redirect destinations on distinct compromised website over entire observation period.
10000
Cumulative number of redirect website
1000
100
10
1
1 10 100 1000 10000 100000
# of URLs on distinct redirect website
Fig. 6 – URL popularity on redirect website X-axis indicates cumulative number of URL on distinct redirect website through
our observation. The top 10 and top 100 websites cover 36.7 and 67.9% of all observed URLs. Websites owning under 10
URLs are 93.9%; in particular, websites owning one URL are 69.1%.
in some cases. The aim of web advertising and tracking is to

Table 6 – Number of URLs on individual websites
induce click-fraud monetization such as fraudulently utiliz- (Top 10).
ing pay-per-click advertising (PPC); therefore, this seems to be
Website # of Prevalence
another aim of malicious URL redirection. An attacker for-
URLs (%)
merly requires recruiting a large amount of malware-infected
user99[dot]freewebhostingarea[dot]com 10,127 0.6
hosts, e.g., DNSChanger, Koobface, and ZeroAccess, to accom-
googleads[dot]g[dot]doubleclick[dot]net 5,250 23.7
plish click-fraud monetization (Blizard and Livic, 2013). In
ad[dot]yieldmanager[dot]com 4,091 1.0
contrast, injecting redirect code into compromised websites is ads[dot]yahoo[dot]com 3,987 0.8
an alternative to recruit general public web users for PPC. The ib[dot]adnxs[dot]com 3,787 1.2
more popular the website an attacker compromises, the more www[dot]google-analytics[dot]com 3,628 59.9
web users an attacker can simultaneously recruit without much g[dot]adnxs[dot]com 3,115 1.0
effort. Therefore, injecting redirect code is more cost effec- pagead2[dot]googlesyndication[dot]com 2,454 28.4
s[dot]ad125m[dot]com 2,052 0.5
tive and scalable than recruiting click bots.
content[dot]yieldmanager 1,949 0.8
Websites having a large amount of URLs in our measure-
[dot]edgesuite[dot]net
ment (the bottom right of Fig. 6) are often used for web
advertising or tracking. These URLs usually contain onetime-
tokens or encoded individual data, e.g., timestamps, client
information, and origin URLs (referrer URLs), in the <path> parts 36.7% of all redirect URLs. We manually surveyed the usages
so that unique URLs are newly generated for every access. of these websites and confirmed that almost all the websites
Table 6 lists the top websites owning a large amount of URLs. are used for web advertising and tracking. To estimate how
Only these 10 websites had 40,440 URLs, which accounts for widely and commonly a certain URL is used, we calculate
prevalence as shown in the table. This value is calculated by The after-mentioned malicious techniques dedicated to the
Semerge Sexist , where Sexist is the number of all websites, i.e., URL redirection ecosystem are deployed through collusion with
decoy compromised websites, and Semerge is the number of web- each other. Our analysis revealed that attackers use the fol-
sites whose redirect destination is set. While the domains of lowing techniques: domain-flux by web-based DGAs, IP-flux by
doubleclick, google-analytics, and googlesyndication fast-flux service networks (FFSNs), redirection controlling
provide popular web advertising and tracking services, they are by traffic distribution networks (TDSs), and target profiling by
also widely used by redirect codes injected into compro- tracking services. We give a clear overview of the URL redirec-
mised websites. We mention the use of google-analytics for tion ecosystem revealed through our measurement in Fig. 7
web tracking in Section 4.4. and explain its characteristics as follows.
The notable point about web advertising and tracking is that
their numbers increased in 2014 and 2015 while the number
4.1. Web-based domain generation algorithm
of drive-by downloads gradually decreased.
The use of DGAs has become popular since 2008 due to the
3.5. Summary emergence of Conficker (Leder and Werner, 2009). Originally,
DGAs were used for C&C in the post-infectious phase, for in-
To summarize, almost all injected redirect codes were devel- stance, malware-infected hosts communicate with specific
oped in obfuscated JavaScript to conceal redirect destinations. generated domains, i.e., C&C server, for only a very short period
Gradually increasing DNS lookup failures resulted from the of time. Although our analysis was focused on the pre-infectious
large amount of domains that are obviously AGDs. The phase, to our surprise, we discovered many suspicious AGDs
drive-by download sequences often involve an intermediate in the observed redirect URLs. We inspect how DGA mecha-
website to control redirections. To address RQ1, we describe nisms have been used in the ecosystem of URL redirection. Our
the in-depth analysis of AGDs and controlling redirections in study revealed that DGAs are also used as a key component
Section 4. of the redirection mechanism. The use of the DGAs we ob-
Domains related to drive-by downloads actively emerged served for URL redirection is broadly classified into two
on URL redirections in 2012 to 2013. While they were not so categories: client-side domain generation (CDG) and server-
active in 2014 and later, the number of domains/URLs related side domain generation (SDG). We give details on CDG and SDG
to web advertising or tracking were increasing. This is the below.
answer to RQ2, which indicates the change in the purpose of
redirection based on redirect code injections, that is, click-
fraud monetization is a new purpose in addition to malware 4.1.1. Client-side domain generation
injection. Web-based CDG is implemented as JavaScript. When a web
browser accesses a page containing a JavaScript DGA, it runs
on the browser and pseudo-randomly generates a domain name
and its URL. After that, the JavaScript DGA outputs redirect code,
4. In-depth analysis of evasive redirection e.g., an iframe tag set src attribute as a generated URL.
techniques The JavaScript DGA is highly obfuscated to obstruct analy-
sis. We used a browser emulator to fully execute obfuscated
To address RQ1 regarding the key characteristics of the URL JavaScript and extract input/output values of eval() and
redirection mechanisms, we unveil and dissect evasive redi- document.write() as candidates of de-obfuscated human-
rection techniques deployed by attackers in our observation. readable code. We then manually analyzed this extracted
Compromised Traffic distribution system

websites Exploit sites
(TDS) (
Web-based DGA
(Client-side) (
Domains
tion
irec
Red
Redirection
(clo R
akin edirect Redirection
Web-based DGA g) ( ion
(cloaking) (
(Server-side) (
Domains
Re
di
re
ct
io
Popular sites
n
Tracking sites (
IP addresses
Fast-flux service network
(FFSN) (
Fig. 7 – Observed malicious redirect ecosystem.

letters timestamp period Table 7 – Comparison between C&C and web-based

DGAs.
month date | hour / period | DP AM CA OC
Shuffle C&C Client (CDG) Difficult Available Low
Web Client (CDG) Easy Available Low
Server (SDG) Easy Unavailable High
schema :// . zone / URL-path
DP: deployment point; AM: algorithm modification; CA: code avail-
length ability; OC: operation cost.
Generated URL
Fig. 8 – DGA: parameters and calculation.

instances were 7-character hostnames with .ru and .com, and
8-character ones with .ru and .us.
JavaScript and identified the following parameters used for
building redirect URLs including the pseudo-randomly gener- 4.1.3. DGA-type comparison and feasibility of conventional
ated domain names. We describe the parameters used in the detection methods
JavaScript DGA as follows and show the procedure to gener- Countermeasures against DGAs, such as detecting and ex-
ate a redirect URL in Fig. 8. tracting AGDs and sinkholing, have already been implemented
and contributed to disrupting C&C (Conficker Working
• length: This length determines the length of a randomly gen- Group, 2011; Shadowserver, 2014). We discuss to what extent
erated hostname. the conventional countermeasures are effective for newly dis-
• letters: These letters are candidates to be pseudo-randomly sected DGAs. First, we organize the main difference between
selected to construct a hostname. conventional DGAs and web-based DGAs and the difference
• timestamp: The current timestamp is used as a seed of between CDG and SDG. Fig. 9 gives an overview of C&C-
random function. It is not directly used, but month, date, and based and web-based DGAs. A C&C-based DGA, which is
hour values extracted from the timestamp are used as input conventionally used, is built in malware binary and used for
of the random function. opportunistically communicating with C&C servers. A web-
• period: A quotient of the hour and period is used for one of based DGA is used for redirecting web users to malicious
the inputs of a random function. This means that the outputs websites such as exploit sites.
of the DGA changes every period (hours). For instance, if the The most significant difference is the deployment point,
period is set to 6, at most four ( 24 hours 6 = 4 ) AGDs would which means where the DGA actually runs. Table 7 shows a com-
be generated in a day. parison between the types of DGAs in regards to algorithm
• zone: A generated hostname is added as a prefix of strings modification, code availability, and operation cost. In a C&C-
of zone, which we call an “upper-level domain” throughout based DGA, an attacker implements a specific algorithm inside
this work. For instance, given an AGD, 123abc.example.dga, malware binary. After distributing malware, an attacker behind
its upper-level domain is example.dga. Note that, in some a C&C server waits to receive a callback from the malware-
cases, an upper-level domain of an AGD could be a top- infected host. Therefore, an attacker cannot change the
level domain (TLD), such as .ru or .pro. Concatenated algorithms or parameters of the DGA before establishing C&C
strings stand for a specific domain. with the malware-infected host. In contrast, a web-based DGA
• schema and URL-path: URL-path is added to the above gen- is more flexible than a C&C-based DGA because an attacker
erated domain as its suffix. These strings mean the <host> can change the algorithm or its parameters on a compro-
and <path> parts of a URL. Finally, after schema, i.e., http:// mised website whenever he or she wants.
or https://, is added to these strings as its prefix, a spe- It is generally known that DGAs use current time informa-
cific URL is built. tion, e.g., hour, day, and month, for the seed of random function.
Since when and what the potential domain names are gen-
The discovered instances were 16-character hostnames with erated can be confirmed by an attacker, he or she previously
.info, .ru, .pro, and .mynumber.org. registers potentially generated domains to operate domains
working during specific times. A specific DGA is inside the
4.1.2. Server-side domain generation client-side program that is malware binary, so that security re-
In addition to web-based CDG, evidence of another type of web- searchers can analyze it by using a malware sandbox or reverse
based DGA, that is SDG, was found through our measurement. engineering and predict potential domain names that will be
In the observed redirections, characteristic URL sets with generated in the future (Conficker Working Group, 2011; Yadav
random character hostnames with a fixed length and similar et al., 2010). Therefore, a C&C-based DGA is potentially vul-
URL path (i.e., count[1-9][0-9]?\.php), which differs from nerable to enumerating AGDs. Due to the same weakness web-
the sets mentioned in Section 4.1.1, often emerged. However, based CDG has, we can enumerate AGDs serving web-based
only hard-coded URLs were set in the injected redirect code attacks when we obtain that DGA code.
instead of a JavaScript DGA. We assume that an attacker pre- Although it should be very costly for this SDG operation to
viously executes a DGA to generate a domain name at the frequently update redirect code, potential domain names of
server-side and injects redirect code with the generated domain SDG are unpredictable, so security measures, such as domain
name into his or her own compromised website. The discovered takedown, are more difficult than those of CDG.
C&C-based DGA
1. Infect with malware
Malware Attacker
DGA code
3. Access C&C
Infected host
server (AGD)
. Generate domain name
Web-based DGA
Client-side Domain Generation (CDG)
Victim host 2. Access 1. Inject
Compromised
Browser DGA Attacker
DGA website JavaScript
DGA 3. Reply JavaScript
JavaScript
Malicious
5. Access (Redirect) website (AGD)
4. Generate domain name
and redirect code
Server-side Domain Generation (SDG)

1. Generate domain name
Victim host 3. Access 2. Inject
Compromised
Browser Redirect code Attacker
Redirect code website to AGD
Redirect code 4. Reply to ADG
to DGA
Malicious
5. Access (Redirect) website (AGD)
Fig. 9 – Overview of C&C-based and web-based DGAs.
In a C&C-based DGA, a large volume of DNS queries and of domain name randomness still remains. Therefore, while
failures generally occurs from each infected host as time pro- conventional DGA detection methods based on linguistic fea-
gresses. Many detection methods have been proposed that are tures were implicitly targeting C&C-based DGAs, they may also
based on these characteristics (Antonakakis et al., 2012; Yadav be effective for web-based DGAs.
and Reddy, 2011). However, these conventional methods may
fail to detect web-based DGAs. In contrast to malware-infected 4.1.4. Lifespan
host accessing C&C-based DGAs, a web user unexpectedly ac- Generally, the more malicious websites are used for a long
cesses web-based DGAs when he or she accesses compromised period, the more blacklisting is effective against them. To cir-
websites with injected redirect code. In web-based DGAs, ac- cumvent this blacklisting, an attacker accordingly changes
cessing AGDs occurs when a web user occasionally accesses redirect code on compromised websites to switch to redirect
a landing site injected with redirect code. Therefore, the con- destinations. To understand the effectiveness of typical black-
ventional viewpoints toward detection, such as using a large listing, we estimated the lifespan of redirect destinations. The
volume of NXDomain derived from speculative DNS queries durations of CDG, SDG, and other non-DGA websites are sepa-
and failover behavior, do not effectively work. While the volumes rately shown in Fig. 10. We calculated domain A’s lifespan
of DNS query and failure are not so large, the characteristic D ( A) = Tf ( A) − Tl ( A) , where Tf(A) is the timestamp of the first
0.8
CDF of emerging website
0.6
0.4
0.2
CDG
SDG
Other
0
100 200 300 400 500 600 700 800 900 1000
Lifespan (days)
Fig. 10 – Lifespans of redirect websites.

emergence of A and T l (A) is the timestamp of the last the domains of DGA instances account for 74.5% out of 2841
emergence. inspections detected as successful drive-by exploits. This means
In the case of CDG, the two large gaps at 365 days (a year) that web-based DGAs play an important role in drive-by exploits.
and 730 days (two years) on the X-axis were caused by domains
that emerge annually. The reason that annually emerging 4.1.6. Discarded redirect code
domains are also AGDs is explained in Section 4.1.1. The dis- In most cases, an attacker carefully maintains redirect code
covered instances of JavaScript DGA, i.e., specific JavaScript code, on a compromised website to change redirect destinations and
use only month, date, and hour information extracted from a obfuscation algorithms. However, we also discovered in-
current timestamp and do not use year information. There- jected redirect codes and corresponding redirect destinations
fore, the same domain name is generated yearly on the same discarded by an attacker that have never been maintained or
hour, date, and month, and we actually observed this phe- changed by the attacker from a certain time. If a web user ac-
nomenon. The lifespans of SDG instances distributed a few to cesses a website containing discarded redirect code, the
tens of days substantially different from those of CDG in- redirection inevitably fails; therefore, it is actually harmless.
stances. Vanished websites on the redirections within 10 days In our observed objects, redirect codes toward 5 domains of
were 82.6% out of other websites, while 7.5% were used over A1 and 25 domains of A2 were discarded on several compro-
100 days. mised websites, and these ghost redirections continued through
each inspection. In addition, the JavaScript DGAs of B1–B3 and
4.1.5. DGA instances C were also discarded and continued generating redirections
We heuristically classified AGDs on the basis of hostname
toward unresolvable websites. Therefore, redirection failures
length, upper-level domain, and URL-path similarity into several
continuously occur despite a discarded redirect code that has
groups as DGA instances, which is a set of AGDs derived from
already been used by an attacker.
a specific DGA. Table 8 shows summarized data correspond-
ing to each instance. Instances A1–A4 are of SDG, and all have
a similar URL path. The remarkable point in these instances 4.2. Traffic distribution systems
is that a massively large amount of IP addresses were re-
solved from their domain names. Instances B1–B3 and C are Our observed AGDs have two patterns of characteristic strings
of CDG, and all also have similar URL paths. Instances B3 and in the URL path: count[1-9][0-9]?\.php and in.cgi\?[1-
C have a massively large amount of domain names, 1564 9][0-9]?. One (in.cgi) is called Sutra-TDS in a previous paper
and 1919, respectively. Despite this, almost all domains of B3 (Symantec Security Response Blog, 2011), which is a toolkit for
and C are unresolvable to IP addresses. Instance C has domains building a traffic distribution system (TDS).
under a private suffix domain, so an attacker can freely create A TDS is an intermediate website placed between an initial
valid domains. In comparison, a domain registration to a reg- website (i.e., compromised website) and final destination
istrar is required to validly use B3’s domain since B3 is directly website. The primary aim is to control the final redirect des-
deployed under the TLD. Instances B1–B3 had been used for tinations to obscure them. In the observed data, we confirmed
only two or three weeks, then a resolvable domain never ap- that final websites are frequently changed by a TDS as time
peared. Instance C newly emerged after B1–B3 operations. We progresses.
manually checked to see whether B1–B3 and C were de- All our discovered DGA instances performed as TDSs and
ployed by the same DGA with the same parameters except for mediated between an initial website and an exploit site. Almost
the upper-level domain. On the basis of this evidence, we all redirect methods on the AGDs based on A1–A4 were HTTP-
assumed that the same attacker uses B1–B3 and C. B1 and B2 200 with JavaScript location.href or HTTP-302.
were used in short periods. However, redirections using the
JavaScript DGA of B3 and C are actually discarded by an at- 4.3. Fast-flux service network under AGDs
tacker and remain on compromised websites even if generated
domains are unresolvable. Domains generated by SDG (A1–A4) respond with different
The domains and URLs of DGA instances account for only IP addresses in each DNS resolution. Consequently, they had
33.8 and 3.4% of all redirect destinations, respectively; however, an extremely large amount of IP addresses through our
Table 8 – Discovered DGA instances.

First Last Last UD HL URL-path DGA # of domains # of # of Prevalence
seen resolvable seen type (IP resolvable) URLs IPs (%)
A1 ’12-04-10 ’12-08-01 Ongoing .ru 7 count**.php Server 36 (36) 36 1,275 3.9
A2 ’12-12-08 ’13-08-05 Ongoing .ru 8 count**.php Server 131 (128) 131 35,789 69.8
A3 ’13-05-20 ’13-06-21 ’13-06-21 .us 8 count**.php Server 39 (39) 49 1,510 54.2
A4 ’13-05-31 ’13-06-17 ’13-06-20 .com 7 count**.php Server 16 (16) 16 2,111 54.8
B1 ’12-07-11 ’12-09-28 ’12-10-03 .info 16 in.cgi?** Client 43 (23) 43 8 1.9
B2 ’12-08-02 ’12-08-27 ’12-09-12 .ru 16 in.cgi?** Client 53 (32) 53 7 1.9
B3 ’12-10-03 ’12-10-20 Ongoing .pro 16 in.cgi?** Client 1,564 (16) 1,564 2 1.9
C ’12-11-08 ’14-12-05 Ongoing .mynumber.org 16 in.cgi?** Client 1,919 (11) 1,919 5 1.4
UD: upper-level domain; HL: hostname length.
measurement, as mentioned in Section 4.1.5 and shown in as a plain text in web content; therefore, we can easily iden-
Table 8. We observed over 1K IP addresses for each instance tify it.
and 39,131 in total. The following evidence indicates that A1–
A4 are deployed on a fast-flux service network (FFSN). As seen
from the GeoIP information, these IP addresses are mas- 4.5. Thwarting security inspection
sively globally distributed, for example, A1–A4 include 425
autonomous system numbers (ASNs) in 95 countries, 1849 ASNs One of the reasons for HTTP access failure seems to be cloak-
in 117 countries, 374 ASNs in 48 countries, and 414 ASNs in ing. Although cloaking was originally used for SEO poisoning,
56 countries. We calculated the ASNs of all IP addresses on it is currently used for web-based malware infection. To thwart
FFSNs and manually checked the several top ASNs and their security inspection, malicious websites respond with harm-
usages. Most IP addresses are located on residential net- less content or redirect to a legitimate website if they recognized
works so that end user PCs are the most dominant. While the an access as a security inspection. The most popular cloak-
IP addresses are globally distributed on hundreds of ASes, ing technique is IP cloaking, which identifies client IP addresses,
Ukraine’s telecom/mobile networks (AS15895 and AS25229) e.g., detected repeated accesses or blacklisted IP addresses, and
account for over 22%. The report published by RiskAnalytics flexibly responds (Rajab et al., 2011). It has been reported that
(2016) mentioned that their originally observed IP addresses almost all exploit kits have IP cloaking functionalities (Eshete
of an FFSN are also located at the same ASes we observed so and Venkatakrishnan, 2014). In addition, a security vendor’s
that we recognized the commonly used bot-infected PCs for report mentioned that TDSs also conduct cloaking (TrendMicro,
FFSNs. The emerging periods of these instances are tempo- 2011). To bypass IP cloaking, we make an effort to obfuscate
rally diversified; however, 1557 IP addresses out of all resolved the IP addresses of a honeyclient by using web proxies dis-
IP addresses are multiply used by them. tributed on various ASes. User-agent information and referrers
Hosts of resolved IP addresses successfully respond with are also usually used for cloaking. Our honeyclient is a high
HTTP replies in most cases; in other words, they have valid IP interaction system, so the user-agent information and referrer
addresses and run as a website. The percentages of IP ad- are naturally set by an actual web browser. Therefore, we did
dresses to which HTTP successfully responded in A1–A4 are not make any special effort to obfuscate this information. Note
79.5, 91.0, 92.6, and 80.6%, respectively. Therefore, in most cases, that our monitoring system did not access websites through
the authoritative DNS servers of FFSN domains in A1–A4 faith- a different browser so that it also possibly fails to access web-
fully answer valid IP addresses controlled by attackers. sites which change redirect destinations based on user-agent
All reply headers of HTTP include the following four com- information.
monly used specific header fields: Server:Apache, Content- In our inspection, our honeyclient encountered several types
Type: without a specific content type, Server:nginx/1.2.6, of cloaking. We summarize probable cloaking situations in HTTP
and X-Powered-By:PHP/5.4.11. These reply headers are as follows.
slightly weird because Server: fields are multiply defined, and
the Content-Type: field is always null. In consideration of this • HTTP error: HTTP-404 or HTTP-500 responses often occur
evidence, FFSN agents seem to be commonly installed on a spe- in inspections of exploit sites. These types of response are
cific server or use blind proxy redirection (BPR). It is said that BPR typical cloaking behavior of exploit kits.
is typically used for an FFSN (Honeynet Project, 2007), which • HTTP-200 without meaningful content
works as a reverse proxy to transparently send received HTTP Malicious websites respond with HTTP-200 without mean-
requests to a backend server directly controlled by an at- ingful content, for instance, 0-byte content or OK. When a
tacker. In this way, an attacker directly knows accessed hosts’ web browser receives such a reply, no redirection or exploit
information and collectively switches HTTP replies at a single occurs. More than half of the HTTP responses from A1–A4
point without exposure. were HTTP-200 with 0-byte content.
• HTTP-302 redirect to benign websites
4.4. Double-crop redirecting Malicious websites respond with HTTP-302 with specific re-
direct URLs, which are Google sites, Microsoft sites, or
One notable point is that Google Analytics emerged on redi- localhost. In addition, on some exploit sites, both exploits
rect chains derived from about 60% of compromised websites. and HTTP-302 redirects whose redirect destinations
Google Analytics is one of the most well-known web access are popular websites simultaneously occurred. This
analysis services. The emerging Google Analytics are from ad- seems to intentionally confuse security inspection. In many
vertising pages and drive-by related pages. The former is normal cases of redirecting to benign websites, the top pages
usage that aims to collect web users’ profiles to prepare to of popular websites, such as http://www.google.com/,
deploy more effective personalized ads. The latter’s aims http://www.google.se/, and http://www.bing.com/, are used
seems to be profiling targeted web users and further strategize for HTTP-302 redirect destinations.
effective exploits or monetization. In our content analysis,
the web content of 19 domains in A1 contained Google Ana- Some redirections of HTTP-302 were not sophisticated
lytics’ JavaScript and redirected web users to www.google because they were falsely set to redirect destinations as
-analytics.com. These 19 domains were set as redirect desti- http://google.com/ or http://bing.com/, which are not actu-
nations on 43.4% of compromised websites. In addition, the ally accessible. When such non-existent URLs are falsely accessed
same tracking ID, obviously an attacker’s, was commonly by a web user, the URL redirects the user to the top effective
used in all these domains. This tracking ID is embedded URLs such as http://www.google.com/ or http://www.bing.com/.
4.6. Summary the URL of a compromised website injected with redirect code.
We confirmed that all HTTP requests toward websites of DGA
To summarize, our in-depth analysis unveiled evasive tech- instances are attached with correct referrer fields. If we pre-
niques to ensure significantly tolerant operation of URL viously know the URL of a specific malicious website, it is
redirections: domain-flux, IP-flux, redirection controlling, and possible to discover unknown compromised websites by check-
target profiling. Domain-flux by web-based DGAs generated over ing the referrer information contained in an HTTP request
three thousand AGDs over an entire period. We found two uses toward the malicious website. Conceivable method of moni-
of DGAs: CDG and SDG, and discussed the operational differ- toring such a request is using web proxy logs or website
ence between the types of DGAs and effectiveness of takeovers by domain sinkholing. We can enumerate poten-
conventional methods. These observed AGDs performed as TDSs tial domains used in the future by reversing client-side DGAs.
which are intermediate websites placed between an initial Given that almost all domains are not registered and are
website and a final destination website to control final redi- unused, we can legitimately register potential domains to own
rect destinations and obscure them. IP-flux by an FFSN and the a part of the AGDs instead of an attacker. If we successfully
above domain-flux (especially CDG) are concurrently used for complete domain registration, on our website for observa-
deploying TDS. In addition, to profile victim web users and to tion, we can receive HTTP requests from potential victim web
further strategize effective exploits or monetization, interme- users when our DNS server responds with the A record set as
diate websites (i.e., TDSs) use a web tacking service while an IP address of our website.
redirecting to exploit sites. The characteristics of these tech- The phenomenon that discarded redirect code perma-
niques to ensure tolerant operation of URL redirections are the nently remains on compromised websites is mentioned in
answers to RQ1. Section 4.1.6. Although discarded redirect code is not cur-
rently harmful, there is a potential risk that valid redirect code
will be injected some day. Therefore, compromised websites
should be fixed immediately after, even if redirect code seems
5. Mitigation to be discarded.
We introduce simple and practical countermeasures against

URL redirect injection based on our knowledge obtained from 5.3. Disabling attackers’ advertising and tracking IDs
our large-scale and long-term observation.
Redirections for web ads and tracking are not inherently ma-
licious, but all redirections included on compromised websites
5.1. Countering DGAs
are obviously implied by an attacker’s malicious intention. In-
jected redirect code for web advertising and tracking sometimes
For countering CDG, AGDs generated in the future can be
contains IDs, e.g., Google Analytics’ tracking ID represented as
predictable, once we acquire the DGA code. Therefore, infor-
UA-000000-01. We found some IDs in injected codes across
mation obtained from our honeypot-based monitoring
different compromised websites, and these IDs obviously belong
system can be used for existing blacklisting and domain take-
to certain attackers. One action we can take is reporting such
over operation. In contrast, the code of SDG is not available
attackers’ IDs and the evidence of compromising, i.e., fraudu-
because it is previously executed in an attacker-side
lent access log and injected code, as violating terms of use to
environment.
advertising or tracking service providers to ban the IDs.
Many methods for detecting AGDs require training data that
contain a set of actual malicious domain names. The conven-
tional method to acquire training data is analyzing a malware
5.4. Suspicious redirect path detection based on
binary to extract AGDs generated by a C&C-based DGA. Simi-
cloaking behavior
larly, our measurement system can acquire a substantial volume
of AGDs by analyzing the web-content injected code of web-
We assume that accessing the top pages of prominent web-
based DGAs.
sites by HTTP-302 redirect is not a common occurrence and
this notion supports the focusing of attention on suspicious
5.2. Discovering unknown compromised websites via redirect by cloaking behavior. We surveyed benign redirec-
sinkholed HTTP requests tions from popular websites with our honeyclient. It inspected
Alexa 607K websites’ top pages with corresponding URLs (22M
A domain takeover can observe the accesses from malware- URL accesses, 15M unique URLs). In these inspections, HTTP-
infected hosts to a sinkholed domain of a C&C server and 3xx redirects occurred 693K times. Regarding www.google.com,
harvest theft sensitive information (Stone-Gross et al., 2009). although there are many redirects pointing to variable URL-
An HTTP request is generally an attached origin URL used as paths under the domain, only 11 redirects (origin URLs) point
a referrer field; therefore, we infer that a domain takeover to the top page. The reason of the above redirections in 6 out
against malicious redirect destination websites can collect of the 11 origin URLs is closed or moved websites. We con-
victims’ HTTP requests and discover unknown origin web- clude that a redirect pointing to a top page of a prominent
sites that are compromised websites. website is extremely rare. We recommend that HTTP-3xx re-
The HTTP request header of redirection is attached with directs to the top pages of prominent websites should be
referrer information, which is the origin of the redirection, i.e., carefully investigated.
5.5. Producing dataset for security research

Table 9 – Overlap with compromised legitimate
and operation websites.
# emerged in # emerged in # of SimS SimJ
We statistically analyzed observed data to give an overview of
decoy site legitimate sites overlaps
the ecosystem of malicious URL redirection and changes
A1 36 34 28 0.88 0.83
throughout a prolonged period. Countering in real time is out
A2 131 85 67 0.78 0.51
of the scope of this study; however, it can contribute to gen-
A3 39 4 4 1.00 0.10
erating a dataset for supporting the development/evaluation A4 16 11 7 0.63 0.43
of detection tools and security operations. The MALICIA project4 B1 43 41 6 0.14 0.13
(MALICIA Project, 2013; Nappa et al., 2015) and malware-traffic B2 53 47 6 0.12 0.11
-analysis.net (http://www.malware-traffic-analysis.net/) are B3 1564 24 13 0.54 0.00
known to provide datasets of web-based attacks. The MALICIA C 1919 90 49 0.53 0.04
dataset has been provided to many research organizations.

malware-traffic-analysis.net also presents training excises to
analyze malicious content for security researchers/engineers. X∩Y X∩Y
SimS ( X, Y ) = , Sim J ( X, Y ) =
Related work, e.g., crawling benign websites (Lee and Kim, min ( X, Y ) X∪Y
2012; Li et al., 2012) or monitoring merged HTTP traffic of end
users (Stringhini et al., 2013), had the difficulty of finding ma- These indices are usually used to measure the similarity of
licious entities from a large amount of data that mainly contain sets.
benign entities. In contrast, our observed data originated from As the AGDs of all DGA instances emerged on redirect-
a malicious activity (i.e., redirect injection) and contain less noise injected compromised websites, there was a little gap between
(i.e., benign redirection) than those of the above studies. our observed URL redirect injections and those of the real world.
In other words, this does not mean that our observation was
focused on narrow and not peculiar space. In fact, there were
various AGDs discovered with only our system. We assumed
6. Discussion that decoy-based observation effectively works to approxi-
mately understand overall circumstances. Although it is easy
6.1. Generality of observed data to discover compromised legitimate websites by using decoy
websites, we can observe more AGDs than those of compro-
We believe that collected malware executables for malware mised legitimate websites. The observation stability of our decoy
analysis and induced redirect injection have generality websites contributes to such broad coverage.
for the following reason. When we leak honeytokens to
various attackers, we repeatedly collected initial malware 6.2. Impact on real world
executables from various blacklisted URLs, which are in-
cluded in the public blacklist (malwaredomainlist.com) widely We surveyed how many times discovered redirect websites
used by security engineers, and a huge number of general public were actually accessed by the general public users to examine
websites. the impact on the real world. To conduct this survey, we used
In addition, to examine how our observation results reflect the DNSDB (DNSDB) gathering passive DNS logs collected
the real world, we surveyed how many redirect websites ac- from globally distributed distinct organizations. The passive
tually emerged on compromised websites in the wild. We used DNS logs included a DNS query toward domains from users
another source by crawling data related to the compromised under specific cache DNS servers. Therefore, we could deter-
websites that were originally benign. The data were obtained mine how many users actually accessed a specific domain
from our honeyclient that inspected the top pages of about under a specific cache DNS server by counting the various
160,000 websites in two- or three-day intervals between Aug. replies with the A records. Impact estimation with passive
2011 and Mar. 2015. It discovered 126 compromised websites DNS logs has been extensively investigated (Grier et al., 2012;
performing as landing sites of drive-by downloads and con- Schiavoni et al., 2014). Similarly, we conducted the aforemen-
taining 2512 websites (16,038 URLs) as redirect destinations. tioned survey in July 2015. A user of DNSDB can retrieve a
To concentrate on important malicious websites, we focused summarized resource record, resource data, and the access
on DGA instances. We extracted redirect URLs that were number with timestamps, which are the first seen and last
matched with the patterns of DGA instances on the basis of seen ones. We counted the number of accesses in the A
a specific combination of hostname length, upper-level domain, record except for unresolvable DNS queries. We retrieved
and URL-path similarity previously indicated in Table 8. The the number of DNS queries toward specific domains from
numbers of emerged websites for each DGA instance are listed the DNSDB, which is a database indexing DNS queries col-
in Table 9. To understand the overlap between them, in addi- lected from cache DNS servers and authoritative DNS servers
tion to the numbers of simply overlapped AGDs, we calculated used by various organizations.
two types of similarity; Simpson index, SimS ( X, Y ) and Jaccard Even though a user who receives an A record of a spe-
index, Sim J ( X, Y ) , which are computed as follows: cific domain does not always access that domain, we simply
counted the number of valid queries in the DNS layer. Table 10
4
This project has stopped distributing a dataset due to the aging shows the number of successful DNS queries (authoritative DNS
of the dataset and the students in this project graduating. server replying with an A record) in each DGA instance and
6.3. Attack automation

Table 10 – Query to AGD in DNSDB.
# of domains # accessed # accessed SimS SimJ Attackers try to automate each malicious activity to conduct
(IP resolvable) in DNSDB IPs in of IP of IP
a sequence of attack timely and scalably. Several types of code
DNSDB
to steal credentials are publicly available. Metasploit pro-
A1 36 (36) 90,507 28,354 0.57 0.02
vides modules to steal credentials from various client
A2 131 (128) 147,855 34,502 0.35 0.34
applications (https://github.com/rapid7/metasploit-framework/
A3 39 (39) 8,748 2,346 0.44 0.28
A4 16 (16) 4,926 1,697 0.40 0.32 tree/master/modules/post/windows/gather/credentials). One
B1 43 (23) 8,759 19 0.87 0.36 malware program called Pony, also known as Fareit, has the
B2 53 (32) 1,887 9 1.00 0.77 functionality of information stealing (Trustwave, 2013), and its
B3 1,564 (16) 1,024 2 1.00 1.00 source code was leaked and is now publicly available (https://
C 1,919 (11) 241 5 1.00 1.00 github.com/nyx0/Pony). Due to the availability of credential-
stealing code, various types of malware can be easily
implemented with such functionality. In the injection activ-
ity, an attacker first downloads web content from a Web CMS
Table 11 – Top 10 high impact domains of exploit sites.
then uploads modified web content with redirect code to it.
Domain Role # accessed
Attackers probably use certain automatic tools instead of
in DNSDB
manual operation because much of this execution time of code
warpdriveactive[dot]com Exploit site 1,109,311 injection is extremely short, for example about 80% of execu-
stevebeam[dot]com Exploit site 333,166
tion time of code injection is less than 2 seconds. The websites
bluefuse[dot]com Exploit site 265,228
prepaidphoneguy[dot]com Exploit site 142,540
that are redirect destinations obviously have exploit-kit-
rompnroll[dot]com Exploit site 127,733 derived URLs. In our previous study, we manually analyzed the
vistaclues[dot]com Exploit site 114,418 data in the first year and confirmed five popular exploit-kits
lovedbaby[dot]com Exploit site 109,606 at that time, i.e., Blackhole, Redkit, Phoenix, Incognito, and
jimmyhophotography[dot]com Exploit site 108,620 Neosploit, used on redirect destinations (Akiyama et al., 2013).
chelmsfordlibrary[dot]org Exploit site 104,727
Our monitoring system does not depend on specific detec-
laurendavidstyle[dot]com Redirector 102,630
tion signatures, so it has the potential for collecting the latest
malicious entities regardless of the type of toolkit.
overlap comparisons between each set of IP addresses. In-
stances A1–A4 were surprisingly accessed about 252K times,
6.4. Adaptivity
B1–B3 were accessed about 11K times, and C was accessed only
241 times. The same sets of IP addresses in B1–B3 and C ap-
Our monitoring system can be applicable to various types of
peared in the DNSDB.
applications/services as long as they require ID/password au-
A large amount of IP addresses of A1–A4 were also re-
thentication. Secure FTP (SFTP) and secure shell (SSH)
solved in the DNSDB. Although overlaps indicate high similarity
are possible expansions of our monitoring system. We
with the set of IP addresses observed in the decoy, there were
should change only two settings to apply them: preparing cor-
unique entities, i.e., IP addresses, observed using distinct
responding honeytokens (e.g., putting a configuration file of SFTP
methods. We had further interest in the question, “How many
or SSH on a malware sandbox), and running the services on
IP addresses (agents) does this FFSN actually use?”. To answer this
a honeypot awaiting fraudulent login. However, to apply our
question, we used mark and recapture estimation, which is a
monitoring system to public online services (e.g., social net-
method for estimating a population’s size and is commonly
works, online banking, online shopping, Webmail), we must
used in ecology. The Lincoln–Petersen estimator (Schwarz and
cooperate with each service provider to monitor behind the
Seber, 1982) represents N̂ = Kn k , where N̂ is the estimator of
service.
AGDs in a population, K is the observed number of entities cap-
tured using the 1st method, i.e., decoy in this case, n is the
observed number of entities captured using the 2nd method,
i.e., the DNSDB in this case, and k is the observed number of 7. Related work
recaptured entities that were marked. By substituting the ob-
served number of IP addresses, i.e., K = 39,131, n = 65,356, and 7.1. URL redirection analysis
k = 14,022, to the estimator, we get the estimation as N̂ 182 K
IP addresses, which is the estimated population of our ob- Many studies have been conducted to examine, detect, and dis-
served FFSN agents. The estimated population is much larger cover URL redirects on specific services, the server-side, client-
than that of known FFSNs discovered in previous surveys (Holz side, and honeypot.
et al., 2008; Passerini et al., 2008). WarningBird reveals that redirect networks start from Twitter
We additionally surveyed the high impact websites URLs (Lee and Kim, 2012). MadTracer inspects the top 90K
involved in the exploit-detected inspections shown in Table 11. popular websites for several months and reveals the ad network
We manually analyzed inspection logs and found that all structure and characteristics (Li et al., 2012). These monitor-
these websites were actually attributed to either exploit sites ing systems start from benign websites; therefore, they require
or redirectors. All these domains, except for warpdriveactive large-scale crawling. In contrast, our monitoring system starts
and laurendavidstyle, were accessed via the TDS of A2. from decoy websites with injected malicious redirect code so
that it can consistently track each URL redirection without large- differences between conventional DGAs and web-based DGAs
scale crawling. in Section 4.1.3.
SpiderWeb (Stringhini et al., 2013) analyzes the web access
data of actual web users on client hosts and detects mali- 7.3. Secure web-application framework against
cious websites on the basis of different web users’ redirect credential theft
structures. This system requires web accesses from various
client environments and uses the web access data of anti- Strong authentication, such as two-factor authentication (TFA),
virus programs installed on computers that are not publicly is effective for protecting a system from a fraudulent login using
available and affected by privacy issues. Our data are ob- stolen credentials. Drupal (Drupal, 2011), a well-known CMS,
tained from our managed decoy servers and are not affected has the functionality of TFA, so a server using this CMS can
by privacy issues. provide secure authentication instead of simple ID/password
While our monitoring system focuses on server-side web authentication such as FTP.
injection, Hulk is an analysis system that detects web injec-
tion on web browsers (not websites), which is caused by
malicious browser extensions (Kapravelos et al., 2014). Thomas 8. Conclusion
et al. focused on the aforementioned client-side web injec-
tion and conducted large-scale observation (Thomas et al., 2015). In this work, we attempted to shed light on the ecosystem of
Canali et al. developed decoy websites with vulnerable web malicious redirections started from compromised websites. In
applications to incur intrusions via vulnerabilities and sur- particular, we focused our attention on the core mechanism of
veyed how intruders compromise websites (Canali and web-based attacks – URL redirection. We investigated the fol-
Balzarotti, 2013). Akiyama et al. developed a decoy system that lowing research questions: RQ1: What are the key characteristics
prompts malware to exfiltrate bait credentials and lure attack- of URL redirection mechanisms?, and RQ2: Have their purposes been
ers into decoy websites with stolen credentials to effectively changed over time? The main contribution of this work is the de-
collect compromised web content (Akiyama et al., 2013). ployment of a honeypot-based monitoring system to track the
However, understanding the ecosystem of URL redirects on com- ecosystem of URL redirections for a long period. Through the
promised websites was out of the scope of these studies. extensive analysis of longitudinal data collected with our moni-
toring system across four years, we derived the following findings.
The findings corresponding to RQ1 are (A) the URL redirection
7.2. DGA detection and analysis mechanism exhibited intrinsic change and new trends, (B) the
use of web-based DGAs has become popular as a means to in-
Malicious domain names are often generated using DGAs to crease the entropy of redirect URLs, and (C) both domain-flux
build a resilient infrastructure for conducting malicious ac- and IP-flux are concurrently used for deploying the intermedi-
tivities. Many researchers have reported that various notorious ate sites of redirect chains to ensure robustness of redirection.
malware families, such as Kraken/Bobax, Conficker, Murofet, The findings corresponding to RQ2 are (D) click-fraud monetiz-
Mebroot/Torpig, Srizbi, Bonnana, and Zeus, usually use DGAs ing has recently become a new purpose of attacker in addition
for generating their C&C servers’ domain names (Antonakakis to malware infection, and (E) interestingly, we found that web
et al., 2012; Bilge et al., 2011; Damballa, 2012; Schiavoni et al., tracking services that track the statistics of visitors, i.e., victims,
2014). are installed onto redirect URLs. To evaluate the generality of
Pseudo-randomly generated domain names obviously have our observation, we quantified the impact of the malicious URL
a specific linguistic feature that differs from that of human- redirection mechanism in the real world by correlating locally
generated domain names. Many researchers have proposed DGA- and globally collected data.
detection methods based on the linguistic features of domain On the basis of these findings originating from the un-
names (Schiavoni et al., 2014; Yadav et al., 2012). In contrast, veiled URL redirection ecosystem, we also presented practical
detection methods based on DNS traffic features, such as a large countermeasures against malicious URL redirections. Security/
volume of NXDomains, have also been proposed (Antonakakis network operators can leverage information obtained from the
et al., 2012; Yadav and Reddy, 2011). honeypot-based measurement method to conventional secu-
The weakness of a conventional DGA is that potential rity operations: disrupting infrastructures of web-based attack
domain names generated in the future will be be easily exposed by using domain blacklisting/takedown, report web advertising/
if the algorithm is analyzed. Many botnet takedown opera- tracking IDs, and discovering victims such as unknown
tions have been conducted using domain sinkholing based on compromised websites in the web by using domain takeover.
extracted potential domains. In addition, Stone-Gross et al. ob-
served botnet communication toward C&C by using sinkholed REFERENCES
domains and collected information about bot-infected hosts

and stolen data (Stone-Gross et al., 2009).
Research on DGAs focusing on a botnet’s C&C, i.e., post- Akiyama M, Aoki K, Kawakoya Y, Iwamura M, Itoh M. Design and
implementation of high interaction client honeypot for drive-
infectious phase, has been extensive; however, little DGA
by-download attacks. IEICE Trans Commun 2010;E93-B:1131–9.
research has been focused on the pre-infectious phase. Our
Akiyama M, Yagi T, Aoki K, Hariu T, Kadobayashi Y. Active
study is the first to conduct longitudinal observation in the pre- credential leakage for observing web-based attack cycle. In:
infectious phase. Through our observation, we reveal the actual Proceedings of the 16th international symposium on research
condition of using web-based DGAs and organize the main in attacks, intrusions, and defenses (RAID); 2013.
Antonakakis M, Perdisci R, Nadji Y, Vasiloglou N, Abu-Nimeh S, MALICIA Project; 2013. Available from: http://malicia-project.com.
Lee W, et al. From throw-away traffic to bots: detecting the Moshchuk A, Bragin T, Gribble SD, Levy HM. A crawler-based
rise of DGA-based malware. In: Proceedings of the 21st study of spyware on the web. In: Proceedings of the 2006
USENIX security symposium (Security); 2012. network and distributed system security symposium (NDSS);
Araujo F, Hamlen KW, Biedermann S, Katzenbeisser S. From 2006.
patches to honey-patches: lightweight attacker misdirection, Nappa A, Rafique MZ, Caballero J. The MALICIA dataset:
deception, and disinformation. In: Proceedings of the 20th identification and analysis of drive-by download operations.
ACM conference on computer and communication security Int J Inf Secur 2015;14(1):15–33.
(CCS); 2014. Passerini E, Paleari R, Martignoni L, Bruschi D. FluXOR: detecting
Bilge L, Kirda E, Kruegel C, Balduzzi M. EXPOSURE: finding and monitoring fast-flux service networks. In: Proceedings of
malicious domains using passive DNS analysis. In: the 5th international conference on detection of intrusions
Proceedings of the 2011 network and distributed system and malware, and vulnerability assessment (DIMVA); 2008.
security symposium (NDSS); 2011. Provos N, Mavrommatis P, Rajab MA, Monrose F. All your
Blizard T, Livic N. Click-fraud monetizing malware: a survey and iFRAMEs point to us. In: Proceedings of the 17th conference
case study. In: Proceedings of the 7th international on security symposium (Security); 2008.
conference on malicious and unwanted software (MALWARE); Rajab MA, Ballard L, Jagpal N, Mavrommatis P, Nojiri D, Provos N,
2013. et al. Trends in circumventing web-malware detection; 2011.
Canali D, Balzarotti D. Behind the scenes of online attacks: Available from: http://static.googleusercontent.com/media/
an analysis of exploitation behaviors on the web. In: research.google.com/ja//archive/papers/rajab-2011a.pdf.
Proceedings of the 2013 network and distributed system RiskAnalytics. Dark cloud network facilitates crimeware; 2016
security symposium (NDSS); 2013. Available from: https://www.riskanalytics.com/blog/
Conficker Working Group. Lessons learned June 2010. Published post.php?s=2016-08-17-dark-cloud-network-facilitates
January. 2011. Available from: http://www -crimeware.
.confickerworkinggroup.org/wiki/uploads/Conficker_Working Schiavoni S, Maggi F, Cavallaro L, Zanero S. Phoenix: DGA-based
_Group_Lessons_Learned_17_June_2010_final.pdf. botnet tracking and intelligence. In: Proceedings of the 11th
Damballa. DGAs in the hands of cyber-criminals; 2012. Available international conference on detection of intrusions and
from: https://www.damballa.com/downloads/r_pubs/ malware, and vulnerability assessment (DIMVA); 2014.
WP_DGAs-in-the-Hands-of-Cyber-Criminals.pdf. Schwarz CJ, Seber GAF. The estimation of animal abundance and
DNSDB. Farsight security. Available from: https://www related parameters; 1982.
.dnsdb.info. Shadowserver. Gameover Zeus; 2014. Available from: https://
Drupal. Two-factor authentication (TFA); 2011. Available from: goz.shadowserver.org/.
https://www.drupal.org/project/tfa. Spitzner L. Honeytokens: the other honeypot, 2003. Available
Durumeric Z, Kasten J, Adrian D, Halderman JA, Bailey M, Li F, from: http://www.symantec.com/connect/articles/
et al. The matter of heartbleed. In: Proceedings of the 2014 honeytokens-other-honeypot.
conference on internet measurement conference (IMC); 2014. Stone-Gross B, Cova M, Cavallaro L, Gilbert B, Szydlowski M,
Eshete B, Venkatakrishnan VN. WebWinnow: leveraging exploit Kemmerer R, et al. Your botnet is my botnet: analysis of a
kit workflows to detect malicious URLs. In: Proceedings of the botnet takeover. In: Proceedings of the 16th ACM conference
4th ACM conference on data and application security and on computer and communications security (CCS); 2009.
privacy (CODASPY); 2014. Stringhini G, Kruegel C, Vigna G. Shady paths: leveraging surfing
fb1h2s. Sandy: opensource exploit analysis framework; 2014. crowds to detect malicious web pages. In: Proceedings of the
Available from: https://github.com/fb1h2s/sandy. 2013 ACM SIGSAC conference on computer and
Grier C, Ballard L, Caballero J, Chachra N, Dietrich CJ, Levchenko communications security (CCS); 2013.
K, et al. Manufacturing compromise: the emergence of Symantec Security Response Blog. Web-based malware
exploit-as-a-service. In: Proceedings of the 19th ACM distribution channels: a look at traffic redistribution systems,
conference on computer and communication security (CCS); 2011. Available from: http://www.symantec.com/connect/
2012. blogs/web-based-malware-distribution-channels-look-traffic
Holz T, Gorecki C, Rieck K, Freiling FC. Measuring and detecting -redistribution-systems.
fast-flux service networks. In: Proceedings of the 2008 network Thomas K, Bursztein E, Grier C, Ho G, Jagpal N, Kapravelos A,
and distributed system security symposium (NDSS); 2008. et al. Ad injection at scale: assessing deceptive advertisement
Honeynet Project. 2008. Capture-HPC. modifications. In: Proceedings of the IEEE symposium on
Honeynet Project. Know your enemy: fast-flux service networks; security and privacy (SP); 2015.
2007. Available from: http://www.honeynet.org/papers/ff. TrendMicro. Traffic direction systems as malware distribution
Invernizzi L, Benvenuti S, Cova M, Comparetti PM, Kruegel C, tools, 2011. Available from: http://www.trendmicro.com/
Vigna G. EvilSeed: a guided approach to finding malicious web cloud-content/us/pdfs/security-intelligence/reports/rpt
pages. In: Proceedings of the 2012 IEEE symposium on _malware-distribution-tools.pdf.
security and privacy (SP); 2012. Trustwave. Look what I found: Moar Pony! 2013. Available from:
Kapravelos A, Grier C, Chachra N, Kruegel C, Vigna G, Paxson V. https://www.trustwave.com/Resources/SpiderLabs-Blog/Look
Hulk: eliciting malicious behavior in browser extensions. In: -What-I-Found—Moar-Pony!/.
Proceedings of the 23rd USENIX security symposium Websense Security Labs. Mass injection – Nine-Ball compromises
(Security); 2014. more than 40,000 Legitimate Web sites, 2009.
Leder F, Werner T. Know your enemy: containing conficker; 2009. Yadav S, Reddy ALN. Winning with DNS failures: strategies for
Lee S, Kim J. WarningBird: detecting suspicious URLs in Twitter faster botnet detection. In: Proceedings of the 7th
stream. In: Proceedings of the 2012 network and distributed international ICST conference on security and privacy in
system security symposium (NDSS); 2012. communication networks (SecureComm); 2011.
Li Z, Zhang K, Xie Y, Yu F, Wang X. Knowing your enemy: Yadav S, Reddy AKK, Reddy ALN, Ranjan S. Detecting
understanding and detecting malicious web advertising. In: algorithmically generated malicious domain names. In:
Proceedings of the 2012 ACM SIGSAC conference on computer Proceedings of the 2010 conference on internet measurement
and communications security (CCS); 2012. conference (IMC); 2010.
Yadav S, Reddy AKK, Reddy ALN, Ranjan S. Detecting Takeshi Yada is a Senior Research Engineer, Supervisor,
algorithmically generated domain-flux attacks with DNS Cyber Security Project, NTT Secure Platform Laboratories. He re-
traffic analysis. IEEE/ACM Trans Netw 2012;20(5):1663–77. ceived the M.S. degree in engineering from Tokyo Institute of
http://dx.doi.org/10.1109/TNET.2012.2184552. Technology, Tokyo, 1991. Since joining NTT in 1991, he has been
Zhang J, Yang C, Xu Z, Gu G. PoisonAmplifier: a guided approach engaged in research and development of network architecture, mea-
of discovering compromised websites through reversing surement and inference for network traffic, and network
search poisoning attacks. In: Proceedings of the 15th management.
international symposium on research in attacks, intrusions
and defenses (RAID); 2012. Tatsuya Mori is currently an Associate Professor at Waseda Uni-
versity, Tokyo, Japan. He received B.E. and M.E. degrees in
Mitsuaki Akiyama received the M.E. degree and the Ph.D. degree applied physics, and Ph.D. degree in information science from
in Information Science from Nara Institute of Science and Tech- the Waseda University, in 1997, 1999 and 2005, respectively. He
nology, Japan, in 2007 and 2013, respectively. He has joined NTT joined NTT lab in 1999. Since then, he has been engaged in the re-
R&D division in Tokyo from 2007, and he is now a member of NTT search of measurement and analysis of networks and cyber
Secure Platform Laboratories. He has been engaged in research and security.
development of network security, especially client honeypot and
malware analysis. Youki Kadobayashi received the Ph.D. degree in computer science
from Osaka University, Osaka, Japan, in 1997. From 1997 to 2000,
Takeshi Yagi is a research engineer at NTT Corporation. He re- he was with the Computation Center of Osaka University as an As-
ceived his M.E. in Science and Technology from Chiba University. sistant Professor. Since 2000, he has been an Associate Professor
Since joining NTT in 2002, he has been engaged in research and with the Graduate School of Information Science, Nara Institute
design of network architecture, traffic engineering, honeypots, and of Science and Technology, Nara, Japan.
security-data analysis based on machine learning.

Analyzing The Ecosystem of Malicious URL Redirection Through Longitudinal Observation From Honeypots

Uploaded by

Copyright:

Available Formats

Analyzing The Ecosystem of Malicious URL Redirection Through Longitudinal Observation From Honeypots

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analyzing The Ecosystem of Malicious URL Redirection Through Longitudinal Observation From Honeypots

Uploaded by

Copyright:

Available Formats

computers & security 69 (2017) 155–173

Available online at www.sciencedirect.com

Analyzing the ecosystem of malicious

Mitsuaki Akiyama a,*, Takeshi Yagi a, Takeshi Yada a, Tatsuya Mori b,

• The use of domain generation algorithm (DGAs), which were

To answer the research questions, we developed a honeypot- 2. Extracting URL redirection

Our monitoring system

Web CMS Malicious

Attacker 1. Setup Web CMS honeypot

Fig. 1 – Monitoring system overview.

compromised decoy websites are caused by malicious redi-

Fig. 2 – Injection activity.

3.2. Malware-infection related websites

3.2.1. Detectors and detection coverage Sets # of websites

• Content-based detector statically scans web content, which

Table 4 – Redirect methods on compromised websites.

Fig. 4 – Time-series of access status.

redirected from each compromised website

CDF of websites or URLs 0.7

in some cases. The aim of web advertising and tracking is to

Compromised Traffic distribution system

Fig. 7 – Observed malicious redirect ecosystem.

letters timestamp period Table 7 – Comparison between C&C and web-based

Fig. 8 – DGA: parameters and calculation.

Server-side Domain Generation (SDG)

Fig. 9 – Overview of C&C-based and web-based DGAs.

Fig. 10 – Lifespans of redirect websites.

Table 8 – Discovered DGA instances.

We introduce simple and practical countermeasures against

5.5. Producing dataset for security research

dataset has been provided to many research organizations.

6.3. Attack automation

domains and collected information about bot-infected hosts

You might also like