Analyzing The Ecosystem of Malicious URL Redirection Through Longitudinal Observation From Honeypots
Analyzing The Ecosystem of Malicious URL Redirection Through Longitudinal Observation From Honeypots
Analyzing The Ecosystem of Malicious URL Redirection Through Longitudinal Observation From Honeypots
ScienceDirect
j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / c o s e
A R T I C L E I N F O A B S T R A C T
Article history: Today, websites are exposed to various threats that exploit their vulnerabilities. A compro-
Available online 11 January 2017 mised website will be used as a stepping-stone and will serve attackers’ evil purposes. For
instance, URL redirection mechanisms have been widely used as a means to perform web-
Keywords: based attacks covertly; i.e., an attacker injects a redirect code into a compromised website
Honeypot so that a victim who visits the site will be automatically navigated to a malware distribu-
Compromised website tion site. Although many defense operations against malicious websites have been developed,
URL redirection we still encounter many active malicious websites today. As we will show in the paper, we
Drive-by download infer that the reason is associated with the evolution of the ecosystem of malicious redirection.
Domain generation algorithm Given this background, we aim to understand the evolution of the ecosystem through
long-term measurement. To this end, we developed a honeypot-based monitoring system,
which specializes in monitoring the behavior of URL redirections. We deployed the moni-
toring system across four years and collected more than 100K malicious redirect URLs, which
were extracted from 776 distinct websites. Our chief findings can be summarized as follows:
(1) Click-fraud has become another motivation for attackers to employ URL redirection,
(2) The use of web-based domain generation algorithms (DGAs) has become popular as a
means to increase the entropy of redirect URLs to thwart URL blacklisting, and (3) Both domain-
flux and IP-flux are concurrently used for deploying the intermediate sites of redirect chains
to ensure robustness of redirection.
Based on the results, we also present practical countermeasures against malicious URL
redirections. Security/network operators can leverage useful information obtained from the
honeypot-based monitoring system. For instance, they can disrupt infrastructures of web-
based attack by taking down domain names extracted from the monitoring system. They
can also collect web advertising/tracking IDs, which can be used to identify the criminals
behind attacks.
© 2017 The Author(s). Published by Elsevier Ltd. This is an open access article under the
CC BY license (http://creativecommons.org/licenses/by/4.0/).
* Corresponding author.
E-mail address: [email protected] (M. Akiyama).
http://dx.doi.org/10.1016/j.cose.2017.01.003
0167-4048/© 2017 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://
creativecommons.org/licenses/by/4.0/).
156 computers & security 69 (2017) 155–173
2.2. Monitoring system malware which gathers usernames and passwords from the
host system and leaks them to a machine elsewhere on the
Discovering specific compromised websites in web space is re- Internet under the control of an attacker. The sandbox gen-
quired in advance of starting observation to understand the erates honeytokens, i.e., IP address/domain name of the
actual circumstances of URL redirect injection. Since web space server, usernames and passwords of the CMSs created in
is extensive (the number of URLs is increasing daily) and deep the previous step, and sets them into a configuration file
(dynamic pages are less easily indexed in search engines), it or registry of several client applications, e.g., FTP clients. In
is not easy to discover malicious websites in the wild in a timely each analysis, the sandbox randomly generates unique
and efficient manner. Guided approaches for efficiently dis- credentials.
covering unknown malicious websites in web space have been 3. The Web CMS honeypot actually behaves as an FTP server
proposed (Invernizzi et al., 2012; Zhang et al., 2012). Addition- and waits for attackers to infect the CMSs and inject URL
ally, a decoy system using honeypots to attract attackers has redirection attack code and perhaps other things besides.
been proposed (Canali and Balzarotti, 2013). Our approach in It creates a user directory for each account corresponding
conducting our research also involved honeypots to monitor to potentially leaked credential. After the attacker has suc-
compromised websites and analyzing malicious redirect code cessfully FTP login to the server with honeytokens, the
injected into them. We previously presented a prototype of our attacker was provided with full access as an FTP user to the
measurement system (Akiyama et al., 2013). In addition, we server, since the attacker could insert HTML tags, JavaScript
extend the scope of the above studies to continually track and code, PHP code and .htaccess files.
dissect websites that are redirect destinations over the long 4. We run a honeyclient, which is a web browser-type honeypot,
term. performing as a vulnerable web browser and access to each
Our monitoring system is composed of several types of infected CMS content on a daily basis to determine where
honeypots and can efficiently collect information of mali- the URL redirections lead. In our observation, all observed
cious URL redirection. The key method is purposely leaking the redirections toward external websites must result from re-
bait credentials, called honeytokens (Spitzner, 2003), of our web direct code injection either directly or indirectly because
content management system (Web CMS), which is also a original decoy content does not include external web
honeypot, to attackers to incur masquerade attacks. Fig. 1 gives content.
an overview of our monitoring system, and we describe the
analytical steps as follows: Our monitoring system restricts web access from outside
benign web users to eliminate the risk of secondary attacks
1. We setup decoy Web Content Management Systems (CMSs) from our compromised websites, while it receives masquer-
on the server, called Web CMS honeypot, and make it ac- ade attacks and URL redirect injections. It means that our
cessible from the open Internet by FTP protocol. We deploy monitoring system permits only our honeyclient to access by
famous CMSs such as Wordpress and Joomla. The CMSs con- web our compromised websites and publicly open FTP to incur
tents, e.g., HTML, JavaScript, and PHP, with little modification masquerade attacks with stolen credentials.
are used as decoys and do not include external web
content. The Web CMS honeypot has a domain name and 2.3. Extraction method
requires a username and password to be able to access any
content. In general, benign websites basically include benign redirects
2. We setup a malware sandbox, running malware from our daily derived from original web content made by a website owner.
crawling of public websites and blacklisted websites i.e., web- Furthermore, an owner sometimes changes original web content
sites listed in http://malwaredomainlist.com, in order to legitimately. This type of redirect becomes noise in analyzing
purposely leak honeytokens. The sandbox is designed to run benign websites. In contrast, since all redirects on our
1
The latest rules (Apr. 6, 2014) can be found at https://github.com/ 3.2.3. Correlative redirections in drive-by download sequence
fb1h2s/sandy/tree/master/yara-ctypes/yara/rules/urlclassifier. To We analyzed how many URLs and websites are correlatively
reduce obvious false positives, we excluded g01pack.yar, which ag- involved in drive-by download sequences (Fig. 3). We simply
gressively detects benign URLs. extracted all URLs and websites in each detected inspection.
2
Eleven distinct kinds of anti-malware software, that is, Avast,
In inspection detecting drive-by downloads, all compromised
AVG, Avira, ClamAV, ESET, Forefront, Kaspersky, McAfee, Sophos,
websites have redirects leading to an exploit URL on an ex-
Symantec, and TrendMicro.
3
Our HoneyPatch was presented in 2010 (Akiyama et al., 2010), ternal website. The figure indicates that almost all sequences
and a similar but sophisticated implementation was presented by involved two or more external URLs and websites. This means
Araujo et al. (2014) in 2014. that exploit codes are inevitably placed at external websites.
160 computers & security 69 (2017) 155–173
1
Website
0.9 URL
0.8
Fraction of inspections 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 10 100
Number of redirect websites or URLs
Fig. 3 – Redirection count in drive-by download exploits. Note that a landing URL that is on our compromised websites is
excluded in this graph.
Exploit sites are built using various exploit kits, and typical case, almost all script tag insertions were script tags without
exploit kits have several URLs that have different roles, such src attribution. This means that obfuscated JavaScript is inside
as browser fingerprinting, exploiting browser vulnerability, and script tags and dynamically outputs iframe tags or ex-
forcing a compromised browser to download a malware ex- ecutes location redirects after de-obfuscating itself. Due to
ecutable. First, a page for browser fingerprinting identifies a this obfuscation, malicious redirect code conceals specific re-
target host’s environment and redirects it to the appropriate direct destinations (i.e., URLs).
exploit pages. If an exploit is successful, the target host is forced In comparison, HTTP redirects are implemented in
to access a malware download page. server-side content as header() insertion in .php files (9 modi-
The reason several inspections include two or more exter- fications) and .htaccess files (174 modifications). In the case
nal websites is due to an intermediate website. In typical exploit of using the above redirect methods controlled in server-side
sequences observed during our observation, an exploit site was content, it is impossible, by client-side measurement, to rec-
accessed via an intermediate website. The role of this inter- ognize the reason a redirection occurs, while a decoy website
mediate website is controlling web accesses and redirecting can collect such server-side .php and .htaccess files? Table 4
them to appropriate destinations. We explain the analysis of lists the classified redirect methods and the breakdown of their
this intermediate website further in Section 4.2. usage on actual injected redirect codes. We statically ana-
lyzed injected redirect codes, but this static analysis to find
3.3. Redirect features specific strings could not enable us to precisely identify D, E,
and F because they are obfuscated and dynamically ex-
3.3.1. Methods and utilization ecuted. Therefore, we counted script tag insertions without
We classified the observed redirect methods into three cat- src attributes as the integrated number of D, E, and F.
egories: tag redirect, script redirect, and HTTP redirect. The most
standard method is tag redirect using script, iframe, or meta 3.3.2. Errors
tags. The iframe and meta tag insertions were only 1.0% and Many redirections fail to access redirect destinations in some
less than 0.1% respectively. In contrast, the script tag inser- layers, e.g., DNS or HTTP. We counted the unique statuses of
tion became a major injection method over time, which websites and indicated the time-series accessibility of redi-
accounted for 70.8% of all observed file modifications. In this rect destinations in regard to websites, which are shown as
percentages of DNS lookup failure, connection errors, and HTTP 3.4.2. Emerging redirect destinations from distinct
errors in Fig. 4. The percentages of DNS and TCP connection compromised website
errors gradually increased. Attackers frequently change redirect codes to switch redirect
During our observation period, 4488 domains responded with destinations to thwart blacklisting or change attack cam-
non-existent domain (NXDomain) more than once, and 32.9% of paigns. Taking this feature into account, our monitoring system
domains responded with NXDomain to all DNS queries. In ad- can efficiently obtain various redirect URLs through periodi-
dition, 20.7% of URLs responded with HTTP error (not HTTP- cally monitoring compromised websites performing as a landing
200) more than once, and 16.0% responded with HTTP error site of web-based attacks. In contrast to benign websites, decoy
to all HTTP requests. These include automatically generated websites are suitable for long-term observation and can maxi-
domains, and they often failed to resolve their domain name. mize the yield amount of redirect codes and redirect
destinations. To indicate how many redirect destinations our
3.4. Domain and URL features monitoring system extracted from each compromised website,
we counted observed URLs and websites derived from each
3.4.1. Domain depth and suffix compromised website over an entire measurement period
Table 5a shows the depth of domain, which means the number (Fig. 5). The compromised websites had a median number of
of labels in the <host> part of a URL, e.g., the domain depth 91 domains and a median number of 127 URLs in terms of re-
of www.example.co.jp is 4. Domains placed 2nd and 3rd were directions in total.
dominate among all domains and accounted for 46.8 and 46.5%,
respectively. 3.4.3. URL popularity on distinct redirect website
Table 5b lists the statistics for the (FQDN-1) label, which From the viewpoint of redirect destination, Fig. 6 shows the
means domain suffix without a hostname, e.g., the (FQDN-1) URL popularity, which means how many URLs each redirect website
label of www.example.com indicates example.com. Although has. A number of malicious websites, which are obviously exploit
mynumber.org and .pro are not popular domain suffixes, they websites identified by the exploit-based detector mentioned
are placed higher in rank. The reason of this unusual distri- in Section 3.2.1, are distributed between two and tens of URLs
bution of domain suffixes is that almost all domains under on the X-axis in the figure. A malicious website built using a
these two domain suffixes are automatically generated domains certain exploit kit generally has several URLs (pages) such as
(AGDs), which are possibly produced using a DGA, and a vast a fingerprinting page for identifying a target’s environment and
majority are sometimes not accessible because of DNS lookup an exploit page targeting specific vulnerabilities. Websites at
failure. Detailed analysis of the DGAs discovered during our the bottom right of this figure, which own a large volume of
observation is described in Section 4. URLs, are used for web advertising and tracking, and we
mention them in Section 3.4.4 in detail. The leftmost web-
sites, which own only one URL, account for 69.3%, and most
are the aforementioned AGDs. In the exploit-detected inspec-
Table 5 – Domain depth and suffix statistics.
tions mentioned in Section 3.2.2, these AGDs often emerged
(a) Domain (b) Distribution of
as intermediate websites; in other words, they fulfill the role
depth domain suffix (Top 6)
of bridging an initial website, i.e., compromised website, and
Domain % Domain suffix Domain % a final website, e.g., an exploit site. We further discuss the analy-
depth domain (FQDN-1) depth domain sis of AGDs and unveil their nature in Section 4.
2 46.8 .mynumber.org 3 17.95
3 46.5 .com 2 16.29
4 6.4 .pro 2 14.60 3.4.4. Web advertising and tracking
5 0.17 .ru 2 3.44 Even if an attacker compromises websites and injects
6 0.02 .metric.gstatic.com 4 2.87 redirect code, a redirect destination is not necessarily related
7 0.01 .net 2 2.51
to malware infection but rather web advertising and tracking
162 computers & security 69 (2017) 155–173
1
Website
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0
1 10 100 1000 10000 100000
# websites or URLs
Fig. 5 – Number of emerging redirect destinations on distinct compromised website over entire observation period.
10000
Cumulative number of redirect website
1000
100
10
1
1 10 100 1000 10000 100000
# of URLs on distinct redirect website
Fig. 6 – URL popularity on redirect website X-axis indicates cumulative number of URL on distinct redirect website through
our observation. The top 10 and top 100 websites cover 36.7 and 67.9% of all observed URLs. Websites owning under 10
URLs are 93.9%; in particular, websites owning one URL are 69.1%.
prevalence as shown in the table. This value is calculated by The after-mentioned malicious techniques dedicated to the
Semerge Sexist , where Sexist is the number of all websites, i.e., URL redirection ecosystem are deployed through collusion with
decoy compromised websites, and Semerge is the number of web- each other. Our analysis revealed that attackers use the fol-
sites whose redirect destination is set. While the domains of lowing techniques: domain-flux by web-based DGAs, IP-flux by
doubleclick, google-analytics, and googlesyndication fast-flux service networks (FFSNs), redirection controlling
provide popular web advertising and tracking services, they are by traffic distribution networks (TDSs), and target profiling by
also widely used by redirect codes injected into compro- tracking services. We give a clear overview of the URL redirec-
mised websites. We mention the use of google-analytics for tion ecosystem revealed through our measurement in Fig. 7
web tracking in Section 4.4. and explain its characteristics as follows.
The notable point about web advertising and tracking is that
their numbers increased in 2014 and 2015 while the number
4.1. Web-based domain generation algorithm
of drive-by downloads gradually decreased.
The use of DGAs has become popular since 2008 due to the
3.5. Summary emergence of Conficker (Leder and Werner, 2009). Originally,
DGAs were used for C&C in the post-infectious phase, for in-
To summarize, almost all injected redirect codes were devel- stance, malware-infected hosts communicate with specific
oped in obfuscated JavaScript to conceal redirect destinations. generated domains, i.e., C&C server, for only a very short period
Gradually increasing DNS lookup failures resulted from the of time. Although our analysis was focused on the pre-infectious
large amount of domains that are obviously AGDs. The phase, to our surprise, we discovered many suspicious AGDs
drive-by download sequences often involve an intermediate in the observed redirect URLs. We inspect how DGA mecha-
website to control redirections. To address RQ1, we describe nisms have been used in the ecosystem of URL redirection. Our
the in-depth analysis of AGDs and controlling redirections in study revealed that DGAs are also used as a key component
Section 4. of the redirection mechanism. The use of the DGAs we ob-
Domains related to drive-by downloads actively emerged served for URL redirection is broadly classified into two
on URL redirections in 2012 to 2013. While they were not so categories: client-side domain generation (CDG) and server-
active in 2014 and later, the number of domains/URLs related side domain generation (SDG). We give details on CDG and SDG
to web advertising or tracking were increasing. This is the below.
answer to RQ2, which indicates the change in the purpose of
redirection based on redirect code injections, that is, click-
fraud monetization is a new purpose in addition to malware 4.1.1. Client-side domain generation
injection. Web-based CDG is implemented as JavaScript. When a web
browser accesses a page containing a JavaScript DGA, it runs
on the browser and pseudo-randomly generates a domain name
and its URL. After that, the JavaScript DGA outputs redirect code,
4. In-depth analysis of evasive redirection e.g., an iframe tag set src attribute as a generated URL.
techniques The JavaScript DGA is highly obfuscated to obstruct analy-
sis. We used a browser emulator to fully execute obfuscated
To address RQ1 regarding the key characteristics of the URL JavaScript and extract input/output values of eval() and
redirection mechanisms, we unveil and dissect evasive redi- document.write() as candidates of de-obfuscated human-
rection techniques deployed by attackers in our observation. readable code. We then manually analyzed this extracted
Popular sites
n
Tracking sites (
IP addresses
Fast-flux service network
(FFSN) (
C&C-based DGA
1. Infect with malware
Malware Attacker
DGA code
3. Access C&C
Infected host
server (AGD)
. Generate domain name
Web-based DGA
Client-side Domain Generation (CDG)
Victim host 2. Access 1. Inject
Compromised
Browser DGA Attacker
DGA website JavaScript
DGA 3. Reply JavaScript
JavaScript
Malicious
5. Access (Redirect) website (AGD)
4. Generate domain name
and redirect code
In a C&C-based DGA, a large volume of DNS queries and of domain name randomness still remains. Therefore, while
failures generally occurs from each infected host as time pro- conventional DGA detection methods based on linguistic fea-
gresses. Many detection methods have been proposed that are tures were implicitly targeting C&C-based DGAs, they may also
based on these characteristics (Antonakakis et al., 2012; Yadav be effective for web-based DGAs.
and Reddy, 2011). However, these conventional methods may
fail to detect web-based DGAs. In contrast to malware-infected 4.1.4. Lifespan
host accessing C&C-based DGAs, a web user unexpectedly ac- Generally, the more malicious websites are used for a long
cesses web-based DGAs when he or she accesses compromised period, the more blacklisting is effective against them. To cir-
websites with injected redirect code. In web-based DGAs, ac- cumvent this blacklisting, an attacker accordingly changes
cessing AGDs occurs when a web user occasionally accesses redirect code on compromised websites to switch to redirect
a landing site injected with redirect code. Therefore, the con- destinations. To understand the effectiveness of typical black-
ventional viewpoints toward detection, such as using a large listing, we estimated the lifespan of redirect destinations. The
volume of NXDomain derived from speculative DNS queries durations of CDG, SDG, and other non-DGA websites are sepa-
and failover behavior, do not effectively work. While the volumes rately shown in Fig. 10. We calculated domain A’s lifespan
of DNS query and failure are not so large, the characteristic D ( A) = Tf ( A) − Tl ( A) , where Tf(A) is the timestamp of the first
0.8
CDF of emerging website
0.6
0.4
0.2
CDG
SDG
Other
0
100 200 300 400 500 600 700 800 900 1000
Lifespan (days)
emergence of A and T l (A) is the timestamp of the last the domains of DGA instances account for 74.5% out of 2841
emergence. inspections detected as successful drive-by exploits. This means
In the case of CDG, the two large gaps at 365 days (a year) that web-based DGAs play an important role in drive-by exploits.
and 730 days (two years) on the X-axis were caused by domains
that emerge annually. The reason that annually emerging 4.1.6. Discarded redirect code
domains are also AGDs is explained in Section 4.1.1. The dis- In most cases, an attacker carefully maintains redirect code
covered instances of JavaScript DGA, i.e., specific JavaScript code, on a compromised website to change redirect destinations and
use only month, date, and hour information extracted from a obfuscation algorithms. However, we also discovered in-
current timestamp and do not use year information. There- jected redirect codes and corresponding redirect destinations
fore, the same domain name is generated yearly on the same discarded by an attacker that have never been maintained or
hour, date, and month, and we actually observed this phe- changed by the attacker from a certain time. If a web user ac-
nomenon. The lifespans of SDG instances distributed a few to cesses a website containing discarded redirect code, the
tens of days substantially different from those of CDG in- redirection inevitably fails; therefore, it is actually harmless.
stances. Vanished websites on the redirections within 10 days In our observed objects, redirect codes toward 5 domains of
were 82.6% out of other websites, while 7.5% were used over A1 and 25 domains of A2 were discarded on several compro-
100 days. mised websites, and these ghost redirections continued through
each inspection. In addition, the JavaScript DGAs of B1–B3 and
4.1.5. DGA instances C were also discarded and continued generating redirections
We heuristically classified AGDs on the basis of hostname
toward unresolvable websites. Therefore, redirection failures
length, upper-level domain, and URL-path similarity into several
continuously occur despite a discarded redirect code that has
groups as DGA instances, which is a set of AGDs derived from
already been used by an attacker.
a specific DGA. Table 8 shows summarized data correspond-
ing to each instance. Instances A1–A4 are of SDG, and all have
a similar URL path. The remarkable point in these instances 4.2. Traffic distribution systems
is that a massively large amount of IP addresses were re-
solved from their domain names. Instances B1–B3 and C are Our observed AGDs have two patterns of characteristic strings
of CDG, and all also have similar URL paths. Instances B3 and in the URL path: count[1-9][0-9]?\.php and in.cgi\?[1-
C have a massively large amount of domain names, 1564 9][0-9]?. One (in.cgi) is called Sutra-TDS in a previous paper
and 1919, respectively. Despite this, almost all domains of B3 (Symantec Security Response Blog, 2011), which is a toolkit for
and C are unresolvable to IP addresses. Instance C has domains building a traffic distribution system (TDS).
under a private suffix domain, so an attacker can freely create A TDS is an intermediate website placed between an initial
valid domains. In comparison, a domain registration to a reg- website (i.e., compromised website) and final destination
istrar is required to validly use B3’s domain since B3 is directly website. The primary aim is to control the final redirect des-
deployed under the TLD. Instances B1–B3 had been used for tinations to obscure them. In the observed data, we confirmed
only two or three weeks, then a resolvable domain never ap- that final websites are frequently changed by a TDS as time
peared. Instance C newly emerged after B1–B3 operations. We progresses.
manually checked to see whether B1–B3 and C were de- All our discovered DGA instances performed as TDSs and
ployed by the same DGA with the same parameters except for mediated between an initial website and an exploit site. Almost
the upper-level domain. On the basis of this evidence, we all redirect methods on the AGDs based on A1–A4 were HTTP-
assumed that the same attacker uses B1–B3 and C. B1 and B2 200 with JavaScript location.href or HTTP-302.
were used in short periods. However, redirections using the
JavaScript DGA of B3 and C are actually discarded by an at- 4.3. Fast-flux service network under AGDs
tacker and remain on compromised websites even if generated
domains are unresolvable. Domains generated by SDG (A1–A4) respond with different
The domains and URLs of DGA instances account for only IP addresses in each DNS resolution. Consequently, they had
33.8 and 3.4% of all redirect destinations, respectively; however, an extremely large amount of IP addresses through our
measurement, as mentioned in Section 4.1.5 and shown in as a plain text in web content; therefore, we can easily iden-
Table 8. We observed over 1K IP addresses for each instance tify it.
and 39,131 in total. The following evidence indicates that A1–
A4 are deployed on a fast-flux service network (FFSN). As seen
from the GeoIP information, these IP addresses are mas- 4.5. Thwarting security inspection
sively globally distributed, for example, A1–A4 include 425
autonomous system numbers (ASNs) in 95 countries, 1849 ASNs One of the reasons for HTTP access failure seems to be cloak-
in 117 countries, 374 ASNs in 48 countries, and 414 ASNs in ing. Although cloaking was originally used for SEO poisoning,
56 countries. We calculated the ASNs of all IP addresses on it is currently used for web-based malware infection. To thwart
FFSNs and manually checked the several top ASNs and their security inspection, malicious websites respond with harm-
usages. Most IP addresses are located on residential net- less content or redirect to a legitimate website if they recognized
works so that end user PCs are the most dominant. While the an access as a security inspection. The most popular cloak-
IP addresses are globally distributed on hundreds of ASes, ing technique is IP cloaking, which identifies client IP addresses,
Ukraine’s telecom/mobile networks (AS15895 and AS25229) e.g., detected repeated accesses or blacklisted IP addresses, and
account for over 22%. The report published by RiskAnalytics flexibly responds (Rajab et al., 2011). It has been reported that
(2016) mentioned that their originally observed IP addresses almost all exploit kits have IP cloaking functionalities (Eshete
of an FFSN are also located at the same ASes we observed so and Venkatakrishnan, 2014). In addition, a security vendor’s
that we recognized the commonly used bot-infected PCs for report mentioned that TDSs also conduct cloaking (TrendMicro,
FFSNs. The emerging periods of these instances are tempo- 2011). To bypass IP cloaking, we make an effort to obfuscate
rally diversified; however, 1557 IP addresses out of all resolved the IP addresses of a honeyclient by using web proxies dis-
IP addresses are multiply used by them. tributed on various ASes. User-agent information and referrers
Hosts of resolved IP addresses successfully respond with are also usually used for cloaking. Our honeyclient is a high
HTTP replies in most cases; in other words, they have valid IP interaction system, so the user-agent information and referrer
addresses and run as a website. The percentages of IP ad- are naturally set by an actual web browser. Therefore, we did
dresses to which HTTP successfully responded in A1–A4 are not make any special effort to obfuscate this information. Note
79.5, 91.0, 92.6, and 80.6%, respectively. Therefore, in most cases, that our monitoring system did not access websites through
the authoritative DNS servers of FFSN domains in A1–A4 faith- a different browser so that it also possibly fails to access web-
fully answer valid IP addresses controlled by attackers. sites which change redirect destinations based on user-agent
All reply headers of HTTP include the following four com- information.
monly used specific header fields: Server:Apache, Content- In our inspection, our honeyclient encountered several types
Type: without a specific content type, Server:nginx/1.2.6, of cloaking. We summarize probable cloaking situations in HTTP
and X-Powered-By:PHP/5.4.11. These reply headers are as follows.
slightly weird because Server: fields are multiply defined, and
the Content-Type: field is always null. In consideration of this • HTTP error: HTTP-404 or HTTP-500 responses often occur
evidence, FFSN agents seem to be commonly installed on a spe- in inspections of exploit sites. These types of response are
cific server or use blind proxy redirection (BPR). It is said that BPR typical cloaking behavior of exploit kits.
is typically used for an FFSN (Honeynet Project, 2007), which • HTTP-200 without meaningful content
works as a reverse proxy to transparently send received HTTP Malicious websites respond with HTTP-200 without mean-
requests to a backend server directly controlled by an at- ingful content, for instance, 0-byte content or OK. When a
tacker. In this way, an attacker directly knows accessed hosts’ web browser receives such a reply, no redirection or exploit
information and collectively switches HTTP replies at a single occurs. More than half of the HTTP responses from A1–A4
point without exposure. were HTTP-200 with 0-byte content.
• HTTP-302 redirect to benign websites
4.4. Double-crop redirecting Malicious websites respond with HTTP-302 with specific re-
direct URLs, which are Google sites, Microsoft sites, or
One notable point is that Google Analytics emerged on redi- localhost. In addition, on some exploit sites, both exploits
rect chains derived from about 60% of compromised websites. and HTTP-302 redirects whose redirect destinations
Google Analytics is one of the most well-known web access are popular websites simultaneously occurred. This
analysis services. The emerging Google Analytics are from ad- seems to intentionally confuse security inspection. In many
vertising pages and drive-by related pages. The former is normal cases of redirecting to benign websites, the top pages
usage that aims to collect web users’ profiles to prepare to of popular websites, such as http://www.google.com/,
deploy more effective personalized ads. The latter’s aims http://www.google.se/, and http://www.bing.com/, are used
seems to be profiling targeted web users and further strategize for HTTP-302 redirect destinations.
effective exploits or monetization. In our content analysis,
the web content of 19 domains in A1 contained Google Ana- Some redirections of HTTP-302 were not sophisticated
lytics’ JavaScript and redirected web users to www.google because they were falsely set to redirect destinations as
-analytics.com. These 19 domains were set as redirect desti- http://google.com/ or http://bing.com/, which are not actu-
nations on 43.4% of compromised websites. In addition, the ally accessible. When such non-existent URLs are falsely accessed
same tracking ID, obviously an attacker’s, was commonly by a web user, the URL redirects the user to the top effective
used in all these domains. This tracking ID is embedded URLs such as http://www.google.com/ or http://www.bing.com/.
168 computers & security 69 (2017) 155–173
4.6. Summary the URL of a compromised website injected with redirect code.
We confirmed that all HTTP requests toward websites of DGA
To summarize, our in-depth analysis unveiled evasive tech- instances are attached with correct referrer fields. If we pre-
niques to ensure significantly tolerant operation of URL viously know the URL of a specific malicious website, it is
redirections: domain-flux, IP-flux, redirection controlling, and possible to discover unknown compromised websites by check-
target profiling. Domain-flux by web-based DGAs generated over ing the referrer information contained in an HTTP request
three thousand AGDs over an entire period. We found two uses toward the malicious website. Conceivable method of moni-
of DGAs: CDG and SDG, and discussed the operational differ- toring such a request is using web proxy logs or website
ence between the types of DGAs and effectiveness of takeovers by domain sinkholing. We can enumerate poten-
conventional methods. These observed AGDs performed as TDSs tial domains used in the future by reversing client-side DGAs.
which are intermediate websites placed between an initial Given that almost all domains are not registered and are
website and a final destination website to control final redi- unused, we can legitimately register potential domains to own
rect destinations and obscure them. IP-flux by an FFSN and the a part of the AGDs instead of an attacker. If we successfully
above domain-flux (especially CDG) are concurrently used for complete domain registration, on our website for observa-
deploying TDS. In addition, to profile victim web users and to tion, we can receive HTTP requests from potential victim web
further strategize effective exploits or monetization, interme- users when our DNS server responds with the A record set as
diate websites (i.e., TDSs) use a web tacking service while an IP address of our website.
redirecting to exploit sites. The characteristics of these tech- The phenomenon that discarded redirect code perma-
niques to ensure tolerant operation of URL redirections are the nently remains on compromised websites is mentioned in
answers to RQ1. Section 4.1.6. Although discarded redirect code is not cur-
rently harmful, there is a potential risk that valid redirect code
will be injected some day. Therefore, compromised websites
should be fixed immediately after, even if redirect code seems
5. Mitigation to be discarded.
that it can consistently track each URL redirection without large- differences between conventional DGAs and web-based DGAs
scale crawling. in Section 4.1.3.
SpiderWeb (Stringhini et al., 2013) analyzes the web access
data of actual web users on client hosts and detects mali- 7.3. Secure web-application framework against
cious websites on the basis of different web users’ redirect credential theft
structures. This system requires web accesses from various
client environments and uses the web access data of anti- Strong authentication, such as two-factor authentication (TFA),
virus programs installed on computers that are not publicly is effective for protecting a system from a fraudulent login using
available and affected by privacy issues. Our data are ob- stolen credentials. Drupal (Drupal, 2011), a well-known CMS,
tained from our managed decoy servers and are not affected has the functionality of TFA, so a server using this CMS can
by privacy issues. provide secure authentication instead of simple ID/password
While our monitoring system focuses on server-side web authentication such as FTP.
injection, Hulk is an analysis system that detects web injec-
tion on web browsers (not websites), which is caused by
malicious browser extensions (Kapravelos et al., 2014). Thomas 8. Conclusion
et al. focused on the aforementioned client-side web injec-
tion and conducted large-scale observation (Thomas et al., 2015). In this work, we attempted to shed light on the ecosystem of
Canali et al. developed decoy websites with vulnerable web malicious redirections started from compromised websites. In
applications to incur intrusions via vulnerabilities and sur- particular, we focused our attention on the core mechanism of
veyed how intruders compromise websites (Canali and web-based attacks – URL redirection. We investigated the fol-
Balzarotti, 2013). Akiyama et al. developed a decoy system that lowing research questions: RQ1: What are the key characteristics
prompts malware to exfiltrate bait credentials and lure attack- of URL redirection mechanisms?, and RQ2: Have their purposes been
ers into decoy websites with stolen credentials to effectively changed over time? The main contribution of this work is the de-
collect compromised web content (Akiyama et al., 2013). ployment of a honeypot-based monitoring system to track the
However, understanding the ecosystem of URL redirects on com- ecosystem of URL redirections for a long period. Through the
promised websites was out of the scope of these studies. extensive analysis of longitudinal data collected with our moni-
toring system across four years, we derived the following findings.
The findings corresponding to RQ1 are (A) the URL redirection
7.2. DGA detection and analysis mechanism exhibited intrinsic change and new trends, (B) the
use of web-based DGAs has become popular as a means to in-
Malicious domain names are often generated using DGAs to crease the entropy of redirect URLs, and (C) both domain-flux
build a resilient infrastructure for conducting malicious ac- and IP-flux are concurrently used for deploying the intermedi-
tivities. Many researchers have reported that various notorious ate sites of redirect chains to ensure robustness of redirection.
malware families, such as Kraken/Bobax, Conficker, Murofet, The findings corresponding to RQ2 are (D) click-fraud monetiz-
Mebroot/Torpig, Srizbi, Bonnana, and Zeus, usually use DGAs ing has recently become a new purpose of attacker in addition
for generating their C&C servers’ domain names (Antonakakis to malware infection, and (E) interestingly, we found that web
et al., 2012; Bilge et al., 2011; Damballa, 2012; Schiavoni et al., tracking services that track the statistics of visitors, i.e., victims,
2014). are installed onto redirect URLs. To evaluate the generality of
Pseudo-randomly generated domain names obviously have our observation, we quantified the impact of the malicious URL
a specific linguistic feature that differs from that of human- redirection mechanism in the real world by correlating locally
generated domain names. Many researchers have proposed DGA- and globally collected data.
detection methods based on the linguistic features of domain On the basis of these findings originating from the un-
names (Schiavoni et al., 2014; Yadav et al., 2012). In contrast, veiled URL redirection ecosystem, we also presented practical
detection methods based on DNS traffic features, such as a large countermeasures against malicious URL redirections. Security/
volume of NXDomains, have also been proposed (Antonakakis network operators can leverage information obtained from the
et al., 2012; Yadav and Reddy, 2011). honeypot-based measurement method to conventional secu-
The weakness of a conventional DGA is that potential rity operations: disrupting infrastructures of web-based attack
domain names generated in the future will be be easily exposed by using domain blacklisting/takedown, report web advertising/
if the algorithm is analyzed. Many botnet takedown opera- tracking IDs, and discovering victims such as unknown
tions have been conducted using domain sinkholing based on compromised websites in the web by using domain takeover.
extracted potential domains. In addition, Stone-Gross et al. ob-
served botnet communication toward C&C by using sinkholed REFERENCES
Antonakakis M, Perdisci R, Nadji Y, Vasiloglou N, Abu-Nimeh S, MALICIA Project; 2013. Available from: http://malicia-project.com.
Lee W, et al. From throw-away traffic to bots: detecting the Moshchuk A, Bragin T, Gribble SD, Levy HM. A crawler-based
rise of DGA-based malware. In: Proceedings of the 21st study of spyware on the web. In: Proceedings of the 2006
USENIX security symposium (Security); 2012. network and distributed system security symposium (NDSS);
Araujo F, Hamlen KW, Biedermann S, Katzenbeisser S. From 2006.
patches to honey-patches: lightweight attacker misdirection, Nappa A, Rafique MZ, Caballero J. The MALICIA dataset:
deception, and disinformation. In: Proceedings of the 20th identification and analysis of drive-by download operations.
ACM conference on computer and communication security Int J Inf Secur 2015;14(1):15–33.
(CCS); 2014. Passerini E, Paleari R, Martignoni L, Bruschi D. FluXOR: detecting
Bilge L, Kirda E, Kruegel C, Balduzzi M. EXPOSURE: finding and monitoring fast-flux service networks. In: Proceedings of
malicious domains using passive DNS analysis. In: the 5th international conference on detection of intrusions
Proceedings of the 2011 network and distributed system and malware, and vulnerability assessment (DIMVA); 2008.
security symposium (NDSS); 2011. Provos N, Mavrommatis P, Rajab MA, Monrose F. All your
Blizard T, Livic N. Click-fraud monetizing malware: a survey and iFRAMEs point to us. In: Proceedings of the 17th conference
case study. In: Proceedings of the 7th international on security symposium (Security); 2008.
conference on malicious and unwanted software (MALWARE); Rajab MA, Ballard L, Jagpal N, Mavrommatis P, Nojiri D, Provos N,
2013. et al. Trends in circumventing web-malware detection; 2011.
Canali D, Balzarotti D. Behind the scenes of online attacks: Available from: http://static.googleusercontent.com/media/
an analysis of exploitation behaviors on the web. In: research.google.com/ja//archive/papers/rajab-2011a.pdf.
Proceedings of the 2013 network and distributed system RiskAnalytics. Dark cloud network facilitates crimeware; 2016
security symposium (NDSS); 2013. Available from: https://www.riskanalytics.com/blog/
Conficker Working Group. Lessons learned June 2010. Published post.php?s=2016-08-17-dark-cloud-network-facilitates
January. 2011. Available from: http://www -crimeware.
.confickerworkinggroup.org/wiki/uploads/Conficker_Working Schiavoni S, Maggi F, Cavallaro L, Zanero S. Phoenix: DGA-based
_Group_Lessons_Learned_17_June_2010_final.pdf. botnet tracking and intelligence. In: Proceedings of the 11th
Damballa. DGAs in the hands of cyber-criminals; 2012. Available international conference on detection of intrusions and
from: https://www.damballa.com/downloads/r_pubs/ malware, and vulnerability assessment (DIMVA); 2014.
WP_DGAs-in-the-Hands-of-Cyber-Criminals.pdf. Schwarz CJ, Seber GAF. The estimation of animal abundance and
DNSDB. Farsight security. Available from: https://www related parameters; 1982.
.dnsdb.info. Shadowserver. Gameover Zeus; 2014. Available from: https://
Drupal. Two-factor authentication (TFA); 2011. Available from: goz.shadowserver.org/.
https://www.drupal.org/project/tfa. Spitzner L. Honeytokens: the other honeypot, 2003. Available
Durumeric Z, Kasten J, Adrian D, Halderman JA, Bailey M, Li F, from: http://www.symantec.com/connect/articles/
et al. The matter of heartbleed. In: Proceedings of the 2014 honeytokens-other-honeypot.
conference on internet measurement conference (IMC); 2014. Stone-Gross B, Cova M, Cavallaro L, Gilbert B, Szydlowski M,
Eshete B, Venkatakrishnan VN. WebWinnow: leveraging exploit Kemmerer R, et al. Your botnet is my botnet: analysis of a
kit workflows to detect malicious URLs. In: Proceedings of the botnet takeover. In: Proceedings of the 16th ACM conference
4th ACM conference on data and application security and on computer and communications security (CCS); 2009.
privacy (CODASPY); 2014. Stringhini G, Kruegel C, Vigna G. Shady paths: leveraging surfing
fb1h2s. Sandy: opensource exploit analysis framework; 2014. crowds to detect malicious web pages. In: Proceedings of the
Available from: https://github.com/fb1h2s/sandy. 2013 ACM SIGSAC conference on computer and
Grier C, Ballard L, Caballero J, Chachra N, Dietrich CJ, Levchenko communications security (CCS); 2013.
K, et al. Manufacturing compromise: the emergence of Symantec Security Response Blog. Web-based malware
exploit-as-a-service. In: Proceedings of the 19th ACM distribution channels: a look at traffic redistribution systems,
conference on computer and communication security (CCS); 2011. Available from: http://www.symantec.com/connect/
2012. blogs/web-based-malware-distribution-channels-look-traffic
Holz T, Gorecki C, Rieck K, Freiling FC. Measuring and detecting -redistribution-systems.
fast-flux service networks. In: Proceedings of the 2008 network Thomas K, Bursztein E, Grier C, Ho G, Jagpal N, Kapravelos A,
and distributed system security symposium (NDSS); 2008. et al. Ad injection at scale: assessing deceptive advertisement
Honeynet Project. 2008. Capture-HPC. modifications. In: Proceedings of the IEEE symposium on
Honeynet Project. Know your enemy: fast-flux service networks; security and privacy (SP); 2015.
2007. Available from: http://www.honeynet.org/papers/ff. TrendMicro. Traffic direction systems as malware distribution
Invernizzi L, Benvenuti S, Cova M, Comparetti PM, Kruegel C, tools, 2011. Available from: http://www.trendmicro.com/
Vigna G. EvilSeed: a guided approach to finding malicious web cloud-content/us/pdfs/security-intelligence/reports/rpt
pages. In: Proceedings of the 2012 IEEE symposium on _malware-distribution-tools.pdf.
security and privacy (SP); 2012. Trustwave. Look what I found: Moar Pony! 2013. Available from:
Kapravelos A, Grier C, Chachra N, Kruegel C, Vigna G, Paxson V. https://www.trustwave.com/Resources/SpiderLabs-Blog/Look
Hulk: eliciting malicious behavior in browser extensions. In: -What-I-Found—Moar-Pony!/.
Proceedings of the 23rd USENIX security symposium Websense Security Labs. Mass injection – Nine-Ball compromises
(Security); 2014. more than 40,000 Legitimate Web sites, 2009.
Leder F, Werner T. Know your enemy: containing conficker; 2009. Yadav S, Reddy ALN. Winning with DNS failures: strategies for
Lee S, Kim J. WarningBird: detecting suspicious URLs in Twitter faster botnet detection. In: Proceedings of the 7th
stream. In: Proceedings of the 2012 network and distributed international ICST conference on security and privacy in
system security symposium (NDSS); 2012. communication networks (SecureComm); 2011.
Li Z, Zhang K, Xie Y, Yu F, Wang X. Knowing your enemy: Yadav S, Reddy AKK, Reddy ALN, Ranjan S. Detecting
understanding and detecting malicious web advertising. In: algorithmically generated malicious domain names. In:
Proceedings of the 2012 ACM SIGSAC conference on computer Proceedings of the 2010 conference on internet measurement
and communications security (CCS); 2012. conference (IMC); 2010.
computers & security 69 (2017) 155–173 173
Yadav S, Reddy AKK, Reddy ALN, Ranjan S. Detecting Takeshi Yada is a Senior Research Engineer, Supervisor,
algorithmically generated domain-flux attacks with DNS Cyber Security Project, NTT Secure Platform Laboratories. He re-
traffic analysis. IEEE/ACM Trans Netw 2012;20(5):1663–77. ceived the M.S. degree in engineering from Tokyo Institute of
http://dx.doi.org/10.1109/TNET.2012.2184552. Technology, Tokyo, 1991. Since joining NTT in 1991, he has been
Zhang J, Yang C, Xu Z, Gu G. PoisonAmplifier: a guided approach engaged in research and development of network architecture, mea-
of discovering compromised websites through reversing surement and inference for network traffic, and network
search poisoning attacks. In: Proceedings of the 15th management.
international symposium on research in attacks, intrusions
and defenses (RAID); 2012. Tatsuya Mori is currently an Associate Professor at Waseda Uni-
versity, Tokyo, Japan. He received B.E. and M.E. degrees in
Mitsuaki Akiyama received the M.E. degree and the Ph.D. degree applied physics, and Ph.D. degree in information science from
in Information Science from Nara Institute of Science and Tech- the Waseda University, in 1997, 1999 and 2005, respectively. He
nology, Japan, in 2007 and 2013, respectively. He has joined NTT joined NTT lab in 1999. Since then, he has been engaged in the re-
R&D division in Tokyo from 2007, and he is now a member of NTT search of measurement and analysis of networks and cyber
Secure Platform Laboratories. He has been engaged in research and security.
development of network security, especially client honeypot and
malware analysis. Youki Kadobayashi received the Ph.D. degree in computer science
from Osaka University, Osaka, Japan, in 1997. From 1997 to 2000,
Takeshi Yagi is a research engineer at NTT Corporation. He re- he was with the Computation Center of Osaka University as an As-
ceived his M.E. in Science and Technology from Chiba University. sistant Professor. Since 2000, he has been an Associate Professor
Since joining NTT in 2002, he has been engaged in research and with the Graduate School of Information Science, Nara Institute
design of network architecture, traffic engineering, honeypots, and of Science and Technology, Nara, Japan.
security-data analysis based on machine learning.