Isc09 Spyware
Isc09 Spyware
1 Introduction
P. Samarati et al. (Eds.): ISC 2009, LNCS 5735, pp. 202–217, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Automated Spyware Collection and Analysis 203
program differently. In this paper, we use the term spyware in a more narrow
sense – as browser-based software that records privacy-sensitive information and
transmits it to a third party without the user’s knowledge and consent. This def-
inition is more faithful to the “original” purpose of spyware, which is to record
the activity of a user while surfing the web.
A host can become infected with spyware in various ways. For example, the
spyware component might come bundled with shareware, such as a peer-to-
peer client or a supposed Internet “accelerator.” It is common practice that
small software companies, unable to sell their products in retail, cooperate with
spyware/adware distributors to fund the development of their products [1]. Most
of the time, however, users have no choice to “unselect” the installation of the
piggybacked nuisance without disrupting the desired software functionality.
In this paper, we are interested in the extent to which executables on the web
present a spyware threat. This allows us to compare our results to a previous
study [2]. For our analysis, we focus on spyware that uses Microsoft’s Internet
Explorer to monitor the actions of a user. Typically, this is done either by using
the Browser Helper Object (BHO) interface or by acting as a browser toolbar.
We feel that this focus is justified by the fact that the overwhelming majority of
spyware uses a component based on one of these two technologies, a fact that is
confirmed by a number of previous papers [3,4,5,6].
As mentioned above, the authors of a previous measurement study [2] at-
tempted to assess the prevalence of spyware on the Internet. To this end, they
crawled the web for executables, downloaded them, and installed the programs
in an automated fashion. Then, the authors executed a single anti-spyware pro-
gram, Lavasoft’s Ad-Aware [7], to assess the threat level of each program.
Unfortunately, in the previous study, little attention was devoted to the fact
that relying on the output and correctness of a single tool can significantly mis-
represent the true problem, and thus, the perception of the spyware threat. Obvi-
ously, scanner-based systems cannot detect novel threats for which no signature
exists. Also, such systems are often trivial to evade by using techniques such as
obfuscation or code substitution. Hence, scanner-based systems may introduce
false negatives, and as a result, cause the threat to be underestimated. However,
it is also possible that a detection tool mislabels programs as being more dan-
gerous than they actually are. Such false positives may cause an overestimation
of the threat.
In our work, one of the aims was to show the bias that is introduced by
deriving statistics from the results of a single tool. Obviously, we did not simply
want to re-run the experiments with more anti-spyware tools (although we did
employ a second, scanner-based application). Instead, we wanted to perform our
analysis using substantially different approaches that aim to detect spyware. To
this end, we checked for spyware-related identifiers in the Windows registry, using
a popular, publicly-available database [8]. Moreover, we employed a behavior-
based approach [3] that monitors the execution of a component in a sandbox and
checks for signs of suspicious behavior. By combining multiple techniques and
employing further, manual analysis in cases for which different tools disagree,
204 A. Stamminger et al.
we aimed to establish a level of “ground truth” for our sample set. Based on this
ground truth, we identify the weaknesses of each technique when exposed to a
large set of real-world, malicious code.
In summary, the contributions of this paper are the following:
– In about ten months, we crawled over 15 million URLs on the Internet and
analyzed 35,853 executables for the presence of spyware components.
– We employed three different analysis techniques and devoted additional man-
ual effort to identify the true nature of the components that we obtained.
This allows us to expose the weaknesses of individual analysis approaches.
– We compare our results to a previous study that attempted to measure the
spyware threat on the Internet and critically review their results.
2 Methodology
In this section, we describe our approach to analyze the extent of spyware on
the Internet. In order to keep a consistent terminology within the rest of the
paper, we first define the behavior that constitutes spyware activity. Then, we
explain how we crawl the web for executables and briefly discuss our approach
to automatically install these programs. Finally, we describe how we identify a
program as spyware,
As mentioned previously, the term spyware is overloaded. For example, it
is not uncommon that a component that displays advertisements is considered
spyware, even it does not read nor leak any privacy-related information. Other
examples of mislabeled spyware are toolbars that provide search fields that send
input to a search engine of the user’s choice. Clearly, information that is entered
into the search field is forwarded to the search engine. Hence, the component is
not malicious, as it informs the user where the data is sent to.
Because of the ambiguous use of the spyware term, it is possible that the actual
risk of downloading a spyware-infected executable is overstated. Consequently,
we need a more precise discrimination between different classes of activity. As
mentioned previously, we focus in our study on browser extensions (BEs) for the
Microsoft Internet Explorer (from now on, we use the term browser extension to
refer to both BHOs and toolbars). To make our discussion of browser extensions
more precise, we propose the following taxonomy:
Benign. An extension is benign when it does not perform any function that
might be undesirable from a privacy-related point of view, nor exposes the user
to unwanted content.
Adware. Adware is benign software with the purpose of advertising a certain
product, e.g., via pop-up windows. These components do not leak any sensitive
information, though.
We also consider a toolbar as adware when it provides a search field to the
user that sends the input to a particular (often, less well-known) search engine.
The reason is that the toolbar promotes, or advertises, the use of a particular
search engine. Of course, the user is free to use the toolbar or not.
Automated Spyware Collection and Analysis 205
to [2], we identify such content by examining two properties for each candidate
URL. If either (1) the URL’s file extension is .exe, or (2) the “Content-type”
HTTP header of the corresponding web resource is application/octet-stream,
we download the file. We then check the first bytes of the file header and com-
pare it with the “magic” value that denotes a Windows executable. We perform
similar checks for zip, cabinet (.cab), and MS Installer files (.msi).
To determine whether Internet users with a specific field of interest are more
likely to encounter spyware on the web, we defined eight categories, similar to [2]:
adult, games, kids, music, desktop (office), pirate, shareware, and toolbar. For
each category, we fed the Google search engine with category-specific keywords
and used the fifty most relevant search results as a seed for our crawler. We
consider this a reasonable approach, because these are the pages that users would
most likely encounter when searching for content in the categories mentioned
above. To focus our crawling to those web sites that are found by the Google
search, we do a breath-first crawl only up to a depth of three links away from
the seed.
2.3 Analysis
the same CLSID to register their component (possibly because the developers
were lazy or use the same code base). Thus, the value of the identifier can provide
some insight into the nature of the corresponding program. Moreover, also the
file name of the extension component can be revealing. Of course, both identifiers
can be easily modified by miscreants.
CastleCops [8] is1 a community of security professionals that provides a free
and searchable database of BHOs and toolbars. At the time of writing, it con-
tained 41,144 entries. For each BE, the database lists various information, includ-
ing the BE’s CLSID and its file name. Furthermore, a classification is provided.
This classification includes X for spyware and malware, O for programs that are
open to debate (such as grayware and adware), and L for legitimate items.
To perform identifier-based detection, we use HijackThis [11], a free util-
ity that scans a computer for installed browser extensions, reporting both the
CLSIDs and path names of the identified components. Based on the informa-
tion provided by HijackThis, we consult the CastleCops database. Using the
classification provided by this database, we can classify the browser extension
accordingly. The absence of any entry results in the BE being classified as legit-
imate.
Scanner-based Detection. Our scanner-based detection was based on two
commercial anti-spyware products, Ad-Aware [7] and Spybot [12] – both popular
and well-known spyware scanners.
Ad-Aware uses a number of threat categories to specify the precise nature of
a sample. During our analysis, we encountered the following categories:
– Adware: Programs displaying advertising on the user’s computer, without
leaking sensitive information.
– Data miner : Programs designed to collect and transmit private user infor-
mation to a third party. This behavior may be disclosed to the user through
to the EULA. This is the equivalent to our spyware definition.
– Malware: A generic category for harmful programs, equivalent to our mal-
ware class.
To ensure that we had the newest signatures, we always updated Ad-Aware’s
signature database before launching a scan. To check for suspicious code, we
perform a full system scan. Once the tool is finished, we check the report for the
presence of any component that is recognized as being suspicious. If this is the
case, we record the corresponding threat category.
Spybot is a spyware scanner that attempts to detect threats on the user’s com-
puter by comparing registry entries and files against a database with signatures
of well-known malware samples. This tool allows to choose the threat categories
for which a user wants to check. For our study, we chose those categories that
we assumed to be most-closely related to spyware: hijackers, keyloggers, mal-
ware, potentially unwanted programs, and spyware. After we run a system scan,
Spybot lists each detected threat, without providing any further classification.
1
Unfortunately, CastleCops has recently ceased its operations, but was still active
while we performed our analysis.
208 A. Stamminger et al.
the most popular search terms. Besides search engine sites, we also entered some
of these keywords directly into the browser’s address bar. To trigger BEs that
hijack error pages, we also entered misspelled URLs.
3 Results
In this section, we discuss the results of our measurement study. More precisely,
we show the prevalence of spyware-infected executables for a number of differ-
ent “regions” on the web. Moreover, we compare the effectiveness of different
detection techniques, examining their strengths and limitations. In particular,
we are interested in the possible bias that Ad-Aware introduces, since this was
the sole tool used in a previous attempt to quantify the extent of spyware on
the Internet [2]. Finally, we compare the findings in the previous study to the
results of our analysis.
Table 3 shows the overall analysis results. It can be seen that about 6.6%
of all executables contain non-benign BEs. However, most of these programs
belong to the adware category, while the fraction of executables that contain
malicious components (spyware and malware) is significantly less - only 0.3% or
117 executables. This clearly underlines that the spyware threat might appear
much more dramatic when the analysis does not distinguish precisely between
non-invasive adware and malicious spyware. A breakdown of the non-benign BEs
according to our taxonomy is presented in Table 4.
desktop (office), pirate, shareware, and toolbar). The results for the prevalence
of non-benign components on pages of these categories are shown in Table 6. As
the numbers demonstrate, we encountered spyware in all categories.
Before analyzing the results in detail, we conjectured that most spyware would
be found on shareware or freeware sites. This is not only because of the large
amount of executables hosted on those sites, but also because shareware is often
offered together with dubious adware to finance its development. Our results
confirm the initial intuition: The shareware category is not only the richest
source for executables in general, but also holds the largest number of executables
that install a BE. Interestingly, although over 15% of the shareware applications
come with a non-benign BE, the actual fraction causing a spyware or malware
infection is comparatively low (0.4%). The categories of the sites where BEs
are most likely misused for malicious purposes are adult, desktop (office), and
games, as indicated by the highest fraction of spyware BEs (last row in Table 6).
large number of CLSIDs (106) used by adware BEs that we could not find in the
online database. This is mainly due to Softomate components, discussed in the
following paragraph.
In general, it can be seen that identifier-based identification works surprisingly
well. Unfortunately, this kind of detection can be easily evaded, and certain
spyware variants (e.g., Win32.Stud.A) already use randomly-generated CLSIDs.
Scanner-based Detection. Table 8 shows our comparison with the reports
provided by Ad-Aware. When we consider the similarity of our definition of spy-
ware and Ad-Aware’s description of a data miner, our results show a surprising
mismatch in the number of detected samples. During our analysis, Ad-Aware
(mis)labeled 130 unique adware components as data miner. All other techniques
could not confirm these threats.
Closer examination of Ad-Aware’s report showed that 92% of these mislabeled
components are toolbars. To determine whether these components only track
Automated Spyware Collection and Analysis 213
user data that is entered into the toolbar, we additionally performed manual
testing. Some of these toolbars provide search results for paid advertisers, but
only when we use the search function of the toolbar. Clearly, this is the expected
behavior, and thus, should not be classified as data miner. We also contacted
Lavasoft to resolve this issue. We were told that one possible cause for their
classification might be the fact that the installation routine does not clearly
state the purpose of an adware program, and thus, it is labeled as data miner.
Additionally, they admit that some samples were misclassified.
One particular problem was caused by the Softomate Toolbar, which is a devel-
oper aid for creating customized Internet Explorer components. A few malicious
samples are created using this tool. However, Ad-Aware tags all toolbars that
are developed with the help of Softomate as data miner. This is unfortunate,
because we observed that over 50% of all executables with browser extensions
were using a component produced by Softomate. However, only a tiny fraction
is recognized as malicious by all other detection techniques. Given the signifi-
cant amount of adware BEs that were tagged as data miners by Ad-Aware, we
recognize a significant bias that overstates the actual threat.
On the other hand, we also found four actual spyware threats not reported
by Ad-Aware. Three of these threats were revealed by the behavior-based de-
tection technique (as we show later below), and three could also be identified
using Spybot. This demonstrates the limitations of signature-based detection
and the possibility to underestimate the threat because of novel, malicious code
instances. However, four cases are still a relatively small number. In two addi-
tional cases, a spyware threat was misclassified as adware.
Table 9 shows our comparison with Spybot. At first glance, it appears that
Spybot misses a considerable amount of adware samples. On further examina-
tion, 93% of these BEs are Softomate Toolbars, a popular type of extension. As
we discussed previously, we labeled these BEs as (mildly annoying) adware, but
one could also argue that they are benign. Therefore, we consider this mismatch
as negligible.
Behavior-based Detection. Table 10 shows the performance of our taint anal-
ysis with respect to ground truth. As expected, those BEs leaking sensitive user
information, such as URLs surfed by the user, could be found in the categories
grayware and spyware. Since benign software and adware do not disclose pri-
vate user information to a remote server, we cannot distinguish between these
components.
214 A. Stamminger et al.
Table 9. Ground truth vs. Spybot Table 10. Ground truth vs. behavior-based
Table 11. BEs detected by behavioral Table 12. False positives raised by
analysis but not Ad-Aware behavior-based detection
4 Related Work
5 Conclusion
In this paper, we present the results of a measurement study that attempts to
quantify the extent of spyware-infected executables on the Internet. Inspired
by previous work, we crawled the web for executables that were then installed
and analyzed. In total, our experiment lasted around ten months. We crawled
over 15 million URLs and downloaded more than 35 thousand executables. An
important difference to previous work is the fact that we used three different
analysis techniques. By combining the views from different vantage points, we
were able to identify the limitations of each individual technique. In particular,
we found that Ad-Aware, the tool used for the previous study, significantly over-
estimates the severity of many samples. As a result, previous work might have
overestimated the prevalence of privacy-invasive spyware. While we did find a
non-negligible number of spyware-infested executables, the vast majority of non-
benign browser extensions were not stealing private information but displaying
annoying advertisements.
References
1. Good, N., Dhamija, R., Grossklags, J., Thaw, D., Aronowitz, S., Mulligan, D.,
Konstan, J.: Stopping Spyware at the Gate: A User Study of Privacy, Notice and
Spyware. In: Symposium On Usable Privacy and Security, SOUPS (2005)
2. Moshchuk, A., Bragin, T., Gribble, S.D., Levy, H.M.: A Crawler-based Study of
Spyware on the Web. In: Network and Distributed Systems Security Symposium,
NDSS (2006)
3. Egele, M., Kruegel, C., Kirda, E., Yin, H., Song, D.: Dynamic Spyware Analysis.
In: Usenix Annual Technical Conference (2007)
4. Kirda, E., Kruegel, C., Banks, G., Vigna, G., Kemmerer, R.: Behavior-based Spy-
ware Detection. In: Usenix Security Symposium (2006)
5. Wang, Y., Roussev, R., Verbowski, C., Johnson, A., Wu, M., Huang, Y., Kuo,
S.: Gatekeeper: Monitoring Auto-Start Extensibility Points (ASEPs) for Spyware
Management. In: Large Installation System Administration Conference (2004)
6. Hackworth, A.: Spyware. US-CERT Publication (2005)
7. Lavasoft: Ad-Aware, http://www.lavasoftusa.com/software/adaware
8. Castlecops: The CLSID / BHO List / Toolbar Master List,
http://www.castlecops.com/CLSID.html
9. Mohr, G., Stack, M., Rnitovic, I., Avery, D., Kimpton, M.: Introduction to Heritrix.
In: 4th International Web Archiving Workshop (2004)
10. Bellard, F.: QEMU, a Fast and Portable Dynamic Translator. In: Usenix Annual
Technical Conference, Freenix Track (2005)
11. Trendmicro: HijackThis,
http://www.trendsecure.com/portal/en-US/tools/security_tools/
hijackthis
12. Spybot: Spybot Search & Destroy, http://www.safer-networking.org/
13. Moser, A., Kruegel, C., Kirda, E.: Exploring Multiple Execution Paths for Malware
Analysis. In: Symposium on Security and Privacy (2007)
14. Yin, H., Song, D., Egele, M., Kruegel, C., Kirda, E.: Panorama: Capturing System-
wide Information Flow for Malware Detection and Analysis. In: ACM Conference
on Computer and Communication Security, CCS (2007)
15. Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.: Semantics-Aware
Malware Detection. In: Symposium on Security and Privacy (2005)
16. Wang, H., Jha, S., Ganapathy, V.: NetSpy: Automatic Generation of Spyware Sig-
natures for NIDS. In: Annual Computer Security Applications Conference, ACSAC
(2006)