0% found this document useful (0 votes)

107 views15 pages

CobWeb - A System For Automated In-Network Cobbling of Web Service

This document describes a system called CobWeb that was designed to automatically construct "session trees" for web services. A session tree represents all the network traffic generated when a user accesses a given web service, including content from CDNs and third-party sites. CobWeb uses heuristics to classify traffic into session trees as this problem is challenging due to the complex nature of modern web services. The authors evaluate CobWeb on 96 popular web sites and find it achieves low false positive and false negative rates when classifying traffic.

Uploaded by

olutay zoayoola

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views15 pages

CobWeb - A System For Automated In-Network Cobbling of Web Service

Uploaded by

olutay zoayoola

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Purdue University

Purdue e-Pubs

Department of Computer Science Technical Department of Computer Science

Reports

2012

CobWeb: A System for Automated In-Network Cobbling of Web

Service Traffic
Hitesh Khandelwal
Purdue University, [email protected]

Fang Hao
Bell Labs Alcatel-Lucent

Sarit Mukherjee
Bell Labs Alcatel-Lucent

Ramana Rao Kompella

Purdue University, [email protected]

T.V. Lakshman
Bell Labs Alcatel-Lucent

Report Number:
12-005

Khandelwal, Hitesh; Hao, Fang; Mukherjee, Sarit; Kompella, Ramana Rao; and Lakshman, T.V., "CobWeb: A
System for Automated In-Network Cobbling of Web Service Traffic" (2012). Department of Computer
Science Technical Reports. Paper 1755.
https://docs.lib.purdue.edu/cstech/1755

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries.
Please contact [email protected] for additional information.
CobWeb: A System for Automated In-Network Cobbling of
Web Service Traffic

†
Hitesh Khandelwal, ‡ Fang Hao, ‡ Sarit Mukherjee, † Ramana Kompella, ‡ T.V. Lakshman
†
Purdue University, ‡ Bell Labs Alcatel-Lucent

ABSTRACT subscriber management and billing, etc.

We consider the problem of in-network categorization of all In this paper, we focus on a new measurement problem,
traffic associated with a given set of web services. While this namely that of associating with a particular web service all
problem can be viewed as a generalization of per-session the traffic that is generated upon each access to that web
traffic monitoring, a key difficulty is that we have to con- service. This entails identifying all the traffic that is due
struct the entire session tree that represents the transitive to downloads from the original host for the web service, its
closure of all traffic downloaded as a result of a user access- content delivery network (CDN), and downloads of embed-
ing a given web service. Such in-network session tree con- ded objects from third-party services (e.g., advertisements).
struction and monitoring is useful for many measurement, Unlike per-flow traffic measurement, for this more general
monitoring, and new types of billing services such as ‘re- problem, it is necessary to construct a “session tree” that rep-
verse billing’ where usage charges are paid for by either the resents the transitive closure of all web service accesses that
service provider or the ISP itself as an incentive to the user. happen as a consequence of accessing a given root web ser-
Automated construction of the session tree based on net- vice. Note that the session tree may be quite dynamic and
work traffic observation is challenging and to our knowledge the internal nodes can change across users as well as across
unaddressed. The challenges arise due to the complexities time. To our knowledge, this general measurement problem
inherent in today’s web services and the lack of universal is largely unaddressed.
standards that are followed when designing web services. The motivations for this new class of measurements are
This necessitates the use of heuristics that rely upon preva- largely similar to existing traffic measurement solutions. To-
lent web service design practices. In this paper, we present a day, per-flow traffic measurement and application identifica-
system, called C OB W EB, that performs this automated in- tion are extensively used by ISPs to gain insight into their
network cobbling and monitoring of web services traffic. network operations. These are often done using several of
We evaluate the classification accuracy of C OB W EB by ex- the specialized monitoring boxes, currently in the market [3,
tensive experimentation using controlled downloads and by 1], that provide extensive per-session, per-application or per-
analysis of about 100 popular web sites using large traffic flow usage and performance reports. These boxes also iden-
traces (over 700 GB) collected at a major university’s gate- tify traffic to various popular web-sites by time-of-day, re-
way. Our experiments suggests that C OB W EB can achieve gion, etc. However, these solutions are at a coarse-granularity
good accuracy with low (< 5%) false positive and negative (e.g., IP address, protocol signature); efficient solutions for
rates. traffic monitoring of more meaningful aggregates such as the
web session trees that we consider in this paper can signifi-
cantly add to the usefulness of existing monitoring and mea-
1. INTRODUCTION surement equipment.
The design of network-based mechanisms for per-flow traf- A more recent and emerging potential application is ‘re-
fic measurement has been a topic of much research interest verse billing’ where any charges associated with accessing
(e.g., [6, 13, 11, 21, 18]). Here, the challenges have been a web service are billed back to the web service provider
in designing low-cost mechanisms for collecting statistics of rather than to users accessing the service. Reverse billing
millions of flows or sessions at speeds of 10 Gbps and be- is motivated by the growing shift from flat-rate to tiered-
yond. Another topic of much interest has been in-network- pricing in wireless networks, and in some countries, wired
based mechanisms for application identification (e.g., [8]). networks as well. Examples are such plans such as those of
Here the challenge is in the design of network-based mech- AT&T which permit 250MB of data usage for about $15 per
anisms to identify the applications that are generating the month and 2GB for $25 in the United States. Thus, web ser-
observed traffic in the network. This identification is needed vice providers may want to make it attractive to customers
for per-application network usage reports, application-aware by to providing ‘toll-free’ access to their services. Another
traffic management, service-level-agreement conformance,

1
alternate could be a subsidized service model where the ISP controlled environment for emulating user browsing behav-
may provide access to a web-services such as ESPN, Face- iors. We take 70 web sites from the Alexa top 100 US sites,
book, or CNN for some nominal fee per month. For exam- along with the top 26 popular web sites for users in a large
ple, Vodafone already offers unlimited access to a few sites university campus network as the target web services. Our
(e.g., Facebook, Twitter, FourSquare and Myspace) in every results show that the system can achieve an average false
new contract in Australia. positive rate of 3.5% and false negative rate of 4.8% across
For such applications, it is important to construct the ses- these 96 web services.
sion tree in the network, which is challenging for many rea- The rest of the paper is organized as follows: We first
sons. Widely used web services are a complex mashup of present the problem statement precisely and discuss naive
content from several supporting services including CDNs, solutions that do not work well. We discuss our approach in
third-party advertisement platforms (e.g., ads.doubleclick.com), Section 3 followed by implementation details of our system
and other third-party services (e.g., CNN web services us- in Section 4. In Section 5, we evaluate each classification
ing Facebook for friend recommendations). This makes the heuristic individually, and then show the accuracy of the fi-
structure of the session tree difficult to infer. Also, with nal combined classification algorithm.
the inherent flexibility in designing web services, the ses-
sion tree is rarely static. Moreover, web services are largely 2. PROBLEM STATEMENT
personalized—the content served varies with the user even
In this section, we start with some preliminaries about
for the same URI. Thus, one cannot use a unique set of URIs
web services. We then clearly define our main objective in
to identify a service. Yet another issue is that web services
this paper, argue why the problem is hard, and show that
are usually hosted across many data centers causing IP ad-
simple solutions do not work well.
dresses to change based on user location. Also, since a CDN
such as Akamai’s, can service many different web services 2.1 Web Service Preliminaries
the use of IP addresses itself is not sufficient.
In this paper, we describe a system, C OB W EB, for this A web browser interacts with a web server by sending
general measurement problem. It automatically performs HTTP requests to the server, and then receiving response
in-network cobbling of different web services—we use the messages back. The most common requests are GET (for
term “cobbling” for the identification and measurement of downloading content) and POST (for uploading content).
all traffic associated with a given web service. To the best The contents on a web page, displayed to a user, are usu-
of our knowledge, C OB W EB is the first system developed ally downloaded via multiple GET/POST requests that are
for addressing this measurement problem. We assume that sent to one or more hosts. The page returned in the response
C OB W EB has access to both upstream and downstream traf- to a request to cnn.com message may contain many links to
fic flows, since C OB W EB is a system that observes network other embedded objects as shown in Figure 1(a). For ex-
traffic (hence deployed at the network edge with port mirror- ample, the objects /.element/..../1pix.gif, /banner.html are a are
ing used for access to traffic flows). We use online mecha- fetched immediately after the main CNN web page is down-
nisms for the cobbling done by C OB W EB since we need to loaded. Embedded objects may in turn trigger the download
effectively handle web service personalization. Also, online of many more embedded objects. For example, as shown in
mechanisms need less storage and raise fewer privacy con- Figure 1(a), the GET request to /index.js on host cdn.turner.com
cerns. However, they require more processing power, which leads to the download of many further embedded objects
is not a limiting factor using current multicore processors. such as /cnn/.../btn_play.jpg. For this session, more than 150
C OB W EB works in two stages. It identifies any support- additional requests are sent to 24 other hosts to acquire all
ing CDN used by a web service and then, it identifies all the the additional content. Of course, such numbers may change
embedded objects downloaded for that web service. The to- due to dynamic nature of the content.
tal data usage consists of all the traffic that is associated with The hosts that provide content to a web page can be clas-
access to this service—be it from the original access, from sified into three broad categories: original hosts, CDN hosts
the CDNs, and from all the related chain of accesses that and third party hosts. Original hosts are those that belong to
are triggered by the embedded links. As pointed out earlier, the same root domain of the web service, e.g., cnn.com and
designing a system that tracks all the traffic belonging to a money.cnn.com. CDN hosts are the servers that are part of the
web service with 100% accuracy is a big challenge given the CDNs associated with that main domain (e.g., cdn.turner.com
lack of any uniform methodology or standards in the com- is the main CDN associated with CNN). Third-party hosts
position of web services. The mechanisms that we use nec- provide content such as advertisements, statistics collection,
essarily rely on heuristics based on prevalent practices in the social networking, and so on (e.g., feeds.bbci.co.uk and www.
provision of web services. facebook.com for CNN). These hosts may be contacted as a
We evaluate our system based on two web traffic traces result of the main web page or because of embedded re-
in total amount of 739 GB collected from a large university quests as shown in Figure 1(a). For example, we can see
campus network, along with traces that are generated at lab more third-party requests to hosts such as ad.doubleclick.com
originating after the ads.cnn.com object is fetched.

2
cnn.com cnn.com
Referer: GET /banner.html
Host: www.cnn.com USER CLICK
cnn.com
CDN Objects Original Hosts Third-Party Hosts
GET /index.jpeg GET /.element/…/1pix.gif
Host: cdn.turner.com Host: www.cnn.com
GET /html.ng/site=cnn?...
Host: ads.cnn.com cnn.com/politics
GET /index.js
Host: cdn.turner.com CNN TREE
Referer: Referer: Third-Party Hosts
CDN Objects
cdn.turner.com GET /888BFFEA-DF82…. ads.cnn.com USER CLICK
Original Hosts
Host: content.pulse360.com
GET /cnn/…./btn_play.jpg
Host: i.cdn.turner.com GET /adi/N5776 FACEBOOK TREE
GET: /extern/login_status.php?...
Host: www.facebook.com Host: ad.doubleclick.com facebook.com
Referer:
GET /cnn/…./hdr-search-google.gif
Host: i.cdn.turner.com
ad.doubleclick.com
GET: /rss.xml?edition=int
Host: feeds.bbci.co.uk GET /2656415/Robots.swf CDN Objects
Third-Party Hosts
Original Hosts
Host: s0.2mdn.net

(a) Constituents of a CNN session (b) Defining CNN web session

Figure 1: Subfigure (a) shows a subset of URLs downloaded when a user downloads the base cnn.com page. Subfigure
(b) shows that clicks to third party websites are not part of the the CNN session tree.

2.2 Objective stream of HTTP requests to various web services from a

Our goal is to ‘cobble’ an entire session tree correspond- given client, but there is no obvious handle one can use
ing to a user accessing a given web service in the network to easily bind the requests that belong to a given session.
(e.g., at an ISP border router). We define a web session more Simple approaches such as enumerating domain names or
precisely as follows: IP addresses for their applications (e.g., for reverse billing)
do not work well as we shall discuss next. Anecdotal evi-
• All the content downloaded from the original hosts (e.g., dence suggests that ISPs today are already using these naive
*.cnn.com) responsible for the web service. approaches for their applications, primarily because of the
• All the content downloaded from CDN servers (e.g., *.cdn. lack of a compelling alternative.
turner.com) the web service uses as part of the session.
Naive Solution 1: Use Domain Names One simple solu-
• All the embedded objects automatically downloaded for
tion to this problem is to use domain names that are associ-
the web service from any servers, e.g., original hosts,
ated with the web service. Thus, the router may essentially,
CDN hosts or any third-party hosts.
for every web service of interest, simply keep the domain
Figure 1(b) shows an example navigation of the CNN web names that it needs to match. While such an approach may
page. The root of the session tree starts at cnn.com that, as have worked 10-15 years ago when web services were very
discussed in Section 2.1, involves the browser automatically simple, e.g., a few servers would serve static HTML content,
fetching embedded content from the original hosts, CDN unfortunately this approach will not work well for today’s
hosts or third party hosts. The user click on cnn.com/politics web services due to their complex composition. Specifi-
leads to another series of sessions to various hosts and this cally, many modern websites are constructed as a mash-up
is again considered part of the session tree since the click of many different services, often relying on third party web-
leads to a CNN webpage. When a user clicks on a link to a sites as well. We illustrate this complexity in Figure 2, where
third party website such as facebook.com, we consider it to be we show the number of unique URLs and unique domain
outside of the CNN session tree (as shown in the figure). web pages for about 96 web sites, comprising 70 of Alexa’s
Though not shown in the example, any clicks to URLs top 100 web pages and 26 most popular domains observed at
which involve the CDN hosts or original hosts are consid- a large university gateway . As we can observe from the fig-
ered as part of the session tree. For example, if the user ure, most of these web pages tend to access many different
clicks on URL cdn.turner.com/xyz.html (a contrived example), domains (up to 50) and the number of unique objects fetched
it would be considered a part of the session tree. While this is as high as 450.
example started with cnn.com, a user may directly enter the Perhaps more importantly, many web services involve fetch-
URL cnn.com/politics into the browser and make that URL the ing objects from common domains. For example, the main
root. However, we do not consider session trees starting di- cnn.com and nytimes.com web pages include an embedded re-
rectly from a CDN URL as part of the web service since quest to facebook.com. Thus, the request to Facebook needs
CDNs are known to be shared across different web services. to be classified as part of CNN or NYTimes or even just
For example, cdn.turner.com may host content from tbs.com. Facebook depending on the overall context, making it dif-
ficult to come up with a blanket rule for all websites. We
show this phenomenon in Figure 3, where we quantify the
2.3 Why is the problem hard ?
overlap among these 96 websites. Specifically, suppose a
The key difficulty in cobbling of the web sessions in the website A resulted in the set of URLs, we compute the frac-
network comes from the fact that routers only observe a

3
500
Unique urls
3500
ISP alone. Even if obtain the site map of domains for a web-
450 Unique Top-level domains
Total size 3000 service, this may not be sufficient because of the overlap be-
400
tween different services we have already discussed before.
Number of unique requests 350 2500

300 (For generality, in this paper, we assume no such coopera-

2000
tion.) We cannot also assume any client cooperation since

KBs
250
1500
200 it is too intrusive an approach to run a special agent on the
150 1000 client side machine.
100
500 Another possible idea is to collect and store all HTTP
50

0 0
data corresponding to a user, and somehow reconstruct the
0 10 20 30 40 50 60 70 80 90 100
Webservice
web activity corresponding to that user accessing a given
page. Unfortunately, modern websites heavily rely on CSS,
Figure 2: Statistics for sessions of the web services javascript, etc.; parsing and interpreting these websites re-
1 quires a full-fledged javascript engine on the router making
0.9 it complicated to keep up with line rates.
0.8 Thus, it is clear that simple enumeration of either domain
0.7
names or IP addresses will not work well for our problem.
0.6
Instead, we need a more sophisticated way to derive the asso-
CDF

0.5

0.4
ciation between web accesses, but not as complex as parsing
0.3 and interpreting javascript and other web pages in a detailed
0.2 fashion. We discuss one such middle-ground approach that
0.1 Requests
Bytes
we propose next.
0
0 20 40 60 80 100
Percentage of requests and bytes
3. COBWEB DESIGN
Figure 3: Quantifying domain overlaps across web ser- In this section, we describe the design of C OB W EB, a sys-
vices tem for in-network cobbling of web service traffic. We first
present an overview of our approach, and then describe the
tion of URLs (and their total bytes) in this set, that have the heuristics that form the basis for C OB W EB. We assume that
same domain name (e.g., facebook.com) as that of at least one both directions of the traffic can be observed by C OB W EB.
URL in URL set of B. We plot the maximum among all
fractions for each of the 96 web sites. As we can see from 3.1 Overview
the figure, the overlap can be quite high; for half of the 96
In our approach, we mainly leverage a key field within
websites, 30% of requests and approximately 20% of bytes
the HTTP headers, namely the ‘Referer’ field, which most
overlap with at least one other website.
browsers today set. This field mainly indicates the (pre-
Naive Solution 2: Use IP Addresses The next solution vious) page that referred to the current page. For exam-
we consider is to enumerate IP addresses that correspond to ple, when a user clicks on www.cnn.com/politics URL on the
a given web service. Similar to the domain names above, www.cnn.com/US web page, the corresponding GET request
IP addresses cannot easily be used for isolating web ser- will contain www.cnn.com/US as the Referer. Note that the
vices since the same web servers may be hosting content Referer field contains both a referrer host (e.g., cnn.com) and
from different providers. For example, today many web referrer URI (e.g., /US) should it be present. We leverage the
services use content delivery networks (CDNs), and often Referer field to keep track of the navigation chains to iden-
they use the same CDN provider (e.g., Akamai) for hosting tify the roots of the session trees.
content. Thus, the same server IP addresses may be shared While using the Referer field, one can form an association
across different web services making it difficult to filter out between two different webpages A and B if A led to B, it
all requests specific to a given web site just by examining is not always easy to establish whether B was a result of the
network-layer header fields alone. In addition, most CDNs user clicking on A, or an automated download. The reason
adopt locality-based DNS resolution and so the exact IP ad- why this is important is that automated downloads need to
dress(es) used can change depending on the location. be counted as part of the actual web service, even if it is to
Other ideas In certain settings, when we have coopera- third party domains, i.e., non-origin domains. On the other
tion from a given web service provider, we could potentially hand, user clicks to third party domains, that start a new ses-
obtain a ‘site map’ of all the host names, and URIs used in sion tree (as we discussed in Figure 1(b)). However, some-
constructing the web service. In general, however, we can- times a user can click on an associated CDN domains (e.g.,
not assume that such cooperation is the norm, since the kind turner.com for CNN), which must be considered as part of the
of application (e.g., for reverse billing) may not necessarily session tree. Thus, to make this differentiation, it is impor-
involve the particular website content provider, and may de- tant to first establish the set of CDNs for a given domain,
pend only on an agreement between the customer and the after which we need methods to differentiate between the

4
embedded downloads and clicks to third party sites. Note than dynamically changing ones.) Hence, it is reasonable to
however, all accesses to the CDN cannot be considered as assume that most traffic for a given web site using the CDN
part of that particular web service, since the same CDN may will come from the CDN (apart from the main domain itself).
host multiple web services. Implementing this idea is not straightforward since we
Our overall approach therefore consists of the two basic still need to identify, from the monitored traffic, the total
steps: CDN detection and Embedded object detection. The traffic to a web site (which is the cobbling problem that
CDN detection step is an offline process that involves iden- we started with). We first focus on the portion of the traf-
tifying the CDN (or supporting) domains that play an im- fic that can be clearly identified—the HTTP GET/POST re-
portant supporting role for delivering a given web service. quests with referer URL belonging to the root domain. For
We expect that for web services of interest, we separately example, to detect CDN for cnn.com, we first look at all re-
track their associated CDNs (which rarely change) and in- quests where the request has its referer host as cnn.com or
corporate them into the cobbling process. For embedded sub-domain of cnn.com such as money.cnn.com. This is the
object detection, which is an online process, the goal is to traffic for downloading the embedded pages of the root do-
identify the set of embedded objects fetched as part of a main (e.g., embedded content for main cnn.com web page)
given web page as opposed to those that are retrieved due and the traffic for accessing pages when user navigates away
to user clicks. The key metric used here to distinguish be- from the page (e.g., user clicks on nytimes.com on cnn.com
tween the two types of retrievals is that embedded object page). Note that in the second case, only the first GET or
retrieval has much less ‘think time’ than user-click based re- POST request has referer as cnn.com. The rest of the requests
trieval since the embedded objects are automatically fetched have nytimes.com or subsequent pages as referer. As long as
by the web browser. We also use the fact that some embed- the traffic for accessing the external web sites, in the second
ded objects have standard file-types such as javascript (ex- case, does not exceed the traffic for accessing the CDN for
tension .js, .json) that are not usually associated with objects cnn.com, it will not affect the result of CDN detection. Since
retrieved by user clicks. users in general are likely to navigate to various different
Given the importance of the Referer field in our approach, pages within the origin or CDN domains, the latter traffic
one could argue that it is easy to disable the Referer field should not be an issue in practice.
since many modern browsers provide the appropriate set- Another issue is that web services may use multiple CDNs.
tings anyway. In most modern browsers, however, the Ref- For example, cnn.com uses both Level 3 and Akamai CDNs.
erer field is by default turned on; very few people even bother Fortunately, in many cases, we find that the CDN host con-
to turn it off (or are even savvy enough to turn it off). If tained in HTTP requests is an alias of the canonical name
indeed, the Referer field were turned off completely by a (CNAME) of the actual server. For example, cnn.com has
majority of the users, the whole multi-billion dollar Inter- i,z.cdn.turner.com for different types of content. i.cdn.turner.com
net advertisement industry would crumble, since they heav- is an alias for CNAME cdn.cnn.com.c.footprint.net and is owned
ily rely on the Referer field for tracking the source of their by Level 3, z.cdn.turner.com is an alias for CNAME z.cdn.turner.
clicks. Thus, companies such as Google and Microsoft, have com.edgesuite.net and is owned by Akamai. Use of this kind
the incentive to keep the Referer field on in Chrome and In- of aliasing is convenient for the web service provider since
ternet Explore browsers to support their online advertising it allows web pages to be not tied to any particular CDN
businesses. provider. A web site can switch to other CDNs by sim-
In the next few subsections, we first discuss each of the ply mapping the alias to other CNAMEs. Given the struc-
heuristics in detail and then present the overall algorithm. ture of the CDN aliases, we can detect the “CDN domain”
(e.g., cdn.turner.com) instead of specific CDN host (e.g., i.cdn.
3.2 CDN Detection turner.com). Our algorithm looks at different levels of the host
domain, and tries to aggregate them. For example, level-1
The goal of CDN detection is to identify supporting CDNs (top-level) domain of i.cdn.turner.com is com, level-2 domain
(if any) for a web service. In the example of Figure 1(a), the of that is turner.com, etc.
focus for this would be the left portion of the tree, i.e., re-
CDN Detection Algorithm We start with traffic monitored
quests with host *.cdn.turner.com that belong to the CDN for
at the network edge router. The algorithm (shown in Fig-
CNN. One method for CDN detection is to monitor web re-
ure 4) starts by looking for all the HTTP GET requests that
quest traffic at a network edge router (e.g., campus gateway
contain the main host domain as the referer (e.g., cnn.com or
or ISP border router), and use it to identify the domain that
ads.cnn.com or money.cnn.com for CNN). Let this total num-
has delivered the most amount of traffic to clients when the
ber be n. We also compute the break down of these requests
pages are downloaded. The main purpose of a CDN is to
individually to each and every host h (denoted by nh ). For
make sure that static and relatively less frequently changing
example, if x and y GET requests with referer as cnn.com
parts of the web pages (such as javascript objects and some
were made to disqus.com and cdn.turner.com, respectively, we
images) are replicated and placed close to the clients. (Note
denote ndisqus = x and ncdn.turner =y. Note that all counts
that this is not to say that frequently changing parts are never
are in number of bytes. Out of all hosts h, we pick the host
part of CDNs, but generally CDNs store more static objects

5
req: HTTP GET or POST request l + 1 CDN domains. We choose to use an aggregated level
host(req): host name in request req of CDN domain only if such aggregation makes a difference,
rhost(req): referer host name in request req e.g., if the aggregated traffic grows by 5%.
root: root domain Special cases For less popular web sites, it is possible that
l: host domain level they do not use any CDN services. One of the following may
n: number of bytes for all sessions s.t. happen in such cases: (1) Most traffic comes from the origin
rhost(req) = root domain, so that no other traffic will pass the threshold. As a
doml (h): level-l domain of host h result, no CDN is detected. This is fine since detection based
nh,l : number of bytes for all sessions s.t. on the origin domain already covers the majority of traffic.
rhost(req) = root and (2) The web site itself does not have much content. Most
doml (host(req)) = doml (h) users who access this web site navigate to another web site.
bi : number of bytes for session i This is a corner case where the web site most likely does not
provide any useful service. We ignore this case as being not
for each session i s.t. rhost(req) = root of practical interest.
BEGIN A more important case that we need to handle involves
n+ = bi a web site that heavily uses services from some third party
nhost(req),l + = bi sites. For example, we found that reddit.com is a popular site
END (in the Alexa top 50 sites in the US) that relies on imgur.com
for each host h for hosting images, although imgur.com is neither a CDN nor
BEGIN the main supporting domain owned by reddit.com. Given that
rh,l = nh,l /nl our algorithm selects the most heavily referred site as the
END CDN, it will end up picking imgur.com as the CDN for red-
dit.com, which is not true even though its presence may be
doml (h) with max(rh,l ) is the top level-l CDN domain
vital to the particular web service. However, we cannot al-
Figure 4: Detecting top level-l CDN domain low clicks to imgur.com beyond the embedded requests as be-
longing to reddit.com service. In that sense, imgur.com should
be considered similar to a third-party host, where embedded
with maximum portion of traffic (nh /n) as the top host.
object accesses from the origin web pages are considered
Suppose we identify i.cdn.turner.com as the top host in this
part of the web service in question while user clicks are not.
step, we then try the next aggregated level of the top host
One other issue is that multiple web sites may claim the
domain: cdn.turner.com by repeating the counting procedure
same CDN as its own CDN. For example, both cnn.com and
for level-3 domains. We can continue this procedure for fur-
adultswim.com use cdn.turner.com as the CDN. This is accept-
ther aggregated levels to get the top domain at each level.
able for our method, since we can trace back to the origin
Note that we need to avoid the trivial levels such as .com or
domain through the chain of referer fields and separate out
expanded levels .co.uk and so we stop at level-3. After get-
the requests originating at different origin domains.
ting the list of the top domains Hl and their traffic rates rl ,
for each level l = 3, 4, .., we use the following heuristic to
3.3 Embedded Object Detection
decide which level of the top domain to use:
Starting from the lowest level (most aggregated) l = 3, we Besides the traffic from the origin domain and CDN (or
select the top level-3 domain as the CDN domain if the dif- main supporting domain), there is also traffic for download-
ference between the proportion of traffic for the top level-3 ing embedded content from third party web sites. This in-
domain and the top level-4 domain is above a pre-set thresh- cludes objects such as www.facebook.com/extern/login... and
old t (r3 − r4 > t). In our system we choose t to be 5%. ad.doubleclick.com/adi/... in the example shown in Figure 1(a).
Otherwise we do not use the level-3 domain, and check the In this section, we investigate two methods for detecting re-
next level l = 4. We select level-4 if r4 − r5 > t. We quests for fetching such objects—one based on the file-type
continue this process until either we find a level l such that extensions and the other based on timing.
rl − rl+1 > t or l is the full host domain. We then select top Classification based on file-types: Our first observation
level-l domain as the CDN domain. is that certain file-types are almost always embedded. This
In the cnn example, we have the top level-3 and level-4 do- includes css, js, swf, ico, json, and xml. Download of such
mains as cdn.turner.com and i.cdn.turner.com, respectively. We files is always triggered automatically by the download of
start from the level-3 domain cdn.turner.com and check if the another (embedding) web page since such files are not use-
traffic rate difference between cdn.turner.com and i.cdn.turner.com ful on their own. Our preliminary inspection for a recent one
is more than 5% of all traffic with referer field as cnn.com. hour trace collected at the large university campus network
This turns out to be true, so we choose cdn.turner.com as the gateway shows that 15% in terms of requests and 11.3% in
CDN domain. Intuitively, when we choose a level l CDN terms of bytes of all the HTTP traffic are for such files. In
domain, we are essentially combining traffic from all level terms of percentage, they cover a smaller portion of traffic

6
than the CDN, but still significant in terms of volume espe- 1
cially for services that use these embedded objects more fre- 0.9
quently. In addition, the almost certainly embedded nature 0.8
0.7
of these objects makes it an accurate classification rule, and
0.6

CDF
is also easy to implement. It also helps cross-checking with
0.5
other heuristics as we discuss next. While it appears that 0.4
relying on the file-type extension may allow gaming the sys- 0.3
tem (e.g., by simply renaming other content with these file- 0.2 Naive Timing heuristic
Refined timing heuristic
types), a user can only cheat if he can collude with the web- 0.1
1 10 100 1000 10000 100000 1e+06 1e+07
service provider since the URLs are managed by the service
Time (msec) - Log scale
owner. Such a case therefore is highly unlikely in practice;
we discuss cheating and collusion further in Section 6.
Figure 5: Think time distribution for embedded object
Classification based on timing: Our second observation downloads with both naive and refined heuristics.
is that the embedded objects are downloaded shortly after
its referer page (called the base page). The time interval be-
tween the two downloads should typically be shorter than the a user to click on a third party link, which will then be clas-
time it takes for a user to browse a given web page and then sified as an embedded object. Clearly, it is not supposed to
click on a URL in the page to navigate to a different page. be counted as part of this web service, but will be because
For convenience, we define the time interval between down- of the relatively high value of Tthresh . If we set Tthresh to
loading the base page and the embedded links as think time. say 100ms or even 1s, we will miss a large fraction of em-
More precisely, we define think time Tthink = TG − TR , bedded objects (80% with Tthresh = 100ms and 50% with
where TG is the time at which the GET/POST request for Tthresh = 1s). It is clear that finding a fixed threshold that
the embedded page has been sent, and TR is the time in- will work for a large number of web-services is not easy.
stant at which the last response packet of the corresponding Next, we discuss how to improve this further.
referer page arrived. For example, the think time for URL
cdn.turner.com/index.jpeg in Figure 1(a) is interval between the 3000
cnn.com/
time when the last response packet for its referer cnn.com is 2500 turner.com/common.css
turner.com/main.css
received and when the GET request for cdn.turner.com/index.jpeg
Time (msec)

2000
is sent. Note that it is possible to have Tthink < 0 since
browsers can start downloading the embedded URL even be- 1500
fore it finishes downloading the entire base page. Intuitively, 1000
think time is the time it takes for the browser to process the
500
web page, extract any embedded URLs, and then send sub-
sequent requests to download these embedded objects. 0
0 10 20 30 40 50 60 70 80 90 100
Naive timing heuristic: We may use the following naive GET request index
heuristic for detecting embedded objects: A session is classi-
fied as embedded URL download if Tthink < Tthresh , where Figure 6: Think time for multiple embedded objects with
Tthresh is a timing threshold, e.g., 1 second. Unlike the file- the same referer
type heuristic, the timing heuristic can generate both false
negatives (missing requests part of the target web service) Refined timing heuristic: To better understand why the
and false positives (including requests after the user navi- browser think times are sometimes exceedingly long, we in-
gates to third party web pages) depending on the value of spect the time sequence of web page downloads more care-
Tthresh . To understand how practical the timing heuristic fully. Figure 6 shows the timing for downloading multi-
is, we take advantage of the file-type heuristic described in ple embedded URLs following the referer URL cnn.com/XXX.
previous section. The X-axis shows the index of the URLs sorted according to
Results from a real packet trace. We calculate the think the time when the GET/POST request is sent. URL 0 is the
time for all embedded file downloads based on the file-type referer URL and other URLs are all embedded URLs. The
heuristic in a full HTTP trace we collected at the large uni- Y-axis shows the think time for each URL. We observe that
versity gateway. Figure 5 shows the think time distribution. the think time increases almost linearly though the requests
We observe that think time varies across a wide range. Al- towards the end are spaced farther apart than at the begin-
though about 60% of think time falls below 1 second, 10% of ning. The increased spacing is because third-party embed-
think time is above 10 seconds. If we naively set the Tthresh ded objects or advertisements are among the last to be re-
to 10s, we will be able to capture almost 90% of all the em- quested and they take more time to load as well. Except for
bedded objects. But, this has the negative effect of increas- this artifact, think times seem to be accumulating over con-
ing the false positives, since 10 seconds is sufficient time for secutive embedded URL downloads. The reason is because

7
the browser typically processes each web page sequentially, Algorithm:
and instead of sending out all requests for all embedded for each request req
URLs at the same time, requests are spaced out over time. BEGIN
Browsers also restrict the number of parallel downloads as if host(req) ∈ Origin
well, so downloads tend to be somewhat sequential. root(url(req)) = Origin
Examination of the figure further reveals that the time off- else if (host(req) ∈ CDN &
set when a GET request for an embedded object is made, rel- root(referer(req)) ∈ Origin)
ative to the request for the base page, is proportional to the root(url(req)) = Origin
number of GET requests for embedded object downloads. else if (url(req) is embedded file-type &
This is why choosing one fixed threshold is hard. Notice, root(referer(req)) ∈ Origin)
however, that the gap between two adjacent requests is more root(url(req)) = Origin
constant and predictable, compared to the time differences else if (req passes refined timing test &
between the original page and the embedded objects. Hence, root(referer(req)) ∈ Origin)
we propose the use of the following refined timing heuris- root(url(req)) = Origin
tic: For each referer page R, maintain time TA as the “latest else
activity time”. The activity can be either the last response root(url(req)) = NULL
packet being received for this referer page (TR ), or be a GET END
request sent for an embedded URL of this referer page (TG ).
When a new request is sent with referer R at time TG0 , we Figure 7: Overall classification algorithm for one domain
check if TG0 −TA < Tth , where Tth is a chosen threshold. If
the condition holds, we classify this request as an embedded CDN detection algorithm described in Section 3.2. Note
URL for R and also update TA = TG0 . that the administrator can choose to include multiple CDNs
The think time distribution of the refined timing heuristic or supporting domains here based on the few top domains
is shown in Figure 5. Most adjacent requests (almost 90%) flagged using the algorithm. In the second procedure, we use
are within about 100-500ms, which is much less than the the detected CDN domains along with the file-type heuristic
human think time. Of course, if the chain of requests is long, and the refined timing heuristic to classify the traffic. Given
the chances of false positives will increase since a user may a list of target origin domains, the goal of the algorithm is
click on some link. However, the chance of a user clicking to classify each connection either as belonging to one of the
on a link before the page completely loads is quite small and target domains or as NULL when it does not belong to any
this approach therefore works for almost all practical cases. target domain.
The problem with the naive timing heuristic was that it was We classify each HTTP session based on the request mes-
trying to choose one threshold for all web sites, whereas the sage that the client sends to the server. Figure 7 shows the
refined heuristic adapts to the number of objects and to the procedure for classifying a request. Each URL is associ-
time taken to download all the previous objects. ated with a “root” domain, which can be either NULL or
Note that the tail still contains a small percentage of re- one of the target domains. During processing, the heuristics
quests that were sent almost 1000s (about 16.67 minutes) are applied according to the specified precedence. Note that
after the previous request. While this may seem like a user although conceptually we are isolating the session tree for
click, our file-type heuristic indicates otherwise. On further each target origin domain, we do not need to maintain one
investigation, we found that the browser was requesting the unified data structure for the entire tree. Instead we can just
same embedded object several times (with the same referer). maintain the root for each URL so that we know which tree
This happens every so often, as in an auto-refresh. If the this URL belongs to.
browser refreshes certain objects automatically after a long The precedence rules in the algorithm in Figure 7 are intu-
time, it could be difficult for us to correctly classify these itive. If the host belongs to the origin domain, or to the CDN
refreshes as embedded requests. However, for the refresh domain provided that referer’s root belongs to the origin do-
requests present in the trace, we observed that the file type main, then the host belongs to the origin domain. Then,
was almost always of embedded variety. So, our file-type we perform the file-type and timing checks, coupled with
heuristic would have correctly flagged them as embedded. whether the referer’s root belongs to the origin domain. Thus,
if there is a false positive in the timing heuristic (i.e., a GET
request was sent to a third party host as a result of a user click
3.4 Overall Algorithm and was not automatically fetched by the browser, but was
The cobble tree construction algorithm combines the mul- misclassified by the timing heuristic) the overall algorithm
tiple heuristics discussed above: CDN detection and embed- is robust enough to stop further misclassification. For exam-
ded object detection based on both file-types and refined tim- ple, while the CNN page is loading, suppose a user clicks on
ing. To combine them, we run them as two separate proce- some third party link, say facebook.com, on the CNN page.
dures. In the first procedure, we detect the CDNs or main Because the page is still getting loaded, this user click may
supporting domains for each target web site by using the be inadvertently classified as an embedded object by the tim-

8
ing heuristic. The algorithm will set the root(facebook.com) Trace Size Date Duration
Field2011 227GB July 29, 2011 1 hour
to the origin domain (CNN). Further accesses to links from Field2012 512GB Jan. 19, 2012 5 hour
this third party page, however, will not match any of the rules Manual 108MB Jan. 19, 2012 96min total
because their referer would be facebook.com. For this one
false positive to cascade into including an entire browsing Table 1: Traces used in evaluation
tree, the user must repeatedly click on one link after another the system took 2 hours 10 minutes to process a 5 hours
while the page is loading, with virtually no think time. We trace, suggesting that it can easily keep up with line rates.
did not encounter such scenario in our trace. The scalability is mainly because it only processes less than
The algorithm also handles URL shorteners flawlessly. 0.1% of the actual data.
An URL shortener redirects a short URL to the actual long
URL using HTTP’s 301 return code. When a browser fetches 5. EXPERIMENTAL EVALUATION
the long URL, the referer field remains NULL and so the
In this section, we evaluate the performance of C OB W EB
cobble tree for the long URL can be formed in its entirety
system. Our evaluation mainly focuses on measuring clas-
as if the redirect never happened. In a similar fashion near
sification accuracy. We first explain the basic methodology
domain names (e.g.nyt.com and nytimes.com) can also be han-
for measuring the accuracy of our system, and then show re-
dled since usually the shorter name redirects the browser to
sults for each component of the algorithm. We also show
the longer one. The algorithm can be made even more ro-
some performance benchmarks for our unoptimized system
bust by filtering out the links to embedded objects by parsing
prototype.
the content within the HTTP response. This not only helps
reduce the number of false classification of the URLs, but 5.1 Evaluation Methodology
also remedies against auto-refresh and other fraudulent ac-
We use mainly two metrics to evaluate the efficacy of any
tivities trying to circumvent the cobbling process. We how-
classification algorithm—false negatives and false positives.
ever, chose not to implement that in our current algorithm
False negatives occur if the algorithm misses part of the tar-
due to the large overhead involved in storing, uncompress-
get web site traffic. False positives occur when the algo-
ing and parsing the content within a HTTP response.
rithm inadvertently includes traffic from web sites not part of
the target web site’s session tree. In the context of reverse-
4. IMPLEMENTATION billing, false negatives cause over-billing for the user and
In this section, we describe the implementation of the C OB - false positives cause loss of revenue for the service provider;
W EB system. This system is designed to operate on any net- it is therefore important to minimize both so that the traffic
work edge router with access to bi-directional traffic. It can accounting can be accurate.
either run directly in a gateway as a service blade or as a Packet trace and ground truth In order to evaluate the
stand alone server that is directly linked to gateways via port web service cobbling algorithm, we take 70 out of the top
mirroring. The system needs configuration information such 100 US web sites listed in Alexa, along with the top 26
as target web domain, their CDN names, timing thresholds popular web sites for users in the campus network. Note
and so on; it generates periodic reports about the cobbled that we exclude the sites that heavily rely on HTTPS (e.g.,
session trees as output. gmail.com) since our algorithm is not applicable there. We
C OB W EB system consists of a standard off-the-shelf packet also exclude the sites that require user login since such ses-
sniffer (e.g., libpcap) and a simple HTTP parser that allows sions cannot be emulated by manual download and hence it
us to extract various fields within HTTP packets. Our sys- is very difficult to obtain ground truth. Those 96 sites cover
tem can operate on live packet captures or in passive mode a wide range of service categories such as shopping, news,
to operate on pcap traces. It maintains some minimal state social networking, searching and so on. They also tend to
regarding the HTTP session, and tracks requests and corre- use sophisticated web technologies, and hence is good for
sponding responses. It also stores some timing information testing the effectiveness and robustness of the algorithms.
to implement the embedded object detection heuristic. We have collected two packet traces from a 10Gbps large
We implement the C OB W EB system in C++. Our total university network gateway link that connects the campus to
source code consists of 4,460 SLOC (Source Lines of Code). the rest of the Internet, both listed in Table 1. Note that the
All our experiments were conducted on an Endance Nin- 512 GB is the size of the compressed Field2012 trace.
jaBox networking monitoring appliance [2]. Internally, it Real traces are very useful for understanding how users
consists of an 2.5 GHz 8-core Intel Xeon E5420 procesor navigate through the web sites and what kinds of false posi-
with a total memory of 16GB, running Linux operating sys- tives and false negatives we can encounter when we deploy
tem with a 10Gbps capture card. Current version of C OB - our C OB W EB system in the field. However, such traces are
W EB is based on single process model, but it can be easily not sufficient by themselves since we do not have access to
parallelized to take advantage of multi-core architectures. In the ground truth. Obtaining ground truth from traces in the
our evaluation, we found that our unoptimized system can field is challenging since there is no easy way to distinguish
keep up with 10 Gbps line rate. Specifically, we found that between user clicks and embedded downloads—a key re-

9
From From Others knowledge of user browsing behavior. For example, concur-
CDN domain Origin domain Main Self Others
2mdn.net doubleclick.net 0.247 0.149 0.604 rent or overlapping HTTP sessions may be caused by down-
gstatic google 0.854 0.006 0.140 loading embedded content on the same web page, or caused
images-amazon amazon 0.523 0.207 0.269 by user opening multiple tabs in the browser, or caused by
imgur reddit 0.749 0.107 0.144
imwx weather 0.373 0.263 0.364
user quickly clicking through multiple URLs.
mzstatic apple 0.919 0.001 0.080 We address this problem by combining campus trace with
turner cnn 0.428 0.215 0.357 controlled browsing as follows. First, we use our cobble al-
twimg twitter 0.489 0.011 0.500 gorithm to generate the cobbled tree for each user browsing
yimg yahoo 0.544 0.105 0.351
session of each target web site in the campus trace. Recall
Table 2: CDN traffic based on referrer domains that the root of each session tree is the starting point for
each browsing session for a web page. The leaves of the
quirement for measuring false positives and false negatives. tree include both the embedded objects that are automati-
Normally, one would assume we could just parse all the cally fetched from any server, and user navigations to other
links from entire base page and we should be able to identify pages within the same web site (as in Figure 1(b)). In order
all the clickable links in the web page. If we see a GET to find out the accuracy of this tree, we use the manual trace
request from the web page for one of the clickable links, we as the ground truth (we call this the ground tree) and com-
should be able to assume that it is a user click. The difficulty pare it against the cobbled tree. However, one issue is that
here is identifying the clickable links from the base page. it is not easy to emulate the user opening other pages within
Modern web sites heavily rely on javascript and it is not easy the web site; for ease of evaluation, therefore, we only con-
to just search for specific patterns such as “<a href=.* >” sider root page downloads (called pruned cobbled tree).
as in traditional plain HTML web pages. Indeed, we tried In the ideal case, both trees, namely the ground tree and
searching for such URLs and we found very few links in the the pruned cobbled tree, should overlap perfectly; branches
entire trace that some user actually clicked on. missing in ground tree implies false positives. However,
Thus, in order to make sure we evaluate our algorithm there is one additional complexity we need to grapple with:
against credible ground truth, we also simulate a real user Many web sites have very dynamic content which changes
clicking on the 96 web sites and collect these traces. We for almost every download. This can cause the trees from
use a web browser to open each of the main web pages and two different sessions differ and introduce “noise” into the
wait for 1 minute to record the traffic trace for each session. comparison result. Such content is typically advertisements
Hence we obtain 96 manual traces, also listed in Table 1. that are randomly selected or customized according to user
Given no interference from any other source, the packets will profile or browsing history. For example, when we make
be pure ground truth. This of course gives us a limited eval- two consecutive downloads for cnn.com, each download gen-
uation since we cannot completely capture all the subtleties erates 150 and 148 GET requests, respectively. Among the
and complexities that are prevalent in the field. The com- 150 requests generated in the first download, 33 requests do
bined evaluation using both manually clicked ground truth not appear in the second download, with 25 being ads. If we
as well as real packet traces will cover more cases leading compute false positives naively, several extraneous sessions
to a more thorough evaluation than either one of them can not part of the ground tree will appear in the cobbled tree;
achieve individually. (Note that, with virtually no papers in this would not be an accurate classification though.
this space, there are no clear established metrics or method- Hence we mark a node as false positive if the following
ologies or data sources that we can directly use for evalua- two conditions hold: (1) it is in the pruned cobbled tree, but
tion.) not in the ground tree; and (2) its URL belongs to third-party
non-ads domain. If a node is marked as false positive, then
False negatives To evaluate false negatives, we mainly rely
the entire sub-tree below this node is also marked as false
on manual downloads in a controlled environment. This en-
positive.
sures that we have access to credible ground truth. We run
Intuitively, false positives can be classified into two cate-
the algorithm on the manual traces and compute the false
gories: (a) user clicking on third-party non-ads link; and (b)
negative rate, the proportion of traffic that is in the manual
user clicking on ads link. Obviously, case (a) is covered by
trace but is missed by the algorithm. For convenience, we
the above heuristic. For case (b), when a user clicks on an
also use its complement true positive rate in our discussion.
ads link to navigate to a third-party web site, e.g., clicking an
False positives Unlike false negatives, it is difficult to eval- ads of nytimes.com served by doubleclick.net on cnn.com page,
uate false positives using manually downloaded web pages, although the first download may contain URL of ads link
because by construction, we are not injecting any interfer- (and hence not covered by the above heuristic), the majority
ence in the user click emulation (hence, every GET request of traffic will be for downloading pages on nytimes.com along
is part of ground truth). Here is where the real campus traces with embedded objects, which is covered by the heuristic.
prove beneficial. But the issue with the real trace is that it As a result, we can capture all false positives involving non-
is difficult to reliably isolate traffic between different web ads links and majority of false positive traffic for ads.
sites even with manual inspection, since we do not have any

10
5.2 CDN Detection
In order to test the CDN detection heuristic, we first run 100
the CDN detection algorithm with the Field2011 trace and
then verify the detection result by using tools such as whois, 80

True positive
dig and google search. Table 2 shows an example of the 60
detection result. CDNs for most web sites are correctly de-
tected except reddit.com, for which imgur.com was mistakenly 40

identified as the CDN or main supporting domain. The rea-

20
son was that lots of reddit users are browsing image links of CDN
CDN+Filetype
imgur during the measurement period. 0
Overall algorithm
0 10 20 30 40 50 60 70 80 90 100
Table 2 also shows the traffic break-down for each CDN
Webservice
(or main supporting) domain (for a subset of websites for
(a) True positives for CDN and file-type heuristics
space reasons). Traffic for each CDN domain can be re-
ferred by either the main domain, the CDN domain itself, or
other domains. We observe that although in most cases the
100
majority of traffic for a CDN is referred by its main domain,
significant fractions of traffic for many CDNs are referred by 80

True positive
external web sites. This confirms our intuition that classifi-
cation just based on host domains is not sufficient. For ex- 60

ample, even though 2mdn.net is the main supporting domain

40
for doubleclick.net, majority of its traffic is referred by exter-
nal web sites to show ads, and hence should be classified as 20

part of the corresponding external web site traffic. The same Refined timing
Overall algorithm
0
can be said for twitter.com’s supporting domain twimg.com. 0 10 20 30 40 50 60 70 80 90 100
We further run the CDN detection algorithm with the Field Webservice

2012 trace. For all 96 top sites, we find that the algorithm (b) True positives for refined timing heuristic
correctly identified CDNs for 89 sites (92.7% accuracy). In
the other cases another third party site is marked as CDN Figure 8: Proportion of traffic correctly classified by in-
incorrectly since there is no separate CDN domain for this dividual heuristics and overall algorithm. Webservices
web site and the third party domain exceeds the 5% thresh- are sorted based on the True positive rate for the overall
old. There are also two web sites that use more than one algorithm
CDN domain, so the algorithm just picks the top one. clude a session if the URI is for an embedded file type and its
Once the CDN or main supporting domain is determined referrer host belongs to the tree. Similar to CDN heuristic,
for the target web site, we can apply the CDN heuristic to file-type heuristic does not generate false positives and we
classify traffic. In order to find out the proportion of traffic use the 96 manual traces for evaluation. Figure 8(a) (blue
classified by using the CDN heuristic, we construct the cob- line) shows the proportion of traffic classified correctly by
bled tree as follows: Starting from the target root domain combining file-type heuristic with the CDN heuristic. Note
with empty Referer field, we include a HTTP request into that such embedded files can be downloaded from either
the tree if its host belongs to the root domain or CDN do- CDN or third party. As a result, the two detection meth-
main, and its referrer also belongs to the tree. Here we only ods have certain overlap between their classification results.
evaluate the true positive rate since there are no false pos- We observe the effectiveness of the file-type heuristic varies
itives for the CDN heuristic. We use the 96 manual traces across web sites. On an average, the method with combined
for this evaluation. Figure 8(a) (the green line) shows the heuristics can correctly classify 83.75% of total traffic, an
proportion of traffic correctly identified by using the CDN 8.08% increase over the original CDN heuristic. In other
heuristic. We observe that true positive rate varies across a words, the file-type heuristic classifies an additional 8.08%
wide range between 5.09% to 100%. The average true posi- traffic for embedded file downloads from third party web
tive rate over the 96 sites is 75.68%. This indicates that the sites.
CDN heuristic is very effective for many web sites, but if
used as the only heuristic, it is not robust enough to classify 5.4 Refined Timing Heuristic
traffic for all web sites. To evaluate effectiveness of the refined timing heuristic,
we construct the cobbled tree using this heuristic, and use
5.3 File-Type Heuristic the method discussed in Section 5.1 to evaluate mainly true
We next investigate how much the file-type heuristic can positive rate. We use the Tthresh value of 500ms for these
further improve the classification result. We construct the results. Since, this heuristic is only part of our overall al-
cobbled tree in a similar way as before, and in addition, in- gorithm, we later study the sensitivity of the Tthresh value

11
100
False negatives
80 False positives
60
Rate

40
20
0
xk
w d.co
w .z
w .w o.c
w .w ipe m
w .u rrio a.o
w .a .co rum
w .p ny
w .n dis ize1
w .n g o m
w .m er. .com
w .k nga m
w .g ls.c ad
w .g gle m .ne
w .fe com om
w .e ex.
w .c y.co m
w .b .co
w .b g.c
w .b tbu
w .a du. .com
w .a com m
w .a .co
w .4 le.
w dpr an m
an fram s.o rg
th nym lph
op ira ze3 .com
an d. bay raig
an nym m rg list
de nym ze4
w e e5 g
w .la com om
w .w me
w .c shin om
w .b t.c ton
w .m .co
w .m ros k
w .e s co
w .ig y.c om
w .d .co m
w .a ian
w .b be rt.c
w .e nes om m
ab .a ed ndn
w .go ny .com ble
w .u om ize2
w .a tod
w .c ut. y.co
w .ta sta m
w m
w ed
w ik o
w a di
w ps rfo rg
w no m
w er m
w ew co .
w av g m
w a co
w oh r
w oo o e
w o. .c
w d
w ts co
w bs m
w in m
w es om
w ai y
w tt. co
w sk
w pp m
or ch co
ol es .o

w a. .c
w ti
w a s.
w ne g
w bc om p
w ic .u
w a o
w ba .c m
w n o
w ev m
w do ta
w ar .c o
w xp a

w .c m
w sa
w bo a
w on co m
w rg n
ep i a
c

o a rg

tm te .c
o co .o s
o i
al iz .o

c no ia o
w
w
w
w
w
w
w
w e .c co
w
w
w
w
w
w
w
w
w
w
w
w
w
w

w
w
w
w
w
w
w cy ft.
w
w
w
w
w
w

w
w
w
w
w e tc
s

.g t. o
od co nt
ad m act
dy
e

c
r

.c
om
.e
.c

du
om

t.c

.c
.o

.c
t

om
rg

om
100
False negatives
80 False positives
60
Rate

40
20
0
w
w
w .dr m
w .da ger
w .wa ma ort.
ho .m art o.u m
di e.m com m
ph na eb
w obu refe arc
w .ur et.c nce om
w .ho ndi
w .im ewh na
w .ny .co hop com
di .wu es
w o
on .hu
es .w to
w .go com os
ne .ve om
w .cn nw
w .re .co ele
xf .co rs.
tim y.c cas m
w sofi ca et
w .eh ia.i et
sl .re .co iatim
gi dea it.co
w do et
w .fo m
w .cn ews
w .ya com om
w .we o.c
w .wo her
pi .nf pre m
w res m .co
fin .wo m
w ce .co
w .en hoo
w .ao dg om
sp .se om com
w ts.y s.co
w .es oo.
w .cr cric m
al .pe ked fo.c
w
w r.co
w ud
w il ep
w

w
w ba o .
w tn ct
w d
w ti m .c

w m rgr

w .c

w
w ut m ss

w n st
w o nd
ic dd m

w
w xn
w n .c
w h
w
w

w t.

w .y m
w g .
w l. et

w a
w p c
w ac in
ct y

gg n .c

in m co

zm ls m

nt l.c s

lre o .c o
pn sj n

an ot

or ar
m sn. .co k

ot ry. se

lin ffin

w riz

e m .n
w
w
w
w y
w lm il.c co

w ck re h.c
w
w
w b
w m

w et ir
w e

w d .n
w w
k

w .co
w
w .
w o
w at om
w rd .co

w co

w a
w a c
w c .

w h m
w n o
w
io w

e o s
s
.

ci ple om m
e

o
.r

pe .c
o

s. om
de om

.
.

co
o

n
g

m
t
ip
io

ou
m co

p
nd

m
r

t.c
y

es
.

.c
om

om
om

om
m

.c
om
Figure 9: False positive and false negative for all the webservices with timing threshold 500 ms.
on the performance of the overall algorithm. We evaluate ation artifact rather than the cobbling algorithm.
true positive rate by using the 96 manual traces as before. The average false negative rate is 4.14% across all web
Figure 8(b) shows the true positive for each target web site. services. Except the last 8 sites, all false negatives for all
We observe that the true positive rate varies across a wide other sites are below 10%. Further investigation of the trace
range between 14.98% to 100%. The average true positive shows that the high false negatives for the last 8 sites are
rate over all web sites is 87.11%. Again, as shown in the mostly caused by flash video players. Unlike the browser,
Figure 8 the combination of all these heuristics has the best the video players often have empty Referer field in their GET
chance of correctly classifying most of the traffic. We dis- requests. Recall that our algorithm relies on Referer field for
cuss this further in the next subsection. cobbling the session tree; if the Referer field is not set, our
algorithm assumes it is not related to this session tree and
5.5 Overall Algorithm will miss the traffic completely. Not setting the Referer field
We now evaluate the overall algorithm (shown in Figure 7) is alright for the CDN or main domain accesses, since they
that combines all the heuristics. We first use the manual will still be accounted for, although as part of a different
traces to evaluate false negatives, and then use both Field2012 cobbled tree.
campus trace and manual traces to evaluate false positives. We think this problem can be addressed by correlating the
In the latter case, there are typically multiple browsing ses- sessions without Referer field with those with Referer field.
sions for each of the 96 sites. We take the average false For example, when we see a GET request with empty Ref-
positive rate of all the sessions for the same site. Figure 9 erer field, we take its URL and find the “best matching” URL
shows both the false negative and false positive rates by us- of another request among all the cobbled trees of the same
ing the overall algorithm. Timing threshold is 500ms. Result user within a certain time period, say 5 sec. The intuition
is shown in two rows, sorted based on false negative rate. here is that if the flash player retrieves video from a domain,
We observe that the overall algorithm performs very well the browser is likely to retrieve some other content from the
for most web sites. The average false positive rate across same domain. We have tested out this approach on the four
all web sites is 3.54%. With very few exceptions, the false domains with highest false negative rate, and found that their
positive rate is below 10%. One anomaly is www.reddit.com false negative rate is reduced to 19.1%. Work is still needed
with false positive 65.12%. We conjecture that this is caused to further validate such approach, and exploring other alter-
by very frequent content update due to active user postings. native correlation methods.
Since we compare all sessions within the 5 hour field trace
with just one ground truth trace, the “ground truth” is out- 5.6 Sensitivity to the Timing Threshold
dated for most sessions. To verify this hypothesis, we collect The only parameter in our algorithm is the timing thresh-
a shorter 20 min trace on the same campus gateway along old Tthresh . Clearly, setting Tthresh too high leads to fewer
with a ground truth trace. For the 48 reddit sessions that we false negatives, but also increases the number of false posi-
capture, the average false positive rate is now 5.2%, indeed tives. It is therefore important to identify a good threshold
much lower. This confirms that the high false positive rate that can balance the two. We vary the timing threshold and
we have observed in Figure 9 is indeed caused by the evalu- observe its impact on the false positives and false negatives;

12
App name Main Supporting Ads Third party
dictionary.reference.com
espn.go.com cnn 58.69 41.31 0.00 0.00
pinterest.com yelp 30.37 69.63 0.00 0.00
100 www.aol.com
www.apple.com washingtonpost 90.21 0.02 9.77 0.00
www.bing.com
80 www.cnn.com amazon 62.98 37.02 0.00 0.00
False negatives

www.ebay.com nytimes 98.90 0.00 0.87 0.23

www.engadget.com
www.foxnews.com facebook 11.15 87.73 0.00 1.11
60 www.google.com
www.imdb.com
twitter 0.00 98.13 0.00 1.87
www.msn.com imdb 0.00 97.68 0.29 2.03
40 www.nytimes.com
www.weather.com bestbuy 80.56 16.67 0.00 2.78
www.wikipedia.org dictionary 56.96 0.00 39.04 4.00
www.yahoo.com
20
bbc 17.82 72.80 0.38 9.01
huffingtonpost 18.33 67.56 4.85 9.26
0 abc 81.01 9.72 0.00 9.27
100 250 500 1000 2000 4000 8000
Time Threshold (msecs)
engadget 0.28 79.90 1.42 18.39
ebay 4.32 84.18 0.00 11.51
(a) False negatives wunderground 69.80 0.00 0.28 29.93
weather 13.94 19.64 30.44 35.99
dictionary.reference.com
espn.go.com
100
pinterest.com Table 3: Mobile app traffic break-down per domain in
www.aol.com
www.apple.com percentage of bytes
www.bing.com
80 www.cnn.com
False positives

www.ebay.com
www.engadget.com
www.foxnews.com
60 www.google.com
www.imdb.com 6. DISCUSSION
www.msn.com
40 www.nytimes.com
www.weather.com Mobile Apps. Browsers on the mobile phone behave in a
www.wikipedia.org
20
www.yahoo.com fashion similar to the desktop browsers in terms of filling in
the Referer fields. However, mobile apps often ignore op-
0
100 250 500 1000 2000 4000 8000
tional fields such as User Agent and Referer. To investigate
Time Threshold (msecs) this, we ran several mobile apps on an Android phone, ran-
(b) False positives domly navigated within the apps, and captured traces. We
found that Referer field is empty in majority of the requests.
Figure 10: Sensitivity to the timing threshold Tthresh . Thus, we believe that our cobbling algorithm, in its current
form, may not be directly applicable to mobile apps. How-
ever, further investigation reveals that the mobile apps send
results for top 17 web services (in terms of traffic volume) in
requests to a significantly small number of domains as com-
the Field2012 trace are shown in Figure 10.
pared to the web browsers (refer to Figure 2). We found that
In Figure 10(a), we show that we can decrease the false
the average number of top level domains per app is just 5.
negative percentage of some of the web sites by increas-
Moreover, as Table 3 shows, for majority of the apps more
ing the timing threshold. For example, bing’s false negative
than 90% of the traffic belongs to the main and the support-
percentage comes down from 35% to almost zero when the
ing domains for the app. Hence it is likely that CDN detec-
threshold is increased from 100ms to 250ms. But we also
tion and simple static traffic rules may work well for mobile
find that the false negatives of several web sites are not af-
apps, although a more careful study is still needed. We plan
fected by the threshold. One reason is that if the Referer field
to investigate such techniques as part of our future work.
is missing, there is not much gain in accuracy by increasing
the timing threshold. Video Services. There are several popular video service
On the other hand, false positives generally increase with sites that also use HTTP streaming to deliver video content.
the increase in the timing threshold (shown in Figure 10(b)). Examples include Youtube, Netflix, and Hulu. Our algo-
google is most sensitive to the timing threshold since its search rithm does not directly apply to such web sites since they
results have many third party web sites, and also user tends typically have multiple domains for supporting content de-
to click on search results faster than when they browse in- livery. For example, both Netflix and Hulu use three CDNs
dividual web sites. But web sites such as cnn and wikipedia (Akamai, Level3, and LimeLight). Youtube also has several
are much less sensitive to increase in timing threshold. One domains to support video and image download although they
reason is that many web sites are designed to attract users do not use third party CDNs. However, since there are very
to their own pages and hence have very few clickable third few of such large scale video service sites, it is affordable
party URLs. Such differences between different web sites in to make specific rules for such web services. Such rules can
their sensitivity to the timing threshold suggests that it may be made based on CDN host or domain names as well as
be beneficial to adopt a different timing threshold for differ- signatures in the message [4].
ent web sites. For the web sites we have studied we find Cheating and Collusion. The cobbling algorithm can po-
that timing threshold of 250ms to 500ms seems to be a good tentially be vulnerable to cheating by a user since the client
trade-off between false positive and false negative rates. browser determines what Referer field to set. For instance, a

13
user can disable the Referer field so that the traffic cannot be 8. CONCLUSIONS
easily associated with any web service. Or a user can modify We have presented the C OB W EB system for in-network
the Referer field to falsely associated the traffic in question cobbling of traffic associated with a given set of web ser-
with a different web service. However, unless most of the vices. Such system can enable new types of monitoring and
Internet users cheat, we can always correlate individual user measurement capabilities, and have the potential to even en-
sessions with the common and vast majority and hence can able new revenue models such as reverse billing for service
detect anomaly and blacklist such individuals. Collusion be- providers. While the classification algorithm is based on a
tween users and web service providers make the detection combination of multiple heuristics based on CDN and HTTP
more difficult. However this should be very rare since there request timing, the association between different web re-
is not much incentive to do so in practice. For instance, in quests is made possible using the HTTP Referer field, which
the case of reverse billing, it is unlikely that the web service is an essential component of Internet advertising industry to-
provider will collude with a user since one of the parties has day. Our extensive evaluation suggests C OB W EB can achieve
to pay for the traffic. low false positive and false negative rates, and can poten-
Prefetching and Auto-Refresh. Prefetching at the origin tially sustain a 10Gbps link.
or CDN server does not affect our algorithm. In addition, We view C OB W EB only as the first step towards solving
prefetching by the client browser is similar to user access- the complex problem of web service classification. Specif-
ing such content, so it will be classified accordingly. There ically, C OB W EB cannot currently handle encrypted traffic
are several scenarios for auto-refresh. First, auto-refresh of (e.g., using HTTPS) since it relies on information within
the entire page is similar to user accessing the page again, the HTTP requests that may not be visible in the network.
and hence it is classified correctly. Second, auto-refresh of While most traffic today is not HTTPS, extensively covering
individual objects such as ads are also fine if the same link all such cases is a big challenge.
has been classified before. Last but not the least, if the auto-
refresh is for a different link, then we can search for similar 9. REFERENCES
links in the cobbled tree similar to what we have done for [1] Allot. http://www.allot.com.
[2] Endace ninjabox. http://www.endace.com.
sessions with missing Referer fields. Further work is needed [3] Sandvine. http://www.sandvine.com.
to fully evaluate the impact of such strategy. [4] V. K. Adhikari, Y. Guo, F. Hao, M. Varvello, V. Hilt, M. Steiner, and Z.-L.
Zhang. Unreeling netflix: Understanding and improving multi-cdn movie
delivery. In IEEE INFOCOM, 2012.
[5] D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, and P. Tofanelli. Revealing skype
traffic: when randomness plays with you. In ACM SIGCOMM, 2007.
7. RELATED WORK [6] C. Estan, K. Keys, D. Moore, and G. Varghese. Building a better netflow. In
ACM SIGCOMM, Aug. 2004.
Major search engines and portals such as Yahoo and Bing [7] F. Hao, M. Kodialam, and T. Lakshman. On-line detection of real time
multimedia traffic. In ICNP, 2009.
have long been offering catalog of web sites based on the [8] T. Karagiannis, K. Papagiannaki, and M. Faloutsos. Blinc: Multilevel traffic
type of their services. Research work has also been done in classification in the dark. In ACM SIGCOMM, 2005.
[9] I. Katakis, G. Meditskos, G. Tsoumakas, N. Bassiliades, and I. Vlahavas. On
the past to catalog web services by using techniques such the combination of textual and semantic descriptions for automated semantic
as machine learning [17, 9]. The focus in our paper is not web service classification. In AIAI, 2009.
[10] E. Kiciman and B. Livshits. Ajaxscope: a platform for remotely monitoring the
to catalog the web services, but to isolate each target indi- client-side behavior of web 2.0 applications. SIGOPS Operating System
vidual web service traffic to assist better billing and service Review, 41(6):17–30, 2007.
[11] R. R. Kompella and C. Estan. The power of slicing in internet flow
management. To the best our knowledge, this is the first measurement. In IMC, 2005.
work to consider this problem and as such very few related [12] M. Lee, R. R. Kompella, and S. Singh. Active measurement system for
high-fidelity characterization of modern cloud applications. In Proceedings of
works exist. USENIX Conference on Web Applications, 2010.
There exists related work in traffic classification and iden- [13] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter
braids: a novel counter architecture for per-flow measurement. In ACM
tification in general, especially at the application level [14, SIGMETRICS, 2008.
19, 8]. Moore and Zuev have proposed a machine learn- [14] J. Ma, K. Levchenko, C. Kreibich, S. Savage, and G. M. Voelker. Unexpected
Means of Protocol Inference. pages 313–326, October 2006.
ing approach based on Bayesian analysis to classify Inter- [15] A. W. Moore and D. Zuev. Internet traffic classification using bayesian analysis
net traffic into categories such as P2P, multimedia, WWW, techniques. In ACM SIGMETRICS, 2005.
[16] T. T. Nguyen and G. Armitage. A survey of techniques for Internet traffic
etc [15]. Karagiannis et al. propose BLINC system [8] for classification using machine learning. In IEEE Communications Surveys and
traffic classification in the dark using communication graphlets Tutorials, 2008.
[17] N. Oldham, C. Thomas, A. Sheth, and K. Verma. Meteor-s web service
for characterizing different types of applications. A survey annotation framework with machine learning classification. In Int. Workshop on
on traffic classification techniques based on machine learn- Semantic Web Services and Web Process Composition, 2004.
[18] A. Ramachandran, S. Seetharaman, N. Feamster, and V. Vazirani. Fast
ing is given in [16]. Prior research has also been done to monitoring of traffic subpopulations. 2008.
identify specific applications such as Skype [5, 7] based on [19] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield. Class-of-Service Mapping
for QoS: A Statistical Signature-based Approach to IP Traffic Classification. In
payload signatures and packet timing characteristics. ACM IMC, 2004.
There have been some measurement studies to understand [20] F. Schneider, S. Agarwal, T. Alpcan, and A. Feldmann. The New Web:
Characterizing AJAX Traffic. In International Conference on Passive and
the new web technologies such as Ajax (e.g., [20, 12, 10]). Active Network Measurement, April 2008.
Our goal in this paper however is to provide a mechanism to [21] L. Yuan, C.-N. Chuah, and P. Mohapatra. ProgME: towards programmable
network measurement. In ACM SIGCOMM, 2007.
identify sessions corresponding to a web service.