Revisiting Cacheability in Times of User
Revisiting Cacheability in Times of User
Abstract—Today’s Internet traffic is dominated by users’ de- study a set of application protocols. For these we investigate
mand for exchanging content. In particular, multi-media content, the potential of caches, traffic redirection for sharing content,
i. e., photos, music, and video, as well as software downloads and as well as causes of non-cacheability.
updates contribute substantially to today’s Internet traffic. One
option for reducing network costs is to use caches—exploiting In this paper we present observations based on passive
the observation that content popularity is consistent with Zipf’s packet-level monitoring of more than 20,000 residential DSL
law. Yet, Web caching became unprofitable due to the increase lines from a major European ISP. This unique vantage point
in popularity of dynamic Web content. However, since at this coupled with our application protocol analysis capabilities
point rich content is not very dynamic caching appears to be enables a more comprehensive and detailed characterizations
worthwhile again.
We base our analysis on anonymized traces from a large Euro- than previously possible. We focus on the cacheability of
pean ISP connecting more than 20,000 residential DSL customers multiple applications that are predominantly used for shar-
to the Internet, collected in 2009. We focus on the most prominent ing content in our environment1 : (i) HTTP, (ii) BitTorrent,
protocols in this environment—HTTP, BitTorrent (BT), eDonkey, (iii) eDonkey, and (iv) NNTP. Note, that HTTP and NNTP are
and NNTP—and estimate the potential of caching for traffic client/server protocols while BitTorrent and eDonkey are P2P
reduction. On the one hand, our results show that the potential
for caching most client/server-based applications like HTTP and protocols. It might seem odd to include NNTP, the protocol
NNTP is small. On the other hand P2P-based applications such used by Usenet, but we surprisingly find that NNTP accounts
as BitTorrent and certain HTTP based applications have high for more than 2 % of the total traffic volume and seems to be
content duplication ratios. used as an alternative for file-sharing [10]. In previous work,
Karagiannis et al. [9] study the cacheability of BitTorrent
I. I NTRODUCTION
before it became one of the dominant file-sharing applications
The Internet has evolved into a system where users can based on data from a residential university enviroment. Erman
easily share content with their friends and/or other users via et al. [6] only focus on the cacheability of Web traffic.
applications such as wikis, blogs, online social networks, P2P We find that the story for caching is ambivalent. For
file-sharing applications, One-Click Hosters, or video portals, client/server-based applications, including NNTP and some
to name a few of the most well-known user generated content Web domain classes, e. g., One Click Hosters, caching is
(UGC) services. In terms of volume, multi-media content ineffective. For other HTTP services, e. g., Software/Updates,
including photos, music, and videos, as well as software we observe caching efficiencies up to 90 %. In addtion, for
downloads and updates, are major contributors and together some domains, using opportunistic cache heuristics improves
responsible for most of the Internet traffic [11], [12], [16], cacheability substantially. For P2P protocols, especially Bit-
[17]. Indeed, HTTP is again accounting for more than 50 % Torrent, there is substantial potential for caching if the cache
of the traffic [11], [12], [16], [17] and is hardly (mis-)used actively participates in the protocol. Moreover, traffic localiza-
as transport protocol for other applications [12]. Among the tion via mechanisms, such as those proposed by, e. g., Aggar-
causes for the increase of HTTP traffic are One-Click-Hosters wal et al. [1], Xie et al. [21], Choffnes and Bustamante [3],
such as rapidshare.com or uploaded.to and the increase of and currently under discussion within the IETF ALTO [18]
streaming content, e. g., offered by youtube.com. working group, are promising2 . Traffic localization in effect
In the early stages of the Internet Web caches were very uses local peers as cache.
popular. However, the efficiency of Web caches [7], [15]
decreased drastically as the popularity of advanced features II. DATA S ETS
increased: dynamic/personalized Web pages (via cookies),
We base our study on multiple sets of anonymized packet-
AJAX-based applications, etc. Reexamining today’s content
level observations of residential DSL connections collected at
we find that a large fraction, especially multi-media content,
is static and therefore it might be rewarding to re-evaluate 1 In previous work Maier et al. [12] found that at our vantage point HTTP
the caching potential. This is confirmed by recent caching is responsible for more than 50 % of the traffic while P2P (BitTorrent and
efficiency studies for specific applications, e. g., Gnutella [8], eDonkey) contribute less than 15 %. Even assuming all unclassified traffic to
be P2P in total it only accounts for less than 30 %.
Fasttrack [20], YouTube [22], BitTorrent [9], and Web [6]. 2 The key idea is that ISPs and P2P users collaborate to locate close by
Rather than focusing on a specific application, we in this paper peers.
TABLE I
OVERVIEW OF ANONYMIZED PACKET TRACES AND SUMMARIES .
Application Volume
Name Start date Dur Size HTTP BitTorrent eDonkey NNTP
APR09 Wed 01 Apr’09 2am 24 h >4 TB 58 % 9% 3% 2%
AUG09 Fri 21 Aug’09 2am 48 h >11 TB 63 % 9% 3% 2%
HTTP-14d Wed 09 Sep’09 3am 14 d > 200 GB corresponds to > 40 TB HTTP
NNTP-15d Wed 05 Aug’09 3am 15 d > 2 GB corresponds to > 2 TB NNTP
BitTorrent-14d Sat 20 Jun’09 3am 14 d > 80 GB corresponds to > 5 TB BitTorrent
100
100
10−2
10−2
CCDF
CCDF
CCDF
10−4
10−4
overall
10−4
10−6
within AS within AS from AS
10−8
extern extern from extern
(a) Peers per torrent (b) Torrents per peer (c) Downloads per block
Fig. 2. Complementary cumulative distribution functions (CCDFs) of BitTorrent analysis for BitTorrent-14d (note the distinct scales).
TABLE III
C ACHEABILITY OF NNTP ARTICLES
35
29.9
% of total BitTorrent Volume
15
10.2
1
0
0
Afrika
Asia
Lang. region
Europe (rest)
North America
Oceanina
South America
22
none 12.0 %
Cacheability in %
20
TABLE V
C ACHEABILITY OF TOP 15 DOMAINS ( BY BYTES )
18
type of service UGC fraction complete full realistic
16
OCH1 ✔ 12.6 % 2.6 % 0.0 % 1.5 %
OCH2 ✔ 1.3 % 6.0 % 1.1 % 2.0 % 1/8 1/4 1/2 1
OCH3 ✔ 1.0 % 7.1 % 0.0 % 0.2 %
OCH4 ✔ 0.9 % 2.6 % 0.0 % 0.1 % Fig. 4. Cacheability dependent on fraction of population
Video1 ✔ 10.8 % 13.0 % 0.0 % 3.8 %
Video2 ✔ 2.2 % 43.4 % 0.1 % 5.2 %
Video3 ✔ 1.4 % 7.0 % 0.0 % 0.2 %
Video4 ✔ 1.4 % 14.7 % 1.5 % 3.6 % We find that this can lead to measurement error of over 40 %
Video5 ✔ 1.1 % 11.7 % 2.5 % 9.0 %
Software1 2.8 % 63.0 % 68.6 % 64.8 %
in download volume. Moreover, Erman et al. only consider
Software2 1.8 % 22.0 % 2.9 % 87.4 % the Cache-Control header. However, one third of the replies
Software3 0.9 % 12.9 % 3.3 % 54.7 % with cache control are controlled by a different header, e. g.,
Software4 0.8 % 56.9 % 42.7 % 65.4 %
CDN1 ? 1.5 % 34.8 % 12.8 % 25.4 %
Expires, as listed in Table IV.
Search ? 1.0 % 56.0 % 5.3 % 32.7 % Individual Sites: Table V shows the cacheability for the
overall 100.0 % 21.0 % 9.5 % 21.7 % top 15 domains (by bytes) classified according to the Web
service that they offer for the scenarios complete, full, and
realistic. In column UGC we mark if a domain is dominated by
General Observations: In principle, there is substantial po- user generated content. When comparing scenarios complete
tential for caching HTTP (see Table VI). In the ideal scenario and full we see that respecting cache control headers often
71 % of the requests are cacheable and 28 % of the bytes. has a devastating effect on cacheability for some domains.
This is consistent with previous results [7] which also observe When comparing scenarios full and realistic we see that the
a significantly higher request hit rate than byte hit rate. negative impact of the object ID also differs across sites.
Disabling caching across different second level domains, The site Software1 differs: realistic cacheability is lower than
scenario domain, does not decrease the caching efficiency by predicted by the full scenario. This can occur if an intermediate
much. It is still 71 % for requests and 27 % for bytes. Disabling HTTP request invalidates the cache before allowing access to
caching across different URLs for the same object, scenario a resource, e. g., when delivering a login page instead of the
complete, causes the efficiency to drop to 57 % and 21 %. This requested object upon missing authentication cookies.
shows that identical objects are usually not hosted by different There are significant differences among the Web hosters.
providers, while for each provider it is quite common to host Sites offering software appear to ensure good cacheability
the same object on different hosts or with different paths. (e. g., > 50 %) and can take advantage of caching (e. g., >
Including cache control headers, the scenario full, reduces 60 % for the realistic scenario). Some sites hosting videos
cache efficiency drastically to 16 % and 9.5 %. However, in have substantial potential for caching (> 40 %) but do not take
the realistic scenario overall cacheability increases to 47 % advantage of it. CDNs also have potential but realize it only
and 22 %. The omission of the object ID is responsible for partially (34.8 vs. 12.8 %). One click hoster (OCH) have hardly
this increase in cacheability: (i) the cache may be allowed to any caching potential (< 8 %) and do not even take advantage
serve an object that has changed on the server; (ii) aborted of the little potential. The caching hit rate for the realistic
downloads and partial requests lead to different object IDs scenario is less than 2 %. We observe that sites dominated by
and are thus only cacheable in the realistic scenario. user generated content exhibit considerably lower cacheability
The results of Erman et al. [6] for the ideal scenario are than other sites, e. g., software hosters.
even more promising. They found a cacheability of 92 % Population Size: Next, we explore the impact of the popula-
for requests and 68 % for bytes. Their final results after tion size on cacheability. We randomly subdivide our popula-
considering cache-control also show a substantial drop but tion into smaller sub-populations and recompute the cacheabil-
again indicate a better cache hit rate with 32 % of the bytes. ity for the realistic scenario. We observe that cacheability
One reason for the more promising results are that they assume appears to increase with population size. When doubling the
that the size of the download is equal to the Content-Length population size the increase in cacheability ranges from 1.6 %
header. However, we find that many downloads are interrupted to 2.9 %, cf. Figure 4. We presume that the caching potential
prematurely. In particular large downloads are often aborted. further increases with an increase in population. However,
TABLE VII
O PPORTUNISTIC CACHING FOR TOP 15 DOMAINS ( SCENARIO realistic). if unused due to cache control, personalization, and load
D OMAINS WITH NO IMPROVEMENT ARE NOT SHOWN . balancing.
Caching for P2P protocols is in principle very promising,
improvement (percentage points)
type baseline 10 s 10 min 1h 1d ∞ especially when combined with P2P neighbor selection strate-
OCH1 1.5 % 8.1 % 16.6 % 17.0 % 17.8 % 18.4 % gies. However, taking advantage of the potential is non-trivial
OCH2 2.0 % 0.0 % 0.1 % 0.1 % 0.1 % 0.1 % as the cacheability for simple chunk downloads drops to 9 %.
Video4 3.6 % 0.2 % 1.0 % 1.2 % 1.3 % 1.3 %
Video5 9.0 % 0.0 % 0.1 % 0.1 % 0.3 % 0.4 % Therefore, we plan to explore how to take advantage of the
Software3 54.6 % 0.0 % 0.0 % 0.0 % 0.0 % 0.1 % potential caching rates of more than 95 % for BitTorrent in
CDN1 25.4 % 0.1 % 3.0 % 9.5 % 19.5 % 23.7 % future work.
Search 32.7 % 0.0 % 0.3 % 0.5 % 0.7 % 0.7 %
overall 21.7 % 1.1 % 2.6 % 2.9 % 3.5 % 4.0 % R EFERENCES
[1] AGGARWAL , V., F ELDMANN , A., AND S CHEIDELER , C. Can ISPs and
P2P users cooperate for improved performance? SIGCOMM Comput.
there may be saturation effects. Also note that the variability Commun. Rev. 37, 3 (2007).
of cacheability increases with decreasing population size. [2] Azureus messaging protocol. http://www.azureuswiki.com/index.php/
Azureus messaging protocol, Apr 2009.
Cache Optimizations: Next we explore why the potential [3] C HOFFNES , D. R., AND B USTAMANTE , F. E. Taming the torrent: a
for HTTP caching is not used. For this purpose we allow practical approach to reducing cross-ISP traffic in peer-to-peer systems.
violations of the strict caching semantics for two high-volume In Proc. ACM SIGCOMM (2008), pp. 363–374.
[4] C OHEN , B. The BitTorrent Protocol Specification. http://bittorrent.org/
sites. For Video1 we study the impact of personalization and beps/bep 0003.html, 2008.
load balancing over servers with different host names. We [5] D REGER , H., F ELDMANN , A., M AI , M., PAXSON , V., AND S OMMER ,
start at a baseline of 3.8 %. Removing personalization, i. e., R. Dynamic application-layer protocol analysis for network intrusion
detection. In Proc. Usenix Security Symp. (2006).
parameters, from URLs yields an increased cacheability of [6] E RMAN , J., G ERBER , A., H AJIAGHAYI , M. T., P EI , D., AND
20.1 %. Unifying host names increases cacheability to 24.6 %. S PATSCHECK , O. Network-aware forward caching. In Proc. World
Thus, we conclude that personalization can be a major cause Wide Web Conference (2009).
[7] F ELDMANN , A., C ACERES , R., D OUGLIS , F., G LASS , G., AND R A -
for non-cacheability of objects. BINOVICH , M. Performance of web proxy caching in heterogeneous
Some objects do not include any information regarding their bandwidth environments. In Proc. IEEE INFOCOM (1999).
cacheability. Thus in principle they cannot be cached. This [8] H EFEEDA , M., AND S ALEH , O. Traffic Modeling and Proportional
Partial Caching of Peer-to-Peer Systems. IEEE/ACM Trans. Networking
may be either intentional, or by negligence of the operator. We 16, 6 (2008).
now explore if opportunistic caching, meaning setting artificial [9] K ARAGIANNIS , T., RODRIGUEZ , P., AND PAPAGIANNAKI , D. Should
expiry times and thereby violating strict cache semantics, can Internet Service Providers fear peer-assisted content distribution? In
Proc. ACM Internet Measurement Conference (2005).
help for such objects. More specifically, we examine expiry [10] K IM , J., S CHNEIDER , F., AGER , B., AND F ELDMANN , A. Today’s
times of 10 s, 10 min, 1 h, 1 d, and infinite. usenet usage: Characterizing NNTP traffic. In Proceedings of the 13th
The overall effect of opportunistic expiries is only small: IEEE Global Internet Symposium (March 2010).
[11] L ABOVITZ , C., M C P HERSON , D., AND I EKEL -J OHNSON , S. NANOG
2.6 % increase for a ten minutes timeout and 4.0 % for infinite 47: 2009 internet observatory report. http://www.nanog.org/meetings/
caching. However, OCH1 and CDN1 show a large increase nanog47/abstracts.php?pt=MTQ1MyZuYW5vZzQ3&nm=nanog47.
in cacheability. For OCH1 even ten minutes is sufficient to [12] M AIER , G., F ELDMANN , A., PAXSON , V., AND A LLMAN , M. On
dominant characteristics of residential broadband internet traffic. In
gain most of the benefits. Further investigation shows that mis- Proc. ACM Internet Measurement Conference (Nov 2009).
configured download accelerators are responsible. Such accel- [13] PAXSON , V. Bro: A system for detecting network intruders in real-time.
erators download large objects across multiple parallel connec- Computer Networks 31, 23–24 (1999).
[14] P LISSONNEAU , L., C OSTEUX , J.-L., AND B ROWN , P. Detailed analysis
tions. While this is not per se harmful, these accelerators issue of eDonkey transfers on ADSL. In Next Generation Internet Design and
partial requests for overlapping regions. As soon as the desired Engineering, 2006. NGI ’06. 2006 2nd Conference on (2006).
data is fetched, the accelerator closes the connection. However [15] R ABINOVICH , M., AND S PATSCHECK , O. Web Caching and Replica-
tion. Addison-Wesley Professional, 2001.
it takes time to cancel the transaction, and therefore additional [16] S ANDVINE I NC . 2009 global broadband phenomena. http://www.
data is downloaded. We observe clients that open up to 300 sandvine.com/news/global broadband trends.asp, 2009.
parallel connections resulting in an increase of the download [17] S CHULZE , H., AND M OCHALSKI , K. Internet study 2008/2009. http:
//www.ipoque.com/resources/internet-studies/ (need to register), 2009.
volume by a factor of three. With some cache tuning such [18] S EEDORF, J., AND B URGER , E. W. Application-layer traffic optimiza-
extra downloads can be eliminated with a small opportunistic tion (ALTO) problem statement. RFC 5693, Oct 2009.
timeout. [19] S TUTZBACH , D., AND R EJAIE , R. Understanding churn in peer-to-peer
networks. In Proc. ACM Internet Measurement Conference (2006).
V. S UMMARY [20] W IERZBICKI , A., L EIBOWITZ , N., R IPEANU , M., AND W O ŹNIAK , R.
Cache Replacement Policies Revisited: The Case of P2P Traffic. In
Our analysis of 20,000 residential broadband DSL lines Cluster Computing and the Grid (2004), pp. 182–189.
of a large European ISP shows that contrary to recent work [21] X IE , H., YANG , Y. R., K RISHNAMURTHY, A., L IU , Y. G., AND S IL -
BERSCHATZ , A. P4P: Provider portal for applications. In Proc. ACM
caching is not necessarily beneficial. For NNTP and some Web SIGCOMM (2008), pp. 351–362.
domain classes, including sites dominated by user generated [22] Z INK , M., K YOUNGWON , S., Y U , G., AND K UROSE , J. Watch Global,
content, we hardly find any potential. However, some Web Cache Local: YouTube Network Traffic at a Campus Network. In Proc.
SPIE (2008), vol. 6818.
service provider can take advantage of caches, e. g., software
download providers or CDNs have substantial potential even