Experiments of Large File Caching and Comparisons of Caching Algorithms

File sizes have grown tremendously over the past years for music/video applications and the trend is still growing. As a result, large ISPs are facing increasing demand for bandwidth from the growth of file sizes. A main contribution to this bandwidth demand problem is inefficient use of bandwidth due to many ISP customers downloading the same large files multiple times. This paper first reports real experiments conducted on Carleton’s Internet backbone by using the large file caching technique.

Uploaded by

Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views5 pages

Experiments of Large File Caching and Comparisons of Caching Algorithms

Uploaded by

Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Seventh IEEE International Symposium on Network Computing and Applications

Experiments of Large File Caching and Comparisons of

Caching Algorithms

Brad Whitehead1, Chung-Horng Lung2, Amogelang Tapela, Gopinath Sivarajah

Department of Systems and Computer Engineering
Carleton University, Ottawa, Canada
{1bwhitehe, 2chlung}@sce.carleton.ca

Abstract perceived to be a major factor in bandwidth usage for an

File sizes have grown tremendously over the past years ISP. However, in our study, we will show that large file
for music/video applications and the trend is still transfers account for about 20% of the Internet traffic on
growing. As a result, large ISPs are facing increasing Carleton’s Internet backbone based on real data traces
demand for bandwidth from the growth of file sizes. A and the amount is still growing.
main contribution to this bandwidth demand problem is The second objective is to evaluate various caching
inefficient use of bandwidth due to many ISP customers algorithms. Many caching algorithms have been used in
downloading the same large files multiple times. This computing systems and Web caching. Some of them are
paper first reports real experiments conducted on applicable to LFC; some need modifications. We have
Carleton’s Internet backbone by using the large file applied and adapted some caching algorithms and
caching technique. Various cache replacement compared their performance based on the data traces
algorithms are then simulated and compared using collected from real experiments.
traces of large file transfers. The results reveal that 2. Background and Related Work
least recently used (LRU) performs better than others. Web caching has been used since the early 90s. Web
caching allows ISPs to save their bandwidth by storing
1. Introduction frequently accessed files locally [20,23,27]. Currently
The Internet today is inefficient in distributing large available caching systems focus on Web pages almost
files which become very common. This leads to long exclusively. Downloading large files, on the other hand,
wait times to download popular files. Whitehead [28] is a much more challenging problem [19]. Table 1
studied the wait times for a popular download site after a summarizes the comparison between LFC and existing
major release of a 210MB file. The file has been Web caching products. In short, LFC has to deal with
downloaded 35,000 times in 18 days and the wait time any file type and various protocols. In addition, large
could reach an hour or even over 100 minutes after an files typically are stored in multiple locations, LFC
hour of its release. techniques need to be able to identify the same files even
An efficient caching solution for large files could if they could be downloaded from different locations.
mitigate wait times and speed up downloads. Further,
efficient caching of large files reduces ISPs’ bandwidth Table 1. Large File Caching & Web Caching
usage. The concept of caching has been widely used for
Large File Caching Web Caching
many years in computing systems and Web [20,23,27]
Any file type, large Web pages Small files and Web pages
and networks [7,11]. However, there is no explicit large
file caching (LFC) solutions or experimental results 1MB-10GB 0-10MB
reported in the literature. The currently available caching Any protocol; HTTP, FTP, HTTP only
techniques are more suitable for caching Web pages and FastTrack, Gnutella,
small files. There are many technical challenges in BitTorrent, CoBlitz, etc.
Files, in chunks or in whole, Files reside on a single site
dealing with large file transfers [19]. A comparison of mostly or always reside on or multiple sites
Web caching and LFC is presented in section 2. The first multiple sites
objective of this paper is to study the effect of large-file
caching on general Internet traffic. Content distribution networks (CDNs) are closely
Another trend of the network inefficiency problem is related to our technique. Many approaches to peer-to-
the fact that the size of a “large” download is continually peer (P2P) content distribution techniques have been
increasing. To save money, ISPs can extend Web reported in the literature or used in practice, such as
caching to LFC. Very few companies are expanding BitTorrent [8], CHORD [5,14], CoBlitz [19], Coral [9],
these tools for larger files. Large file transfers are not Fast-Replica [4], Gnutella [10], Kazaa [15], Shark [1],

978-0-7695-3192-2/08 $25.00 © 2008 IEEE 244

DOI 10.1109/NCA.2008.44
and etc.. Some focus on content search and sharing using Jan. 23, 2004 were processed to form Figures 1 and 2.
keywords, e.g., [5,14,16,29]. Some techniques focus on Figures 1 and 2 show the correlation between file size
reducing the download time of popular files. and the network impact. Figure 1 shows the expected
However, downloading large files is a qualitatively decrease in the number of files transferred as the file size
different problem for CDNs, as [19] reported based on increases. Figure 2 shows that large files, over 1MB in
Akamai’s experience. Some techniques break a large file size, account for a large percentage of network traffic.
into pieces and exchange those pieces among clients
Figure 1: No. of Files Transferred vs. File Size
instead of always downloading from the origin server
[4,8,19]. Other techniques, e.g., FatNemo [2], Split-
Stream [3], ESM [6], Bullet [17], Astrolabe [21] mainly
deal with load balancing and link utilization, but they
require clients to simultaneously transmit the content.
BitTorrent is wildly used to support file downloads
and it scales well. CoBlitz has been explicitly proposed
to support large files and it outperforms BitTorrent [19].
One of CoBlitz’s design goals is to trade bandwidth for
disk seek times, because bandwidth price is continually
dropping and disk seek times are high and hence the
solutions may not be scalable [19]. However, high
bandwidth along does not guarantee high performance,
because every element of the delivery chain
(intermediate nodes and links) can affect the overall
performance. Also, if the cache hit ratio is high, using By comparing Figure 1 and Figure 2, several
caching could have better performance than that of high conclusions can be drawn that can improve the efficiency
bandwidth [22]. of LFC. The number of requests (files transferred) is an
The cache in our approach resides on the local ISP important factor determining caching system
side where many clients using the same ISP may try to performance. Figure 1 shows that files between 0.5-1MB
access the same large files. The chance that many clients account for almost the greatest number of transfers at
share similar interests is high in some environment, such 71226 transfers, behind 1-5MB at 76328. However
as universities. Moreover, our approach could be used Figure 2 shows that these 0.5-1MB transfers only make
together with other large file distribution services such as up 55GB of traffic, far lower than the 180GB for 1-5MB
BitTorrent and CoBlitz. In other words, large files can be transfers. Therefore, for our study, we focus on files over
downloaded using those techniques from other peers or 1MB in size. This can also reduce the amount of
servers if those files are not in the local cache. connections that need to be monitored. The specific file
size is actually configurable depending on different
needs or environments.
3. Experiments and Empirical Results
Figure 2: Bytes Transferred vs. File Size
The main functionalities required to support LFC are
cache lookup, redirection of connection, and identifying
large files and tracking of all file transfers on the
backbone server for every TCP connection, which
presents high challenge to the design and
implementation. Tracking the TCP connections on the
backbone server requires that information be stored for
every computer on the network and for each connection
for that computer.
Filestats is a network-sniffing program that records a
log of all file transfers over a network. Filestats was used
to validate the need for LFC in two ways. First, large-file
network traffic must make up a large enough percentage
of the total network traffic that it was significant; and
second, that enough of the files transferred were the same
(redundant) that a caching solution was viable. 4. Large File Caching Algorithms
All network traffic of the university going through the This section presents a comparison of various LFC
backbone server was processed, but only file transfers algorithms with respect to hit ratio, byte hit ratio,
over 500KB in size were logged. The logs from Jan. 5 to bandwidth saved, and cache size. The comparison was

245
conducted based on simulation of data traces collected The value of Lvalue tarts from 0 and is updated every
through experiments. The total number of requests from time a file is evicted. The new value for Lvalue will be
these files (at least 1MB) was 130,866, and the total size the key value of the removed file. If a request is a miss,
for the files was 380.81GBytes. There were 62,530 the file will be fetched from the remote server. If the free
unique requests and their total size was 205.86GBytes. space in the cache is smaller than the size of the
The files in the trace file were collected between January incoming file, one or more files that has the minimum
05 and January 14, 2004 inclusive. key will be removed. The file will then be cached and:
Each algorithm is further explained as follows. - the frequency (f) count is set to 1.
LRU –Least Recently Used. LRU removes the file - key is recalculated from formula (A)
that was least recently accessed. In this method, time is - cache used = cache used + size(f)
used to determine the file to be removed. More than one If a request is a hit:
file can be replaced if the incoming file size is greater - file frequency is increased by 1;
than the size of the replaced least recently object. - key is recalculated from formula (A)
LRU Size – Least Recently Used with Size. This - cache size does not change
method determines the file to be removed by looking at
the least recently used and the size of the file. The file to Static Cache. The static cache algorithm is a simple
be removed must have size which is greater or equally to algorithm that caches all the popular requests that were
the size of the incoming file to be cached. The main aim made on the current day. The cache is deleted and reset
of the algorithm is to replace one file if the cache is full. at the end of each day.
LRU Threshold – LRU with Threshold. In this Static Algorithm
algorithm a value known as a threshold and the time for Gather all the requests for the day
the last access to the file are used to choose a file to For each requested file {
replace from the cache. The file that has the longest Set value = #references / size of file
recent-accessed time and its size is less than the }
threshold value is evicted from the cache if a new file is Sort these files from descending order
to be cached cannot fit in the current cache. Files that are Populate the cache from the top of the sorted list
larger than the threshold value are not replaced from the Largest Size. This algorithm determines the victim
cache. The idea is to keep larger files in cache to save file by removing the largest file from the cache. Size is
bandwidth. the key parameter in this method. The rationale here is
LFU –Least Frequently Used. This algorithm that more disk space will be available by replacing the
removes the file that has been accessed the least. largest file.
Frequency is the key parameter that is used to determine Table 2 summarizes the implemented caching
the object to be replaced. algorithms and the rationale in determining the file for
LFU Size –Least Frequently Used with Size. This replacement. Each algorithm is further explained as
algorithm replaces the file that has been accessed the follows.
least. Frequency is the key parameter that is used to Table 2. Summary of the Implemented Algorithms
determine the file to be replaced. The file to be removed Algorithm Replacement Policy
must have size which is greater or equally to the size of LRU Least recently accessed first
LRU LRU and file size in a specified range with
the incoming file to be cached. Similar to LRU Size, the Size respect to incoming file size
main aim of the algorithm is to replace just one file if the LRU LRU and file size less than a certain threshold
cache is full. Threshold
LFU Threshold – LFU with Threshold LFU Least frequently accessed first
LFU LFU and file size in a specified range with
This algorithm removes the file that has been accessed Size respect to incoming object
the least and its size is less than or equal to the threshold LFU LFU and file size less than a certain threshold
value. The idea is similar to LRU Threshold. Threshold
Greedy Dual Size Frequency [13]. The key GreedyDual Least value first according to
SizeFreq Key(f) = L + (F*B)/S;
parameters in this method are file size, number of LargestSize Largest file first
references (frequency) to the file and the cost. The cost Static No files are evicted, the cache is reset at the
of downloading the file from the remote server is equal end of the day, based on yesterday’s request
value (frequency/# of files)
to the average bandwidth usage of the file. Each file that
is cached is assigned a key calculated using the above
parameters. If the cache is full, the file that has the least 5. Results and Analysis
key will be evicted. The key is calculated using the Hit Ratios. Figure 3 shows the hit ratios for each of
formula shown below: the implemented algorithms.
Key(f) = Lvalue + freq(f) * bandwidth(f) / size(f) …(A)

246
Figure 3. Hit Ratio Figure 5. Bytes Saved
Cache sizes Vs Hit Ratios Cache Sizes Vs Saved Bytes
0.6 1.40E+11
1.20E+11

S a v e d B y te s (B y te s )
Hit Ratios LRU Saved Bytes: LRU
0.5
Hit Ratios LFU 1.00E+11 Saved Bytes: LFU
Saved Bytes: GreedyDual
0.4 Hit Ratios GreedyDual
H it R atios

8.00E+10 Saved Bytes: LRUSize

Hit Ratios LRUSize
Saved Bytes: LRUThresh
0.3 Hit Ratios LRUThresh 6.00E+10
Saved Bytes: LFUSize
Hit Ratios LFUSize
4.00E+10 Saved Bytes: LFUThresh
0.2 Hit Ratios LFUThresh Saved Bytes: LargestSize
Hit Ratios LargestSize
2.00E+10 Saved Bytes: Static
0.1 Hit Ratios Static 0.00E+00
0 1 4 8 16 32 64 100 120 150

1 4 8 16 32 64 100 120 150 Cache Sizes(GBytes)

Cache sizes(GBytes)

Bandwidth saved. Here we show the actual amount

The LRU Size has higher hit ratio for all cache sizes. The
of bandwidth used without cache and the amount of
hit ratios for the Static Cache were the lowest for all the
bandwidth saved using cache. Only one case of using
cache sizes simulated. The hit ratio only deals with the
LRU with 64G cache is presented in Figure 6 as an
count of files that were found in the cache. However, it is
demonstration. The first curve is the amount of
possible to have a high hit ratio and still save less
bandwidth that will be used without cache (labeled
bandwidth especially if most of the hit were smaller files.
NoCache). The second or bottom curve represents the
Byte Hit Ratios. Figure 4 shows the byte hit ratios
bandwidth used with a cache in place (labeled 64GB).
for each of the implemented algorithms. The byte hit
Finally, the third middle curve represents the amount of
ratios were calculated for each of the cache size that has
bandwidth saved from having a cache. The difference
been used during the simulation.
between the NoCache curve and the bandwidth used with
64GB curve is the amount of bandwidth saved for that
Figure 4. Byte Hit Ratio
particular algorithm and corresponding cache size.
Cache sizes Vs Byte Hit Ratios
0.5 Figure 6. Bandwidth Saved on with 64 GB Cache
BHR: LRU
0.45 BHR: LFU
0.4 BHR: GreedyDual
LRU: Time Vs Bandwidth(64GBytes)
64GB
Byte Hit Ratios

BHR: LRUSize
0.35 NoCache
BHR: LRUThresh
0.3 BHR: LFUSize
ba 6000 bandwidth saved 64GB
BHR: LFUThresh
nd
0.25 wi
5000
BHR: LargestSize
0.2 BHR: Static dth 4000
0.15 (K 3000
0.1 Bit 2000
s/s
0.05 ec)
1000
0 0
0 14 29 44 59 74 89 10 11 13
1 4 8 16 32 64 100 120 150 9 8 7 6 5 4 43 92 41
Cache Sizes(GBytes)
average time(30 minutes)

The byte hit ratio gives more information about the

bandwidth saved, because this parameter deals with the 6. Conclusion
actual size of the file. High byte hit ratio means more This paper presented a LFC technique and a
bandwidth saved. The LRU algorithm has the highest comparison of various cache algorithms to support LFC.
byte hit ratios for all the cache sizes. As the size of files will be continually increasing,
Bytes Saved. Figure 5 shows the bytes saved for caching large files at the local ISP can reduce download
different cache sizes used in the simulation. The LRU times and bandwidth usage, especially for popular files.
has the highest number of saved bytes for each cache size The comparison was based on data traces obtained
simulated. This bytes saved parameter and the byte hit from real experiments [28] in a university environment.
ratio are related; both are good at determining how The experiments revealed that many redundant large files
effective an algorithm is in saving bandwidth. had been downloaded from external servers. By caching
those large files in the local ISP server, download times
and bandwidth usage can be reduced. Generally
speaking, LRU performs the best based on the data

247
collected in our experiments. GreedyDualSize-Freq has [13] S. Jin and A. Bestavros, "Popularity-aware GreedyDual-
performed better than the LFU, because Size Web Proxy Caching Algorithms", Proc. of ICDCS, 2000.
GreedyDualSize-Freq method optimizes the byte hit ratio [14] L. Karsten, K. Sven, "Open-CHORD: Distributed and
Mobile Systems Group", http://www.lspi.wiai.unibamberg.de/
and hit ratio with respect to LFU results.
dmsg/software/open_chord/.
LFU can be used if the user is only interested in the [15] Kazza's Architecture, “How a Kazza’s client finds a
popular files that are used frequently. If the hit ratio is song”, http://computer.howstuffworks.
the main concern, then the LRUSize could be the [16] B. Kim, K. Kim, “Keyword Search in DHT-Based Peer-
appropriate algorithm to use. For maximum byte hit ratio to-Peer Networks”, Proc. of 7th Int’l Conf. on Algorithms and
and bandwidth saved LRU generates the highest value. Arch. for Parallel Processing,, 2007, pp.338-347.
We are planning to repeat the experimental work as [17] D. Kosti´c, A. Rodriguez, J. Albrecht, and A. Vahdat,
the nature of our network traffic has changed “Bullet: high bandwidth data dissemination using an overlay
significantly since 2004. When the experiments were mesh”, Proc. of 19th ACM SOSP, 2003.
[18] T. Leighton, Akamai Technologies Inc., “The Challenges
conduced, the university had 30Mbps of Internet
of Delivering Content on the Internet” Keynote Speech,
bandwidth in January, 2004. For the 2006/2007 IASTED Conf. On Parallel and Distributed Systems, 2002.
academic year the bandwidth was expanded to 60Mbps [19] K. Park and V. S. Pai, “Scale and Performance in the
in September and by April it had been doubled again to CoBlitz Large-File Distribution Service”, Proc. of the 3rd
120Mbps [24]. It is expected that more bandwidth has Symp. on Networked Sys Design and Implementation, 2006.
been consumed for identical large files. [20] M. Rabinovich and O. Spatscheck, Web Caching and
Replication, Addison Wesley, 2002.
[21] R. van Renesse, K. Birman, and W. Vogels, “Astrolabe: A
References
robust and scalable technology for distributed system
[1] S. Annapureddy, M. J. Freedman, and D. Mazires, “Shark:
monitoring, management and data mining”, In ACM
Scaling File Servers via Cooperative Caching”, Proc. of 2nd
Transactions on Computer Systems, May 2003.
USENIX/ACM Symp. on Networked Sys Design and
[22] S. van Rompaey, K. Spaey, and C. Blondia, “Bandwidth
Implementation, May 2005.
versus Storage Trade-off in a Content Distribution Network
[2] S. Birrer, D. Lu, F. E. Bustamante, Y. Qiao, and P. Dinda,
and a Single Server System”, Proc. of the 7th Int’l Conf. on
“FatNemo: Building a resilient multi-source multicast fat-tree”,
Telecommunications, 2003, pp. 315-320.
Proc. of 9th Int’l Workshop on Web Content Caching and
[23] W. Shi and Y. Mao, “Performance Evaluation of Peer-to-
Distribution, October 2004.
Peer Web Caching Systems”, Journal of Systems and Software,
[3] M. Castro, P. Drushcel, A. Kermarrec, A. Nandi, A.
May 2005, pp. 714-726.
Rowstron, and A. Singh, “SplitStream: High-bandwidth
[24] J. Stewart, Personal communication (email) with C.-H.
content distribution in a cooperative environment”, Proc. of
Lung, May 29th 2007.
SOSP, Oct 2003.
[25] D. C. Verma, “Overview of Content Distribution
[4] L. Cherkasova and J. Lee, “FastReplica: Efficient Large
Networks” and “Site Design and Scalability Issues in Content
File Distribution within Content Delivery Networks”, Proc. of
Distribution Networks” in Content Distribution Networks, New
the 4th USITS, March 2003.
York, NY: John Wiley & Sons, Inc., 2002, pp 1-61.
[5] “Open Chord Specification”, http://open-
[26] D. C. Verma, “CDN Data Sharing Schemes” in Content
chord.sourceforge.net.
Distribution Networks, John Wiley & Sons, Inc., 2002.
[6] Y. Chu, S. G. Rao, S. Seshan, and H. Zhang, “A case for
[27] “Web Caching: Making the Most of Your Internet
end system Multicast”, IEEE J. on Selected Areas in
Connection”, http://www.web-cache.com/.
Communication, 2002.
[28] B. Whitehead, A Scalable Anycast Technology for
[7] Cisco Systems, “Network Caching Technologies”,
Caching Content Distribution Networks, Project Report, Dept
http://www.cisco.com/univercd/cc/td/doc/cisintwk/ito_doc/net
of Sys and Comp Eng, Carleton Univ., Canada, 2004.
_cach.htm.
[29] K.-H. Yang and J.-M. Ho, “Proof: A Novel DHT-Based
[8] B. Cohen. Bittorrent, 2003,
Peer-to-Peer Search Engine”, IEIEC Trans. on
http://bitconjurer.org/BitTorrent.
Communications, E90-B4, 2007, pp. 817-825.
[9] “The Coral Content Distribution Network”,
http://www.coralcdn.org/.
[10] “Gnutella Protocol Definition”, Acknowledgements:
http://rfcgnutella.sourceforge.net/. We are grateful to John Stewart at Computer Computing
[11] Internet Caching Resource Center, Services, and Narendra Mehta and Dave Sword in the Systems
http://www.caching.com/caching101.htm and Computer Engineering Department for their trust, support,
[12] Internet Systems Consortium Inc., “ISC Domain Survey: and encouragement.
Number of Internet Hosts”, 2004,
http://www.isc.org/index.pl?/ops/ds/host-count history.php.