0% found this document useful (0 votes)
474 views

Benefits & Drawbacks Of: Dissertation

This dissertation examines the benefits and drawbacks of HTTP data compression. It discusses popular compression techniques like gzip and how they work. It explores HTTP's support for compression standards in versions 1.0 and 1.1. The document evaluates approaches to implementing compression on servers, load balancers, and dedicated devices. It also analyzes browser and web server support for compression. The dissertation includes experiments measuring compression ratios and comparing the performance of Apache and IIS servers with and without compression enabled. Overall, the document provides an in-depth overview of HTTP compression technologies and their real-world impacts.

Uploaded by

Asha Asha
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
474 views

Benefits & Drawbacks Of: Dissertation

This dissertation examines the benefits and drawbacks of HTTP data compression. It discusses popular compression techniques like gzip and how they work. It explores HTTP's support for compression standards in versions 1.0 and 1.1. The document evaluates approaches to implementing compression on servers, load balancers, and dedicated devices. It also analyzes browser and web server support for compression. The dissertation includes experiments measuring compression ratios and comparing the performance of Apache and IIS servers with and without compression enabled. Overall, the document provides an in-depth overview of HTTP compression technologies and their real-world impacts.

Uploaded by

Asha Asha
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Dissertation

ON

BENEFITS & DRAWBACKS


Of
HTTP DATA COMPRESSION

IN PARTIAL FULFILMENT

OF

MASTER’S DEGREE IN COMPUTER APPLICATIONS (M.C.A.)

GUJARAT UNIVERSITY

[GUIDE: Mr.Vrutik Shah]

INDUS INSTITUTE OF COMPUTER TECHNOLOGY & ENGINNERRING

SUBMITTED BY:

Chintan Parikh

Nihar Dave
Benefits and Drawbacks of HTTP Data Compression

ACKNOWLEDGEMENT
The Dissertation gives us the feeling of fulfillment & deep gratitude towards all our
teachers, friends and colleagues. As the final frontier towards achievement of my master’s degree,
the activity of going through the dissertation has bridged the gap between the academics and the
research work for us. It has prepared us to apply ourselves better to become good computer
Professionals. Since we were beginners it required lot of support from different people. We
acknowledge all the help we have received from so many people in accomplishing this project
and wish to thank them.

We take this opportunity to thank Mr. H.K DESAI, director of I.I.T.E. for taking
personal interest in the project and guidance which often resulted into valuable tips. We also
thank our guide MR.VRUTIK SHAH for his regular guidance and for his encouragement to us.

Our sincere thanks to our batch mates, who have provided us with innumerable
discussions on many technicalities and friendly tips without their cordial support his activity
would have been much tougher.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 2


Benefits and Drawbacks of HTTP Data Compression

Table of Contents
1. Overview of HTTP Compression...............................................................................................7
1.1 Abstract...............................................................................................................................7
1.2 Introduction........................................................................................................................8
1.3 Benefits of HTTP Compression........................................................................................10
2.0 Process Flow...........................................................................................................................12
2.1 Negotiation Process................................................................................................................12
2.2 Client send request to the server.............................................................................................13
2.3 Server send Response to the Client.........................................................................................13
3.1 Popular Compression Techniques...........................................................................................14
3.2 Modem Compression..............................................................................................................15
3.3 GZIP.......................................................................................................................................17
3.3.1 Introduction.........................................................................................................................17
HTTP Request and Response(Uncompressed).................................................................17
So what’s the problem?.............................................................................................................17
3.3.2 Purpose................................................................................................................................18
3.4 HTTP Compression................................................................................................................19
3.5 Static Compression.................................................................................................................20
3.6 Content and Transfer encoding...............................................................................................21
4.0 HTTP’s Support for Compression..........................................................................................23
4.1 HTTP/1.0................................................................................................................................23
4.2 HTTP/1.1................................................................................................................................23
4.3 Content Coding Values...........................................................................................................24
5.0 Approaches to HTTP Compression........................................................................................26
5.1 HTTP Compression on Servers...............................................................................................26
5.2 HTTP Compression on Software-based Load Balancers........................................................26
5.3 HTTP Compression on ASIC-based Load Balancers..............................................................27
5.4 HTTP Compression on a Purpose-built HTTP Compression Device......................................28
6.0 Browser Support for Compression..........................................................................................30
7.0 Client Side Compression Issues..............................................................................................32
8.0 Web Server Support for Compression....................................................................................34
8.1 IIS...........................................................................................................................................34
8.2 Apache....................................................................................................................................35

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 3


Benefits and Drawbacks of HTTP Data Compression

9.0 Proxy Support for Compression..............................................................................................37


10.0 Related Work........................................................................................................................38
11.0 Experiments..........................................................................................................................40
11.1 Compression Ratio Measurements........................................................................................40
11.2 Web Server Performance Test..............................................................................................44
11.2.1 Apache Performance Benchmark.......................................................................................45
11.2.2 IIS Performance Benchmark..............................................................................................51
12.0 Summary / Suggestion..........................................................................................................57
References....................................................................................................................................59

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 4


Benefits and Drawbacks of HTTP Data Compression

Table of Figure
3.1 Http Request & Response (uncompressed)……………………………………………..25
3.2 Http Request & Response (compressed)………………………………………………..25
11.1 The total page size (including HTML and embedded resources) for the top ten web
sites……………………………………………………………………………………53
11.2 Benchmarking results for the retrieval of the Google HTML file from the Apache
Server…………………………………………………………………………………58
11.3 Benchmarking results for the retrieval of the Yahoo HTML file from the Apache
Server…………………………………………………………………………………59
11.4 Benchmarking results for the retrieval of the AOL HTML file from the Apache
Server…………………………………………………………………………………60
11.5 Benchmarking results for the retrieval of the eBay HTML file from the Apache
Server…………………………………………………………………………………61
11.6 Benchmarking results for the retrieval of the Google HTML file from the IIS
Server…………………………………………………………………………………64
11.7 Benchmarking results for the retrieval of the Yahoo HTML file from the IIS
Server…………………………………………………………………………………65
11.8 Benchmarking results for the retrieval of the AOL HTML file from the IIS
Server…………............................................................................................................66
11.9 Benchmarking results for the retrieval of the eBay HTML file from the IIS
Server……………………………………………………………………………….67
11.10 HTTP response header from an IIS Server after a request for an uncompressed .asp
resource…………………………………………………………………………….68
11.11 HTTP response header from an IIS Server after a request for a compressed .asp
resource…………………………………………………………………………….69

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 5


Benefits and Drawbacks of HTTP Data Compression

Tables of Tables
11.1 Comparison of the total compression ratios of level 1 and level 9 gzip encoding for the indicated
URLs……………………………………………………………………………………………54
11.2 Estimated saturation points for the Apache web server based on repeated client requests for the
indicated Document…………………………………………………………………...62
11.3 Average response time (in milliseconds) for the Apache server to respond to requests for compressed
and uncompressed static documents………………………………………………………………….62

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 6


Benefits and Drawbacks of HTTP Data Compression

Chapter 1
1. Overview of HTTP Compression

1.1 Abstract
HTTP compression addresses some of the performance problems of the
Web by attempting to reduce the size of resources transferred between a server and client
thereby conserving bandwidth and reducing user perceived latency.

Currently, most modern browsers and web servers support some form of
content compression. Additionally, a number of browsers are able to perform streaming
decompression of gzipped content. Despite this existing support for HTTP compression,
it remains an underutilized feature of the Web today. This can perhaps be explained, in
part, by the fact that there currently exists little proxy support for the Vary header, which
is necessary for a proxy cache to correctly store and handle compressed content.

To demonstrate some of the quantitativebenefits of compression, we


conducted a test to determine the potential byte savings for a number of popular web
sites. Our results show that, on average, 27% byte reductions are possible for these sites.

Finally, we provide an in depth look at the compression features that exist


in Microsoft’s Internet Information Services (IIS) 5.0 and the Apache 1.3.22 web server
and perform benchmarking tests on both applications.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 7


Benefits and Drawbacks of HTTP Data Compression

1.2 Introduction
User perceived latency is one of the main performance problems plaguing
the World Wide Web today. At one point or another every Internet user has experienced
just how painfully slow the “World Wide Wait” can be. As a result, there has been a
great deal of research and development focused on improving Web performance

Currently there exist a number of techniques designed to bring content


closer to the end user in the hopes of conserving bandwidth and reducing user perceived
latency, among other things. Such techniques include prefetching, caching and content
delivery networks. However, one area that seems to have drawn only a modest amount
of attention involves HTTP compression.

Many Web resources, such as HTML, JavaScript, CSS and XML


documents, are simply ASCII text files. Given the fact that such files often contain many
repeated sequences of identical information they are ideal candidates for compression.
Other resources, such as JPEG and GIF images and streaming audio and video files, are
precompressed and hence would not benefit from further compression. As such, when
dealing with HTTP compression, focus is typically limited to text resources, which stand
to gain the most byte savings from compression.

Encoding schemes for such text resources must provide lossless data
compression. As the name implies, a lossless data compression algorithm is one that can
recreate the original data, bit-for-bit, from a compressed file. One can easily imagine how
the loss or alteration of a single bit in an HTML file could affect its meaning.

The goal of HTTP compression is to reduce the size of certain resources


that are transferred between a server and client. By reducing the size of web resources,
compression can make more efficient use of network bandwidth. Compressed content can
also provide monetary savings for those individuals who pay a fee based on the amount
of bandwidth they consume. More importantly, though, since fewer bytes are transmitted,
clients would typically receive the resource in less time than if it had been sent
uncompressed. This is especially true for narrowband clients (that is, those users who are
connected to the Internet via a modem). Modems typically present what is referred to as

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 8


Benefits and Drawbacks of HTTP Data Compression

the weakest link or longest mile in a data transfer; hence methods to reduce download
times are especially pertinent to these users.

Furthermore, compression can potentially alleviate some of the burden


imposed by the TCP slow start phase. The TCP slow start phase is a means of controlling
the amount of congestion on a network. It works by forcing a small initial congestion
window on each new TCP connection thereby limiting the number of maximum-size
packets that can initially be transmitted by the sender. Upon the reception of an ACK
packet, the sender’s congestion window is increased. This continues until a packet is lost,
at which point the size of the congestion window is decreased. This process of increasing
and decreasing the congestion window continues throughout the connection in order to
constantly maintain an appropriate transmission rate. In this way, a new TCP connection
avoids overburdening a network with large bursts of data. Due to this slow start phase,
the first few packets that are transferred on a connection are relatively more expensive
than subsequent ones. Also, one can imagine that for the transfer of small files, a
connection may not reach its maximum transfer rate because the transfer may reach
completion before it has the chance to get out of the TCP slow start phase. So, by
compressing a resource, more data effectively fits into each packet. This in turns results
in fewer packets being transferred thereby lessening the effects of slow start (reducing the
number of server stalls)[3].

In the case where an HTML document is sent in a compressed format, it is


probable that the first few packets of data will contain more HTML code and hence a
greater number of inline image references than if the same document had been sent
uncompressed. As a result, the client can subsequently issue requests for these embedded
resources quicker hence easing some of the slow start burden. Also, inline objects are
likely to be on the same server as this HTML document. Therefore an HTTP/1.1
compliant browser may be able to pipeline these requests onto the same TCP connection
[4]. Thus, not only does the client receive the HTML file in less time he/she is also able
to expedite the process of requesting embedded resources [3].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 9


Benefits and Drawbacks of HTTP Data Compression

Currently, most modern browsers and web servers support some form of
content compression. Additionally, a number of browsers are able to perform streaming
decompression of gzipped content. This means that, for instance, such a browser could
decompress and parse a gzipped HTML file as each successive packet of data arrives
rather than having to wait for the entire file to be retrieved before decompressing. Despite
all of the aforementioned benefits and the existing support for HTTP compression, it
remains an underutilized feature of the Web today. This can perhaps be explained, in
part, by the fact that there currently exists little proxy support for the Vary header, as we
shall see later, which is necessary for a proxy cache to correctly store and handle
compressed content.

1.3 Benefits of HTTP Compression


Many Internet users are concerned about speed and complain that the
Internet is too slow. They want Web pages to load faster. Modem users want the Web to
be faster, but aren't yet ready or able to upgrade to broadband service. Broadband users
enjoy the speed they have, but as new bandwidth-intensive content becomes more
available, they're looking for still more performance. For site operators, the answer to
these customer demands may be HTTP compression, which can offer modem and
broadband users up to 2 to 4 times their current page load performance.

Most Web sites would benefit from serving compressed HTTP data. Infect
most site images in GIF and JPEG formats are already compressed and do not compress
much more using HTTP compression. But what about the HTML code? Currently, the
base page is compressible text, as are JavaScript, cascading style sheets, XML, and
more. Using HTTP compression, static or on-the-fly-generated HTML content can be
reduced. A typical 30 KB HTML home page can be compressed to 6 KB, with no loss of
fidelity. This lossless compression provides for the exact same content to be rendered on
a user’s browser, but the content is represented with fewer bits. This results in less time
being required to transfer fewer bits.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 10


Benefits and Drawbacks of HTTP Data Compression

Using HTTP compression in production environments, outbound


bandwidth can typically be reduced by 40 percent to 60 percent. This compression level
is achieved by leveraging browsers’ tendencies to cache images, especially those with
expiration dates set in the HTTP headers. However, browsers do not typically cache the
base content on a page. When a user frequents a site, the images are usually pre-cached
by the browser, but the base-page content must be requested again, as it changes rapidly.
Pages tend to be much more dynamic and change far more often than images.

With a reduction in outbound data comes a corresponding cost


decreas.For example, if the reduction is 50 percent, then instead of peaking at 50 Mb/sec,
the peak is at 25 Mb/sec, freeing up bandwidth.

Clearly, HTTP compression offers sites and users tremendous advantages,


which is part of the reason that HTTP compression has been a W3C standard since 1999.
Browsers also have supported compression for years. In fact, most browsers, from IE 4.0
and Netscape 4.7 to the latest AOL, Mozilla, IE, and Netscape browsers, all automatically
support compression in one form or another. The standards are set and the user base is in
place to take further advantage of HTTP compression, but the challenges are many.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 11


Benefits and Drawbacks of HTTP Data Compression

Chapter 2
2.0 Process Flow

2.1 Negotiation Process


In order for a client to receive a compressed resource from a server via
content coding a negotiation process similar to the following occurs:

First, the client (a web browser) includes with every HTTP request a
header field indicating its ability to accept compressed content. The header may look
similar to the following: “Accept-Encoding: gzip, deflate”. In this example the client is
indicating that it can handle resources compressed in either the gzip or deflate format.
The server, upon receiving the HTTP request, examines the Accept-Encoding field. The
server may not support any of the indicated encoding formats and would simply send the
requested resource uncompressed. However, in the case where the server can handle such
encoding it then decides if it should compress the requested resource based on its
Content-Type or filename extension. In most cases it is up to the site administrator to
specify which resources the server should compress.

Usually one would only want to compress text resources, such as HTML
and CGI output, as opposed to images, as explained above. For this example we will
assume that the server supports gzip encoding and successfully determines that it should
compress the resource being requested. The server would include a field similar to
“Content-Encoding: gzip” in the reply header and then include the gzipped resource in
the body of the reply message. The client would receive the reply from the server,
analyze the headers, see that the resource is gzipped and then perform the necessary
decoding in order to produce the original, uncompressed resource [9].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 12


Benefits and Drawbacks of HTTP Data Compression

2.2 Client send request to the server


When a Web browser loads a Web page, it opens a connection to the Web server
and sends an HTTP request to the Web server. A typical HTTP request looks like this:

GET /index.html HTTP/1.1


Host: www.http-compression.com
Accept-Encoding: gzip
User-Agent: Firefox/3.6

With this request, the Web browsers asks for the object "/index.html" on
host "www.http-compression.com". The browser identifies itself as "Firefox/1.0" and
claims that it can understand HTTP responses in gzip format.

2.3 Server send Response to the Client


After parsing and processing the client's request, the Web server may send
the HTTP response in compressed format. Then a typical HTTP response looks like this:

HTTP/1.1 200 OK
Server: Apache
Content-Type: text/html
Content-Encoding: gzip
Content-Length: 26395

With this response, the Web server tells the browser with status code 200
that he could fulfil the request. In the next line, the Web server identifies itself as Apache.
The line "Content-Type" says that it's an HTML document. The response header
"Content-Encoding" informs the browser that the following data is compressed with gzip.
Finally, the length of the compressed data is stated.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 13


Benefits and Drawbacks of HTTP Data Compression

Chapter 3
3.0 Compression Techniques

3.1 Popular Compression Techniques


Though there exists many different lossless compression algorithms
today, most are variations of two popular schemes: Huffman encoding and the Lempel-
Ziv algorithm.

Huffman encoding works by assigning a binary code to each of the


symbols (characters) in an input stream (file). This is accomplished by first building a
binary tree of symbols based on their frequency of occurrence in a file. The assignment
of binary codes to symbols is done in such a way that the most frequently occurring
symbols are assigned the shortest binary codes and the least frequently
occurring symbols assigned the longest codes. This in turn creates a smaller
compressed file.

The Lempel–Ziv algorithm, also known as LZ-77, exploits the redundant


nature of data to provide compression. The algorithm utilizes what is referred to as a
sliding window to keep track of the last n bytes of data seen. Each time a phrase is
encountered that exists in the sliding window buffer, it is replaced with a pointer to the
starting position of the previously occurring phrase in the sliding window along with the
length of the phrase.

The main metric for data compression algorithms is the compression ratio,
which refers to the ratio of the size of the original data to the size of the compressed data.
For example, if we had a 100 kilobyte file and were able to compress it down to only 20
kilobytes we would say the compression ratio is 5-to-1, or 80%. The contents of a file,
particularly the redundancy and orderliness of the data, can strongly affect the
compression ratio.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 14


Benefits and Drawbacks of HTTP Data Compression

3.2 Modem Compression


Users with low bandwidth and/or high latency connections to the Internet,
such as dialup modem users, are the most likely to perceive the benefits of data
compression. Modems currently implement compression on their own however it is not
optimal nor is it as effective as HTTP compression. Most modems implement
compression at the hardware level, though it can be built into software that interfaces
with the modem. The V.44 protocol, which is the current modem compression standard,
uses the Lempel-Ziv-Jeff-Heath (LZJH) algorithm, a variant of LZ-77, to perform
compression.

The LZJH algorithm works by constructing a dictionary, represented by a


tree, which contains frequently repeated groups of characters and strings. For each
occurrence of a string that appears in the dictionary one need only output the dictionary
index, referred to as the codeword, rather than each individual character in the string.
When data transfer begins, the encoder (sender) and decoder (receiver) begin building
identical dictionary trees. Thus, whenever the decoder receives a codeword sent by the
encoder, they can use this as the index into the dictionary tree in order to rebuild the
original string.

Modem compression only works over small blocks of fixed length data,
called frames, rather than on a large chunk of a file as is the case with high-level
algorithms such as gzip. An algorithm constantly monitors the compressibility of these
frames to determine whether it should send the compressed or uncompressed version of
the frame. Modems can quickly switch between compressed and transparent mode
through use of a special escape code.
Furthermore, modems only offer point to- point compression. It is only
when the data is transferred on the analog line, usually between an end user and an
Internet Service Provider, that the data is compressed. HTTP compression would thus
generally be considered superior since it provides an end-to-end solution between the
origin server and the end user.

In, Mogul et al. carried out an experiment to compare the performance of a


high level compression algorithm versus that of modem compression. To do this, several

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 15


Benefits and Drawbacks of HTTP Data Compression

plaintext and gzipped HTML files were transferred over a 28.8 Kbps modem that
supported the V.42bis compression protocol, which is a predecessor to the V.44 protocol.
The algorithm used in the V.42bis protocol is similar to that of the V.44 protocol
described above; however, it is less efficient in terms of speed and memory requirements
and usually achieves a lower compression ratio for an identical resource compared to
V.44. The size of the plaintext documents in the experiment ranged from approximately
six to 365 kilobytes. Seven trials were run for each HTML file and the average transfer
time in every case was less for the gzipped file versus the plaintext file. So, the authors
concluded that, while the tests showed that modem compression does work, it is not
nearly as effective as a high level compression algorithm for reducing transfer time.

In [3], a 42 kilobyte HTML file, consisting of data combined from the


Microsoft and Netscape home page, was transferred in uncompressed and compressed
form over a 28.8 kbps modem. The results showed that using high level compression
rather than standard modem compression resulted in a 68% reduction in the total number
of packets transferred and a 64% reduction in total transfer time.

An experiment was conducted, to compare the performance of modems


that support the V.42bis and V.44 compression protocols versus that of PKZIP, a high-
level compression algorithm. The experiment involved measuring the throughput
achieved by several modems in transferring files of varying type, such as a text file, an
uncompressed graphics file, and an executable file, via FTP. Each of these files was also
compressed using the PKZIP program. The results showed the high-level compression
program, PKZIP, to be significantly more effective than either of the modem
compression algorithms. Specifically, the performance gain of the V.44 modem over the
V.42bis modem was approximately 29%. However, the performance gain of PKZIP over
the V.44 modem was approximately 94%. Clearly, there was a more drastic performance
difference between PKZIP and the V.44 modem compared to that between the V.44 and
V.42bis modems. Thus, the author concluded that while the V.44 modems provided
better performance than the V.42bis modems, the high-level compression algorithm,
PKZIP, was significantly more effective than modem compression.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 16


Benefits and Drawbacks of HTTP Data Compression

3.3 GZIP

3.3.1 Introduction
This specification defines a lossless compressed data format that is
compatible with the widely used GZIP utility. The format includes a cyclic redundancy
check value for detecting data corruption. The format presently uses the DEFLATE
method of compression but can be easily extended to use other compression methods.
The format can be implemented readily in a manner not covered by patents.

Before we start I should explain what content encoding is. When you
request a file like http://www.yahoo.com/index.html, your browser talks to a web server.
The conversation goes a little like this:

HTTP Request and Response(Uncompressed)

So what’s the problem?


Well, the system works, but it’s not that efficient. 100KB is a lot of text,
and frankly, HTML is redundant. Every <html>, <table> and <div> tag has a closing tag
that’s almost the same. Words are repeated throughout the document. Any way you slice
it, HTML (and its beefy cousin, XML) is not lean.

And what’s the plan when a file’s too big? Zip it![5]

If we could send a .zip file to the browser (index.html.zip) instead of plain


old index.html, we’d save on bandwidth and download time. The browser could
download the zipped file, extract it, and then show it to user, who’s in a good mood

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 17


Benefits and Drawbacks of HTTP Data Compression

because the page loaded quickly[5]. The browser-server conversation might look like
this:

Compressed HTTP Request and Response

3.3.2 Purpose
The purpose of this specification is to define a lossless compressed data format that:

 Is independent of CPU type, operating system, file system, and character set, and
hence can be used for interchange;
 Can compress or decompress a data stream (as opposed to a randomly accessible
file) to produce another data stream, using only an a priori bounded amount of
intermediate storage, and hence can be used in data communications or similar
structures such as Unix filters;
 Compresses data with efficiency comparable to the best currently available
general-purpose compression methods, and in particular considerably better than
the “compress” program;
 Can be implemented readily in a manner not covered by patents, and hence can be
practiced freely;
 Is compatible with the file format produced by the current widely used gzip
utility, in that conforming decompressors will be able to read data produced by
the existing gzip compressor.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 18


Benefits and Drawbacks of HTTP Data Compression

The data format defined by this specification does not attempt to:

 Provide random access to compressed data;


 Compress specialized data (e.g., raster graphics) as well as the best currently
available specialized algorithms.

GZIP is a freely available compressor available within the JRE and  the SDK as
Java.util.zip.GZIPInputStream and Java.util.zip.GZIPOutputStream.
The Command line versions are available with most Unix Operating Systems, Windows
Unix Toolkits (Cygwin and MKS), or they are dowloadable for a plethora of operating
systems at http://www.gzip.org/.
One can get the highest degree of compression using gzip to compress an uncompressed
jar file vs. compressing a compressed jar file, the downside is that the file may be stored
uncompressed on the target systems.
Here is an example:
Compressing using gzip on a jar file containing individual deflated entries.
Notepad.jar       46.25 kb
Notepad.jar.gz   43.00 kb
Compressing using gzip on a jar file containing "stored" entries
Notepad.jar      987.47 kb
Notepad.jar.gz   32.47 kb
As you can see the download size can be reduced by 14% using uncompressed jar, versus
3% using compressed jar file.

3.4 HTTP Compression

HTTP compression is the technology used to compress contents from a Web


server (also known as an HTTP server). The Web server content may be in the form of
any of the many available MIME types: HTML, plain text, images formats, PDF files,
and more. HTML and image formats are the most widely used MIME formats in a Web
application.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 19


Benefits and Drawbacks of HTTP Data Compression

Most images used in Web applications (for example, GIF and JPG) are
already in compressed format and do not compress much further; certainly no discernible
performance is gained by another incremental compression of these files. However, static
or on-the-fly created HTML content contains only plain text and is ideal for compression.

The focus of HTTP compression is to enable the Web site to serve fewer bytes of
data. For this to work effectively, a couple of things are required:

 The Web server should compress the data

 The browser should decompress the data and display the pages in the usual
manner

This is obvious. Of course, the process of compression and decompression should


not consume a significant amount of time or resources.

So what's the hold-up in this seemingly simple process? The recommendations for
HTTP compression were stipulated by the IETF (Internet Engineering Task Force) while
specifying the protocol specifications of HTTP 1.1. The publicly available gzip
compression format was intended to be the compression algorithm. Popular browsers
have already implemented the decompression feature and were ready to receive the
encoded data (as per the HTTP 1.1 protocol specifications), but HTTP compression on
the Web server side was not implemented as quickly nor in a serious manner.

3.5 Static Compression

If the Web content is pre-generated and requires no server-side dynamic


interaction with other systems, the content can be pre-compressed and placed in the Web
server, with these compressed pages being delivered to the user. Publicly available
compression tools (gzip, Unix compress) can be used to compress the static files.

Static compression, though, is not useful when the content has to be generated
dynamically, such as on e-commerce sites or on sites which are driven by applications
and databases. The better solution is to compress the data on the fly.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 20


Benefits and Drawbacks of HTTP Data Compression

3.6 Content and Transfer encoding

The IETF's standard for compressing HTTP contents includes two levels of
encoding: content encoding and transfer encoding. Content encoding applies to methods
of encoding and compression that have been already applied to documents before the
Web user requests them. This is also known as pre-compressing pages or static
compression. This concept never really caught on because of the complex file-
maintenance burden it represents and few Internet sites use pre-compressed pages.

On the other hand, transfer encoding applies to methods of encoding during the
actual transmission of the data.

In modern practice the difference between content and transfer encoding is


blurred since the pages requested do not exist until after they are requested (they are
created in real-time). Therefore the encoding has to be always in real-time

The browsers, taking the cue from IETF recommendations, implemented the
Accept Encoding feature by 1998-99. This allows browsers to receive and decompress
files compressed using the public algorithms. In this case, the HTTP request header fields
sent from the browser indicate that the browser is capable of receiving encoded
information. When the Web server receives this request, it can

1. Send pre-compressed files as requested. If they are not available, then it can:

2. Compress the requested static files, send the compressed data, and keep the
compressed file in a temporary directory for further requests; or

3. If transfer encoding is implemented, compress the Web server output on the fly.

As I mentioned, pre-compressing files, as well as real-time compression of static


files by the Web server (the first two points, above) never caught on because of the
complexities of file maintenance, though some Web servers supported these functions to
an extent.

The feature of compressing Web server dynamic output on the fly wasn't
seriously considered until recently, since its importance is only now being realized. So,

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 21


Benefits and Drawbacks of HTTP Data Compression

sending dynamically compressed HTTP data over the network has remained a dream
even though many browsers were ready to receive the compressed formats.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 22


Benefits and Drawbacks of HTTP Data Compression

Chapter 4
4.0 HTTP’s Support for Compression

HTTP compression has been around since the day of HTTP/1.0. However,
it was not until the last two years or so that support for this feature was added to most
major web browsers and servers.

4.1 HTTP/1.0

In HTTP/1.0 the manner in which the client and server negotiate which
compression format, if any, to use when transferring a resource is not well defined.
Support for compression exists in HTTP/1.0 by way of content coding values in the
Content- Encoding and, to an extent, the Accept-Encoding header fields.

The Content-Encoding entity header was included in the HTTP/1.0


specification as a way for the message sender to indicate what transformation, if any, had
been applied to an entity and hence what decoding must be performed by the receiver in
order to obtain the original resource. Content-Encoding values apply only to end-to-end
encoding.

The Accept-Encoding header was included as part of the Additional


Header Field Definitions in the appendix of HTTP/1.0. Its purpose is presumably to
restrict the content coding that the server can apply to a client’s response; however, this
header field is explained in a single sentence and there is no specification as to how the
server should handle this header.

4.2 HTTP/1.1

HTTP/1.1 extended support for compression by expanding on the content-


coding values and adding transfer-coding values. Much like in HTTP/1.0, content coding
values are specified within the Accept-Encoding and Content-Encoding headers in
HTTP/1.1[6].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 23


Benefits and Drawbacks of HTTP Data Compression

The Accept-Encoding header is more thoroughly defined in HTTP/1.1 and


provides a way for the client to indicate to the server which encoding formats it supports.
HTTP/1.1 also specifies how the server should handle the Accept-Encoding field if it is
present in a request [6].

Much like content coding values, transfer coding values, as defined in

HTTP/1.1, provide a way for communicating parties to indicate the type of encoding
transformation that can be, or has been, applied to a resource. The difference being that
transfer coding values are a property of the message and not of the entity. That is,
transfer-coding values can be used to indicate the hop-by-hop encoding that has been
applied to a message [6].

Transfer coding values are specified in HTTP/1.1 by the TE and Transfer-


Encoding header fields. The TE request header field provides a way for the client to
specify which encoding formats it supports. The Transfer- Encoding general-header field
indicates what transformation, if any, has been applied to the message body [6].

To summarize, HTTP/1.1 allows for compression to occur on either an


end-to-end basis, via content-coding values, or on a hop-byhop basis, via transfer coding
values [6]. In either case, compression is transparent to the end user.

4.3 Content Coding Values

The Internet Assigned Numbers Authority (IANA) was designated in


HTTP/1.1 as a registry for content-coding value tokens. The HTTP/1.1 specification
defined the initial tokens to be: deflate, compress, gzip and identity [6].

The deflate algorithm is based on both Huffman coding and LZ-77


compression [7].

“Compress” was a popular UNIX file compression program. It uses the


LZW algorithm, which is covered by patents held by UNISYS and IBM [6].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 24


Benefits and Drawbacks of HTTP Data Compression

Gzip is an open source, patent free compression utility that was designed
to replace “compress”. Gzip currently uses a variation of the LZ-77 algorithm by default,
though it was designed to handle several compression algorithms [7]. Gzip works by
replacing duplicated patterns with a (distance, length) pointer pair. The distance pointer is
limited to the previous 32 kilobytes while the length is limited to 258 bytes [7].

The identity token indicates the use of default encoding, that is, no
transformation at all.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 25


Benefits and Drawbacks of HTTP Data Compression

Chapter 5
5.0 Approaches to HTTP Compression

5.1 HTTP Compression on Servers

At first glance, the server level seems like a logical place to perform
HTTP compression. After all, servers have already implemented the HTTP protocol and
know how to communicate with clients and handle retransmits, chunking, HTTP/1.0,
HTTP/1.1, and more. But a Web server is not the best-performing piece of equipment in
the network.

Typically, each server operates rather slowly and at a relatively low


capacity. HTTP compression is hard, computationally intensive work, and HTTP
compression is far too much work for servers. Tasking servers to perform HTTP
compression results in even longer response times and still lower throughput,
demonstrating that HTTP compression on the server is not the optimal solution.

5.2 HTTP Compression on Software-based Load Balancers

In software-based load balancers, the TCP/IP stack is optimized for store-


and-forward operations. The device gets a packet in, inspects it as little as possible (often
just the destination address information), and then forwards the unmodified packet to an
origin server. The origin server then communicates with the client over the HTTP
protocol, while the load balancer is simply shuttling packets back and forth between the
origin server and client, without regard to the complex elements of the HTTP protocol.

This store-and-forward design contrasts to a server or cache device, which


must know the complete HTTP protocol and be able to conduct an HTTP dialog with the
client. While a software-based server load balancer may have impressive specifications
for storing and forwarding packets, it does not know how to communicate to clients,
handle re-transmits, chunking, HTTP/1.0, HTTP/1.1, and other functions. Software load
balancer vendors have acknowledged this fundamental incongruity between the two

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 26


Benefits and Drawbacks of HTTP Data Compression

operations of load balancing and doing full HTTP protocol work. In response, they have
offered standalone caching devices rather than attempting to integrate caching
functionality into the load balancer.

HTTP compression is even more demanding than HTTP caching. In


caching, a device merely needs to store the exact contents delivered to it by the origin
server. Some caches even store the whole packets provided by the server. In contrast,
HTTP compression demands that new data be generated for individual user devices. This
is necessary because not every user device supports compression and not all user devices
that support compression generally support compression in every particular instance.

5.3 HTTP Compression on ASIC-based Load Balancers

ASIC-based load balancers were developed to extend the capacity and


performance of storeand- forward load balancing. In practice, even though these devices
are ASIC-based, they rely on general-purpose CPUs to perform the Layer 7 functionality.
Capacity and performance when performing Layer 7 functions is greatly decreased
compared to Layer 4 load balancing.

ASIC-based load balancers use general-purpose CPUs for Layer 7


functionality because the HTTP/1.1 protocol is more complex than the TCP/IP protocol
and would require an unrealistic number of gates to execute on an ASIC. The ever-
changing demands of browsers and servers, and the potential security vulnerabilities that
require regular modifications, make it impractical to implement HTTP/1.1 protocol in an
ASIC design.

Onboard memory limitations further restrict the amount of Layer 7


information an ASICbased load balancer can gather and process. For example, some load
balancers support processing only 256 bytes of a cookie, but a cookie can be up to 4,000
bytes long. The fundamental architecture of these devices is designed to process only
header information while ignoring the actual data portion of the packet.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 27


Benefits and Drawbacks of HTTP Data Compression

Even if a packet-level compression engine were to be developed, this


design would work with a maximum of 1460 data bytes, which could be compressed at a
compression ratio of 2:1, rather than the 5:1 or up to 10:1 ratios that can be achieved by
processing at the application layer. Furthermore, any compression at the packet level
would simply lead to smaller packets — but not fewer packets. Given TCP’s reliance on
packet acknowledgements (ACKs) and dependence on round-trip times (RTTs), by not
reducing the packet count, RTTs and ACK requirements remain constant. Therefore, the
total download time would be virtually identical whether the packets were filled at 100
percent or at 50 percent. The minimal Layer 7 functionality and intentionally limited
architectural design of ASIC-based load balancers does not provide the platform
necessary for HTTP compression technology.

5.4 HTTP Compression on a Purpose-built HTTP Compression Device

A dedicated HTTP compression device is the best place to perform HTTP


compression. The device must be a full proxy server, able to communicate the HTTP/1.0
and HTTP/1.1 protocol, and have enough resources available to handle the
computationally intensive compression work. The device should also own the connection
to the end user, taking responsibility for resending dropped packets so that the origin
server is completely offloaded.

Some first-generation approaches use caching techniques and attempt to


cache the content to overcome the challenge of compressing content “on the fly.” Other
early efforts cannot overcome the computational latency involved in compressing data,
and thus do not compress for high-speed users. An unfortunate side effect of this
technique is that users coming through proxy servers, such as AOL users, appear to be
high-speed users and, as such, the device will not perform any compression despite the
reality that the end user is connected via a slow link. Thus, in first generation approaches,
slow users do not experience a faster download and sites do not realize bandwidth
savings.
Despite the limitations of these initial attempts, a dedicated solution is still
the best approach. The right solution consists of a core platform that can:

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 28


Benefits and Drawbacks of HTTP Data Compression

 Handle tens or hundreds of thousands of simultaneous, persistent connections,


completely owning both the TCP and HTTP interaction with the client, including
re-transmits
 Manage client TCP connections in such a way as to eliminate unnecessary control
packets and further delays caused by TCP’s slow start mechanisms
 Deliver equal benefits for static as well as personalized data by avoiding the use
of caching technology
 Deliver equal benefits for slow users as well as high-speed users by eliminating
the computational latency of HTTP compression
 Request content from multiple origin servers and guarantee even workload across
all servers
 Employ high-speed SSL capabilities to provide benefits for all secure transactions
 Guarantee user stickiness in a total end-to-end SSL environment

With this platform, HTTP compression can be conducted at wire speed


and origin Web servers can get data out to the HTTP compression device faster, resulting
in a net decrease in latency for the entire transaction. In this way, adding another box in
the network actually decreases every step of the transaction time, starting with the very
first byte of data received by the client. It also becomes possible for an HTTP
compression device to deliver SSL content faster than a Web server can deliver clear text
to both modem and broadband users.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 29


Benefits and Drawbacks of HTTP Data Compression

Chapter 6
6.0 Browser Support for Compression
Support for content coding is inherent in most of today’s popular web
browsers. Support for content coding has existed in the following browsers since the
indicated version: Netscape 4.5+ (including Netscape 6), all versions of Mozilla, Internet
Explorer 4.0+, Opera 4.0+ and Lynx 2.8+. However, there have been reports that Internet
Explorer for the Macintosh cannot correctly handle gzip encoded content. Unfortunately,
support for transfer coding is lacking in most browsers.

In order to verify some of these claims regarding browser support for


content coding a test was conducted using a Java program called BrowserSpy. When
executed, this program binds itself to an open port on the user’s machine. The user can
then issue HTTP requests from a web browser to this program. The program then returns
an HTML file to the browser indicating the HTTP request headers it had received from
the browser.

For this test we used all of the web browsers that were accessible to us in
order to see exactly what HTTP request headers these browsers were sending. The header
fields and values that is of greatest interest include the HTTP version number and the
Accept-Encoding and TE field.

First, Lynx version 2.8.4dev.16 running under Linux sends:

GET / HTTP/1.0
Accept-Encoding: gzip, compress

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 30


Benefits and Drawbacks of HTTP Data Compression

Next, Opera, Netscape and Internet Explorer were tested under various versions
Of Windows. The Opera 5.11 request header contained the following relevant fields:
GET / HTTP/1.1
Accept-Encoding: deflate, gzip, x-gzip, identity,
*;q=0
Connection: Keep-Alive, TE
TE: deflate, gzip, chunked, identity, trailers.

Internet Explorer 5.01, 5.5 SP2 and 6.0 all issued the same relevant
request information:
GET / HTTP/1.1
Accept-Encoding: gzip, deflate

Finally, the Netscape 4.78 request header contained the following relevant
fields:
GET / HTTP/1.0

Accept-Encoding: gzip

Whereas Netscape 6.2.1 issued:

GET / HTTP/1.1

Accept-Encoding: gzip, deflate, compress; q=0.9;

Notice that gzip is the most common content coding value and, in fact, the
only value to appear in the request header of every browser tested. Also, observe that
Lynx and Netscape 4.x only support the HTTP/1.0 protocol. For reasons that will be
explained later, some compression-enabled servers may choose not to send compressed
content to such clients. Finally, note that Opera appears to be the only browser that
supports transfer coding as it is the only one that includes the TE header with each HTTP
request.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 31


Benefits and Drawbacks of HTTP Data Compression

Chapter 7
7.0 Client Side Compression Issues

When receiving compressed data in either the gzip or deflate format most
web browsers are capable of performing streaming decompression. That is,
decompression can be performed as each successive packet arrives at the client end rather
than having to wait for the entire resource to be retrieved before performing the
decompression [8].

This is possible in the case of gzip because, as was mentioned above, gzip
performs compression by replacing previously encountered phrases with pointers. Thus,
each successive packet the client receives could only contain pointers to phrases in the
current or previous packets, not future packets.

The CPU time necessary for the client to uncompress data is minimal and
usually takes only a fraction of the time it does to compress the data. Thus the
decompression process adds only a small amount of latency on the client’s end.

A test was conducted to verify that some of the major web browsers do in
fact perform streaming decompression. To begin, a single web page consisting of an
extremely large amount of text was constructed. Along with the text, four embedded
images were included. Three of the images were placed near the top of the web page and
the fourth at bottom. The web page was designed to be as simple as possible consisting
only of a few basic HTML tags and image references. The page, which was uploaded to a
compression enabled web server, ended up containing over 550 kilobytes of text.

Three popular web browsers, Opera 5.11, Netscape 4.78 and Internet
Explorer 5.01, all running in Windows 2000, were each used to issue an HTTP request
for the web page. Using Ethereal, a packet-sniffing program, an analysis of the packet log
for each of the three browsers was performed. In all three cases, the browsers, after
receiving a few dozen packets containing the compressed HTML file, issued requests for

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 32


Benefits and Drawbacks of HTTP Data Compression

the first three embedded image files. Clearly, these requests were issued well before the
entire compressed HTML file had been transferred hence proving that the major
browsers, running in Windows, do indeed perform streaming decompression for content
encoding using gzip.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 33


Benefits and Drawbacks of HTTP Data Compression

Chapter 8
8.0 Web Server Support for Compression
The analyses in this report focus exclusively on the two most popular web
servers, namely Apache and IIS. Based on a survey of over 38 million web sites, Netcraft
reported that Apache and IIS comprise over 85% of the hosts on the World Wide Web as
of March 2002. Two alternate web servers, Flash and TUX, were also initially considered
for inclusion in this report, however, based on the information provided in their
respective specification documents, Flash do not appear to support HTTP compression in
any form and TUX can only serve static resources that exist as a precompressed file on
the web server. As a result, neither program was selected for further analysis.

8.1 IIS
Microsoft’s Internet Information Services (IIS) 5.0 web servers provides
built in support for compression; however it is not enabled by default. The process of
enabling compression is straightforward and involves changing a single configuration
setting in IIS. An IIS server can be configured to compress both static and dynamic
documents. However, compression must be applied to the entire server and cannot be
activated on a directory-by directory basis. IIS has built in support for the gzip and
deflate compression standards and includes a mechanism through which customized
compression filters can be added. Also, only the Server families of the Windows
operating system have compression capability, as the Professional family is intended for
use as a development platform and not as a production server...

In IIS, static content is compressed in two passes. The first time a static
document is requested it is sent uncompressed to the client and also added to a
compression queue where a background process will compress the file and store it in a
temporary directory. Then, on subsequent accesses, the compressed file can be retrieved
from this temporary directory and sent to the client.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 34


Benefits and Drawbacks of HTTP Data Compression

Dynamic content, on the other hand, is always compressed on-demand


since IIS does not cache such content.

Thus in IIS the cost of compressing a static document is negligible, as this


cost is incurred only once for each document. On the other hand, since dynamic content
is compressed on demand it imposes a slightly greater overhead on the server.

By default, IIS does not send compressed content to HTTP/1.0 clients.


Additionally, all compressed content is sent by IIS with a header indicating an Expiration
date of January 1, 1997. By always setting the expiration date in the past, proxies will be
forced to validate the object on every request before it can be served to the client.
However, IIS does allow the site administrator to set the Max-Age header for dynamic,
compressed documents. The HTTP/1.1 specification stipulates that the Max-Age header
overrides the Expires header, even if the Expires header is more restrictive [6]. As we
will see in section 9, this scheme may have been designed intentionally so as to allow
HTTP/1.1 clients to get a cached version of a compressed object and at the same time to
prevent HTTP/1.0 clients from getting stale or bad content.

8.2 Apache

Apache does not provide a built in mechanism for HTTP compression.


However, there is a fairly popular and extensively tested open source module, called
mod_gzip, which provides HTTP compression for Apache.

Enabling compression is fairly straightforward, as mod_gzip can either be

loaded as an external Apache module or compiled directly into the Apache web server.

Since mod_gzip is a standard Apache module, it runs on any platform supported by the
Apache server. Much like IIS, mod_gzip uses existing content-coding standards as
described in HTTP/1.1. Contrary to IIS, mod_gzip allows for compression to be activated
on a per-directory basis thus giving the website administrator greater control over the
compression functionality for his/her site(s). Mod_gzip, like IIS, can compress both static
and dynamic documents.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 35


Benefits and Drawbacks of HTTP Data Compression

In the case of static documents, mod_gzip can first check to see if a


precompressed version of the file exists and, if it does, send this version. Otherwise
mod_gzip will compress the document on-the-fly. In this way, mod_gzip differs from IIS
because mod_gzip can compress a static document on its first access. Also, mod_gzip can
be configured to save the compressed files to a temporary directory. However, if such a
directory is not specified the static document will simply be compressed on every access.
Mod_gzip is purported to support nearly every type of CGI output, including: Perl, PHP,
ColdFusion, compiled C code, etc. Both IIS and mod_gzip allow the administrator to
specify which static and dynamic content should and should not be compressed based on
the mime type or file name extension of the resource. Also, contrary to IIS, mod_gzip by
default sends content to clients regardless of the HTTP version they support. This can
easily be changed so that only requests from HTTP/1.1-compliant clients can receive
compressed content.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 36


Benefits and Drawbacks of HTTP Data Compression

Chapter 9
9.0 Proxy Support for Compression
Currently one of the main problems with HTTP compression is the lack of
proxy cache support. Many proxies cannot handle the Content-Encoding header and
hence simply forward the response to the client without caching the resource. As was
mentioned above, IIS attempts to ensure compressed documents are not served stale by
setting the Expires time in the past.

Caching was handled in HTTP/1.0 by storing and retrieving resources


based on the URI. This, of course, proves inadequate when multiple versions of the same
resource exist - in this case, a compressed and uncompressed representation.

This problem was addressed in HTTP/1.1 with the inclusion of the Vary
response header. A cache could then store both a compressed and uncompressed version
of the same object and use the Vary header to distinguish between the two. The Vary
header is used to indicate which response headers should be analyzed in order to
determine the appropriate variant of the cached resource to return to the client.

Looking ahead to Figure 11, we can see that IIS already sends the Vary
header with its compressed content. The most recent version of mod_gzip does not set the
Vary header. However, a patch was posted on a mod_gzip discussion board in April 2001
that incorporates support for the Vary header into mod_gzip. A message posted in
September 2001 was posted to another discussion and stated that mod_gzip had been
ported to Apache 2.0 and includes support for the Vary header.

Based on a quick search of the Web, the latest stable release of Squid,
version 2.4, was the only proxy cache that appeared to support the Vary header. Since
such support was only recently included in Squid it is likely that many proxy caches are
not running this latest version and hence cannot cache negotiated content. Hence the full
potential of HTTP compression cannot be realized until more proxies are able to cache
compressed content.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 37


Benefits and Drawbacks of HTTP Data Compression

Chapter 10
10.0 Related Work
In Mogul et al. quantified the potential benefits of delta encoding and data
compression for HTTP by analyzing lives traces from Digital Equipment Corporation
(DEC) and an AT&T Research Lab. The traces were filtered in an attempt to remove
requests for precompressed content; for example, references to GIF, JPEG and MPEG
files. The authors then estimated the time and byte savings that could have been achieved
had the HTTP responses to the clients been delta encoded and/or compressed. The
authors determined that in the case of the DEC trace, of the 2465 MB of data analyzed,
965 MB, or approximately 39%, could have been saved had the content been gzip
compressed. For the AT&T trace, 1054MB, or approximately 17%, of the total 6216 MB
of data could have been saved. Furthermore, retrieval times could have been reduced
22% and 14% in the DEC and AT&T traces, respectively. The authors remarked that they
felt their results demonstrated a significant potential improvement in response size and
response delay as a result of delta encoding and compression.

In [8], the authors attempted to determine the performance benefits of


HTTP compression by simulating a realistic workload environment. This was done by
setting up a web server and replicating the CNN site on this machine. The authors then
accessed the replicated CNN main page and ten subsections within this page (i.e. World
News, Weather, etc), emptying the cache before each test. Analysis of the total time to
load all of these pages showed that when accessing the site on a 28.8 kbps modem, gzip
content coding resulted in 30% faster page loads. They also experienced 35% faster page
loads when using a 14.4 kbps modem.
Finally, in [4] the authors attempted to determine the performance effects
of HTTP/1.1. Their tests included an analysis of the benefits of HTTP compression via
the deflate contentcoding. The authors created a test web site that combined data from the
Netscape and Microsoft home pages into a page called “MicroScape”. The HTML for
this new page totaled 42 KB with 42 inline GIF images totaling 125 KB. Three different

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 38


Benefits and Drawbacks of HTTP Data Compression

network environments were used to perform the test: a Local Area Network (high
bandwidth, low latency), a Wide Area Network (high bandwidth, high latency) and a 28.8
kbps modem (low bandwidth, high latency). The test involved measuring the time
required for the client to retrieve the Microscape web page from the server, parse, and, if
necessary, decompress the HTML file on-the-fly and retrieve the 42 inline images. The
results showed significant improvements for those clients on low bandwidth and/or high
latency connections. In fact, looking at the results from all of the test environments,
compression reduced the total number of packets transferred by 16% and the download
time for the first time retrieval of the page by 12%.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 39


Benefits and Drawbacks of HTTP Data Compression

11.0 Experiments
We will now analyze the results from a number of tests that were
performed in order to determine the potential benefits and drawbacks of HTTP
compression.

11.1 Compression Ratio Measurements


The first test that was conducted was designed to provide a basic idea of
the compression ratio that could be achieved by compressing some of the more popular
sites on the Web. The objective was to determine how many fewer bytes would need to
be transferred across the Internet if web pages were sent to the client in a compressed
form. To determine this we first found a web page that ranks the Top 99 sites on the Web
based on the number of unique visitors. Although the rankings had not been updated
since March 2001 most of the indicated sites are still fairly popular. Besides, the intent
was not to find a definitive list of the most popular sites but rather to get a general idea of
some of the more highly visited ones. A freely available program called wget was used
to retrieve pages from the Web and Perl scripts were written to parse these files and
extract relevant information.

The steps involved in carrying out this test consisted of first fetching the
web page containing the list of Top 99 web sites. This HTML file was then parsed in
order to extract all of the URLs for the Top 99 sites. A pre-existing CGI program on the
Web that allows a user to submit a URL for analysis was then utilized. The program
determines whether or not the indicated site utilizes gzip compression and, if not, how
many bytes could have been saved was the site to implement compression. These byte
savings are calculated for all 10 levels of gzip encoding. Level 0 corresponds to no gzip
encoding. Level 1 encoding uses the least aggressive form of phrase matching but is also
the fastest, as it uses the least amount of CPU time when compared to the other levels,
excluding level 0. Alternatively, level 9 encoding performs the most aggressive form of
pattern matching but also takes the longest, utilizing the most CPU resources.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 40


Benefits and Drawbacks of HTTP Data Compression

A Perl script was employed to parse the HTML file returned by this CGI
program, with all of the relevant information being dumped to a file that could be easily
imported into a spreadsheet. Unfortunately, the CGI program can only determine the byte
savings for the HTML file. While this information is useful it does not give the user an
idea of the compression ratio for the entire page - including the images and other
embedded resources. Therefore, we set our web browser to go through a proxy cache and
subsequently retrieved each of the top 99 web pages. We then used the trace log from the
proxy to determine the total size of all of the web pages. After filtering out the web sites
that could not be handled by the CGI program, wget or Perl scripts we were left with 77
URLs. One of the problems encountered by the CGI program and wget involved the
handling of server redirection replies. Also, a number of the URLs referenced sites that
either no longer existed or were inaccessible at the time the tests were run.

The results of this experiment were encouraging. First, if we consider the


savings for the HTML document alone, the average compression ratio for level 1 gzip
encoding turns out to be 74% and for level 9 this figure is 78%. This clearly shows that
HTML files are prime candidates for compression. Next, we factor into the equation the
size of all of the embedded resources for each web page. We will refer to this as the total
compression ratio and define it as the ratio of the size of the original page, which includes
the embedded resources and the uncompressed HTML, to the size of the encoded page,
which includes the embedded resources and the compressed HTML.

The results show that the average total compression ratio comes to about
27% for level 1 encoding and 29% for level 9 encoding. This still represents a significant
amount of savings, especially in the case where the content is being served to a modem
user.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 41


Benefits and Drawbacks of HTTP Data Compression

Figure 11.1 – The total page size (including HTML and embedded resources) for
the top ten web sites.

Figure 11.1 shows the difference in the total number of bytes transferred
in an uncompressed web page versus those transferred with level 1 and level 9 gzip
compressions. Note the small difference in compression ratios between level 1 and level
9 encoding.

Table 11.1 shows a comparison of the total compression ratios for the top
ten web sites. You can see that there is only a slight difference in the total compression

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 42


Benefits and Drawbacks of HTTP Data Compression

ratios for levels 1 and 9 of gzip encoding. Thus, if a site administrator were to decide to
enable gzip compression on a web server but wanted to devote the least amount of CPU
cycles as possible to the compression process, he/she could set the encoding to level 1
and still maintain favorable byte savings.

Ultimately, what these results show is that, on average, a compression-


enabled server could send approximately 27% less bytes yet still transmit the exact same
web page to supporting clients. Despite this potential savings, out of all of the URLs
examined, www.excite.com was the only site that supported gzip content coding. We
believe this is indicative of HTTP compression’s current popularity, or lack thereof.

URL Level 1 Level 9


www.yahoo.com 36.353 38.222
www.aol.com 26.436 25.697
www.msn.com 35.465 37.624
www.microsoft.com 38.850 40.189
www.passport.com 25.193 26.544
www.geocities.com 43.316 45.129
www.ebay.com 29.030 30.446
www.lycos.com 40.170 42.058
www.amazon.com 31.334 32.755
www.angelfire.com 34.537 36.427

Table 11.1 – Comparison of the total compression ratios of level 1 and level 9 gzip encoding for
the indicated URLs.

11.2 Web Server Performance Test


The next set of tests that were conducted involved gathering performance
statistics from both the IIS and Apache web servers.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 43


Benefits and Drawbacks of HTTP Data Compression

The test environment consisted of a client and server PC both with 10


Mbps Ethernet Network Interface Cards connected together directly via a crossover
cable. The first computer, a Pentium II 266 MHz PC with 128 MB of RAM, was setup as
a dual-boot server running Windows 2000 Server with IIS 5.0 and Red Hat Linux 7.1
(kernel 2.4.2-2) with Apache 1.3.22 and mod_gzip 1.3.19.1a. Compression was enabled
for static and dynamic content in both IIS and Apache. All of the compression caching
options was disabled for both web servers. The client machine was a Pentium II 266 MHz
PC with 190 MB of RAM running Red Hat Linux 7.1 (kernel 2.4.2-2).

Two programs, httperf and Autobench, were used to perform the


benchmarking tests. Httperf actually generates the HTTP workloads and measures server
performance. Autobench is simply a Perl script that acts a wrapper around httperf,
automating its execution and extracting and formatting its output in such a way that it can
easily be imported into a spreadsheet program. Httperf was one of the few benchmarking
programs that fully support the HTTP/1.1 protocol. Httperf was helpful for these tests as
it allows the insertion of additional HTTP request header fields via a command line
option. Thus, if an httperf user wanted to receive compressed content they could easily
append the “Accept- Encoding: gzip” header to each HTTP request.
The HTML code from the main page of four major web sites was
downloaded for use in the next set of tests. The file sizes for these sites: Google, Yahoo,
AOL and eBay were 2528, 20508, 29898, and 50187 bytes respectively. Thus, the test
files were not padded with repeated strings in order to improve compression ratios. Also
note that these tests only involved retrieval of the HTML file for each page and not any
of the embedded images. Since images are not further compressible they would provide
no indication of the performance effects of compression on a web server.

The tests were designed to determine the maximum throughput of the two
servers by issuing a series of requests for compressed and uncompressed documents.
Using Autobench we were able to start by issuing a low rate of requests per second to the
server and then increase this rate by a specified step until a high rate of requests per
second were attempted to be issued. An example of the command line options used to run
some of the tests is as follows:

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 44


Benefits and Drawbacks of HTTP Data Compression

./autobench_gzip_on --low_rate 10 --
high_rate 150 --rate_step 10 --
single_host --host1 192.168.0.106 --
num_conn 1000 --num_call 1 --output_fmt
csv --quiet --timeout 10 --uri1
/google.html --file google_compr.csv

These command line options indicate that initially requests for the
google.html file will be issued at a rate of 10 requests per second. Requests will continue
at this rate until 1000 connections have been made. For these tests each connection makes
only one call. In other words no persistent connections were used. The rate of requests is
then increased by the rate step, which is 10. So, now 20 requests will be attempted per
second until 1000 connections have been made. This will continue until a rate of 150
requests per second is attempted.

Keep in mind when looking at the results that the client may not be
capable of issuing 150 requests per second to the server. Thus a distinction is made
between the desired and actual number of requests per second.

11.2.1 Apache Performance Benchmark

We will first take a look at the results of the tests when run against the
Apache server. Figures 2 through 5 represent graphs of some of the results from
respective test cases. Referring to the graphs, we can see that for each test case a
saturation point was reached. This saturation point reflects the maximum numbers of
requests the server could handle for the given resource. Looking at the graphs, the
saturation point can be recognized by the point at which the server’s average response
time increases significantly, often times jumping from a few milliseconds up to hundreds
or thousands of milliseconds. The response time corresponds to the time between when
the client sends the first byte of the request and receives the first byte of the reply.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 45


Benefits and Drawbacks of HTTP Data Compression

So, if we were to look at Yahoo (Figure 3), for instance, we would notice
that the server reaches its saturation point at about the time when the client issues 36
requests per second for uncompressed content. This figure falls slightly, to about 33
requests per second, when compressed content is requested.

Refer to Table 2 for a comparison of the estimated saturation points for


each test case. These estimates were obtained by calculating the average number of
connections per second handled by the server using data available from the benchmarking
results. One interesting thing to note from the graphs is that, aside from the Google page,
the server maintained almost the same average reply rate for a page up until the saturation
point, regardless of whether the content was being served compressed or uncompressed.

Figure11. 2 – Benchmarking results for the retrieval of the Google HTML file from the Apache
Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 46


Benefits and Drawbacks of HTTP Data Compression

Figure11. 3 – Benchmarking results for the retrieval of the Yahoo HTML file from the Apache
Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 47


Benefits and Drawbacks of HTTP Data Compression

Figure 11.4 – Benchmarking results for the retrieval of the AOL HTML file from the Apache
Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 48


Benefits and Drawbacks of HTTP Data Compression

Figure 11.5 – Benchmarking results for the retrieval of the eBay HTML file from the Apache
Server.
After the saturation point the numbers diverge slightly, as is noticeable in
the graphs. What this means is that the server was able to serve almost the same number
of requests per second for both compressed and uncompressed documents.

The Google test case shows it is beneficial to impose a limit on the


minimum file size necessary to compress a document. Both mod_gzip and IIS allow the
site administrator to set a lower and upper bound on the size of compressible resources.
Thus if the size of a resource falls outside of these bounds it will be sent uncompressed.

For these tests all such bounds were disabled, which caused all resources
to be compressed regardless of their size.

When calculating results we will only look at those cases where the
demanded number of requests per second is less than or equal to the saturation point. Not
surprisingly, compression greatly reduced the network bandwidth required for server

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 49


Benefits and Drawbacks of HTTP Data Compression

replies. The factor by which network bandwidth was reduced roughly corresponds to the
compression ratio of the document.

Web Site Uncompressed Compressed


Google 215 105
Yahoo 36 33
AOL 27 25
EBay 16 15

Table11. 2 – Estimated saturation points for the Apache web server based on repeated
client requests for the indicated document

Next we will see the performance effects that on-the-fly compression


imposed on the server. To do so we will compare the server’s average response time in
serving the compressed and uncompressed document. The findings are summarized in
Table 3. The results are not particularly surprising. We can see that the size of a static
document does not affect response time when it is requested in an uncompressed form. In
the case of compression, however, we can see that as the file size of the resource
increases so too does the average response time. We would certainly expect to see such
results because it takes a slightly longer time to compress larger documents. Keep in
mind that the time to compress a document will likely be far smaller for a faster, more
powerful computer. The machine running as the web server for these tests has a modest
amount of computing power, especially when compared to the speed of today’s average
web server.

Web Site Uncompressed Compressed


Google 3.2 10.2
Yahoo 3.3 27.5
AOL 3.4 34.7
EBay 3.4 51.4

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 50


Benefits and Drawbacks of HTTP Data Compression

Table 11.3 – Average response time (in milliseconds) for the Apache server to respond to
requests for compressed and uncompressed static documents.

11.2.2 IIS Performance Benchmark


We attempted to repeat the same tests as above using the IIS server.
However, our test had to be slightly altered since, unlike mod_gzip and Apache, IIS does
not compress static documents on the fly. So, to overcome this we simply changed the
extension of all of the .html files to .asp. Active Server Pages (ASP) are generated
dynamically hence IIS will not cache them.

Figures 6 through 9 represent graphs of some of the results from each of


the respective test cases. Note from the graphs that, like with Apache, the actual request
rate and the average reply rate are identical for the compressed and uncompressed format
up until the saturation point. Note also that the average response times present an
interesting value to examine. Recall that with Apache the average reply time was
significantly greater when serving compressed content. On the other hand, in IIS, the
average response time when serving compressed dynamic content is the same, and
sometimes even a few milliseconds less, than when serving uncompressed content. At
first we thought this was happening because the content was somehow being cached even
though the IIS documentation states that dynamic content is never cached, so we spent
hours searching for every possible compression and cache setting but to no avail. Finally,
we stumbled across what we believe to be the answer, which is the use of chunked
transfer coding. By default IIS does not chunk the response for an uncompressed
document that is dynamically generated via ASP. Figure 10 shows the HTTP response
header that was generated from the IIS server based on a client request for an
uncompressed ASP resource. In this case, IIS generates the entire page dynamically
before returning it to the client.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 51


Benefits and Drawbacks of HTTP Data Compression

Figure 11.6 – Benchmarking results for the retrieval of the Google HTML file from the IIS
Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 52


Benefits and Drawbacks of HTTP Data Compression

Figure 11.7 – Benchmarking results for the retrieval of the Yahoo HTML file from the IIS Server.

Figure 11.8 – Benchmarking results for the retrieval of the AOL HTML file from the IIS Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 53


Benefits and Drawbacks of HTTP Data Compression

Figure 11.9 – Benchmarking results for the retrieval of the eBay HTML file from the IIS Server.

However, as we can see in Figure 11.9, when a client requests an ASP file
in compressed form, IIS appears to send the compressed content to the client in chunks.
Thus, the server can immediately compress and send chunks of data as they are
dynamically generated without having to wait for the entire document to be generated
before performing this process. Such a process appears to significantly reduce response
time latency. In this way IIS is able to achieve average response times that are identical to
the response times for uncompressed content.

HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Tue, 11 Dec 2001 18:01:09 GMT
Connection: close
Content-Length: 13506
Content-Type: text/html
Set-Cookie:

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 54


Benefits and Drawbacks of HTTP Data Compression

ASPSESSIONIDGGQQQJHK=EJKABPBDEDGHBPLPLMDP
EAHA; path=/
Cache-control: private

Figure 11.10 – HTTP response header from an IIS Server after a request for an uncompressed .asp resource

HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Tue, 11 Dec 2001 18:01:01 GMT
Connection: close
Content-Type: text/html
Set-Cookie:
ASPSESSIONIDGGQQQJHK=DJKABPBDPEAMPJHEBNIH
CBHB; path=/
Content-Encoding: gzip
Transfer-Encoding: chunked
Expires: Wed, 01 Jan 1997 12:00:00 GMT
Cache-Control: private, max-age=86400
Vary: Accept-Encoding

Figure 11.11 – HTTP response header from an IIS Server after a request for a compressed .asp
resource

The graphs show that the lines for the average response times follow
nearly identical paths for the Google, Yahoo! and AOL test case, though there is a slight
divergence in the AOL test as you near the saturation point. In the eBay test, however,
the server achieves must quicker average response times. Again, this can be explained by
the server’s use of chunked transfers. Looking at the graphs we can see that for the
Google, Yahoo! and AOL test cases, the saturation point is roughly equivalent regardless
of whether the content is served in a compressed form or not.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 55


Benefits and Drawbacks of HTTP Data Compression

Based on these results we conclude that the use of chunked transfer coding
for compressing dynamic content provides significant performance benefits.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 56


Benefits and Drawbacks of HTTP Data Compression

Chapter 12
12.0 Summary / Suggestion
So, should a web server use HTTP compression? Well, that’s not such an
easy question to answer. There are a number of things that must first be considered. For
instance, if the server generates a large amount of dynamic content one must consider
whether the server can handle the additional processing costs of on-the-fly compression
while still maintaining acceptable performance. Thus it must be determined whether the
price of a few extra CPU cycles per request is an acceptable trade-off for reduced
network bandwidth. Also, compression currently comes at the price of cache ability.

Much Internet content is already compressed, such as GIF and JPEG


images and streaming audio and video. However, a large portion of the Internet is text
based and is currently being transferred uncompressed. As we have seen, HTTP
compression is an underutilized feature on the web today. This despite the fact that
support for compression is built into most modern web browsers and servers.
Furthermore, the fact that most browsers running in the Windows environment perform
streaming decompression of gzipped content is beneficial because a client receiving a
compressed HTML file can decompress the file as new packets of data arrive rather than
having to wait for the entire object to be retrieved. Our tests indicated that 27% byte
reductions are possible for the average web site, proving the practicality of HTTP
compression. However, in order for HTTP compression to gain popularity a few things
need to occur.

First, the design of a new patent free algorithm that is tailored specifically
towards compressing web documents, such as HTML and CSS, could be helpful. After
all, gzip and deflate are simply general purpose compression schemes and do not take
into account the content type of the input stream. Therefore, an algorithm that, for
instance, has a predefined library of common HTML tags could provide a much higher
compression ratio than gzip or deflate.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 57


Benefits and Drawbacks of HTTP Data Compression

Secondly, we believe expanded support for compressed transfer coding is


essential. Currently, support for this feature is scarce in most browsers, proxies and
servers. Based on the browser test in Section 6 we can see that as of Netscape 6.2 and
Internet Explorer 6.0, neither browser fully supports transfer coding. Opera 5 was the
only browser tested that indicated its ability to handle transfer coding by issuing a TE
header field along with each HTTP request. As far as proxies are concerned, Squid
appears only to support compressed content coding, but not transfer coding, in its current
version. According to the Squid development project web site a beta version of a patch
was developed to extend Squid to handle transfer coding. However, the patch has not
been updated recently and the status of this particular project is listed as idle and in need
of developers. Also, in our research we found no evidence of support for compressed
transfer coding in either Apache or IIS.

The most important thing in regards to HTTP compression, in our opinion,


is the need for expanded proxy support. As of now compression comes at the price of
uncacheability in most instances. As we saw, outside of the latest version of Squid, little
proxy support exists for the Vary header. So, even though a given resource may be
compressible by a large factor, this effectiveness is negated if the server has to constantly
retransmit this compressed document to clients who should have otherwise been served
by a proxy cache.

Until these issues can be resolved HTTP compression will likely continue
to be overlooked as a way to reduce user perceived latency and improve Web
performance.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 58


Benefits and Drawbacks of HTTP Data Compression

References

1 http://www.http-compression.com

2 http://www.google.com

3 H.F. Nielsen. The Effect of HTML Compression on a LAN.


http://www.w3.org/Protocols/HTTP/Performance/Compression/LAN.html.
4 http://www.w3.org/Protocols/HTTP/Perform ance/Pipeline.html

5 P. Peterlin. Why Compression? Why Gzip?


http://sizif.mf.uni-lj.si/gzip.html.

6 http://www.ietf.org/rfc/rfc2616.txt

7 http://www.gzip.org/zlib/feldspar.html.

8 Mozilla. performance: HTTP Compression.


http://www.mozilla.org/projects/apache/gzip

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 59

You might also like