Demystifying Cloud Benchmarking

Uploaded by

Luciana Ferreira Miranda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views11 pages

Demystifying Cloud Benchmarking

Uploaded by

Luciana Ferreira Miranda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Demystifying Cloud Benchmarking

Tapti Palit Yongming Shen Michael Ferdman

Department of Computer Science Department of Computer Science Department of Computer Science
Stony Brook University Stony Brook University Stony Brook University
[email protected] [email protected] [email protected]

Abstract—The popularity of online services has grown expo- inherently more challenging because it must take into account
nentially, spurring great interest in improving server hardware the quality of service (latency of requests) and not just the peak
and software. However, conducting research on servers has tra- raw throughput.
ditionally been challenging due to the complexity of setting up
representative server configurations and measuring their perfor- Several recent projects [1], [2] have begun to address the
mance. Recent work has eased the effort of benchmarking servers challenges of benchmarking servers faced by the research
by making benchmarking software and benchmarking instructions community. These efforts have made great strides in identifying
readily available to the research community. the relevant server workloads and codifying performance mea-
Unfortunately, the existing benchmarks are a black box; their surement methodologies. They collected the necessary bench-
users are expected to trust the design decisions made in the
construction of these benchmarks with little justification and few marking tools and have been disseminating instructions for
cited sources. In this work, we have attempted to overcome this setting up server systems under test, configuring software that
problem by building new server benchmarks for three popular simulates client requests, and generating synthetic data sets and
network-intensive workloads: video streaming, web serving, and statistical distributions of request patterns. As a result, the effort
object caching. This paper documents the benchmark construction needed to benchmark server systems for a typical researcher has
process, describes the software, and provides the resources we
used to justify the design decisions that make our benchmarks gone down drastically.
representative for system-level studies. Unfortunately, although acknowledging their benefits, we
identified a core drawback of the existing server benchmarking
I. I NTRODUCTION tools. While they provide the software and installation direc-
tions, the existing benchmark suites do not readily provide
The past two decades have seen a proliferation of online
justification for the myriad decisions made in the construction
services. The Internet has transitioned from being merely a
of the benchmarks. The existing benchmark tools are essentially
useful tool to becoming a dominant part of life and culture.
a black box; users of these tools are called upon to implicitly
To support this phenomenal growth, the handful of computers
trust the decisions that were made in their design. The design
that were once used to service the entire online community have
choices for the existing benchmarks and the justifications for
transformed into clouds comprising hundreds of thousands of
these choices are not clearly cited and therefore do not allow
servers in stadium-sized data centers. Servers operate around
users to make a decision regarding the actual relevance of
the clock, handling millions of requests and providing access
these benchmarks. Moreover, when some of the design choices
to petabytes of data to users across the globe.
become dated, a revision of the benchmarks to make them rep-
Driven by the constantly rising demand for more servers and
resentative of the new server environment becomes necessary.
larger data centers, academia and industry have directed sig-
However, in the existing benchmarks, these decisions remain
nificant efforts toward improving the state-of-the-art in server
undiscovered without deep investigation of the tools.
hardware architecture and software design. However, these
This paper chronicles our experience in setting up new
efforts often face a number of obstacles that arise due to the
server benchmarks with explicitly justified design choices. In
difficulty in measuring the performance of server systems.
this work, we concentrate on benchmarking three of the most
While traditional benchmarks used to measure computer per-
popular network-intensive online applications:
formance are widely accepted and easy to use, benchmarking
of server systems calls for considerably more expertise. Server • Video Streaming services dominating the Internet network
software is complex to configure and operate, and requires traffic [3].
tuning the server’s operating system and server software to • Web Serving of dynamic Web2.0 pages performed by the
achieve peak performance. Whereas typical datasets for tra- most popular websites in the world [4].
ditional benchmarks are intuitive to identify, the datasets for • Object Caching of data used extensively in all popular
server systems are diverse and include not only the contents cloud services [5].
of the data, but also the frequency and the access patterns To the extent practical, we document the tools that we use,
to that data. Moreover, while the performance of traditional leveraging prior work and expertise. Wherever possible, we
benchmarks is easily defined as the time taken to complete explicitly describe the decisions that were made in the construc-
a unit of work, quantifying the performance of a server is tion of our benchmarks and cite the motivation and the sources

978-1-5090-1953-3/16/$31.00 ©2016 IEEE 122

that led to those decisions. Finally, we have integrated our workload from the CloudSuite [1] was built for RTSP, both the
benchmarks with the CloudSuite [1] and led the re-engineering server and the client infrastructure of the benchmark needed
of all benchmarks in CloudSuite 3 using Dockers [6], a mech- to be replaced to make the workload representative of modern
anism that makes it easy to use the benchmarks and documents video-streaming services.
all of the configuration choices in clear machine-readable form.
The rest of this paper is organized as follows. We motivate B. Web Serving
replacing the existing benchmarks in Section II and describe the In older generation websites, the user’s interaction consisted
key considerations in server benchmark design in Section III. In primarily of reading content and periodically clicking on URLs
Section IV, we describe our benchmarks and detail the design that changed the reading content of the entire page. Although
decisions we made in their creation. Section V present the submission of web forms was supported, this was a relatively
overview of our Docker-based benchmark deployment setup. rare operation, and the submitted data rarely if at all affected
We present several surprising results and compare our bench- the website content, both for the user submitting the form and
marks to prior work in Section VI. We review the related work for other users. On the other hand, Web2.0 websites are more
in Section VII and conclude in Section VIII. responsive and provide a richer user experience, similar to
II. M OTIVATING THE C HANGE GUI-style applications. These websites cause the browser to
dynamically fetch small chunks of data and integrate it into
We begin the presentation of our workloads by motivating
parts of the webpage, instead of loading the entire webpage
our decision to create new benchmarks. After using the existing
content at once. Because of this, the amount of data transferred
CloudSuite benchmarks [1] in several studies, we realized that
per request is significantly less than older web applications
a number of decisions made in these benchmarks do not match
where the whole page had to be reloaded for each request.
the current state of practice. These realizations initially moti-
The CloudSuite [1] Web Serving workload was intended
vated us to fix some of the problems we observed. However,
to represent the new Web2.0 paradigm. However, the Olio
after fixing a number of problems with the Web Serving and
project [10] used for this benchmark was written as a bench-
Data Caching benchmarks, we realized that it is easier to use the
mark and was never in use in a production environment. In fact,
knowledge we gained in this process (and some of the existing
a large fraction of the requests made in this benchmark were
tools) to re-implement the benchmarks, forcing us through the
requests for static images, a task that is commonly outsourced
process of considering many of the implicit design decisions
to content distribution networks (CDNs) in production systems.
along the way.
Moreover, the fraction of dynamic requests in the benchmark
It is worth noting that the focus of our new benchmarks is
corresponded to the situation circa 2008. We replaced the
system-level evaluation and performance measurement. Micro-
benchmark website with a more representative package of a
architectural study of the CloudSuite workloads demonstrated
production system, with a larger fraction of small dynamic
that vastly different cloud workloads have similar micro-
requests, that aligns more closely with modern practices. Impor-
architectural characteristics [1]. In this work, we target mak-
tantly, while we deemed the server-side software inappropriate,
ing the system-level behavior representative of real-world be-
the Faban [11] framework used to emulate web clients in the
havior, while the micro-architectural behavior is expected to
original CloudSuite benchmark is a good fit and we leveraged
remain largely similar to the original CloudSuite workloads.
it in the creation of our own web serving benchmark.
In Section VI-D we confirms that, from a micro-architecture
perspective, the workloads are practically the same.
C. Object Caching
A. Video Streaming The Memcached server software used for the Data Caching
In the past, various streaming application layer protocols benchmark in the CloudSuite [1] remains one of the most
were used for delivering audio and video content over IP popular and most-widely used cloud applications [5]. However,
networks. Protocols such as the Real Time Streaming Protocol as we started heavily using this benchmark, we discovered
(RTSP) were designed for media streaming and specifically tar- a number of drawbacks in the software that simulated client
geted the needs of this application. However, RTSP is a stateful requests to the server. As we delved deeper into understanding
protocol that keeps track of user sessions on the server, making the benchmark behavior, we realized that its behavior was
it difficult to scale out the service horizontally. Moreover, the not representative of real Memcached deployments. For ex-
common RTSP implementations used UDP, which presented ample, while UDP-based GET requests are common in real
problems for clients behind network address translation (NAT) deployments [12], the client included in the benchmark suite
proxies [7]. Because of this, video-streaming websites such threw an “Unimplemented” exception when configured for
as YouTube and Netflix built their streaming services on the UDP. We were interested in using the benchmark because it
Hypertext Transfer Protocol (HTTP) [8]. Using HTTP allows included the “Twitter” dataset, which was appealing because it
stateless servers that can be easily scaled out in modern appeared to represent a real-world use case of the application.
environments and permits the use of existing battle-tested high- However, upon further investigation, we discovered that the
performance web servers such as NGINX [9] as the core of client included a bug that performed incorrect scaling of the
these streaming services. Because the existing video-streaming dataset. We therefore decided to pursue implementing our own

123
client that included support for UDP requests and correct For benchmarking video-streaming applications, we deploy a
dataset scaling. video-streaming server on NGINX with video files in different
video resolutions. We ensure that the distribution of video
III. K EY C ONSIDERATIONS FOR B ENCHMARK D ESIGN
files and video resolutions is similar to that of popular video-
Benchmarking cloud applications has unique challenges over streaming services such as YouTube. We use the methodology
traditional benchmarking. Traditional benchmarking measures described in [16] for our work, basing our client on httperf [17],
wall-clock time, which is the time needed to complete an a tool from HP Labs. Accesses to videos follow a popularity
operation, with no other considerations. On the other hand, distribution [18]; we therefore take into account the difference
because the ultimate goal of cloud applications is end-user in the popularity of the videos when simulating the clients.
satisfaction, measuring the performance of server applications Elgg is a production-ready, actively used and developed
requires also taking into account the Quality-of-Service (QoS) social networking engine, which has similar functionality to
as a proxy for end-user satisfaction. QoS is typically specified Facebook. The bulk of the workload that dynamically generates
as a maximum latency L for a percentile P of all requests, web responses is performed by PHP, one of the most-popular
meaning that P% of all requests must be completed within platforms for developing dynamic websites [4]. Inspired by
latency L. As a result, measuring performance under QoS the CloudSuite [1], we use the Faban framework to develop
constraints implies iteratively applying different loads to find a benchmark for Elgg. Faban uses a light-weight Java thread
the peak load under which the QoS requirements are met. for each client, allowing the simulation of thousands of clients
Importantly, such performance measurements imply that the on a single machine with relatively low memory requirements.
measured system is under-utilized at its peak performance Memcached is a key-value store server used by some of
(higher utilization would yield higher throughput, but the QoS the most popular websites [12] and is a representative object
requirements would not be met). caching application. We developed a new Memcached bench-
Moreover, benchmarking cloud applications requires faith- marking client called Memloader. Memloader allows users
fully mimicking the behavior of real clients. In case of video- to specify arbitrary key-size, value-size and item-popularity
streaming servers, some videos are accessed more frequently distributions using a dataset file and provides first-class support
than others; Web2.0 client requests are dominated by small for UDP requests. With these features, Memloader can accu-
dynamic AJAX requests; object cache systems handle a wide rately emulate real-world clients. In addition to Memloader, we
range of key and object sizes and requests, all of which developed a set of automation scripts to simplify the task of
can affect the observed server throughput [13]. To properly benchmarking a Memcached server.
measure a server’s performance, the statistical distribution of
request sizes and popularity emulated by the benchmarking tool A. Video-streaming benchmark
must be representative of real-world setups [12], [14]. When Streaming video is a content delivery method in which
designing benchmarks for cloud applications, one must ensure videos are continuously delivered to the client, instead of being
that the request mix closely resembles real clients. downloaded at once. The video is progressively transferred to
Finally, we note that benchmarking is typically performed in the user as it is being watched.
a lab environment, where client machines connect to the server Two techniques – pacing and chunking – are used to stream
over a high speed network. Each client machine simulates the video over HTTP. The video-streaming client’s requests to the
behavior of hundreds or thousands of real clients. Therefore, server are “paced,” ensuring that the video-streaming client
care must be taken to ensure that each of the simulated clients retrieves only the part of the video that will be played back in
is operating in an environment similar to that of real clients. the near future. HTTP/1.1 Range Requests are used to request
We found this of particular concern for video streaming, where one “chunk” of the video at a time. At the start of playback,
network traffic is bursty because of different media-streaming the streaming video client prefetches a few chunks of video
clients requesting chunks of the file at different points in time. content. It then waits for these chunks to be viewed by the
Without artificially-imposed limits, it is possible for a single user. As the fetched chunks are consumed and the buffers are
simulated client to gain access to the entire bandwidth of the emptied, the next chunk in the sequence is fetched.
high-speed link. Such behavior is not representative of real Millions of videos are uploaded to video-streaming services,
client environments that experience a range of limited and but only a small fraction of them are popular. The popularity of
varying bandwidth capabilities. The benchmarking environment videos follows a Zipf distribution [18]. A small percentage of
must provide a realistic setup for simulated clients to ensure the videos are very popular and are accessed regularly, while the
that the behavior that the server exhibits resembles the behavior majority of the videos are accessed rarely. The request patterns
observed in real deployments. of our benchmark reflect this popularity distribution.
The bitrate at which content is encoded determines the stream
IV. O UR BENCHMARKS quality. Encoding content at a higher bitrate yields a higher
We base our benchmarks on several production applications: quality stream, increasing the amount of data that needs to
the NGINX-based HTTP video-streaming server, the Elgg be downloaded to play the video. If a low-bandwidth client
social networking engine [15], and the Memcached key-value attempts to view a high-quality video, the speed at which the
store. buffers are replenished could be lower than the speed at which

124
the video is viewed. The client will experience pauses during rent web users. For our benchmark, we developed a workload
viewing due to “buffering,” causing the viewer to become generator videosesslog, based on the design of wsesslog.
frustrated and potentially stop the video playback. Wsesslog allows the specification of the behavior of individ-
To solve this problem, the quality of the video stream is ual client sessions through its workload generator. A session
varied depending on the available network bandwidth of the comprises of a sequence of bursts, spaced out by the user
client. Videos uploaded to the video-streaming service are think-time. Each burst is one or more requests to the server.
stored encoded at different bitrates, creating different quality The wsesslog workload generator allows the specification of
versions of the same video, each of a different size. parameters of these client sessions, such as the sequence of
1) System to benchmark – NGINX based Video streaming: URIs to access and the think-time between requests. In our
Although our infrastructure can benchmark any HTTP-based workload, the think-time parameter is used to simulate “pacing”
video-streaming server, we select the NGINX server. NGINX is by separating requests for consecutive chunks of a video,
an event-driven HTTP server built around an asynchronous I/O simulating gradual viewing.
architecture. Event-based I/O systems allow a single application We specify a time-out period for each request. If the response
thread to handle multiple file descriptors, unlike the thread- to the request is not received within the timeout period, the re-
based I/O models (employed by servers such as Apache) which quest is considered as causing buffering. Buffering corresponds
require one thread or process per client. A single NGINX thread to frustrated users of the video-streaming service – in other
can service many client connections, making NGINX highly words, these are violations of Quality-of-Service.
scalable and performant. For these reasons, prominent video- In addition to the features supported by the wsesslog work-
streaming services use NGINX for content distribution [19]. load generator, the videosesslog workload generator supports
We enable the following NGINX configuration options: multiple input log files. The user can specify multiple input logs
and a probability distribution. Videosesslog generates requests
1) Sendfile: Sendfile is a system call that directly copies from each of these input logs, according to the specified prob-
contents from one file descriptor to another within the ability distribution, enabling the user to specify the percentage
kernel. Enabling sendfile speeds up copying of data of requests that will be generated from each of the input logs.
between the video file descriptor and the network socket The user can specify requests for different video qualities in
descriptor by avoiding unnecessary memory copies and different input logs and then specify the probability distribution,
context switches. described in Table I, for the ratio of accesses from each of
2) Epoll event mechanism: Under Linux, NGINX supports these input logs. Videosesslog allows binding each input log to
the select, poll, and epoll event mechanisms. We use the a different local IP on the client machine. Doing this enables
Epoll mechanism because it has the best performance and dummynet [21] rules that limit network bandwidth at each local
scalability [20]. IP to simulate realistic network conditions.
3) Persistent connections: HTTP persistent connections en- The httperf tool gathers statistics, such as the percentage of
able one TCP connection to be reused for multiple connections which timed-out, average reply rate, and average
requests. This mitigates the overhead of initiating and request rate, and summarizes and displays these statistics at the
tearing down TCP connections during video playback. By end of the benchmark execution.
default, in HTTP/1.1, all connections are persistent for a
b) Request mix generation tool: As discussed in Sec-
fixed duration specified by a timeout value. The default
tion IV-A, video popularity follows a Zipf distribution. A small
installation of NGINX specifies this value as five seconds.
subset of videos are more popular and are accessed more
We set the timeout to 60 seconds, which exceeds the inter-
frequently. The make-zipf [17] program is used to generate the
chunk interval at all bitrates and allows the server to shut
list of videos and corresponding videosesslog logs, such that
down connections only if they are no longer used by the
the requests reflect the Zipf distribution, with a configurable
client for video streaming.
Zipf exponent. Similar to [16], we use Zipf exponent -0.8 for
2) Benchmarking Tools: We base our benchmarking work our experiments. This list contains the name of the video files,
on the methodologies and tools presented in [16]. The bench- their duration, and popularity rank. The list of videos is then
marking tools consist of an enhanced httperf client that acts read by the gen-fileset tool, which creates the video files.
as the benchmark driver, a file-set generator, and a httperf log c) File-set generation tool: The file-set generator, gen-
generator. The file-set generator generates the videos on the fileset [17], reads the list of videos and creates the files on
video-streaming server. The httperf log generator generates a the video-streaming server. The number of files generated, and
log simulating the sequence of URLs accessed by the clients. therefore the dataset size, are configurable. Video-streaming
The httperf client executes the benchmark by replaying the log servers store different quality versions of the same video. We
of requests against the server under test. support Low Definition (240p), two Standard Definition (360p,
a) Benchmarking Driver: Our benchmarking client driver 480p), and High Definition (720p) resolutions. We generate all
is based on httperf [17], which was enhanced in [16]. The videos at Low Defintion (240p) and two Standard Definition
httperf tool, originally developed at HP Labs, measures web (360p, 480p) resolutions. Additionally, High Definition (720p)
server performance by simulating the behavior of many concur- resolution is generated for 20% of the videos. The size of a

125
TABLE I other users. Posting content on this microblogging platform
B ROADBAND CLIENTS - BANDWIDTH DISTRIBUTION makes it available to be read by other users. This is similar
Bandwidth Percentage of users to Facebook’s popular Wall functionality. Every user has a live
Above 15 Mbps 19% feed of content shared by their network of friends. This live
10 Mbps - 15 Mbps 20% feed is called Elgg River. Several plugins exist to custom-tailor
4 Mbps - 10 Mbps 34%
1 Mbps - 4 Mbps 27% the base functionality with additional features desired for a
particular installation.
The Elgg platform and the available plugins allow the user to
video file depends on its bitrate and its duration. We use the carry out a variety of operations, such as sending and receiving
bitrates suggested by YouTube [22] for our calculations. chat messages, posting on Elgg Wire, and retrieving the latest
3) Benchmark Setup: We simulate clients with different posts. These operations are AJAX based, sending and receiving
bandwidth capabilities. From Akamai [23], we obtain the dis- many small requests. The workload is dominated by these
tribution of the network speed for worldwide broadband users frequent AJAX requests.
in 2015. The percentage of clients with different bandwidth Elgg uses PHP as its server-side scripting language and uses
capabilities are presented in Table I. MySQL as its database backend. Similar to the setup used at
We use dummynet [21] to emulate clients with different Facebook [12], we enable Memcached to cache the results of
bandwidth. Dummynet is a network emulation tool that can database queries. We enable the Zend Opcache, which is a PHP
perform network bandwidth shaping. It can filter packets “accelerator” commonly used in production environments. We
based on any combination of parameters that identify a TCP change the default storage engine of MySQL to InnoDB to
connection (Source/Destination MAC-address/IP-address/Port- support a large number of concurrent reads and writes, needed
number). These filters can be used to forward the packets to support many concurrent users to the website.
through a virtual pipe, which is configured with attributes such 2) Benchmarking Tools: We use the Faban [11] framework
as a bandwidth cap or traversal latency. We configure multiple to develop our benchmark for Elgg. Faban is a Java-based
IP aliases on each of the client machines and use dummynet benchmark development and execution tool with two main
to filter packets on the basis of these IP aliases into different components: Faban Driver framework and Faban Harness.
pipes. On each pipe, we configure a bandwidth limit for each Faban Driver is a framework that provides an API that can
TCP connection that passes through the pipe. be used to quickly develop a benchmark. A benchmark driver
B. Web2.0 benchmark is defined by the operations it runs. The request mix for the
Web2.0 is a set of principles that guided the shift in the benchmark is specified by the list of operations to be performed
direction of web development after the year 2001 [24]. Web2.0 and the probability of each operation.
websites have certain characteristics that cause these workloads The Faban Harness comprises an Apache Tomcat instance
to be different from the workloads of older generation websites. that hosts a web application which automates deploying and
Older generation websites typically served static content, while running the benchmark. At the end of the benchmark run,
Web2.0 websites serve dynamic content. Web2.0 websites have a report is generated that contains statistics such as the suc-
richer user-interfaces that engage the user more frequently than cess/failure count of each operation, the response time, and the
older websites. number of quality-of-service violations.
A Web2.0 website delivers a service or platform, unlike 3) Faban Benchmark Details: Our benchmark takes into
traditional websites. For example, the social networking site account the fact that different operations occur with different
Facebook delivers a social networking platform to the users. frequencies. In the Faban driver for our benchmark, we specify
Most of the content on these websites consists of data provided a function for each of the operations in Table II. In the mix,
by the users of the website and not by the web developer. we assign higher probability for more common operations,
The content is dynamically generated from the actions of other such as updating the live feed, posting on walls, and sending
users and from external sources, such as news feeds from other and receiving chat messages. We assign a lower probability
websites. Because of this, writes to the backend database are for operations such as login and logout, reloading the home
frequent and the data written is consumed by other users. page, and creation of new users, as these are carried out less
1) System to benchmark – Elgg: The Elgg social networking frequently. Also, each operation is assigned an individual QoS
engine is a Web2.0 application developed in PHP, similar in latency limit. Table II shows our request mix and the QoS
functionality to the popular social networking engine Facebook. latency limit for each operation. We derive these values by
Elgg is currently used by the Australian Government, the extrapolating Facebook’s page load time, which is reported as
New Zealand Ministry of Education, Wiley Publishing, the 2.93 seconds by Alexa [26].
University of Florida, and many other organizations [25]. We specify a Quality-of-Service (QoS) requirement of 95%
Elgg allows users to build a network of associations or for our benchmark. 95% of all operations performed must meet
friends. It provides a platform for the users to share content. the QoS limit specified for that operation. If less than 95% of
Elgg includes a microblogging platform, called Elgg Wire, the operations meet the QoS latency limit, the Faban driver
which can be used to share text, image, or video content with deems the benchmark run as failed.

126
TABLE II 2) Benchmarking Tools – Memloader: Memloader emulates
E LGG – R EQUEST M IX , Q UALITY- OF -S ERVICE L IMITS a large number of virtual clients that perform requests to a
Request Percentage QoS (in seconds) Memcached server. Each virtual client independently generates
Create new user 0.5% 3 requests and examines responses. If a cluster of servers or
Login existing user 2.5% 3 server processes are benchmarked, a virtual client can send
Logout logged in user 2.5% 3
requests to multiple servers. By default, Memloader spawns
Access home page 5% 1
Wall post 20% 1 one worker thread per CPU core. Half of the worker threads
Send friend request 10% 1 are dedicated to sending requests. The other half are dedicated
Send chat message 17% 1 to receiving responses. The separation reduces interference
Receive chat message 17% 1
Update live feed 25.5% 1
between request and response activities, enabling precise timing
of request generation and accurate statistics of the response
latencies. A request-sending thread can send requests on be-
Our benchmark clears the transactional data between each half of multiple virtual clients. Similarly, a response-receiving
run to avoid any possible performance degradation due to large thread can receive responses on behalf of multiple virtual
database tables and to ensure that the execution environment is clients. Memloader threads are pinned to CPU cores to avoid
similar from one run to the next. overhead from unnecessary thread migrations. Memloader can
4) User prepopulation tool: Before running the benchmark, send requests using ether TCP or UDP.
we must prepopulate the database with Elgg users. These are Key-size, value-size and item-popularity distributions can be
simulated clients who will log in to the system and perform specified by providing Memloader a dataset file, where each
operations. We developed the UserSetup utility for this purpose. line represents a data item and specifies that item’s key size,
This utility can create a configurable number of users and value size, and popularity. By populating a dataset file with ap-
forward their login credentials to the Faban benchmark driver. propriate records, any key-size, value-size and item-popularity
When the benchmark is launched, each benchmark client thread distributions can be achieved. Memloader can synthesize a large
logs in with one of these users’ credentials and proceeds to dataset from a small dataset file. Conceptually, this is done
perform the operations described in Table II as that user. The by replicating the same dataset multiple times for use by the
number of pre-generated users determines the maximum num- virtual clients. The actual implementation stores the small file
ber of client threads that can be launched, in turn determining in memory and performs the replication on the fly, ensuring a
the maximum scale of the benchmark. small memory footprint. All virtual clients use the same dataset.
C. Object Caching benchmark The performance measured by Memloader is based on the
specified target latency (e.g., 1ms). Memloader reports the
In order to improve performance, web servers use object
percentage of requests that completed within the target latency.
caching systems to cache the results of expensive computations,
Memloader also reports the average latency, the throughput, the
thus making object caching an important workload to study. In
number of outstanding requests, and the hit-ratio. If detailed
this section, we present a benchmark for Memcached, which is
analysis of a server’s performance is needed, Memloader can
a popular, open-source, object caching system.
1) System to benchmark – Memcached: Memcached is a output the complete response latency histogram.
popular object caching system, which is typically used by web 3) Benchmarking Harness: To automate the benchmarking
servers to cache the results of expensive database queries. It is task, we provide a benchmarking harness, which is a set
a completely in-memory key-value store. It supports both TCP of scripts. Its main functions are system configuration, peak
and UDP protocols. Memcached typically acts as an object- throughput seeking, and multi-client control.
cache for web servers, and a single web request can result in The Memcached server under test is likely to use a high-
many Memcached requests. Therefore, to avoid delaying web performance NIC. In this case, the configuration of the server
requests, Memcached requests need to be serviced with low OS and NIC driver can have a significant impact on the
latency. The latency requirements for Memcached are typically measured performance. For example, to get the most out of
1 to 2ms. A single Memcached request requires very little RSS [27] support in the NIC, all CPU cores should partic-
processing. So a Memcached server can serve over a million ipate in interrupt handling. If the NIC supports a TCP flow
requests per second, which is a significantly higher throughput director [28], out-going packets of a TCP connection should be
than those of other workloads. The benchmark is capable sent out through a single queue, so that the flow director can
of benchmarking a single Memcached server or a cluster of correctly associate the TCP connection with the queue. Also,
Memcached servers. The dataset can be either replicated or the CPU core responsible for handling that TCP connection’s
sharded across multiple servers. incoming data should handle the associated queue’s interrupts.
Our Memcached benchmarking tool comprises two parts. If the purpose of the test is to measure a server’s maximum
The first is Memloader, an efficient C++ program for traffic performance, the frequencies of all server cores should be set
generation and performance statistics collection. The second is to the maximum. Similarly, proper system configuration on the
the benchmarking harness, which is a collection of scripts for client side is necessary to avoid overloading the client ma-
automating the benchmarking task. chines, which would yield erroneous traffic generation patterns

127
and latency measurements. Managing these settings manually is benchmark setups for each installation. Based on the positive
a tedious and error prone task, especially when many machines experience we gained in bringing up our new workloads with
are used in the benchmarking setup. To address this problem, Dockers, this approach to benchmark distribution has been
the benchmarking harness automatically configures server and adopted in CloudSuite 3 for all of the benchmarks.
client machines with peak performance settings. We provide
VI. R ESULTS
an example configuration harness for Intel Xeon E5v3 Linux
servers with Intel 82599ES NICs. In this section, we demonstrate several key aspects of
When benchmarking a high performance Memcached server our benchmarks (CloudSuite 3) by comparing them to their
or a cluster of Memcached servers, a single client machine counterparts from CloudSuite 2. First, we examine the results
may be insufficient to drive the requisite load. To address of running the benchmarks on a typical server and compare
this problem, the benchmarking harness supports deploying three notable metrics: request mix, I/O utilization, and in-
instances of Memloader across multiple client machines and terrupt distribution. Next, we present two case studies that
coordinates their simultaneous execution. The harness parses demonstrate how the key considerations for cloud benchmark
outputs from all Memloader instances and aggregates the per- design, outlined in section III, affect server characteristics and
formance statistics. measured performance. Finally, using the Video-streaming and
A common goal for benchmarking a Memcached server is to Memcached workloads, we show that, although the system-
find the server’s peak throughput within QoS constraints. Man- level behavior of these workloads is radically different, the
ually re-running experiments with different throughputs to find micro-architectural behavior when running these benchmarks
the best performance that does not violate QoS requirements is similar to the previous generation CloudSuite workloads.
is a tedious task. The benchmarking harness can automatically
A. Comparison of Request Mix
perform a binary search for the peak throughput by repeatedly
running the benchmark at different loads and automatically We contrast the request mixes of our video-streaming and
monitoring the QoS. Web2.0 workloads with the CloudSuite 2 workloads for the
respective applications to highlight the key system-level differ-
V. D OCKER DEPLOYMENT MECHANISM ences between them.
Working with the CloudSuite 2 benchmarks, we found that 1) Request Mix for Video Streaming: As described in Sec-
the installation and configuration process is extremely complex tion IV-A, different video-streaming clients have different band-
and error-prone. The benchmarking software, system libraries, width capabilities and the video stream quality is selected based
and Linux kernel often create complicated dependencies that on the bandwidth of each client. Moreover, only a fraction of
must be maintained for the software to run correctly. However, all videos are available in a High Definition format. Because
as newer Linux distributions are released, support for older video resolution has a direct impact on the file size of the video,
system libraries and kernels are gradually phased out, requiring the request mix of the benchmarking tool must be similar to
the benchmark user to manually resolve dependencies. More- real-world situations in order to accurately represent the server
over, not only did the process of following the installation and network characteristics.
instructions require significant time and effort, the provided Figure 1 compares the request mix of our NGINX-based
instructions were outdated and often not applicable to the new benchmark with the request mix of the CloudSuite 2 RTSP-
Linux distributions. based benchmark. Our new benchmark supports four video
To remedy this situation, we implemented our new bench- qualities: 240p, 360p, 480p and 720p, whereas the older Cloud-
marks using Docker containers [6]. Docker containers parcel Suite 2 benchmark supports three video qualities: Low (160p),
the benchmarks along with the complete filesystem needed to Medium (240p), and High (360p). In the default configuration,
execute them. This includes all dependencies – the benchmark- the CloudSuite 2 benchmark accesses videos of all three
ing software, system tools, and system libraries. Moreover, qualities with equal probability. In contrast, our workload takes
the Dockerfile not only serves as a quick and painless way into account the varying bandwidth availability of broadband
to automatically recreate the benchmark setup, but it simulta- clients, as described in Table I, and mimics real-world behavior
neously serves as pedantic and precise documentation of the where not all videos are available in High Definition.
exact dependencies, installation procedures, and configuration 2) Request Mix for Web2.0 benchmarks: As discussed in
settings of the benchmark. section III, Web2.0 clients perform many small AJAX requests.
The Docker containers for our benchmarks are released Thus, for a large fraction of requests, the response comprises
on Docker Hub [29], a free globally-accessible repository only a few bytes, which update a small part of the webpage, in-
for Docker containers. The benchmark user can perform a stead of kilobytes of data required to update the entire webpage.
“docker pull” to download these containers into their local Because the number of bytes transferred per request determines
machines. Once downloaded, the user can run the containers the network characteristics, a workload to benchmark these
by issuing a “docker run” command. Dockers greatly simplify applications must be representative of this reality.
the benchmark deployment process, distilling it into two simple We compare the request mix of a run of our Elgg benchmark
commands. Dockers enable benchmark installation and bring- with the request mix of CloudSuite 2 Olio benchmark. Figure 2
up in seconds instead of days and ensure stable and consistent shows the comparison of the AJAX requests and requests that

128
Fig. 1. Request Mix – Video Quality TABLE III
E XPERIMENTAL S ETUP

CPU Intel® Xeon® E5-2620 v3 @ 2.40GHz

Number of cores 12 (2 sockets, 6 per socket)
Hyperthreading Off
RAM 64 GB
NIC Intel 82599ES 10 Gbps
SAS Disk Array 10 disks, 15K RPM
OS Ubuntu 14.04.3 (kernel 3.19.0-30)

Fig. 3. IO-Wait % for Popularity-Aware and Popularity-Unaware Mix

Fig. 2. Request Mix – AJAX vs Full Page Refresh

popularity distribution of videos, as described in IV-A, and

the popularity-unaware request mix accesses all videos with
an equal probability.
Figure 3 shows the I/O-Wait% for these configurations.
The average I/O-Wait% for the popularity-aware request mix
is 22.6%, while that of the popularity-unaware request mix
result in full page refresh. In Olio, 77% of all operations in the is 61.2%. As a result of this, in case of the popularity-
request mix result in full page refreshes. On the other hand, unaware request mix, 55% of the requests fail to meet the QoS
our Elgg benchmark consists of 89% of small, AJAX requests, requirements. In case of the popularity-aware request mix, only
which update a small part of the webpage, and only 11% of 4% of the requests fail to meet QoS. The system-level behavior
requests perform full page refresh. is therefore greatly dependent on this parameter and should be
B. Video-Streaming – Effect of Request Mix on I/O Wait% selected to match real-world conditions.

In this section, we study the effect of the popularity dis- C. Memcached – Effect of Interrupt Distribution
tribution of a video-streaming request mix on the disk I/O Prior work on Memcached servers has shown the importance
wait time, to show how the request mix of a workload has an of factors like value size, item popularity, using UDP, and
impact on server characteristics. As discussed in Section IV-A, others [13], [12]. In addition to having a realistic workload,
a small percentage of videos on the video-streaming server it is important to configure the server under test in a way that
are frequently accessed. This allows the operating system to ensures there are no artificial bottlenecks in the system. In this
cache these “hot” files, thus reducing the total number of disk section, we demonstrate the effect of interrupt distribution on
accesses and speeding up reads of the less frequently accessed the server machine on the benchmark’s result. A Memcached
videos. Ultimately, this affects the Quality-of-Service of the server handles hundreds of thousands of requests per second,
video-streaming system. which corresponds to a massive number of network interrupts
For our experiments, we use a server and client machine with on the server. To make sure a multi-core system is not bot-
the configuration described in Table III. Our fileset consists of tlenecked on interrupt handling, the server NIC requires RSS
8000 videos, each available in 240p, 360p, and 480p video support. However, having an appropriate NIC by itself is not
resolutions. Additionally, 20% of the videos are available in sufficient. To properly configure a Memcached server with an
720p quality. The total size of the fileset is 1.1 TB. RSS-capable NIC, interrupts must be evenly distributed among
For this experiment, we generate two request mixes – a cores, which is often not the default behavior.
popularity-aware request mix and a popularity-unaware request We demonstrate the impact of interrupt distribution using two
mix. The popularity-aware request mix takes into account the machines with identical hardware and software configuration

129
Fig. 4. Memcached Performance and IRQ Handling Fig. 5. IPC Comparison

Fig. 6. Micro-architectural Features - Video-Streaming Workloads

(Table III), connected by a 10G network. The server runs Mem-

cached 1.4.25. The client sends 860K requests per second to
the server through 600 TCP connections. The dataset comprises
five million items with key size 40 bytes, value size 500 bytes,
and uniform popularity. 80% of the requests are GET and 20%
are SET. We use this simple workload to isolate and highlight
the effect of interrupt distribution. The key size and value size
are chosen according to [14]. On the server, only 6 cores (one
socket) are used, the other 6 cores are halted.
Figure 4 shows the system performance, measured as a
percentage of requests meeting the 1ms latency requirement, Thus from the results described in sections VI-B, VI-C,
as we vary the number of server cores that handle receive and VI-D, we see that even though the micro-architectural
interrupts. We find that the performance difference exceeds characteristics of CloudSuite 2 and our new (CloudSuite 3)
33% as the number of interrupt-handling cores is varied from benchmarks are similar, the system-level behavior differs sig-
2 to 6, demonstrating the magnitude of the impact that this nificantly. Thus, for the purpose of micro-architectual studies,
parameter has on overall system throughput. Although the both CloudSuite 2 and CloudSuite 3 workloads are equally
default configuration may perform some interrupt distribution suitable. However, for the purpose of system-level evaluation,
to partialy mitigate this problem, our object caching benchmark CloudSuite 3 workloads are more suitable because they are
demonstrates the configuration needed to equally distribute more representative of real-life situations.
interrupts to achieve peak performance from the hardware.
Fig. 7. Micro-architectural Features - Memcached Workloads
D. Micro-architectural Features of Workloads
We demonstrated major differences in the system-level be-
havior of our benchmarks compared to CloudSuite 2. We now
examine the micro-architectural behavior of two of the work-
loads, video-streaming and Memcached, with their CloudSuite
2 counterparts. Similar to [1], we concentrate on IPC, L1-
Dcache and L1-Icache misses, and D-TLB and I-TLB misses.
For all experiments, we use one client and one server
machine. The configuration for both the client and server
are described in Table III. To ensure a fair comparison, we
examine the server systems under identical CPU load. Figure 5
shows the comparison of the IPCs for the video-streaming VII. R ELATED W ORK
and Memcached workloads. Prior work found that the IPC There has been considerable work related to benchmarking
achieved by cloud workloads is relatively low [1]. We find that cloud workloads. We present a survey of the popular bench-
the IPC of our workloads and their CloudSuite 2 counterparts marking tools for video streaming, Web2.0, and object caching
are practically the same, differing by less than 3%. Figures 6 (Memcached) applications.
and 7 further compare the micro-architectural behavior of the
workloads with respect to the caches and TLBs. From these A. Video streaming
results, we conclude that the micro-architectural characteristics The CloudSuite 2 includes a video-streaming benchmark
of the CloudSuite 2 workloads and our workloads are similar. based on the Darwin video-streaming server. Darwin serves

130
content using the Real Time Streaming Protocol (RTSP). How- VIII. C ONCLUSIONS
ever, today, popular video-streaming services, such as YouTube With the increase in popularity of the cloud as a platform for
and Netflix, stream video using the HTTP protocol [8]. delivering global-scale online services, it has become important
Benchlab [30] is a web application benchmarking framework to benchmark cloud workloads, to continue to improve the state
that uses trace-replays on real web browsers and gathers of art of server systems. We attempted to use the existing
statistics on both the client and the server. The Benchlab cloud benchmarks, but found drawbacks in them – the most
tool was extended into the Video-Benchlab suite [31] for important one being that the benchmark design choices are not
benchmarking video-streaming servers. Unfortunately, because transparent to the end-user of the benchmark. In this work,
the tool launches a separate browser instance for each simulated we chronicled our experience of developing benchmarks for
user, and each browser instance uses a significant amount of three network-intensive cloud applications and documented our
memory capacity and CPU time, the number of clients that design choices and rationale behind them. Specifically, we
can be simulated by a single machine in a benchmarking setup described three cloud applications: video-streaming, Web2.0,
is severely limited. The per-client resource requirements make and object caching.
Video-Benchlab impractical for studying servers under high We compared our benchmarks with existing tools and
throughput conditions. demonstrated a number of distinct differences, concentrating on
Methodologies for generating HTTP streaming video work- the aspects that have an impact on the results of the system-level
loads were presented by [16]. Our video-streaming benchmark measurements. In particular, we highlighted how the request
makes extensive use of these methodologies and our video- mix and machine setup can have a significant impact on the
streaming benchmarking tool is based on the source code performance of the cloud application under test. Finally, we
provided as part of the work. showed that, despite major system-level performance differ-
B. Web2.0 ences, the micro-architectural behavior of the new benchmarks
is similar to the CloudSuite 2 workloads.
Olio was developed to benchmark a “typical” Web2.0 ap-
plication, a social event calendar that allows multiple users to ACKNOWLEDGEMENTS
post social events, browse events, and “friend” other members. This material is based upon work supported by the National
The CloudSuite 2 [1] benchmarking suite includes a Faban Science Foundation (NSF) under Grant No. 1452904 and by a
driver that measures the performance of Olio. However, the gift from Cavium, Inc. The experiments were conducted using
Olio application is retired, no longer supported, and remains equipment purchased through NFS CISE Research Infrastruc-
an example of an outdated Web2.0 benchmark application that ture Grant No. 1513028.
was never used in a production environment. We thank Dr. Tim Brecht and Jim Summers for their help
Benchlab [30] (mentioned above) was originally designed in the development of the video-streaming benchmark and for
to benchmark websites, but suffers from the problem of high providing us the source code from their prior work [16].
per-client resource requirements, which makes it unsuitable for
launching thousands of simulated clients on a single machine. R EFERENCES
SPECweb [32] is a popular SPEC benchmark for evaluating [1] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic,
C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing
web server performance. The benchmark contains three work- the clouds: a study of emerging scale-out workloads on modern
loads: Banking, E-commerce, and Support. These workloads hardware,” in Proceedings of the seventeenth international conference
are traditional web applications and don’t have the characteris- on Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’12. New York, NY, USA: ACM, 2012, pp.
tics of a modern Web2.0 website. Moreover, similarly to Olio, 37–48. [Online]. Available: http://doi.acm.org/10.1145/2150976.2150982
SPECweb is officially retired and is no longer maintained. [2] “Googlecloudplatform – perfkitbenchmarker.” [Online]. Available: https:
//github.com/GoogleCloudPlatform/PerfKitBenchmarker
C. Object Caching [3] “The zettabyte era: Trends and analysis.” [Online]. Avail-
able: http://www.cisco.com/c/en/us/solutions/collateral/service-provider/
CloudSuite 2 contains a Data Caching benchmarking visual-networking-index-vni/VNI Hyperconnectivity WP.html
client [1] that can generate a workload based on a “Twitter” [4] “Php: Hypertext preprocessor.” [Online]. Available: http://php.net
dataset. However, it only supports TCP, while large-scale de- [5] “Memcached - a distributed memory object caching system.” [Online].
Available: http://memcached.org/
ployments of Memcached use UDP for GET requests [12]. [6] “Docker - build, ship, and run any app, anywhere.” [Online]. Available:
Like the CloudSuite 2 client, Mutilate [33] also only supports https://www.docker.com
TCP. Furthermore, it does not have the ability to control item [7] “A network address translator (nat) traversal mechanism for media
controlled by real-time streaming protocol (rtsp).” [Online]. Available:
popularity. Memaslap [34] is a Memcached benchmarking tool https://tools.ietf.org/html/draft-ietf-mmusic-rtsp-nat-22
that comes with the libmemcached library. Libmemcached is a [8] A. C. Begen, T. Akgul, and M. Baugher, “Watching video over the web:
C/C++ library that facilitates the development of clients for the Part 1: Streaming protocols,” Internet Computing, IEEE, vol. 15, no. 2,
pp. 54–63, 2011.
Memcached server. Memaslap supports both TCP and UDP, but [9] “Nginx – high performance load balancer, web server, and reverse
it only reports the minimum, maximum, mean, and standard proxy.” [Online]. Available: https://www.nginx.com/
deviation for latency, without reporting the quality-of-service [10] “Olio – a web 2.0 toolkit.” [Online]. Available: http://incubator.apache.
org/projects/olio.html
percentage. Like Mutilate, Memslap does not have the ability [11] “Faban - helping measure performance.” [Online]. Available: http:
to control item popularity. //faban.org

131
[12] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li,
R. McElroy, M. Paleczny, D. Peek, P. Saab et al., “Scaling memcache at
facebook.” in nsdi, vol. 13, 2013, pp. 385–398.
[13] K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch,
“Thin servers with smart pipes: designing soc accelerators for mem-
cached,” ACM SIGARCH Computer Architecture News, vol. 41, no. 3,
pp. 36–47, 2013.
[14] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, “Work-
load analysis of a large-scale key-value store,” in ACM SIGMETRICS
Performance Evaluation Review, vol. 40, no. 1. ACM, 2012, pp. 53–64.
[15] “Elgg - open source social networking engine.” [Online]. Available:
https://elgg.org
[16] J. Summers, T. Brecht, D. Eager, and B. Wong, “Methodologies for
generating http streaming video workloads to evaluate web server per-
formance,” in Proceedings of the 5th Annual International Systems and
Storage Conference. ACM, 2012, p. 2.
[17] D. Mosberger and T. Jin, “httperf – a tool for measuring web server
performance,” vol. 26, no. 3. ACM, 1998, pp. 31–37.
[18] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Youtube traffic characterization:
a view from the edge,” in Proceedings of the 7th ACM SIGCOMM
conference on Internet measurement. ACM, 2007, pp. 15–28.
[19] “Why netflix chose nginx as the heart of its cdn.” [Online].
Available: https://www.nginx.com/blog/why-netflix-chose-nginx-as-the-
heart-of-its-cdn
[20] L. Gammo, T. Brecht, A. Shukla, and D. Pariag, “Comparing and
evaluating epoll, select, and poll event mechanisms,” in Linux Symposium,
vol. 1, 2004.
[21] M. Carbone and L. Rizzo, “Dummynet revisited,” ACM SIGCOMM
Computer Communication Review, vol. 40, no. 2, pp. 12–20, 2010.
[22] “Recommended upload encoding settings.” [Online]. Available: https:
//support.google.com/youtube/answer/1722171
[23] “Akamai’s state of the internet: Q1 2015 report.” [Online]. Avail-
able: https://www.stateoftheinternet.com/resources-connectivity-2015-
q1-state-of-the-internet-report.html
[24] T. o’Reilly, What is web 2.0. ” O’Reilly Media, Inc.”, 2009.
[25] “Powered by elgg.” [Online]. Available: https://elgg.org/powering.php
[26] “Alexa statistics for facebook.” [Online]. Available: http://www.alexa.
com/siteinfo/facebook.com
[27] “Scaling in the linux networking stack.” [Online]. Available: https:
//www.kernel.org/doc/Documentation/networking/scaling.txt
[28] “Introduction to intel ethernet flow director
and memcached performance,” 2014. [Online]. Avail-
able: http://www.intel.com/content/dam/www/public/us/en/documents/
white-papers/intel-ethernet-flow-director.pdf
[29] “Dockerhub.” [Online]. Available: https://hub.docker.com/
[30] E. Cecchet, V. Udayabhanu, T. Wood, and P. Shenoy, “Benchlab: an open
testbed for realistic benchmarking of web applications,” in Proceedings of
the 2nd USENIX conference on Web application development. USENIX
Association, 2011.
[31] P. Pegus, E. Cecchet, and P. Shenoy, “Video benchlab: an open platform
for realistic benchmarking of streaming media workloads,” in Proc. ACM
Multimedia Systems Conference (MMSys), Portland, OR, 2015.
[32] “Standard performance evaluation corporation.” [Online]. Available:
http://www.spec.org/web2009
[33] “Mutilate: high-performance memcached load generator.” [Online].
Available: https://github.com/leverich/mutilate
[34] “Memaslap – an open source c/c++ client library and tools for
the memcached server.” [Online]. Available: http://libmemcached.org/
libMemcached.html