0% found this document useful (0 votes)
62 views

Benchmarking Cloud Serving Systems With YCSB

big data benchmark paper

Uploaded by

bushraNazir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Benchmarking Cloud Serving Systems With YCSB

big data benchmark paper

Uploaded by

bushraNazir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Benchmarking Cloud Serving Systems with YCSB

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears

Yahoo! Research
Santa Clara, CA, USA
{cooperb,silberst,etam,ramakris,sears}@yahoo-inc.com

ABSTRACT ers [3, 5, 7, 8]. Some systems are offered only as cloud
While the use of MapReduce systems (such as Hadoop) for services, either directly in the case of Amazon SimpleDB [1]
large scale data analysis has been widely recognized and and Microsoft Azure SQL Services [11], or as part of a pro-
studied, we have recently seen an explosion in the number gramming environment like Google’s AppEngine [6] or Ya-
of systems developed for cloud data serving. These newer hoo!’s YQL [13]. Still other systems are used only within a
systems address “cloud OLTP” applications, though they particular company, such as Yahoo!’s PNUTS [17], Google’s
typically do not support ACID transactions. Examples of BigTable [16], and Amazon’s Dynamo [18]. Many of these
systems proposed for cloud serving use include BigTable, “cloud” systems are also referred to as “key-value stores” or
PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, “NoSQL systems,” but regardless of the moniker, they share
Voldemort, and many others. Further, they are being ap- the goals of massive scaling “on demand” (elasticity) and
plied to a diverse range of applications that differ consider- simplified application development and deployment.
ably from traditional (e.g., TPC-C like) serving workloads. The large variety has made it difficult for developers to
The number of emerging cloud serving systems and the wide choose the appropriate system. The most obvious differ-
range of proposed applications, coupled with a lack of apples- ences are between the various data models, such as the
to-apples performance comparisons, makes it difficult to un- column-group oriented BigTable model used in Cassandra
derstand the tradeoffs between systems and the workloads and HBase versus the simple hashtable model of Voldemort
for which they are suited. We present the Yahoo! Cloud or the document model of CouchDB. However, the data
Serving Benchmark (YCSB) framework, with the goal of fa- models can be documented and compared qualitatively. Com-
cilitating performance comparisons of the new generation paring the performance of various systems is a harder prob-
of cloud data serving systems. We define a core set of lem. Some systems have made the decision to optimize for
benchmarks and report results for four widely used systems: writes by using on-disk structures that can be maintained us-
Cassandra, HBase, Yahoo!’s PNUTS, and a simple sharded ing sequential I/O (as in the case of Cassandra and HBase),
MySQL implementation. We also hope to foster the devel- while others have optimized for random reads by using a
opment of additional cloud benchmark suites that represent more traditional buffer-pool architecture (as in the case of
other classes of applications by making our benchmark tool PNUTS). Furthermore, decisions about data partitioning
available via open source. In this regard, a key feature of the and placement, replication, transactional consistency, and
YCSB framework/tool is that it is extensible—it supports so on all have an impact on performance.
easy definition of new workloads, in addition to making it Understanding the performance implications of these de-
easy to benchmark new systems. cisions for a given type of application is challenging. De-
velopers of various systems report performance numbers for
Categories and Subject Descriptors: H.3.4 [Systems the “sweet spot” workloads for their system, which may not
and Software]: Performance evaluation match the workload of a target application. Moreover, an
General Terms: Measurement, Performance apples-to-apples comparison is hard, given numbers for dif-
ferent systems based on different workloads. Thus, devel-
opers often have to download and manually evaluate mul-
1. INTRODUCTION tiple systems. Engineers at Digg [20] reported evaluating
There has been an explosion of new systems for data stor- eight different data stores in order to implement one feature
age and management “in the cloud.” Open source systems (the Green Badge, or “what have my friends dugg” feature).
include Cassandra [2, 24], HBase [4], Voldemort [9] and oth- There have been multiple similar examples at Yahoo!. This
process is time-consuming and expensive.
Our goal is to create a standard benchmark and bench-
marking framework to assist in the evaluation of different
Permission to make digital or hard copies of all or part of this work for cloud systems. We focus on serving systems, which are sys-
personal or classroom use is granted without fee provided that copies are tems that provide online read/write access to data. That is,
not made or distributed for profit or commercial advantage and that copies usually a web user is waiting for a web page to load, and
bear this notice and the full citation on the first page. To copy otherwise, to reads and writes to the database are carried out as part of
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
the page construction and delivery. In contrast, batch or an-
SoCC’10, June 10–11, 2010, Indianapolis, Indiana, USA. alytical systems such as Hadoop or relational OLAP systems
Copyright 2010 ACM 978-1-4503-0036-0/10/06 ...$10.00.

143
provide only near-line or offline queries, and are not typically • High availability: Cloud systems must provide high
used to support serving workloads (though the result of a levels of availability. In particular, they are often multi-
batch computation may be cached into a serving store for tenant systems, which means that an outage affects many
low-latency access). Other researchers [26] are doing work different applications. Moreover, the use of commodity
to benchmark analytical systems such as these. Similarly, hardware means that failures are relatively common, and
there are existing benchmarks for a variety of data storage automated recovery must be a first-class operation of the
systems (such as SQL databases [22] and filesystems [12, system.
10]). However, the novel interfaces (usually neither SQL The main motivation for developing new cloud serving
nor POSIX), elasticity, and new use cases of cloud serving systems is the difficulty in providing all of these features (es-
systems motivate a new benchmark. pecially scale-out and elasticity) using traditional database
We present the Yahoo! Cloud Serving Benchmark (YCSB) systems. As a tradeoff, cloud systems typically sacrifice the
framework. We are using this framework to benchmark complex query capabilities and strong, sophisticated trans-
our own PNUTS system and to compare it to other cloud action models found in traditional systems. Without the
databases. The framework consists of a workload gener- need for complex planning and processing of joins and aggre-
ating client and a package of standard workloads that cover gates, scale-out and elasticity become significantly easier to
interesting parts of the performance space (read-heavy work- achieve. Similarly, scale-out (especially to multiple datacen-
loads, write-heavy workloads, scan workloads, etc.). An im- ters) is easier to achieve without strong transaction protocols
portant aspect of the YCSB framework is its extensibility: like two-phase commit or Paxos. In particular, it is impos-
the workload generator makes it easy to define new work- sible to simultaneously guarantee availability, consistency
load types, and it is also straightforward to adapt the client and partition tolerance [21]. Because network partitions (or
to benchmark new data serving systems. The YCSB frame- delays and failures which mimic partitions) are unavoidable,
work and workloads are available in open source so that systems must prioritize either availability or consistency, and
developers can use it to evaluate systems, and contribute most cloud systems choose availability. As a result, cloud
new workload packages that model interesting applications1 . systems typically provide a consistency model that is weaker
In this paper, we describe the YCSB benchmark, and re- in various ways than traditional ACID databases.
port performance results for four systems: Cassandra, HBase,
PNUTS, and a simple sharded MySQL implementation. Al-
though our focus in this paper is on performance and elas-
ticity, the framework is intended to serve as a tool for evalu- 2.2 Classification of Systems and Tradeoffs
ating other aspects of cloud systems such as availability and We now examine the different architectural decisions made
replication, and we discuss approaches to extending it for by cloud systems. As with many types of computer systems,
these purposes. no one system can be best for all workloads, and different
The paper is organized as follows. Section 2 provides an systems make different tradeoffs in order to optimize for dif-
overview of cloud data serving systems. Section 3 discusses ferent applications. The main tradeoffs facing cloud serving
benchmark tiers for performance and scaling. Section 4 dis- systems are:
cusses the core workloads of the benchmark in detail, while • Read performance versus write performance
Section 5 examines the architecture and extensibility of the
In a serving system, it is difficult to predict which record
YCSB tool. Section 6 presents benchmarking results on sev-
will be read or written next. Unless all data fits in mem-
eral systems. We propose future benchmark tiers covering
ory, this means that random I/O to the disk is needed to
availability and replication in Section 7. Section 8 examines
serve reads (e.g., as opposed to scans). Random I/O can be
related work, and Section 9 presents our conclusions.
used for writes as well, but much higher write throughput
can be achieved by appending all updates to a sequential
2. CLOUD SYSTEM OVERVIEW disk-based log. However, log-structured systems that only
store update deltas can very inefficient for reads if the data
2.1 Cloud Serving System Characteristics is modified over time, as typically multiple updates from
different parts of the log must be merged to provide a con-
Clouds serving systems share common goals, despite the sistent record. Writing the complete record to the log on
different architectures and design decisions. In general, these each update avoids the cost of reconstruction at read time,
systems aim for: but there is a correspondingly higher cost on update. Log-
• Scale-out: To support huge datasets (multiple terabytes structured merge trees [29] avoid the cost of reconstructing
or petabytes) and very high request rates, cloud systems on reads by using a background process to merge updates
are architected to scale-out, so that large scale is achieved and cluster records by primary key, but the disk cost of this
using large numbers of commodity servers, each running process can reduce performance for other operations. Over-
copies of the database software. An effective scale-out all, then, there is an inherent tradeoff between optimizing
system must balance load across servers and avoid bot- for reads and optimizing for writes.
tlenecks. A particular case of this tradeoff is seen in systems such
• Elasticity: While scale-out provides the ability to have as HBase that are based on filesystems optimized for batch
large systems, elasticity means that we can add more processing (for example, HBase is built on HDFS, which is
capacity to a running system by deploying new instances the data store for Hadoop). For a system to excel at batch
of each component, and shifting load to them. processing, it requires high throughput sequential reads and
writes, rather than fast random accesses; thus, Hadoop only
1 supports append-only files. Updates to existing records must
YCSB can be obtained from http://research.yahoo.com/-
Web Information Management/YCSB. be handled by using a differential files scheme that shares

144
the same disadvantages as a log-structured file system with BigTable-like systems; but reads are fast because a single
respect to reads. I/O can retrieve the entire, up-to-date record.
• Latency versus durability Latency versus durability is another important axis. If
developers know they can lose a small fraction of writes (for
Writes may be synched to disk before the system returns suc- example, web poll votes), they can return success to writes
cess to the user, or they may be stored in memory at write without waiting for them to be synched to disk. Cassandra
time and synched later. The advantages of the latter ap- allows the client to specify on a per-call basis whether the
proach are that avoiding disk greatly improves write latency,
write is durably persisted. MySQL and PNUTS always force
and potentially improves throughput (if multiple writes to
log updates to disk when committing a transaction, although
the same record can be serviced by a single I/O operation this log force can be disabled. HBase does not sync log
or can be condensed in memory). The disadvantage is risk updates to disk, which provides low latency updates and
of data loss if a server crashes and loses unsynched updates. high throughput. This is appropriate for HBase’s target use
• Synchronous versus asynchronous replication cases, which are primarily to run batch analytics over serving
Replication is used to improve system availability (by di- data, rather than to provide guaranteed durability for such
recting traffic to a replica after a failure), avoid data loss data. For such a system, high throughput sequential reads
(by recovering lost data from a replica), and improve per- and writes are favored over durability for random updates.
formance (by spreading load across multiple replicas and Synchronous replication ensures freshness of replicas, and
by making low-latency access available to users around the is used in HBase and Cassandra. Cassandra also supports
world). However, there are different approaches to repli- asynchronous replication, as do MySQL and PNUTS. Asyn-
cation. Synchronous replication ensures all copies are up chronous replication supports wide-area replication without
to date, but potentially incurs high latency on updates. adding significant overhead to the update call itself.
Furthermore, availability may be impacted if synchronously Column storage is advantageous for applications that need
replicated updates cannot complete while some replicas are only access a subset of columns with each request, and know
offline. Asynchronous replication avoids high write latency these subsets in advance. BigTable, HBase, and Cassandra
(in particular, making it suitable for wide area replication) all provide the ability to declare column groups or fami-
but allows replicas to be stale. Furthermore, data loss may lies, and add columns to any of them. Each group/family
occur if an update is lost due to failure before it can be is physically stored separately. On the other hand, if re-
replicated. quests typically want the entire row, or arbitrary subsets of
it, partitioning that keeps the entire row physically together
• Data partitioning
is best. This can be done with row storage (as in PNUTS),
Systems may be strictly row-based, or allow for column or by using a single column group/family in a column store
storage. In row-based storage all of a record’s fields are like Cassandra.
stored contiguously on disk. With column storage, differ- The systems we discuss here are representative, rather
ent columns or groups of columns can be stored separately than comprehensive. A variety of other systems make dif-
(possibly on different servers). Row-based storage supports ferent decisions. While we cannot survey them all here, we
efficient access to an entire record (including low latency will point out a few interesting characteristics. Dynamo,
reads and insertion/update in a serving-oriented system), Voldemort and Cassandra use eventual consistency to bal-
and is ideal if we typically access a few records in their en- ance replication and availability. In this model, writes are
tirety. Column-based storage is more efficient for accessing allowed anywhere, and conflicting writes to the same object
a subset of the columns, particularly when multiple records are resolved later. Amazon SimpleDB and Microsoft Azure
are accessed. are hosted cloud serving stores. Both provide transactional
functions not found in other serving stores. The caveat is
that the user must partition their data into different contain-
2.3 A Brief Survey of Cloud Data Systems ers, both in terms of size and request rate. SimpleDB calls
these containers domains, while Azure calls them databases.
To illustrate these tradeoffs, Table 1 lists several cloud
The wide variance in design decisions has significant perfor-
systems and the choices they have made for each dimension.
mance implications, which motivated us to develop a bench-
We now examine some of these decisions.
mark for quantitatively evaluating those implications.
An application developer must match their workload re-
quirements to the best suited cloud database system. Con-
sider the read-optimized versus write-optimized tradeoff. 3. BENCHMARK TIERS
BigTable-like systems such as Cassandra and HBase attempt In this section, we propose two benchmark tiers for eval-
to always perform sequential I/O for updates. Records on uating the performance and scalability of cloud serving sys-
disk are never overwritten; instead, updates are written to tems. Our intention is to extend the benchmark to deal with
a buffer in memory, and the entire buffer is written sequen- more tiers for availability and replication, and we discuss our
tially to disk. Multiple updates to the same record may initial ideas for doing so in Section 7.
be flushed at different times to different parts of the disk.
The result is that to perform a read of a record, multiple 3.1 Tier 1—Performance
I/Os are needed to retrieve and combine the various up- The Performance tier of the benchmark focuses on the
dates. Since all writing is sequential, it is very fast; but latency of requests when the database is under load. La-
reads are correspondingly de-optimized. In contrast, a more tency is very important in serving systems, since there is
traditional buffer-pool architecture, such as that in MySQL usually an impatient human waiting for a web page to load.
and PNUTS, overwrites records when they are updated. Be- However, there is an inherent tradeoff between latency and
cause updates require random I/O, they are slower than the throughput: on a given hardware setup, as the amount of

145
System Read/Write Latency/durability Sync/async Row/column
optimized replication
PNUTS Read Durability Async Row
BigTable Write Durability Sync Column
HBase Write Latency Async Column
Cassandra Write Tunable Tunable Column
Sharded MySQL Read Tunable Async Row

Table 1: Design decisions of various systems.

load increases, the latency of individual requests increases as short or non-existent period of disruption while the system
well since there is more contention for disk, CPU, network, is reconfiguring itself to use the new server. This is similar
and so on. Typically application designers must decide on an to the speedup metric from [19], with the added twist that
acceptable latency, and provision enough servers to achieve the new server is added while the workload is running.
the desired throughput while preserving acceptable latency.
A system with better performance will achieve the desired
latency and throughput with fewer servers.
The Performance tier of the benchmark aims to charac- 4. BENCHMARK WORKLOADS
terize this tradeoff for each database system by measuring
latency as we increase throughput, until the point at which We have developed a core set of workloads to evaluate dif-
the database system is saturated and throughput stops in- ferent aspects of a system’s performance, called the YCSB
creasing. In the terminology of the Wisconsin Benchmark, a Core Package. In our framework, a package is a collection
popular early benchmark used for parallel database systems, of related workloads. Each workload represents a particular
our metric is similar to sizeup [19], where the hardware is mix of read/write operations, data sizes, request distribu-
kept constant but the size of the workload increases. tions, and so on, and can be used to evaluate systems at one
To conduct this benchmark tier, we need a workload gen- particular point in the performance space. A package, which
erator which serves two purposes: first, to define the dataset includes multiple workloads, examines a broader slice of the
and load it into the database; and second, to execute op- performance space. While the core package examines sev-
erations against the dataset while measuring performance. eral interesting performance axes, we have not attempted to
We have implemented the YCSB Client (described in more exhaustively examine the entire performance space. Users
detail in Sections 4 and 5) for both purposes. A set of pa- of YCSB can develop their own packages either by defining
rameter files defines the nature of the dataset and the opera- a new set of workload parameters, or if necessary by writing
tions (transactions) performed against the data. The YCSB Java code. We hope to foster an open-source effort to create
Client allows the user to define the offered throughput as and maintain a set of packages that are representative of
a command line parameter, and reports the resulting la- many different application settings through the YCSB open
tency, making it straightforward to produce latency versus source distribution. The process of defining new packages is
throughput curves. discussed in Section 5.
To develop the core package, we examined a variety of
systems and applications to identify the fundamental kinds
3.2 Tier 2—Scaling of workloads web applications place on cloud data systems.
A key aspect of cloud systems is their ability to scale elas- We did not attempt to exactly model a particular applica-
tically, so that they can handle more load as applications tion or set of applications, as is done in benchmarks like
add features and grow in popularity. The Scaling tier of the TPC-C. Such benchmarks give realistic performance results
database examines the impact on performance as more ma- for a narrow set of use cases. In contrast, our goal was to
chines are added to the system. There are two metrics to examine a wide range of workload characteristics, in order
measure in this tier: to understand in which portions of the space of workloads
systems performed well or poorly. For example, some sys-
Scaleup—How does the database perform as the number of tems may be highly optimized for reads but not for writes,
machines increases? In this case, we load a given number of or for inserts but not updates, or for scans but not for point
servers with data and run the workload. Then, we delete the lookups. The workloads in the core package were chosen to
data, add more servers, load a larger amount of data on the explore these tradeoffs directly.
larger cluster, and run the workload again. If the database The workloads in the core package are a variation of the
system has good scaleup properties, the performance (e.g., same basic application type. In this application, there is a
latency) should remain constant, as the number of servers, table of records, each with F fields. Each record is identi-
amount of data, and offered throughput scale proportionally. fied by a primary key, which is a string like “user234123”.
This is equivalent to the scaleup metric from [19]. Each field is named field0, field1 and so on. The values of
Elastic speedup—How does the database perform as the each field are a random string of ASCII characters of length
number of machines increases while the system is running? L. For example, in the results reported in this paper, we
In this case, we load a given number of servers with data construct 1,000 byte records by using F = 10 fields, each of
and run the workload. As the workload is running, we add L = 100 bytes.
one or more servers, and observe the impact on performance. Each operation against the data store is randomly chosen
A system that offers good elasticity should show a perfor- to be one of:
mance improvement when the new servers are added, with a • Insert: Insert a new record.

146
• Update: Update a record by replacing the value of one Uniform:

Popularity
field.
• Read: Read a record, either one randomly chosen field
or all fields.
0 1 ...
• Scan: Scan records in order, starting at a randomly Insertion order
N

chosen record key. The number of records to scan is


randomly chosen. Zipfian:

Popularity
For scan specifically, the distribution of scan lengths is
chosen as part of the workload. Thus, the scan() method
takes an initial key and the number of records to scan. Of
course, a real application may instead specify a scan interval 0 1 ... N
Insertion order
(i.e., from February 1st to February 15th). The number of
records parameter allows us to control the size of these in- Latest:
tervals, without having to determine and specify meaningful

Popularity
endpoints for the scan. (All of the database calls, including
scan(), are described in Section 5.2.1.)
0 1 ... N
Insertion order
4.1 Distributions
The workload client must make many random choices
when generating load: which operation to perform (Insert, Figure 1: Probability distributions. Horizontal axes
Update, Read or Scan), which record to read or write, how represents items in order of insertion, and vertical
many records to scan, and so on. These decisions are gov- axes represent probability of being chosen.
erned by random distributions. YCSB has several built-in
distributions:
ular, with many views of her profile page, even though she
• Uniform: Choose an item uniformly at random. For ex-
has joined many years ago.
ample, when choosing a record, all records in the database
are equally likely to be chosen. 4.2 The Workloads
• Zipfian: Choose an item according to the Zipfian dis- We defined the workloads in the core package by assign-
tribution. For example, when choosing a record, some ing different distributions to the two main choices we must
records will be extremely popular (the head of the distri- make: which operation to perform, and which record to read
bution) while most records will be unpopular (the tail ). or write. The various combinations are shown in Table 2.
• Latest: Like the Zipfian distribution, except that the Although we do not attempt to model complex applications
most recently inserted records are in the head of the dis- precisely (as discussed above), we list a sample application
tribution. that generally has the characteristics of the workload.
• Multinomial: Probabilities for each item can be speci- Loading the database is likely to take longer than any
fied. For example, we might assign a probability of 0.95 individual experiment. In our tests, loads took between 10-
to the Read operation, a probability of 0.05 to the Up- 20 hours (depending on the database system), while we ran
date operation, and a probability of 0 to Scan and Insert. each experiment (e.g., a particular workload at a particular
The result would be a read-heavy workload. target throughput against a particular database) for 30 min-
utes. All the core package workloads use the same dataset,
Figure 1 illustrates the difference between the uniform, so it is possible to load the database once and then run all
zipfian and latest distributions. The horizontal axes in the the workloads. However, workloads A and B modify records,
figure represent the items that may be chosen (e.g., records) and D and E insert records. If database writes are likely to
in order of insertion, while the vertical bars represent the impact the operation of other workloads (e.g., by fragment-
probability that the item is chosen. Note that the last in- ing the on-disk representation) it may be necessary to re-load
serted item may not be inserted at the end of the key space. the database. We do not prescribe a particular database
For example, Twitter status updates might be clustered by loading strategy in our benchmark, since different database
user, rather than by timestamp, meaning that two recently systems have different loading mechanisms (including some
inserted items may be far apart in the key space. that have no special bulk load facility at all).
A key difference between the Latest and Zipfian distribu-
tions is their behavior when new items are inserted. Under
the Latest distribution, the newly inserted item becomes the 5. DETAILS OF THE BENCHMARK TOOL
most popular, while the previously popular items become We have developed a tool, called the YCSB Client, to
less so. Under the Zipfian distribution, items retain their execute the YCSB benchmarks. A key design goal of our
popularity even as new items are inserted, whether or not tool is extensibility, so that it can be used to benchmark
the newly inserted item is popular. The Latest distribution new cloud database systems, and so that new workloads
is meant to model applications where recency matters; for can be developed. We have used this tool to measure the
example, only recent blog posts or news stories are popular, performance of several cloud systems, as we report in the
and the popularity decays quickly. In contrast, the Zipfian next section. This tool is also available under an open source
distribution models items whose popularity is independent license, so that others may use and extend the tool, and
of their newness; a particular user might be extremely pop- contribute new workloads and database interfaces.

147
Workload Operations Record selection Application example
A—Update heavy Read: 50% Zipfian Session store recording recent actions in a user session
Update: 50%
B—Read heavy Read: 95% Zipfian Photo tagging; add a tag is an update, but most operations
Update: 5% are to read tags
C—Read only Read: 100% Zipfian User profile cache, where profiles are constructed elsewhere
(e.g., Hadoop)
D—Read latest Read: 95% Latest User status updates; people want to read the latest statuses
Insert: 5%
E—Short ranges Scan: 95% Zipfian/Uniform* Threaded conversations, where each scan is for the posts in a
Insert: 5% given thread (assumed to be clustered by thread id)

*Workload E uses the Zipfian distribution to choose the first key in the range, and the Uniform distribution to choose the number of records to
scan.

Table 2: Workloads in the core package

Workload file Command line properties


load, independent of a given database or experimental
− Read/write mix − DB to use
− Record size − Workload to use run. For example, the read/write mix of the database,
− Popularity distribution − Target throughput the distribution to use (zipfian, latest, etc.), and the size
− ... − Number of threads and number of fields in a record.
− ...
• Runtime properties: Properties specific to a given ex-
periment. For example, the database interface layer to
use (e.g., Cassandra, HBase, etc.), properties used to
initialize that layer (such as the database service host-
Cloud
YCSB Client Serving names), the number of client threads, etc.
Store Thus, there can be workload property files which remain
DB Interface

Client
Workload
Executor

static and are used to benchmark a variety of databases


Layer

Threads
(such as the YCSB core package described in Section 4). In
Stats contrast, runtime properties, while also potentially stored in
property files, will vary from experiment to experiment, as
the database, target throughput, etc., change.

5.2 Extensibility
Figure 2: YCSB client architecture A primary goal of YCSB is extensibility. In fact, one
of our motivations was to make it easy for developers to
benchmark the increasing variety of cloud serving systems.
In this section we describe the architecture of the YCSB The shaded boxes in Figure 2 show the components which
client, and examine how it can be extended. We also de- can be easily replaced. The Workload Executor contains
scribe some of the complexities in producing distributions code to execute both the load and transaction phases of
for the workloads. the workload. The YCSB package includes CoreWorkload, a
standard workload executor for the core package described
5.1 Architecture in Section 4. Users of YCSB can define new packages in
The YCSB Client is a Java program for generating the two ways. The most straightforward is to define a set of
data to be loaded to the database, and generating the op- workloads that use CoreWorkload but define different work-
erations which make up the workload. The architecture of load parameters. This allows users to vary several axes of
the client is shown in Figure 2. The basic operation is that the core package: which operation to perform, the skew in
the workload executor drives multiple client threads. Each record popularity, and the size and number of records. The
thread executes a sequential series of operations by mak- second approach is to define a new workload executor class
ing calls to the database interface layer, both to load the (e.g., by writing Java) and associated parameters. This ap-
database (the load phase) and to execute the workload (the proach allows for introducing more complex operations, and
transaction phase). The threads throttle the rate at which exploring different tradeoffs, than the core package does; but
they generate requests, so that we may directly control the involves greater effort compared to the former approach.
offered load against the database. The threads also measure The Database Interface Layer translates simple requests
the latency and achieved throughput of their operations, and (such as read()) from the client threads into calls against
report these measurements to the statistics module. At the the database (such as Thrift calls to Cassandra or REST
end of the experiment, the statistics module aggregates the requests to PNUTS). The Workload Executor and Database
measurements and reports average, 95th and 99th percentile Interface Layer classes to use for an experiment are specified
latencies, and either a histogram or time series of the laten- as properties, and those classes are loaded dynamically when
cies. the client starts. Of course, as an open source package, any
The client takes a series of properties (name/value pairs) class in the YCSB tool can be replaced, but the Workload
which define its operation. By convention, we divide these Executor and Database Interface Layer can be replaced most
properties into two groups: easily. We now discuss in more detail how the YCSB client
• Workload properties: Properties defining the work- can be extended with new database backends and workloads.

148
5.2.1 New Database Backends a much larger keyspace than we actually needed; apply the
The YCSB Client can be used to benchmark new database FNV hash to each generated value; and then take mod N
systems by writing a new class to implement the following (where N size of the keyspace, that is, number of records in
methods: the database). The result was that 99.97 % of the keyspace
is generated, and the generated keys continued to have a
• read()—read a single record from the database, and re-
Zipfian distribution.
turn either a specified set of fields or all fields.
The second issue was dealing with changing numbers of
• insert()—insert a single record into the database. items in the distribution. For some workloads, new records
• update()—update a single record in the database, adding are inserted into the database. The Zipfian distribution
or replacing the specified fields. should result in the same records being popular, even af-
• delete()—delete a single record in the database. ter insertions, while in the Latest distribution, popularity
should shift to the new keys. For the Latest, we computed
• scan()—execute a range scan, reading a specified num- a new distribution when a record was inserted; to do this
ber of records starting at a given record key. cheaply we modified the Gray algorithm of [23] to compute
These operations are quite simple, representing the stan- its constants incrementally. For Zipfian, we expanded the
dard “CRUD” operations: Create, Read, Update, Delete, initial keyspace to the expected size after inserts. If a data
with Read operations to read one record or to scan records. set had N records, and the workload had T total opera-
Despite its simplicity, this API maps well to the native APIs tions, with an expected fraction I of inserts, then we con-
of many of the cloud serving systems we examined. structed the Zipfian generator to draw from a space of size
N + T × I + . We added an additional factor  since the
5.2.2 New Workload Executors actual number of inserts depends on the random choice of
A user can define a new workload executor to replace operations during the workload according to a multinomial
CoreWorkload by extending the Workload class of the YCSB distribution. While running the workload, if the genera-
framework. One instance of the workload object is cre- tor produced an item which had not been inserted yet, we
ated and shared among the worker threads, which allows skipped that value and drew another. Then, the popularity
the threads to share common distributions, counters and distribution did not shift as new records were inserted.
so on. For example, the workload object can maintain a
counter used to generate new unique record ids when insert-
ing records. Similarly the workload object can maintain a 6. RESULTS
common LatestGenerator object, which assigns high pop-
We present benchmarking results for four systems: Cas-
ularity to the latest record ids generated by the counter.
sandra, HBase, PNUTS and sharded MySQL. While both
For each operation, the thread will either execute the
Cassandra and HBase have a data model similar to that
workload object’s doInsert() method (if the client is in
of Google’s BigTable [16], their underlying implementations
the load phase) or the workload object’s doTransaction()
are quite different—HBase’s architecture is similar to BigTable
method (if the client is in the transaction phase).
(using synchronous updates to multiple copies of data chunks),
while Cassandra’s is similar to Dynamo [18] (e.g., using gos-
5.3 Distributions sip and eventual consistency). PNUTS has its own data
One unexpectedly complex aspect of implementing the model, and also differs architecturally from the other sys-
YCSB tool involved implementing the Zipfian and Latest tems. Our implementation of sharded MySQL (like other
distributions. In particular, we used the algorithm for gen- implementations we have encountered) does not support elas-
erating a Zipfian-distributed sequence from Gray et al [23]. tic growth and data repartitioning. However, it serves well
However, this algorithm had to be modified in several ways as a control in our experiments, representing a conventional
to be used in our tool. The first problem is that the popular distributed database architecture, rather than a cloud-oriented
items are clustered together in the keyspace. In particular, system designed to be elastic. More details of these systems
the most popular item is item 0; the second most popular are presented in Section 2.
item is item 1, and so on. For the Zipfian distribution, the In our tests, we ran the workloads of the core package de-
popular items should be scattered across the keyspace. In scribed in Section 4, both to measure performance (bench-
real web applications, the most popular user or blog topic is mark tier 1) and to measure scalability and elasticity (bench-
not necessarily the lexicographically first item. mark tier 2). Here we report the average latency of requests.
To scatter items across the keyspace, we hashed the out- The 95th and 99th percentile latencies are not reported, but
put of the Gray generator. That is, we called a nextItem() followed the same trends as average latency. In summary,
method to get the next (integer) item, then took a hash of our results show:
that value to produce the key that we use. The choice of hash • The hypothesized tradeoffs between read and write opti-
function is critical: the Java built-in String.hashCode() mization are apparent in practice: Cassandra and HBase
function tended to leave the popular items clustered. Fur-
have higher read latencies on a read heavy workload than
thermore, after hashing, collisions meant that only about 80
PNUTS and MySQL, and lower update latencies on a
percent of the keyspace would be generated in the sequence. write heavy workload.
This was true even as we tried a variety of hash functions
(FNV, Jenkins, etc.). One approach would be to use perfect • PNUTS and Cassandra scaled well as the number of
hashing, which avoids collisions, with a downside that more servers and workload increased proportionally. HBase’s
setup time is needed to construct the perfect hash (multiple performance was more erratic as the system scaled.
minutes for hundreds of millions of records) [15]. The ap- • Cassandra, HBase and PNUTS were able to grow elasti-
proach that we took was to construct a Zipfian generator for cally while the workload was executing. However, PNUTS

149
70 80
Cassandra Cassandra
60 70
HBase HBase

Update latency (ms)


60

Read latency (ms)


50 PNUTS PNUTS
MySQL 50 MySQL
40
40
30
30
20 20
10 10
0 0
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
Throughput (ops/sec) Throughput (ops/sec)
(a) (b)

Figure 3: Workload A—update heavy: (a) read operations, (b) update operations. Throughput in this (and
all figures) represents total operations per second, including reads and writes.

provided the best, most stable latency while elastically HBase and PNUTS systems. For HBase, we allocated 1GB
repartitioning data. of heap to Hadoop, and 5GB to HBase. For PNUTS and
It is important to note that the results we report here are sharded MySQL, we allocated 6 GB of RAM to the MySQL
for particular versions of systems that are undergoing con- buffer pool. For Cassandra, we allocated 3GB of heap to the
tinuous development, and the performance may change and JVM, at the suggestion of Cassandra developers, so the rest
improve in the future. Even during the interval from the of RAM could be used for the Linux filesystem buffer. We
initial submission of this paper to the camera ready version, disabled replication on each system so that we could bench-
both HBase and Cassandra released new versions that sig- mark the baseline performance of the system itself. In on-
nificantly improved the throughput they could support. We going work we are examining the impact of replication. For
provide results primarily to illustrate the tradeoffs between Cassandra, sharded MySQL and PNUTS, all updates were
systems and demonstrate the value of the YCSB tool in synched to disk before returning to the client. HBase does
benchmarking systems. This value is both to users and de- not sync to disk, but relies on in-memory replication across
velopers of cloud serving systems: for example, while trying multiple servers for durability; this increases write through-
to understand one of our benchmarking results, the HBase put and reduces latency, but can result in data loss on fail-
developers uncovered a bug and, after simple fixes, nearly ure. We ran HBase experiments with and without client-side
doubled throughput for some workloads. buffering; since buffering gave a significant throughput ben-
efit, we mainly report on those numbers. Cassandra, and
possibly PNUTS and sharded MySQL, may have benefited
6.1 Experimental Setup if we had given them a dedicated log disk. However, to
For most experiments, we used six server-class machines ensure a fair comparison, we configured all systems with a
(dual 64-bit quad core 2.5 GHz Intel Xeon CPUs, 8 GB of single RAID-10 array and no dedicated log disk. Users of
RAM, 6 disk RAID-10 array and gigabit ethernet) to run YCSB are free to set up alternative hardware configurations
each system. We also ran PNUTS on a 47 server cluster to to see if they can get better performance.
successfully demonstrate that YCSB can be used to bench- HBase performance is sensitive to the number of log struc-
mark larger systems. PNUTS required two additional ma- tured files per key range, and the number of writes buffered
chines to serve as a configuration server and router, and in memory. HBase shrinks these numbers using compactions
HBase required an additional machine called the “master and flushes, respectively, and they can be system or user-
server.” These servers were lightly loaded, and the results initiated. We periodically applied these operations during
we report here depend primarily on the capacity of the six our experiments; but HBase users must evaluate how often
storage servers. The YCSB Client ran on a separate 8 core such operations are needed in their own environment.
machine. The Client was run with up to 500 threads, de- Our database consisted of 120 million 1 KB records, for
pending on the desired offered throughput. We observed in a total size of 120 GB. Each server thus had an average of
our tests that the client machine was not a bottleneck; in 20 GB of data, more than it could cache entirely in RAM.
particular, the CPU was almost idle as most time was spent Read operations retrieved an entire record, while updates
waiting for the database system to respond. modified one of the fields (out of ten).
We ran Cassandra 0.5.0, HBase 0.20.3, and MySQL 5.1.24
(for PNUTS) and 5.1.32 (for sharded MySQL). For one ex-
periment (elastic speedup), we used Cassandra 0.6.0-beta2, 6.2 Workload A—Update Heavy
at the suggestion of the Cassandra developers. For Cas- First, we examined Workload A, which has 50 percent
sandra, we used the OrderedPartitioner with node tokens reads and 50 percent updates. Figure 3 shows latency ver-
evenly spaced around the key space. Our sharded MySQL sus throughput curves for each system for both the read
implementation used client-side hashing to determine which and update operations. In each case, we increased the of-
server a given record should be stored on. fered throughput until the actual throughput stopped in-
We configured and tuned each system as well as we knew creasing. As the figure shows, for all systems, operation la-
how. In particular, we received extensive tuning assistance tency increased as offered throughput increased. Cassandra,
from members of the development teams of the Cassandra, which is optimized for write-heavy workloads, achieved the

150
40 20
Cassandra Cassandra
35 HBase HBase
PNUTS PNUTS

Update latency (ms)


30 15

Read latency (ms)


MySQL MySQL
25
20 10
15
10 5
5
0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Throughput (ops/sec) Throughput (ops/sec)
(a) (b)

Figure 4: Workload B—read heavy: (a) read operations, (b) update operations.

best throughput and the lowest latency for reads. At high 120
Cassandra
throughputs, Cassandra’s efficient sequential use of disk for 100 HBase
updates reduces contention for the disk head, meaning that

Read latency (ms)


PNUTS
read latency is lower than for the other systems. PNUTS 80

has slightly higher latency than MySQL, because it has extra 60


logic on top of MySQL for distributed consistency. HBase
40
had very low latency for updates, since updates are buffered
in memory. With buffering off, writes are committed to 20
memory on the HBase server, and latency was only 10-50% 0
lower than read latency. Because of efficient sequential up- 0 200 400 600 800 1000 1200 1400 1600
dates, Cassandra also provides the best throughput, peaking Throughput (ops/sec)

at 11978 operations/sec, compared to 7904 operations/sec


for HBase, 7283 operations/sec for sharded MySQL, and
7448 operations/sec for PNUTS. Figure 5: Workload E—short scans.

6.3 Workload B—Read Heavy results show, both HBase and PNUTS can sustain similar
Workload B, the read heavy workload, provides a differ- maximum throughputs (1519 ops/sec for HBase, and 1440
ent picture than workload A. The results for workload B are ops/sec for PNUTS) with roughly equivalent latency. Cas-
shown in Figure 4. As the figure shows, PNUTS and sharded sandra performs much worse. Cassandra’s range scan sup-
MySQL are now able to provide lower latencies on reads port is relatively new in version 0.5.0, and is not yet heavily
than either Cassandra or HBase. Cassandra and HBase still optimized; future versions may or may not be able to pro-
perform better for writes. The extra disk I/Os being done vide better scan latency. HBase and PNUTS have similar
by Cassandra to assemble records for reading dominates its scan performance, but only for short scans. We ran an ex-
performance on reads. Note that Cassandra only begins periment where we varied the average range query from 25
to show higher read latency at high throughputs, indicat- to 800 records. The results (not shown) demonstrate that
ing that the effects matter primarily when the disk is close HBase has better scan performance when the range scans
to saturation. HBase also has to reconstruct fragments of retrieve more than 100 records on average. MySQL’s on
records from multiple disk pages. However, the read latency disk storage used a B-Tree to support low latency reads;
is relatively higher because of HBase’s log-structured stor- but B-trees inherently have some fragmentation, so the cost
age implementation. HBase flushes its memtables to disk in of reading pages with empty space increases the cost of a
separate files, and potentially must search each such file for range scan in PNUTS. HBase stores data more compactly
fragments of the record, even if only some contain relevant on disk, improving performance for large ranges. For exam-
fragments. In fact, we observe worse latency and through- ple, when the average range scan is for 800 records, HBase’s
put as the number of files grows, and improvement when response time is 3.5 times faster than PNUTS.
the number shrinks through compaction. When there is a
great deal of fragmentation (for example, after a large num- 6.5 Other Workloads
ber of writes), throughput drops to as low as 4800 opera- Next, we ran the other core YCSB workloads. The re-
tions/sec due to the expense of reconstruction. Future im- sults of workload C (read only) were similar to those of
provements, such as Bloom filters, may cut down such false the read-heavy workload B: PNUTS and sharded MySQL
positive searches. achieved the lowest latency and highest throughput for the
read operations. Workload D (read latest) also showed sim-
6.4 Workload E—Short Ranges ilar results to workload B. Although the popularity of items
We ran workload E (short ranges) using HBase, Cassandra shifted over time, the dominant effect was that PNUTS and
and PNUTS, with ranges up to 100 records. Our sharded MySQL were most efficient for reads, and workload D is
MySQL implementation is a hash table and does not sup- dominated by read requests. Note that in workload D the
port range scans. The results are shown in Figure 5. As the recently inserted records are not necessarily clustered on the

151
70
Cassandra
percent of that achievable with 6 servers. This models a
60 HBase situation where we attempt to elastically expand an over-
PNUTS
loaded cluster to a size where it can handle the offered load,
Read latency (ms)
50
a situation that often occurs in practice.
40
First, we show a slice of the time series of read latencies for
30 Cassandra in Figure 7(a). In this figure, the first ten minutes
20 represent five servers, after performance has stabilized; then,
10 a sixth server is added. As the figure shows, this results in
0 a sharp increase in latency, as well as a wide variance in the
0 2 4 6 8 10 12 latency. This performance degradation results from moving
Servers data to the 6th server; regular serving requests compete for
disk and network resources with the data repartitioning pro-
cess, resulting in high and highly variable latency. In fact,
Figure 6: Read performance as cluster size in- under load, it takes many hours for Cassandra to stabilize.
creases. In our test, we had to stop the YCSB workload after 5.5
hours to allow the system to complete its repartitioning and
same server. Such a clustering scheme would not likely be quiesce. The high cost of repartitioning is a known issue with
used in practice, as it would result in a severe hotspot on Cassandra 0.5.0 and is being optimized in ongoing develop-
one server while the other servers were underutilized. “Read ment. After completing its repartitioning, the performance
latest” applications instead typically construct the record of the system under load matched that of a system that had
keys to cluster records by some other attribute to avoid this started with 6 servers, indicating that eventually, the elastic
hotspot, and we did this as well. This mimics the design expansion of the cluster will result in good performance.
of an application like blogging or Twitter, where recent up- Results for HBase are shown in Figure 7(b). As before,
dates are most popular but are likely clustered by user rather this figure represents just one slice of the total experiment.
than by time. We have omitted the graphs for workloads C As the figure shows, the read latency spikes initially after
and D for lack of space. the sixth server is added, before the latency stabilizes at a
We also ran a “read-modify-write” workload that reflects value slightly lower than the latency for five servers. This re-
the frequently used pattern of reading a record, modifying sult indicates that HBase is able to shift read and write load
it, and writing the changes back to the database. This work- to the new server, resulting in lower latency. HBase does
load is similar to workload A (50/50 read/write ratio) except not move existing data to the new server until compactions
that the updates are “read-modify-write” rather than blind occur2 . The result is less latency variance compared to Cas-
writes. The results (not shown) showed the same trends as sandra since there is no repartitioning process competing
workload A. for the disk. However, the new server is underutilized, since
existing data is served off the old servers.
6.6 Scalability A similar slice of the timeseries (adding a sixth server
So far, all experiments have used six storage servers. (As at time=10 minutes) is shown for PNUTS in Figure 7(c).
mentioned above, we did run one experiment with PNUTS PNUTS also moves data to the new server, resulting in
on 47 servers, to verify the scalability of YCSB itself). How- higher latency after the sixth server is added, as well as
ever, cloud serving systems are designed to scale out: more latency variability. However, PNUTS is able to stabilize
load can be handled when more servers are added. We more quickly, as its repartitioning scheme is more optimized.
first tested the scaleup capability of Cassandra, HBase and After stabilizing at time=80 minutes, the read latency is
PNUTS by varying the number of storage servers from 2 to comparable to a cluster that had started with six servers,
12 (while varying the data size and request rate proportion- indicating that PNUTS provides good elastic speedup.
ally). The resulting read latency for workload B is shown
in Figure 6. As the results show, latency is nearly constant 7. FUTURE WORK
for Cassandra and PNUTS, indicating good elastic scaling
In addition to performance comparisons, it is important
properties. In contrast, HBase’s performance varies a lot as
to examine other aspects of cloud serving systems. In this
the number of servers increases; in particular, performance
section, we propose two more benchmark tiers which we are
is better for larger numbers of servers. A known issue in
developing in ongoing work.
HBase is that its behavior is erratic for very small clusters
(less than three servers.)
7.1 Tier 3—Availability
6.7 Elastic Speedup A cloud database must be highly available despite failures,
We also examined the elastic speedup of Cassandra, HBase and the Availability tier measures the impact of failures on
and PNUTS. Sharded MySQL is inherently inelastic. In this the system. The simplest way to measure availability is to
case, we started two servers, loaded with 120 GB of data. start a workload, kill a server while the workload is running,
We then added more servers, one at a time, until there were and observe any resulting errors and performance impact.
six servers running. After adding each server, we attempted However, in a real deployed system, a variety of failures can
to run long enough to allow the system to stabilize before occur at various other levels, including the disk, the network,
adding the next server (although in some cases it was diffi- 2
It is possible to run the HDFS load balancer to force data
cult to determine if the system had truly stabilized.) Dur- to the new servers, but this greatly disrupts HBase’s ability
ing the entire run, we set the YCSB client to offer the same to serve data partitions from the same servers on which they
throughput; in particular, the offered throughput was 80 are stored.

152
800 250 120
Cassandra HBase PNUTS
700
200 100
600
Read latency (ms)

Read latency (ms)

Read latency (ms)


80
500 150
400 60
300 100
40
200
50 20
100
0 0 0
0 50 100 150 200 250 300 350 0 5 10 15 20 25 0 20 40 60 80 100 120
Duration of test (min) Duration of test (min) Duration of test (min)
(a) Cassandra (b) HBase (c) PNUTS

Figure 7: Elastic speedup: Time series showing impact of adding servers online.

and the whole datacenter. These failure may be the result cas can potentially be used to spread out load, improving
of hardware failure (for example, a faulty network interface performance.
card), power outage, software faults, and so on. A proper • Availability cost or benefit—what is the impact to
availability benchmark would cause (or simulate) a variety availability as we increase the replication factor on a con-
of different faults and examine their impact. stant amount of hardware? In some systems, the read
At least two difficulties arise when benchmarking avail- availability may increase but the write availability may
ability. First, injecting faults is not straightforward. While decrease if all replicas must be live to commit an update.
it is easy to ssh to a server and kill the database process,
• Freshness—are replicas consistent, or are some replicas
it is more difficult to cause a network fault for example.
stale? How much of the data is stale and how stale is
One approach is to allow each database request to carry a
special fault flag that causes the system to simulate a fault. it? This is primarily an issue for systems that use asyn-
This approach is particularly amenable to benchmarking be- chronous replication.
cause the workload generator can add the appropriate flag • Wide area performance—how does replication per-
for different tests to measure the impact of different faults. form between datacenters in geographically separate lo-
A “fault-injection” header can be inserted into PNUTS re- cations? Some replication schemes are optimized to be
quests, but to our knowledge, a similar mechanism is not used within the datacenter or between nearby datacen-
currently available in the other systems we benchmarked. ters, while others work well in globally distributed data-
An alternate approach to fault injection that works well at centers.
the network layer is to use a system like ModelNet [34] or
Emulab [33] to simulate the network layer, and insert the
appropriate faults. 8. RELATED WORK
The second difficulty is that different systems have dif-
ferent components, and therefore different, unique failure Benchmarking
modes. Thus, it is difficult to design failure scenarios that Benchmarking is widely used for evaluating computer sys-
cover all systems. tems, and benchmarks exist for a variety of levels of ab-
Despite these difficulties, evaluating the availability of sys- straction, from the CPU, to the database software, to com-
tems is important and we are continuing to work on devel- plete enterprise systems. Our work is most closely related
oping benchmarking approaches. to database benchmarks. Gray surveyed popular database
benchmarks, such as the TPC benchmarks and the Wiscon-
sin benchmark, in [22]. He also identified four criteria for a
7.2 Tier 4—Replication successful benchmark: relevance to an application domain,
Cloud systems use replication for both availability and portability to allow benchmarking of different systems, scala-
performance. Replicas can be used for failover, and to spread bility to support benchmarking large systems, and simplicity
out read load. Some systems are also designed to split write so the results are understandable. We have aimed to satisfy
requests across replicas to improve performance, although these criteria by developing a benchmark that is relevant to
a consistency mechanism (such as eventual consistency in serving systems, portable to different backends through our
Cassandra, or timeline consistency in PNUTS) is needed to extensibility framework, scalable to realistic data sizes, and
avoid corruption due to write-write conflicts. employing simple transaction types.
Replication may be synchronous or asynchronous. HBase, Despite the existence of database benchmarks, we felt it
for example, writes synchronously to multiple replicas, while was necessary to define a new benchmark for cloud serving
PNUTS performs replication asynchronously. Thus, the fol- systems. First, most cloud systems do not have a SQL in-
lowing measures are important when evaluating replication: terface, and support only a subset of relational operations
(usually, just the CRUD operations), so that the complex
• Performance cost or benefit—what is the impact to queries of many existing benchmarks were not applicable.
performance as we increase the replication factor on a Second, the use cases of cloud systems are often different
constant amount of hardware? The extra work to main- than traditional database applications, so that narrow do-
tain replicas may hurt performance, but the extra repli- main benchmarks (such as the debit/credit style benchmarks

153
like TPC-A or E-commerce benchmarks like TPC-W) may highlight the importance of a standard framework for exam-
not match the intended usage of the system. Furthermore, ining system performance so that developers can select the
our goal was to develop a benchmarking framework that most appropriate system for their needs.
could be used to explore the performance space of different
systems, rather than to measure a single performance num- 10. ACKNOWLEDGEMENTS
ber representing a particular application. It is for similar We would like to thank the system developers who helped
reasons that new benchmarks have been developed for other us tune the various systems: Jonathan Ellis from Cassan-
non-traditional database systems (such as XMark for XML dra, Ryan Rawson and Michael Stack from HBase, and the
systems [28] and Linear Road for stream systems [14]). Sherpa Engineering Team in Yahoo! Cloud Computing.
Designing an accurate and fair benchmark, and using it
to gather accurate results, is non-trivial. Seltzer et al [30]
argue that many micro and macrobenchmarks do not effec-
11. REFERENCES
[1] Amazon SimpleDB. http://aws.amazon.com/simpledb/.
tively model real workloads. One approach they propose [2] Apache Cassandra. http://incubator.apache.org/cassandra/.
(the vector approach) is to measure the performance of sys- [3] Apache CouchDB. http://couchdb.apache.org/.
tem operations, and compute the expected performance for a [4] Apache HBase. http://hadoop.apache.org/hbase/.
[5] Dynomite Framework. http://wiki.github.com/cliffmoon/-
particular application that uses some specified combination dynomite/dynomite-framework.
of those operations. Our approach is similar to this, except [6] Google App Engine. http://appengine.google.com.
that we directly measure the performance of a particular [7] Hypertable. http://www.hypertable.org/.
combination of operations; this allows us to accurately mea- [8] mongodb. http://www.mongodb.org/.
sure the impact of things like disk or cache contention when [9] Project Voldemort. http://project-voldemort.com/.
the operations are used together. Shivam et al [31] describe [10] Solaris FileBench.
http://www.solarisinternals.com/wiki/index.php/FileBench.
a workbench tool for efficiently running multiple benchmark [11] SQL Data Services/Azure Services Platform.
tests to achieve high confidence results. Their tool inter- http://www.microsoft.com/azure/data.mspx.
faces with a workload generator, like the YCSB Client, to [12] Storage Performance Council.
execute each run. We are examining the possibility of using http://www.storageperformance.org/home.
[13] Yahoo! Query Language. http://developer.yahoo.com/yql/.
their workbench to run our benchmark. [14] A. Arasu et al. Linear Road: a stream data management
benchmark. In VLDB, 2004.
Cloud systems [15] F. C. Botelho, D. Belazzougui, and M. Dietzfelbinger.
The term “cloud” has been used for a variety of different Compress, hash and displace. In Proc. of the 17th European
kinds of systems and architectures. A special issue of the Symposium on Algorithms, 2009.
[16] F. Chang et al. Bigtable: A distributed storage system for
Data Engineering Bulletin [25] showcased several aspects of structured data. In OSDI, 2006.
data management in the cloud. We have focused on serving [17] B. F. Cooper et al. PNUTS: Yahoo!’s hosted data serving
systems like PNUTS, Cassandra, and others. In contrast, platform. In VLDB, 2008.
batch systems provide near-line or offline analysis, but are [18] G. DeCandia et al. Dynamo: Amazon’s highly available
key-value store. In SOSP, 2007.
not appropriate for online serving. Pavlo et al [26] have
[19] D. J. DeWitt. The Wisconsin Benchmark: Past, present and
benchmarked cloud systems like Hadoop against more tra- future. In J. Gray, editor, The Benchmark Handbook. Morgan
ditional relational systems and relational column stores like Kaufmann, 1993.
Vertica/C-Store [32]. Some batch systems might use the [20] I. Eure. Looking to the future with Cassandra.
http://blog.digg.com/?p=966.
same database systems that would be used in a serving envi-
[21] S. Gilbert and N. Lynch. Brewer’s conjecture and the
ronment. For example, HBase can be used both as a serving feasibility of consistent, available, partition-tolerant web
store and as a storage backend for Hadoop, and is reportedly services. ACM SIGACT News, 33(2):51–59, 2002.
used this way at StumbleUpon, one of the major developers [22] J. Gray, editor. The Benchmark Handbook For Database and
Transaction Processing Systems. Morgan Kaufmann, 1993.
of HBase [27].
[23] J. Gray et al. Quickly generating billion-record synthetic
databases. In SIGMOD, 1994.
[24] A. Lakshman, P. Malik, and K. Ranganathan. Cassandra: A
9. CONCLUSIONS structured storage system on a P2P network. In SIGMOD,
We have presented the Yahoo! Cloud Serving Benchmark. 2008.
This benchmark is designed to provide tools for apples-to- [25] B. C. Ooi and S. Parthasarathy. Special issue on data
management on cloud computing platforms. IEEE Data
apples comparison of different serving data stores. One con- Engineering Bulletin, vol. 32, 2009.
tribution of the benchmark is an extensible workload gener- [26] A. Pavlo et al. A comparison of approaches to large-scale data
ator, the YCSB Client, which can be used to load datasets analysis. In SIGMOD, 2009.
and execute workloads across a variety of data serving sys- [27] R. Rawson. HBase intro. In NoSQL Oakland, 2009.
[28] A. Schmidt et al. Xmark: A benchmark for XML data
tems. Another contribution is the definition of five core management. In VLDB, 2002.
workloads, which begin to fill out the space of performance [29] R. Sears, M. Callaghan, and E. Brewer. Rose: Compressed,
tradeoffs made by these systems. New workloads can be log-structured replication. In VLDB, 2008.
easily created, including generalized workloads to examine [30] M. Seltzer, D. Krinsky, K. A. Smith, and X. Zhang. The case
for application-specific benchmarking. In Proc. HotOS, 1999.
system fundamentals, as well as more domain-specific work-
[31] P. Shivam et al. Cutting corners: Workbench automation for
loads to model particular applications. As an open-source server benchmarking. In Proc. USENIX Annual Technical
package, the YCSB Client is available for developers to use Conference, 2008.
and extend in order to effectively evaluate cloud systems. [32] M. Stonebraker et al. C-store: a column-oriented DBMS. In
VLDB, 2005.
We have used this tool to benchmark the performance of
[33] B. White et al. An integrated experimental environment for
four cloud serving systems, and observed that there are clear distributed systems and networks. In OSDI, 2002.
tradeoffs between read and write performance that result [34] K. Yocum et al. Scalability and accuracy in a large-scale
from each system’s architectural decisions. These results network emulator. In OSDI, 2002.

154

You might also like