Bigtable: A Distributed Storage System For Structured Data
Bigtable: A Distributed Storage System For Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,fikes,gruber}@google.com
Google, Inc.
"<html>..." t3
"com.cnn.www" "<html>..." t5 "CNN" t9 "CNN.com" t8
"<html>..." t6
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-
tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3 , t5 , and t6 .
Bloom filters
Compression As described in Section 5.3, a read operation has to read
from all SSTables that make up the state of a tablet.
Clients can control whether or not the SSTables for a If these SSTables are not in memory, we may end up
locality group are compressed, and if so, which com- doing many disk accesses. We reduce the number of
pression format is used. The user-specified compres- accesses by allowing clients to specify that Bloom fil-
sion format is applied to each SSTable block (whose size ters [7] should be created for SSTables in a particu-
is controllable via a locality group specific tuning pa- lar locality group. A Bloom filter allows us to ask
rameter). Although we lose some space by compress- whether an SSTable might contain any data for a spec-
ing each block separately, we benefit in that small por- ified row/column pair. For certain applications, a small
tions of an SSTable can be read without decompress- amount of tablet server memory used for storing Bloom
ing the entire file. Many clients use a two-pass custom filters drastically reduces the number of disk seeks re-
compression scheme. The first pass uses Bentley and quired for read operations. Our use of Bloom filters
McIlroy’s scheme [6], which compresses long common also implies that most lookups for non-existent rows or
strings across a large window. The second pass uses a columns do not need to touch disk.
fast compression algorithm that looks for repetitions in
a small 16 KB window of the data. Both compression
passes are very fast—they encode at 100–200 MB/s, and Commit-log implementation
decode at 400–1000 MB/s on modern machines. If we kept the commit log for each tablet in a separate
Even though we emphasized speed instead of space re- log file, a very large number of files would be written
duction when choosing our compression algorithms, this concurrently in GFS. Depending on the underlying file
two-pass compression scheme does surprisingly well. system implementation on each GFS server, these writes
For example, in Webtable, we use this compression could cause a large number of disk seeks to write to the
scheme to store Web page contents. In one experiment, different physical log files. In addition, having separate
we stored a large number of documents in a compressed log files per tablet also reduces the effectiveness of the
locality group. For the purposes of the experiment, we group commit optimization, since groups would tend to
limited ourselves to one version of each document in- be smaller. To fix these issues, we append mutations
stead of storing all versions available to us. The scheme to a single commit log per tablet server, co-mingling
achieved a 10-to-1 reduction in space. This is much mutations for different tablets in the same physical log
better than typical Gzip reductions of 3-to-1 or 4-to-1 file [18, 20].
on HTML pages because of the way Webtable rows are Using one log provides significant performance ben-
laid out: all pages from a single host are stored close efits during normal operation, but it complicates recov-
to each other. This allows the Bentley-McIlroy algo- ery. When a tablet server dies, the tablets that it served
rithm to identify large amounts of shared boilerplate in will be moved to a large number of other tablet servers:
pages from the same host. Many applications, not just each server typically loads a small number of the orig-
Webtable, choose their row names so that similar data inal server’s tablets. To recover the state for a tablet,
ends up clustered, and therefore achieve very good com- the new tablet server needs to reapply the mutations for
pression ratios. Compression ratios get even better when that tablet from the commit log written by the original
we store multiple versions of the same value in Bigtable. tablet server. However, the mutations for these tablets
signed the next available range to a client as soon as the Single tablet-server performance
client finished processing the previous range assigned to
it. This dynamic assignment helped mitigate the effects Let us first consider performance with just one tablet
of performance variations caused by other processes run- server. Random reads are slower than all other operations
ning on the client machines. We wrote a single string un- by an order of magnitude or more. Each random read in-
der each row key. Each string was generated randomly volves the transfer of a 64 KB SSTable block over the
and was therefore uncompressible. In addition, strings network from GFS to a tablet server, out of which only a
under different row key were distinct, so no cross-row single 1000-byte value is used. The tablet server executes
compression was possible. The random write benchmark approximately 1200 reads per second, which translates
was similar except that the row key was hashed modulo into approximately 75 MB/s of data read from GFS. This
R immediately before writing so that the write load was bandwidth is enough to saturate the tablet server CPUs
spread roughly uniformly across the entire row space for because of overheads in our networking stack, SSTable
the entire duration of the benchmark. parsing, and Bigtable code, and is also almost enough
to saturate the network links used in our system. Most
The sequential read benchmark generated row keys in Bigtable applications with this type of an access pattern
exactly the same way as the sequential write benchmark, reduce the block size to a smaller value, typically 8KB.
but instead of writing under the row key, it read the string Random reads from memory are much faster since
stored under the row key (which was written by an earlier each 1000-byte read is satisfied from the tablet server’s
invocation of the sequential write benchmark). Similarly, local memory without fetching a large 64 KB block from
the random read benchmark shadowed the operation of GFS.
the random write benchmark. Random and sequential writes perform better than ran-
The scan benchmark is similar to the sequential read dom reads since each tablet server appends all incoming
benchmark, but uses support provided by the Bigtable writes to a single commit log and uses group commit to
API for scanning over all values in a row range. Us- stream these writes efficiently to GFS. There is no sig-
ing a scan reduces the number of RPCs executed by the nificant difference between the performance of random
benchmark since a single RPC fetches a large sequence writes and sequential writes; in both cases, all writes to
of values from a tablet server. the tablet server are recorded in the same commit log.
Sequential reads perform better than random reads
The random reads (mem) benchmark is similar to the since every 64 KB SSTable block that is fetched from
random read benchmark, but the locality group that con- GFS is stored into our block cache, where it is used to
tains the benchmark data is marked as in-memory, and serve the next 64 read requests.
therefore the reads are satisfied from the tablet server’s Scans are even faster since the tablet server can return
memory instead of requiring a GFS read. For just this a large number of values in response to a single client
benchmark, we reduced the amount of data per tablet RPC, and therefore RPC overhead is amortized over a
server from 1 GB to 100 MB so that it would fit com- large number of values.
fortably in the memory available to the tablet server.
Figure 6 shows two views on the performance of our Scaling
benchmarks when reading and writing 1000-byte values
to Bigtable. The table shows the number of operations Aggregate throughput increases dramatically, by over a
per second per tablet server; the graph shows the aggre- factor of a hundred, as we increase the number of tablet
gate number of operations per second. servers in the system from 1 to 500. For example, the
Table 2: Characteristics of a few tables in production use. Table size (measured before compression) and # Cells indicate approxi-
mate sizes. Compression ratio is not given for tables that have compression disabled.
Each row in the imagery table corresponds to a sin- The Personalized Search data is replicated across sev-
gle geographic segment. Rows are named to ensure that eral Bigtable clusters to increase availability and to re-
adjacent geographic segments are stored near each other. duce latency due to distance from clients. The Personal-
The table contains a column family to keep track of the ized Search team originally built a client-side replication
sources of data for each segment. This column family mechanism on top of Bigtable that ensured eventual con-
has a large number of columns: essentially one for each sistency of all replicas. The current system now uses a
raw data image. Since each segment is only built from a replication subsystem that is built into the servers.
few images, this column family is very sparse. The design of the Personalized Search storage system
The preprocessing pipeline relies heavily on MapRe- allows other groups to add new per-user information in
duce over Bigtable to transform data. The overall system their own columns, and the system is now used by many
processes over 1 MB/sec of data per tablet server during other Google properties that need to store per-user con-
some of these MapReduce jobs. figuration options and settings. Sharing a table amongst
The serving system uses one table to index data stored many groups resulted in an unusually large number of
in GFS. This table is relatively small (˜500 GB), but it column families. To help support sharing, we added a
must serve tens of thousands of queries per second per simple quota mechanism to Bigtable to limit the stor-
datacenter with low latency. As a result, this table is age consumption by any particular client in shared ta-
hosted across hundreds of tablet servers and contains in- bles; this mechanism provides some isolation between
memory column families. the various product groups using this system for per-user
information storage.
8.3 Personalized Search
9 Lessons
Personalized Search (www.google.com/psearch) is an
opt-in service that records user queries and clicks across In the process of designing, implementing, maintaining,
a variety of Google properties such as web search, im- and supporting Bigtable, we gained useful experience
ages, and news. Users can browse their search histories and learned several interesting lessons.
to revisit their old queries and clicks, and they can ask One lesson we learned is that large distributed sys-
for personalized search results based on their historical tems are vulnerable to many types of failures, not just
Google usage patterns. the standard network partitions and fail-stop failures as-
Personalized Search stores each user’s data in sumed in many distributed protocols. For example, we
Bigtable. Each user has a unique userid and is assigned have seen problems due to all of the following causes:
a row named by that userid. All user actions are stored memory and network corruption, large clock skew, hung
in a table. A separate column family is reserved for each machines, extended and asymmetric network partitions,
type of action (for example, there is a column family that bugs in other systems that we are using (Chubby for ex-
stores all web queries). Each data element uses as its ample), overflow of GFS quotas, and planned and un-
Bigtable timestamp the time at which the corresponding planned hardware maintenance. As we have gained more
user action occurred. Personalized Search generates user experience with these problems, we have addressed them
profiles using a MapReduce over Bigtable. These user by changing various protocols. For example, we added
profiles are used to personalize live search results. checksumming to our RPC mechanism. We also handled
References
11 Conclusions [1] A BADI , D. J., M ADDEN , S. R., AND F ERREIRA ,
M. C. Integrating compression and execution in column-
We have described Bigtable, a distributed system for oriented database systems. Proc. of SIGMOD (2006).
storing structured data at Google. Bigtable clusters have [2] A ILAMAKI , A., D E W ITT, D. J., H ILL , M. D., AND S K -
been in production use since April 2005, and we spent OUNAKIS , M. Weaving relations for cache performance.
roughly seven person-years on design and implementa- In The VLDB Journal (2001), pp. 169–180.
tion before that date. As of August 2006, more than sixty [3] BANGA , G., D RUSCHEL , P., AND M OGUL , J. C. Re-
projects are using Bigtable. Our users like the perfor- source containers: A new facility for resource manage-
mance and high availability provided by the Bigtable im- ment in server systems. In Proc. of the 3rd OSDI (Feb.
plementation, and that they can scale the capacity of their 1999), pp. 45–58.
clusters by simply adding more machines to the system [4] BARU , C. K., F ECTEAU , G., G OYAL , A., H SIAO ,
as their resource demands change over time. H., J HINGRAN , A., PADMANABHAN , S., C OPELAND ,