Google: Designs, Lessons and Advice From Building Large Distributed Systems
Google: Designs, Lessons and Advice From Building Large Distributed Systems
Google: Designs, Lessons and Advice From Building Large Distributed Systems
Distributed Systems
Jeff Dean
Google Fellow
[email protected]
Computing shifting to really small and really big devices
UI-centric devices
Servers
• CPUs
• DRAM
• Disks Clusters
Racks
• 40-80 servers
• Ethernet switch
Architectural view of the storage hierarchy
P P P P
One server
L1$ … L1$ L1$ … L1$
DRAM: 16GB, 100ns, 20GB/s
L2$
… L2$ Disk: 2TB, 10ms, 200MB/s
Disk
Local DRAM
Architectural view of the storage hierarchy
P P P P
One server
L1$ … L1$ L1$ … L1$
DRAM: 16GB, 100ns, 20GB/s
L2$
… L2$ Disk: 2TB, 10ms, 200MB/s
Disk
Local DRAM
Rack Switch
Local rack (80 servers)
Disk Disk
DRAM
DRAM
DRAM
Disk
Disk … DRAM
DRAM
DRAM
Disk
Disk
DRAM: 1TB, 300us, 100MB/s
L2$
… L2$ Disk: 2TB, 10ms, 200MB/s
Disk
Local DRAM
Rack Switch
Local rack (80 servers)
Disk Disk
DRAM
DRAM
DRAM
Disk
Disk … DRAM
DRAM
DRAM
Disk
Disk
DRAM: 1TB, 300us, 100MB/s
Cluster Switch
Cluster (30+ racks)
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.
Understanding downtime behavior matters
Understanding downtime behavior matters
Understanding downtime behavior matters
Understanding fault statistics matters
Understanding fault statistics matters
scheduling
master
job 1 job 3
task task
... job 12
task
job 7 job 3
task task
job 5
task
GFS
chunk
server
scheduling
slave ... chunk
server
scheduling
slave
master
Machine 1 Machine N
GFS Design
Replicas
Client
Masters
GFS Master
Client
Client
C0 C1 C1 C0 C5
C5 C2 C5 C3 … C2
message SearchResult {
required int32 estimated_results = 1; // (1 is the tag number)
optional string error_message = 2;
repeated group Result = 3 {
required float score = 4;
required fixed64 docid = 5;
optional message<WebResultDetails> = 6;
…
}
};
Protocol Buffers (cont)
• Automatically generated language wrappers
• Graceful client and server upgrades
– systems ignore tags they don't understand, but pass the information through
(no need to upgrade intermediate servers)
• Serialization/deserialization
– high performance (200+ MB/s encode/decode)
– fairly compact (uses variable length encodings)
– format used to store data persistently (not just for RPCs)
• Also allow service specifications:
service Search {
rpc DoSearch(SearchRequest) returns (SearchResponse);
rpc DoSnippets(SnippetRequest) returns
(SnippetResponse);
rpc Ping(EmptyMessage) returns (EmptyMessage) {
{ protocol=udp; };
};
Given a basic problem definition, how do you choose the "best" solution?
• Best could be simplest, highest performance, easiest to extend, etc.
Lots of variations:
– caching (single images? whole sets of thumbnails?)
– pre-computing thumbnails
–…
Back of the envelope helps identify most promising…
Write Microbenchmarks!
Canary requests
Failover to other replicas/datacenters
Bad backend detection:
stop using for live requests until behavior gets better
More aggressive load balancing when imbalance is more severe
… (0, WA-520)
(1, I-90)
… (1, I-5)
1
(1, Lake Wash.)
(1, I-90)
…
Parallel MapReduce
Input
data
Master
Master
Columns
Rows
“contents:” Columns
Rows
“www.cnn.com”
“<html>…”
“contents:” Columns
Rows
“www.cnn.com”
“<html>…” t17
Timestamps
“contents:” Columns
Rows
“www.cnn.com” t11
“<html>…” t17
Timestamps
“contents:” Columns
Rows
t3
“www.cnn.com” t11
“<html>…” t17
Timestamps
“language:” “contents:”
“aaa.com”
“cnn.com” EN “<html>…”
“cnn.com/sports.html”
“website.com”
“zuppa.com/menu.html”
Tablets & Splitting
“language:” “contents:”
“aaa.com”
“cnn.com” EN “<html>…”
“cnn.com/sports.html”
Tablets
…
“website.com”
“zuppa.com/menu.html”
Tablets & Splitting
“language:” “contents:”
“aaa.com”
“cnn.com” EN “<html>…”
“cnn.com/sports.html”
Tablets
…
“website.com”
…
“yahoo.com/kids.html”
…
“yahoo.com/kids.html\0”
…
“zuppa.com/menu.html”
BigTable System Structure
Bigtable Cell
Bigtable master
Bigtable master
performs metadata ops +
load balancing
Bigtable master
performs metadata ops +
load balancing
Bigtable master
performs metadata ops +
load balancing
Bigtable master
performs metadata ops +
load balancing
Bigtable master
performs metadata ops +
load balancing
Bigtable master
performs metadata ops +
load balancing
Bigtable master
performs metadata ops +
load balancing
Bigtable master
performs metadata ops + Open()
load balancing
Bigtable master
performs metadata ops + Open()
read/write
load balancing
Bigtable master
performs metadata ops + Open()
read/write
load balancing
Further reading:
• Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003.
• Barroso, Dean, & Hölzle . Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003.
• Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004.
• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed
Storage System for Structured Data, OSDI 2006.
• Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006.
• Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population. FAST 2007.
• Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP 2007.
• Malewicz et al. Pregel: A System for Large-Scale Graph Processing. PODC, 2009.
• Schroeder, Pinheiro, & Weber. DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS’09.