11 Distributed1
11 Distributed1
Operating Systems
Brad Campbell – [email protected]
https://www.cs.virginia.edu/~bjc8c/class/cs6456-f19/
2
Conservative Interpretation of
E2E
• Don’t implement a function at the
lower levels of the system unless it
can be completely implemented at
this level
3
Moderate Interpretation
• Think twice before implementing functionality in the
network
• If hosts can implement functionality correctly, implement
it in a lower layer only as a performance enhancement
• But do so only if it does not impose burden on
applications that do not require that functionality
• This is the interpretation we are using
4
Remote Procedure Call (RPC)
• Raw messaging is a bit too low-level for
programming
• Must wrap up information into message at source
• Must decide what to do with message at destination
• May need to sit and wait for multiple messages to
arrive
6
RPC Information Flow
bundle
args
call send
Client Client Packet
(caller) Stub Handler
return receive
unbundle mbox2
Network
Machine A ret vals
Network
Machine B bundle
ret vals mbox1
return send
Server Server Packet
(callee) Stub Handler
call receive
unbundle
args
7
RPC Details
• Equivalence with regular procedure call
• Parameters Request Message
• Result Reply message
• Name of Procedure: Passed in request message
• Return Address: mbox2 (client return mail box)
9
Problems with RPC: Non-Atomic Failures
• Different failure modes in dist. system than on a
single machine
• Consider many different types of failures
• User-level bug causes address space to crash
• Machine failure, kernel bug causes all
processes on same machine to fail
• Some machine is compromised by malicious
party
• Before RPC: whole system would crash/die
• After RPC: One machine crashes/compromised
while others keep working
• Can easily result in inconsistent view of the world
• Did my cached data get written back or not?
• Did server do what I requested or not?
10
Problems with RPC: Performance
• Cost of Procedure call « same-machine RPC «
network RPC
11
12
Important “ilities”
17
The Promise of Dist. Systems
• Availability: One machine goes down, overall
system stays up
18
Distributed: Worst-Case Reality
• Availability: Failure in one machine brings
down entire system
19
Distributed Systems Goal
• Transparency: Hide "distributed-ness" from any
external observer, make system simpler
• Types
• Location: Location of resources is invisible
• Migration: Resources can move without user knowing
• Replication: Invisible extra copies of resources (for
reliability, performance)
• Parallelism: Job split into multiple pieces, but looks
like a single task
• Fault Tolerance: Components fail without users
knowing
20
Challenge of Coordination
• Components communicate over the
network
• Send messages between machines
21
CAP Theorem
• Originally proposed by Eric Brewer (Berkeley)
Partition A
23
Consistency Preferred
24
What about AP Systems?
• Partition occurs, but both groups of nodes
continue to accept requests
• Consequence: State might diverge between
the two groups (e.g., different updates are
executed)
• When communication is restored, there
needs to be an explicit recovery process
• Resolve conflicting updates so everyone agrees
on system state once again
25
General’s Paradox
• Two generals located on opposite sides of
their enemy’s position
• Can only communicate via messengers
• Messengers go through enemy territory:
might be captured
if you
Yeah, but what
ck?
Don’t get this a
27
Two-Phase Commit
Commit Transaction
• Coordinator learns all machines have agreed
to commit
• Apply transaction, inform voters
• Record decision in local log
Abort Transaction
• Coordinator learns at least on machine has
voted to abort
• Do not apply transaction, inform voters
• Record decision in local log
31
Example: Failure-Free 2PC
coordinator
VOTE- GLOBAL-
REQ COMMIT
worker 1
worker 2
VOTE-
COMMIT
worker 3
time
36
Example: Failure-Free 2PC
coordinator
VOTE- VOTE- GLOBAL-
REQ ABORT ABORT
worker 1
worker 2
VOTE-
COMMIT
worker 3
time
37
Example of Worker Failure
INIT
WAIT
VOTE-
worker 2 COMMIT
worker 3 time 38
Example of Coordinator Failure
INIT
READY
ABORT COMM
coordinator restarted
VOTE-REQ
worker 1
VOTE- GLOBAL-
worker 2 COMMIT ABORT
45
Google Spanner
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman,
Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian
Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle,
Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth
Wang, and Dale Woodford. 2012. Spanner: Google's globally-distributed database. In Proceedings of
the 10th USENIX conference on Operating Systems Design and Implementation (OSDI'12). USENIX
Association, Berkeley, CA, USA, 251-264.
46
Basic Spanner Operation
• Data replicated across
datacenters
• Paxos groups support
transactions
• On commit:
• Grab Paxos lock
• Paxos algorithm decides
consensus
• If all agree, transaction is
committeed
47
Spanner Operation
Paxos Paxos
2PC
48
Base operation great for writes…
• What about reads?
• Reads are dominant operations
• e.g., FB’s TAO had 500 reads : 1 write [ATC 2013]
• e.g., Google Ads (F1) on Spanner from 1? DC in
24h:
21.5B reads
31.2M single-shard transactions
32.1M multi-shard transactions
• Want efficient read transactions
49
Make Read-Only Txns Efficient
• Ideal: Read-only transactions that are non-
blocking
• Arrive at shard, read data, send data back
50
TrueTime
TT.now()
time
earliest latest
2*ε
Consider event enow which invoked tt = TT.now():
Guarantee: tt.earliest <= tabs(enow) <= tt.latest
51
TrueTime for Read-Only Txns
Commit wait
average ε average ε
• Key: Need to ensure that all future transactions will get a higher timestamp
• Commit wait ensures this
53
Commit wait
• What does this mean for performance?
• Larger TrueTime uncertainty bound
• longer commit wait
• Longer commit wait
• locks held longer
• can’t process conflicting transactions
• lower throughput
• i.e., if time is less certain, Spanner is slower!
54