0% found this document useful (0 votes)
19 views42 pages

11 Distributed1

This document discusses the CS6456 Graduate Operating Systems course taught by Brad Campbell at the University of Virginia. It covers topics like the end-to-end principle in computer networks, remote procedure calls, problems with non-atomic failures and performance in distributed systems, and challenges of coordination and consistency in distributed systems. The CAP theorem is introduced, which states that it is impossible for a distributed data store to simultaneously provide consistency, availability, and partition tolerance.

Uploaded by

maykelnawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views42 pages

11 Distributed1

This document discusses the CS6456 Graduate Operating Systems course taught by Brad Campbell at the University of Virginia. It covers topics like the end-to-end principle in computer networks, remote procedure calls, problems with non-atomic failures and performance in distributed systems, and challenges of coordination and consistency in distributed systems. The CAP theorem is introduced, which states that it is impossible for a distributed data store to simultaneously provide consistency, availability, and partition tolerance.

Uploaded by

maykelnawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

CS6456: Graduate

Operating Systems
Brad Campbell – [email protected]
https://www.cs.virginia.edu/~bjc8c/class/cs6456-f19/

Some slides modified from CS162 at UCB


1
End-to-End Principle
Implementing complex functionality in the network:
• Doesn’t reduce host implementation complexity
• Does increase network complexity
• Probably imposes delay and overhead on all
applications, even if they don’t need functionality

• However, implementing in network can enhance


performance in some cases
• e.g., very lossy link

2
Conservative Interpretation of
E2E
• Don’t implement a function at the
lower levels of the system unless it
can be completely implemented at
this level

• Or: Unless you can relieve the burden


from hosts, don’t bother

3
Moderate Interpretation
• Think twice before implementing functionality in the
network
• If hosts can implement functionality correctly, implement
it in a lower layer only as a performance enhancement
• But do so only if it does not impose burden on
applications that do not require that functionality
• This is the interpretation we are using

• Is this still valid?


• What about Denial of Service?
• What about privacy against intrusion?

• Perhaps there are things that must be in the network?

4
Remote Procedure Call (RPC)
• Raw messaging is a bit too low-level for
programming
• Must wrap up information into message at source
• Must decide what to do with message at destination
• May need to sit and wait for multiple messages to
arrive

• Another option: Remote Procedure Call (RPC)


• Calls a procedure on a remote machine
• Client calls:
remoteFileSystemRead("rutabaga");
• Translated automatically into call on server:
fileSysRead("rutabaga");
5
RPC Implementation
• Request-response message passing (under covers!)
• “Stub” provides glue on client/server
• Client stub is responsible for “marshalling” arguments and
“unmarshalling” the return values
• Server-side stub is responsible for “unmarshalling”
arguments and “marshalling” the return values.

• Marshalling involves (depending on system)


• Converting values to a canonical form, serializing objects,
copying arguments passed by reference, etc.

6
RPC Information Flow
bundle
args
call send
Client Client Packet
(caller) Stub Handler
return receive
unbundle mbox2

Network
Machine A ret vals

Network
Machine B bundle
ret vals mbox1
return send
Server Server Packet
(callee) Stub Handler
call receive
unbundle
args

7
RPC Details
• Equivalence with regular procedure call
• Parameters Request Message
• Result  Reply message
• Name of Procedure: Passed in request message
• Return Address: mbox2 (client return mail box)

• Stub generator: Compiler that generates stubs


• Input: interface definitions in an “interface
definition language (IDL)”
• Contains, among other things, types of arguments/return
• Output: stub code in the appropriate source
language
• Code for client to pack message, send it off, wait for result,
unpack result and return to caller
• Code for server to unpack message, call procedure, pack results,
send them off 8
RPC Details
• Cross-platform issues:
• What if client/server machines are
different architectures/ languages?
• Convert everything to/from some canonical form
• Tag every item with an indication of how it is
encoded (avoids unnecessary conversions)

9
Problems with RPC: Non-Atomic Failures
• Different failure modes in dist. system than on a
single machine
• Consider many different types of failures
• User-level bug causes address space to crash
• Machine failure, kernel bug causes all
processes on same machine to fail
• Some machine is compromised by malicious
party
• Before RPC: whole system would crash/die
• After RPC: One machine crashes/compromised
while others keep working
• Can easily result in inconsistent view of the world
• Did my cached data get written back or not?
• Did server do what I requested or not?
10
Problems with RPC: Performance
• Cost of Procedure call « same-machine RPC «
network RPC

• Means programmers must be aware that RPC


is not free
• Caching can help, but may make failure
handling complex

11
12
Important “ilities”

• Availability: probability that the system


can accept and process requests

• Durability: the ability of a system to


recover data despite faults

• Reliability: the ability of a system or


component to perform its required
functions under stated conditions for a
specified period of time (IEEE
definition) 13
Distributed: Why?
• Simple, cheaper components

• Easy to add capability incrementally

• Let multiple users cooperate (maybe)


• Physical components owned by different users
• Enable collaboration between diverse users

17
The Promise of Dist. Systems
• Availability: One machine goes down, overall
system stays up

• Durability: One machine loses data, but


system does not lose anything

• Security: Easier to secure each component of


the system individually?

18
Distributed: Worst-Case Reality
• Availability: Failure in one machine brings
down entire system

• Durability: Any machine can lose your data

• Security: More components means more


points of attack

19
Distributed Systems Goal
• Transparency: Hide "distributed-ness" from any
external observer, make system simpler
• Types
• Location: Location of resources is invisible
• Migration: Resources can move without user knowing
• Replication: Invisible extra copies of resources (for
reliability, performance)
• Parallelism: Job split into multiple pieces, but looks
like a single task
• Fault Tolerance: Components fail without users
knowing
20
Challenge of Coordination
• Components communicate over the
network
• Send messages between machines

• Need to use messages to agree on


system state
• This issue does not exist in a centralized
system

21
CAP Theorem
• Originally proposed by Eric Brewer (Berkeley)

1. Consistency – changes appear to everyone


in same sequential order
2. Availability – can get a result at any time
3. Partition Tolerance – system continues to
work even when one part of network can't
communicate with the other

• Impossible to achieve all 3 at the same time


(pick two)
22
CAP Theorem Example
• What do we do if a network partition
occurs?
• Prefer Availability: Allow the state at some
nodes to disagree with the state at other
nodes (AP)
• Prefer Consistency: Reject requests until
Partition B
the partition is resolved (CP)

Partition A

23
Consistency Preferred

• Block writes until all nodes able to agree

• Consistent: Reads never return wrong values

• Not Available: Writes block until partition is


resolved and unanimous approval is possible

24
What about AP Systems?
• Partition occurs, but both groups of nodes
continue to accept requests
• Consequence: State might diverge between
the two groups (e.g., different updates are
executed)
• When communication is restored, there
needs to be an explicit recovery process
• Resolve conflicting updates so everyone agrees
on system state once again
25
General’s Paradox
• Two generals located on opposite sides of
their enemy’s position
• Can only communicate via messengers
• Messengers go through enemy territory:
might be captured

• Problem: Need to coordinate time of attack


• Two generals lose unless they attack at same time
• If they attack at same time, they win
26
General’s Paradox
• Can messages over an unreliable network
be used to guarantee two entities do
something simultaneously?
• No, even if all messages go through
11 am ok
?
Yes, 11 works
So, 11 it i
General 1 s? General 2

if you
Yeah, but what
ck?
Don’t get this a

27
Two-Phase Commit

• We can’t solve the General’s Paradox


• No simultaneous action
• But we can solve a related problem

• Distributed Transaction: Two (or more)


machines agree to do something or not do
it atomically

• Extra tool: Persistent Log


• If machine fails, it will remember what
happened 28
Two-Phase Commit: Setup
• One machine (coordinator) initiates the
protocol
• It asks every machine to vote on
transaction

• Two possible votes:


• Commit
• Abort

• Commit transaction only if unanimous 29


Two-Phase Commit: Preparing
Agree to Commit
• Machine has guaranteed that it will accept
transaction
• Must be recorded in log so machine will remember
this decision if it fails and restarts
Agree to Abort
• Machine has guaranteed that it will never accept
this transaction
• Must be recorded in log so machine will remember
this decision if it fails and restarts
30
Two-Phase Commit: Finishing

Commit Transaction
• Coordinator learns all machines have agreed
to commit
• Apply transaction, inform voters
• Record decision in local log
Abort Transaction
• Coordinator learns at least on machine has
voted to abort
• Do not apply transaction, inform voters
• Record decision in local log
31
Example: Failure-Free 2PC

coordinator
VOTE- GLOBAL-
REQ COMMIT
worker 1

worker 2
VOTE-
COMMIT
worker 3
time
36
Example: Failure-Free 2PC

coordinator
VOTE- VOTE- GLOBAL-
REQ ABORT ABORT
worker 1

worker 2
VOTE-
COMMIT
worker 3
time
37
Example of Worker Failure
INIT

WAIT

coordinator ABORT COMM timeout


GLOBAL-
VOTE-REQ ABORT
worker 1

VOTE-
worker 2 COMMIT

worker 3 time 38
Example of Coordinator Failure
INIT

READY

ABORT COMM
coordinator restarted

VOTE-REQ
worker 1
VOTE- GLOBAL-
worker 2 COMMIT ABORT

block waiting for


worker 3 coordinator 40
Paxos: fault tolerant agreement
• Paxos lets all nodes agree on the same
value despite node failures, network
failures and delays
• High-level process:
• One (or more) node decides to be the leader
• Leader proposes a value and solicits acceptance
from others
• Leader announces result or try again

45
Google Spanner
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman,
Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian
Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle,
Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth
Wang, and Dale Woodford. 2012. Spanner: Google's globally-distributed database. In Proceedings of
the 10th USENIX conference on Operating Systems Design and Implementation (OSDI'12). USENIX
Association, Berkeley, CA, USA, 251-264.

46
Basic Spanner Operation
• Data replicated across
datacenters
• Paxos groups support
transactions
• On commit:
• Grab Paxos lock
• Paxos algorithm decides
consensus
• If all agree, transaction is
committeed

47
Spanner Operation

Paxos Paxos

2PC

48
Base operation great for writes…
• What about reads?
• Reads are dominant operations
• e.g., FB’s TAO had 500 reads : 1 write [ATC 2013]
• e.g., Google Ads (F1) on Spanner from 1? DC in
24h:
21.5B reads
31.2M single-shard transactions
32.1M multi-shard transactions
• Want efficient read transactions

49
Make Read-Only Txns Efficient
• Ideal: Read-only transactions that are non-
blocking
• Arrive at shard, read data, send data back

• Goal 1: Lock-free read-only transactions

• Goal 2: Non-blocking stale read-only txns

50
TrueTime

• “Global wall-clock time” with bounded uncertainty


• ε is worst-case clock divergence
• Timestamps become intervals, not single values

TT.now()
time

earliest latest

2*ε
Consider event enow which invoked tt = TT.now():
Guarantee: tt.earliest <= tabs(enow) <= tt.latest
51
TrueTime for Read-Only Txns

• Assign all transactions a wall-clock commit time (s)


• All replicas of all shards track how up-to-date they are with
tsafe: all transactions with s < tsafe have committed on this
machine

• Goal 1: Lock-free read-only transactions


• Current time ≤ TT.now.latest()
• sread = TT.now.latest()
• wait until sread < tsafe
• Read data as of sread

• Goal 2: Non-blocking stale read-only txns


• Similar to above, except explicitly choose time in the past
• (Trades away consistency for better perf, e.g., lower latency)
52
Timestamps and TrueTime

Acquired locks Release locks

Pick s > TT.now().latest s Wait until TT.now().earliest > s

Commit wait

average ε average ε

• Key: Need to ensure that all future transactions will get a higher timestamp
• Commit wait ensures this

53
Commit wait
• What does this mean for performance?
• Larger TrueTime uncertainty bound
• longer commit wait
• Longer commit wait
• locks held longer
• can’t process conflicting transactions
• lower throughput
• i.e., if time is less certain, Spanner is slower!

54

You might also like