CS2510 00 Distributed Storage Overview
CS2510 00 Distributed Storage Overview
Federated Storage
How to store things… in… many places... (maybe)
CS2510
Presented by: wilkie
[email protected]
University of Pittsburgh
Recommended Reading (or Skimming)
• NFS: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473
• WAFL: https://dl.acm.org/citation.cfm?id=1267093
• Hierarchical File Systems are Dead (Margo Seltzer, 2009):
https://www.eecs.harvard.edu/margo/papers/hotos09/paper.pdf
• Chord (Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari
Balakrishnan, 2001):
https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf
• Kademlia (Petar Maymounkov, David Mazières, 2002):
https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf
• BitTorrent Overview: http://web.cs.ucla.edu/classes/cs217/05BitTorrent.pdf
• IPFS (Juan Benet, 2014):
https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/
ipfs.draft3.pdf (served via IPFS, neat)
Network File System
NFS: A Traditional and Classic Distributed File System
Problem
• Storage is cheap.
• YES. This is a problem in a classical sense.
• People are storing more stuff and want very strong storage guarantees.
• Networked (web) applications are global and people want strong availability
and stable speed/performance (wherever in the world they are.) Yikes!
Most Reliable??
NFS System Model
• Each client connects directly to the server. Files could be duplicated
on client-side.
Client
Server
Client
Client
NFS Stateless Protocol
Set of common operations clients can issue: (where is open? close?)
fd
success
success
Local File Remote File
success
Server-side Writes Are Slow
Problem: Writes are really slow…
(Did the server crash?? Should I try again?? Delay… delay… delay)
Client Server
lookup
fd
… 1 second …
… 2 seconds? ...
success
Time relates to the amount of data we want to write… is there a good block size?
1KiB? 4KiB? 1MiB? (bigger == slower, harsher failures; small == faster, but more messages)
Server-side Write Cache?
Solution: Cache writes and commit them when we have time.
(Client gets a respond much more quickly… but at what cost? There’s always a trade-off)
Client Server
lookup
fd
400 milliseconds.
success
Write Cache:
Need to write this
block at some point!
fd
write(fd, 0, 15)
success
success
Fault Tolerance
• So, we can allow failure, but only if we know if an operation
succeeded. (we are assuming a strong eventual consistency)
• In this case, writes… but those are really slow. Hmm.
• Hey! We’ve seen this all before…
• This is all fault tolerance basics.
• But this is our chance to see it in practice.
• [a basic conforming implementation of] NFS makes a trade-off.
It gives you distributed data that is reliably stored at the cost of
slow writes.
• When distributing the file, one can know it got the file by simply
hashing what it received.
• Since our hash function is deterministic the hash will be the same.
• If it isn’t, our file is corrupted.
A B C D E F G H
Distribution (Detecting Failure)
• Client requests the hashes given. But receives chunks with hashes:
vacation_video.mov
A B C D F G H
We can organize a file such that it
can be referred to by a single hash,
Merkle Tree/DAG but also be divided up into more
easily shared chunks.
The hash of each node is the hash
of the hashes it points to
N6
𝑁4 = ℎ𝑎𝑠ℎ(𝑁0 + 𝑁1) 𝑁5 = ℎ𝑎𝑠ℎ(𝑁2 + 𝑁3)
𝑁6 = ℎ𝑎𝑠ℎ(𝑁4 + 𝑁5)
N4 N5
𝑁2 = ℎ𝑎𝑠ℎ(𝐸 + 𝐹)
𝑁0 = ℎ𝑎𝑠ℎ(𝐴 + 𝐵) 𝑁1 = ℎ𝑎𝑠ℎ(𝐶 + 𝐷) 𝑁3 = ℎ𝑎𝑠ℎ(𝐺 + 𝐻)
N0 N1 N2 N3
vacation_video.mov
A B C D E F G H
Merkle-based Deduplication
• Updating a chunk ripples.
• But leaves N9
intact
parts N4 N8
alone!
N0 N1 N7 N3
vacation_video.mov
A B C D R F G H
vacation_video.mov (v1)
01774f1d8f6621ccd7a7a845525e4157
N0 N1 N2 N7 N3
A B C D E R F G H
(N1) 01774f1d8f6621ccd7a7a845525e4157
Distribution
{N4, N5}
Possibly: Gossip
C Client/Server
about D to other “Tracker”
nodes downloading
this file.
BitTorrent Block Sharing
• Files are divided into chunks (blocks) and
traded among the different peers.
• Many possible solutions. Most are VERY interesting and some are
slightly counter-intuitive (hence interesting!)
Distributed Hash Tables (DHT)
• A distributed system devoted to decentralized key/value storage
across a (presumably large or global) network.
• Many find a way to relate the key to the location of the server that
holds the value.
• Two “neighbors”
may be entirely
across the planet!
(or right next door)
00110
00111
Kademlia Network Topology
• Each node knows about nodes that
have a distance successively larger
00110 than it.
• Recall XOR is distance, so largest distance
occurs when MSB is different.
• It maintains buckets of nodes with IDs
Routing Table k-buckets that share a prefix of 𝑘 bits (matching
0-bit 1-bit 2-bit 3-bit 4-bit MSBs)
• There are a certain number of entries in
10001 01001 00011 00100 00111 each bucket. (not exhaustive)
10100 01100 00010 00101 • The number of entries relates to the
replication amount.
10110 01010 00001
• The overall network is a trie.
11001 01001 00000
• The buckets are subtrees of that trie.
Note: 0-bit list contains half of the overall network!
Kademlia Routing (bucket visualization)
0-bit 1 0
1 0 1 0
1-bit
1 0 1 0 1 0 1 0
1 0 1 0 1 0 1 0 1 0 1 0
2-bit
1 0 1 0 1 0 1 0
“Close”
“Far Away” 3-bit
Kademlia Routing Algorithm
• Ask the nodes we know that are
00110 “close” to 𝑘 to tell as about nodes that
are “close” to 𝑘
• Repeat by asking those nodes which
Routing Table k-buckets nodes are “close” to 𝑘 until we get a
set that say “I know 𝑘!!”
0-bit 1-bit 2-bit 3-bit 4-bit
• Because of our k-bucket scheme, each
10001 01001 00011 00100 00111 step we will look at nodes that share
10100 01100 00010 00101 an increasing number of bits with 𝑘.
10110 01010 00001 • And because of our binary tree, we
11001 01001 00000 essentially divide our search space in half.
Note: 0-bit list contains half of the overall network!
• Search: 𝑂(log 𝑁) queries.
Kademlia Routing Algorithm
• Finding 𝑘 = 00111 from node 00110.
00110 • Easy! Starts with a similar sequence.
• It’s hopefully at our own node, node 00111,
or maybe node 00100…
Routing Table k-buckets • Finding 𝑘 = 11011 from 00110:
• Worst case! No matching prefix!
0-bit 1-bit 2-bit 3-bit 4-bit
• Ask several nodes with IDs starting with 1.
10001 01001 00011 00100 00111 • This is, at worst, half of our network… so we
10100 01100 00010 00101 have to rely on the algorithm to narrow it down.
• It hopefully returns nodes that start with 11 or
10110 01010 00001 better. (which eliminates another half of our
11001 01001 00000 network from consideration)