01 en Principles of Distributed Systems
01 en Principles of Distributed Systems
Classify computer architectures by looking at the number of instruction streams and number of data streams
SISD — Single Instruction, Single Data stream
Traditional uniprocessor systems
SIMD — Single Instruction, Multiple Data streams
Array (vector) processors
Examples:
GPUs – Graphical Processing Units for computer graphics, GPGPU (General Purpose GPU): AMD/ATI, NVIDIA
AVX: Intel’s Advanced Vector Extensions
Memory
Shared memory systems: multiprocessors
No shared memory: networks of computers, multicomputers
Interconnect
Bus
Switch
Delay/bandwidth
Tightly coupled systems
Loosely coupled systems
MULTIPROCESSORS AND MULTICOMPUTERS
Multiprocessors
Shared memory
Shared clock
Shared operating system
All-or-nothing failure
Scale
Collaboration
Reduced latency
Mobility
High availability & Fault tolerance
Incremental cost
Delegated infrastructure & operations
SCALE – INCREASING PERFORMANCE
Computers are
getting faster
Moore's Law
prediction:
performance doubles
approximately every
18 months because
of faster transistors
and more transistors
per chip
SCALING A SINGLE SYSTEM HAS LIMITS
Google
Over 63,000 search queries per second on average
Over 130 trillion pages indexed
Uses hundreds of thousands of servers to do this
In 1999, it took Google one month to crawl and build an index of about 50 million pages
In 2012, the same task was accomplished in less than one minute.
16% to 20% of queries that get asked every day have never been asked before
Every query has to travel on average 1,500 miles to a data center and back to return the answer to the user
A single Google query uses 1,000 computers in 0.2 seconds to retrieve an answer
Facebook
Approximately 100M requests per second with 4B users
COLLABORATION AND CONTENT
Fault tolerance
Identify & recover from component failures
Recoverability
Software can restart and function
May involve restoring state
INCREMENTAL COST
Offload responsibility
Let someone else manage systems
Use third-party services
Speed deployment
Don’t buy & configure your own systems
Don’t build your own data center
Modularize services on different systems
Dedicated systems for storage, email, etc.
Use cloud, network attached storage
Let someone else figure out how to expand storage and do backups
TRANSPARENCY
Location transparency
Users don’t care where resources are
Migration transparency
Resources move at will
Replication transparency
Users cannot tell whether there are copies of resources
Concurrency transparency
Users share resources transparently
Parallelism transparency
Operations take place in parallel without user’s knowledge
CORE CHALLENGES IN DISTRIBUTED SYSTEMS DESIGN
Concurrency
Latency
Partial Failure
Security
CONCURRENCY
Fail-stop
Failed component stops functioning
Halting = stop without notice
Detect failed components via timeouts
But you can’t count on timeouts in asynchronous networks
And what if the network isn’t reliable?
Sometimes we guess
Fail-restart
Component stops but then restarts
Danger: stale state
NETWORK FAILURE TYPES
Omission
Failure to send or receive messages
Due to queue overflow in router, corrupted data, receive buffer overflow
Timing
Messages take longer than expected
We may assume a system is dead when it isn't
Partition
Network fragments into two or more sub-networks that cannot communicate with each other
NETWORK AND SYSTEM FAILURE TYPES
Fail-silent
A failed component (process or hardware) does not produce any output
Byzantine failures
Instead of stopping, a component produces faulty data
Due to bad hardware, software, network problems, or malicious interference
State
Information about some component that cannot be reconstructed
Network connection info, process memory, list of clients with open files, lists of which
clients finished their tasks
Replicas
Redundant copies of data → used to address fault tolerance
Cache
Local storage of frequently-accessed data to reduce latency → used to address latency
NO GLOBAL KNOWLEDGE
The environment
Public networks, remotely-managed services, 3rd party services
Some issues
Malicious interference, bad user input, impersonation of users & services
Protocol attacks, input validation attacks, time-based attacks, replay attacks
Rely on authentication, cryptography (hashes, encryption) … and good defensive
programming!
Users also want convenience
Single sign-on, no repeated entering of login credentials
Controlled access to services
KEY APPROACHES IN DISTRIBUTED SYSTEMS
Replication
For high availability, caching, and sharing data
Challenge: keep replicas consistent even if systems go down and come up
Quorum/consensus
Enable a group to reach agreement
QUESTIONS?
NOW, BY E-MAIL, …