Lecture 06
Lecture 06
Anatomy of Cloud
HARNESSING BIG DATA
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
WHAT’S DRIVING BIG DATA
Interactive
Business Big Data:
Intelligence & Real Time &
Speed In-memory RDBMS
Scale Single View
QliqView, Tableau, HANA
Graph Databases
BI Reporting
OLAP &
Dataware house
Big Data:
Business Objects, SAS, Scale Speed
Informatica, Cognos other
Batch Processing &
SQL Reporting Tools Distributed Data Store
Hadoop/Spark;
HBase/Cassandra
Operating System OS OS OS
Hardware Hypervisor
Virtualized Stack
EVERYTHING AS A SERVICE
Microsoft Azure
GoGrid
AppNexus
Lecture 6 – Part 3
Web applications
Client/server paradigm
Request/response messaging pattern
Interactive communication
Processing pipelines
Examples: Indexing, data mining, image processing,
video transcoding, document processing
Batch processing systems
Example: report generation, fraud detection, analytics,
backups, automated testing
MANY STYLES OF SYSTEM
Near the edge of the application focus is on vast numbers
of clients and rapid response
Clients
talk to application using Web
browsers or the Web services standards
But this only gets us to the outer “skin” of
the data center, not the interior
Consider Amazon: it can host entire
company web sites (like Netflix.com), data
(S3), servers (EC2), databases (RDS) and
even virtual desktops!
BIG PICTURE OVERVIEW
Web servers
Application servers
Web servers
Application servers
Data store
LOAD BALANCER
Load balancer
Web servers
Application servers
Data store
SCALING: STATELESS, CACHING, AND SHARDING
STATELESS SERVERS ARE EASIEST TO SCALE
Views a client request as an independent transaction and responds
to it
Advantages:
Simpler and easier to scale: does not maintain state
More robust: tolerating instance failures does not require overheads
restoring state
Stateless servers
CACHING
Load balancer
Web servers
Application servers
Caching
Data store
CACHING
Data store
Master Slave
STATEFUL SERVERS REQUIRE ATTENTION
Cons:
Master becomes the write bottleneck
Master is a single point of failure
As load increases, cost of replication increases
Slaves may fall behind and serve stale data
SHARDING
the job?
Glimpse of an answer
When you make a search on Bing, the query is processed in
parallel by even 1000s of servers that run in real-time on
your request!
Parallel actions must focus on the critical path
WHAT DOES “CRITICAL PATH” MEAN?
Request
Service instance
Response delay
seen by end-user Service
would include response
Internet latencies delay
Response
PARALLEL SPEEDUP
In this example of a parallel read-only request, the critical path
centers on the middle “subservice”
Critical path
Response
WITH REPLICAS WE JUST LOAD BALANCE
Response
delay seen
by end-user
would Service
include response
Internet delay
latencies
Response
WHAT IF A REQUEST TRIGGERS UPDATES?
What if the leader replies to the end user but then crashes and it
turns out that the updates were lost in the network?
Data center networks can be surprisingly lossy at times
Also, bursts of updates can queue up
Operating system
Fixed-size blocks
- read
- write
Web service
Key/value store
- read, write
- delete
(bob, [email protected])
(gettysburg, "Four score and seven years ago...")
(29ck2dxa1, 0128ckso1$9#*!!8349e)
(windows, )
Delete(key)
EXAMPLES OF KVS
Where have you seen this concept before?
Dynamo [SOSP’07]
Many services only store and retrieve data by primary key
Examples: user preferences, shopping cart, best seller lists
Don’t require querying and management RDBMS functionality
Simple Storage Service (S3)
Need to store large objects that change infrequently
Examples: virtual machines, pictures
SPECIALIZED DATA STORES
Example: Google’s solutions
Bigtable [OSDI’06]
Distributed storage system for structured data
Data model is a sparse multi-dimensional sorted map indexed
by row and column keys and a timestamp
Each value in the map is opaque to the storage system
SPECIALIZED DATA STORES
Haystack [OSDI’10]
Object store system optimized for photos
In 2010, over 260 billion images; 20 PB of data; 60 TB/week
Data written once, read often, never modified, rarely deleted
TAO [ATC’13]
A read-optimized graph data store to serve the social graph
Sustains 1 billion reads/s on a changing data set of many PBs
Explicitly favors availability over consistency
SPECIALIZED DATA STORES
Example: LinkedIn’s solutions
Kafka [NetDB’11]
A high-throughput distributed messaging system
Pub/sub architecture designed for aggregating log data
Messages are persisted on disk for durability and replicated for fault
tolerance; guarantees at-least-once delivery
Voldemort
A distributed key-value store supporting only get/put/delete
Inspired by Amazon’s Dynamo: tunable consistency, highly available