0% found this document useful (0 votes)
37 views

Lecture 06

Uploaded by

browninasia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Lecture 06

Uploaded by

browninasia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Cloud Computing

Professor Mangal Sain


Lecture 6

Cloud Basics & Storage


Lecture 6 – Part 1

Anatomy of Cloud
HARNESSING BIG DATA

 OLTP: Online Transaction Processing (DBMSs)


 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
THE MODEL HAS CHANGED…

 The Model of Generating/Consuming Data has


Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data
WHAT’S DRIVING BIG DATA

- Optimizations and predictive analytics


- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting


- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
THE EVOLUTION OF BUSINESS INTELLIGENCE

Interactive
Business Big Data:
Intelligence & Real Time &
Speed In-memory RDBMS
Scale Single View
QliqView, Tableau, HANA
Graph Databases
BI Reporting
OLAP &
Dataware house
Big Data:
Business Objects, SAS, Scale Speed
Informatica, Cognos other
Batch Processing &
SQL Reporting Tools Distributed Data Store
Hadoop/Spark;
HBase/Cassandra

1990’s 2000’s 2010’s


BIG DATA ANALYTICS

 Big data is more real-time in


nature than traditional DW
applications
 Traditional DW
architectures (e.g. Exadata,
Teradata) are not well-suited
for big data apps
 Shared nothing, massively
parallel processing, scale out
architectures are well-suited
for big data apps
BIG DATA TECHNOLOGY
CLOUD COMPUTING

 IT resources provided as a service


 Compute, storage, databases, queues
 Clouds
leverage economies of scale of
commodity hardware
 Cheap storage, high bandwidth networks &
multicore processors
 Geographically distributed data centers

 Offerings from Microsoft, Amazon, Google, …


wikipedia:Cloud Computing
BENEFITS
 Cost Savings
 Security
 Flexibility
 Mobility
 Insight
 Increased Collaboration
 Quality Control
 Disaster Recovery
 Loss Prevention
 Automatic Software Updates
 Competitive Edge
 Sustainability
Lecture 6– Part 2

Anatomy of Cloud contd.


TYPES OF CLOUD COMPUTING

 Public Cloud: Computing infrastructure is


hosted at the vendor’s premises.
 Private Cloud: Computing architecture is
dedicated to the customer and is not shared
with other organisations.
 Hybrid Cloud: Organisations host some
critical, secure applications in private
clouds. The not so critical applications are
hosted in the public cloud
 Cloud bursting: the organisation uses its own
infrastructure for normal usage, but cloud is used
for peak loads.
 Community Cloud
CLASSIFICATION OF CLOUD COMPUTING BASED ON
SERVICE PROVIDED
 Infrastructure as a service (IaaS)
 Offering hardware related services using the principles of cloud computing. These
could include storage services (database or disk storage) or virtual servers.
 Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.
 Platform as a Service (PaaS)
 Offering a development platform on the cloud.
 Google’s Application Engine, Microsofts Azure,
Salesforce.com’s force.com .
 Software as a service (SaaS)
 Including a complete software offering on the cloud. Users
can access a software application hosted by the cloud
vendor on pay-per-use basis. This is a well-established
sector.
 Salesforce.coms’ offering in the online Customer
Relationship Management (CRM) space, Googles gmail
and Microsofts hotmail, Google docs.
INFRASTRUCTURE AS A SERVICE (IAAS)
PLATFORM AS A SERVICE
SOFTWARE AS A SERVICE (SAAS)
MORE REFINED CATEGORIZATION
 Storage-as-a-service
 Database-as-a-service
 Information-as-a-service
 Process-as-a-service
 Application-as-a-service
 Platform-as-a-service
 Integration-as-a-service
 Security-as-a-service
 Management/
Governance-as-a-service
 Testing-as-a-service
 Infrastructure-as-a-service

InfoWorld Cloud Computing Deep Dive


KEY INGREDIENTS IN CLOUD COMPUTING

 Service-Oriented Architecture (SOA)


 Utility Computing (on demand)

 Virtualization (P2P Network)

 SAAS (Software As A Service)

 PAAS (Platform AS A Service)

 IAAS (Infrastructure AS A Servie)

 Web Services in Cloud


ENABLING TECHNOLOGY: VIRTUALIZATION

App App App App App App

Operating System OS OS OS

Hardware Hypervisor

Traditional Stack Hardware

Virtualized Stack
EVERYTHING AS A SERVICE

 Utility computing = Infrastructure as a Service (IaaS)


 Why buy machines when you can rent cycles?
 Examples: Amazon’s EC2, Rackspace

 Platform as a Service (PaaS)


 Give me nice API and take care of the maintenance,
upgrades, …
 Example: Google App Engine

 Software as a Service (SaaS)


 Just run it for me!
 Example: Gmail, Salesforce
CLOUD VERSUS CLOUD

 Amazon Elastic Compute Cloud


 Google App Engine

 Microsoft Azure

 GoGrid

 AppNexus
Lecture 6 – Part 3

Anatomy of Cloud and Cloud Storage


RECAP: CLOUD BENEFITS
 Elastic,just-in-time infrastructure
 More efficient resource utilization

 Pay for what you use

 Potential to reduce processing time


 Parallelization
 Leverage multiple data centers
 High availability, lower response times
TODAY’S CLOUD APPLICATIONS

 Web applications
 Client/server paradigm
 Request/response messaging pattern
 Interactive communication

 Processing pipelines
 Examples: Indexing, data mining, image processing,
video transcoding, document processing
 Batch processing systems
 Example: report generation, fraud detection, analytics,
backups, automated testing
MANY STYLES OF SYSTEM
 Near the edge of the application focus is on vast numbers
of clients and rapid response

 Inside we find data-intensive services that operate in a


pipelined manner, asynchronously

 Deep inside the application we see a world of virtual


computer clusters that are scheduled to share resources
and on which applications like MapReduce (Hadoop) are
very popular
HOW ARE CLOUD APPS STRUCTURED?

 Clients
talk to application using Web
browsers or the Web services standards
 But this only gets us to the outer “skin” of
the data center, not the interior
 Consider Amazon: it can host entire
company web sites (like Netflix.com), data
(S3), servers (EC2), databases (RDS) and
even virtual desktops!
BIG PICTURE OVERVIEW

 Client requests are


handled in by front-end
Web servers
 Application servers are
Web
invoked for dynamic Web
content generation Web
App Web
App
and run app logic App
Web App Shards
 PHP, Java, Python, …
App
 Back-end databases Cache Cache
manage and provide
access to data DB DB Index
APPLICATIONS WITH MULTIPLE TIERS

Web servers

Application servers

Data store (or database)


REDUNDANCY AT EACH TIER

Web servers

Application servers

Data store
LOAD BALANCER

Load balancer

Web servers

Application servers

Data store
SCALING: STATELESS, CACHING, AND SHARDING
STATELESS SERVERS ARE EASIEST TO SCALE
 Views a client request as an independent transaction and responds
to it
 Advantages:
 Simpler and easier to scale: does not maintain state
 More robust: tolerating instance failures does not require overheads
restoring state

Stateless servers
CACHING
Load balancer

Web servers

Application servers

Caching

Data store
CACHING

 Caching is central to responsiveness


 Basic idea is to always used cached data if at all possible,
so the inner services (data stores) are shielded from
“online” load
 Caching is only temporary storage, hence it is stateless
 We can add multiple cache serves to spread loads
 Must think hard about patterns of data access
 Some data needs to be heavily replicated to offer very fast
access on vast numbers of nodes
 In principle the level of replication should match level of
load and the degree to which the data is needed
STATEFUL SERVERS REQUIRE ATTENTION

 Scaling a relational database is challenging


 Traditional approach is replication
 Data is written to a master server and then replicated to one or
more slave servers (synchronously or asynchronously)
 Read operations can be handled by the slaves
 All writes happen on the master

Data store
Master Slave
STATEFUL SERVERS REQUIRE ATTENTION

 Cons:
 Master becomes the write bottleneck
 Master is a single point of failure
 As load increases, cost of replication increases
 Slaves may fall behind and serve stale data
SHARDING

 Data partitioning strategy


 Basic idea: split data between multiple machines and
have a way to make sure you always access data from the
right place
 Typically define a sharding key and create a shard mapping
(e.g., shard_idx = hash(key) mod N)
 Other partitioning schemes exist: e.g., allocate whole tables on
the same machine

Partition Partition Partition


1 2 3
BENEFITS OF SHARDING

 Increased read and write throughput


 High availability

 Possibility of doing more work in parallel within the


application server

 Challenge: picking a good partitioning scheme


 Otherwise risk of having hotspots in the system due to
load imbalance
SHARDING USED IN MANY WAYS

 Sharding is not only for partitioning data within a


database
 Applies essentially to every application tier
 Notion of sharding is cross-cutting

 Example: partition data across caching servers


 Two popular in-memory caching systems:
 memcached: distributed object caching system
 redis: distributed data structure server (also works as store)
AND IT ISN’T JUST ABOUT UPDATES

 Shouldalso be thinking about patterns that arise


when doing reads (“queries”)
 Some can just be performed by a single representative of a
service
 But others might need the parallelism of having several (or
even a huge number) of machines do parts of the work
concurrently
 The term sharding is used for data, but here we
might talk about “parallel computation on a shard”
FIRST-TIER PARALLELISM
 Parallelismis vital for fast interactive services
 Key question:
 Request has reached some service instance X
 Will it be faster…
 … For X to just compute the response
 … Or for X to subdivide the work by asking subservices to do parts of

the job?
 Glimpse of an answer
 When you make a search on Bing, the query is processed in
parallel by even 1000s of servers that run in real-time on
your request!
 Parallel actions must focus on the critical path
WHAT DOES “CRITICAL PATH” MEAN?

 Focus on delay until a client receives a reply


 Critical path are actions that contribute to this delay

Request
Service instance

Response delay
seen by end-user Service
would include response
Internet latencies delay

Response
PARALLEL SPEEDUP
 In this example of a parallel read-only request, the critical path
centers on the middle “subservice”

Request Service instance

Response Critical path


delay seen
by end-user
would Service Critical path
include response
Internet delay
latencies

Critical path
Response
WITH REPLICAS WE JUST LOAD BALANCE

Request Service instance

Response
delay seen
by end-user
would Service
include response
Internet delay
latencies

Response
WHAT IF A REQUEST TRIGGERS UPDATES?

 If updates are done “asynchronously” we might not experience


much delay on the critical path
 Cloud systems often work this way
 Avoids waiting for slow services to process the updates but may force the
tier-one service to “guess” the outcome
 For example, store in the master database and replicate to the slave in
the background
 Many cloud systems use these sorts of “tricks” to speed up response
time
WHAT IF WE SEND UPDATES WITHOUT WAITING?

 Several issues now arise


 Are all the replicas applying updates in the same order?
 Might not matter unless the same data item is being changed
 But then clearly we do need some “agreement” on order

 What if the leader replies to the end user but then crashes and it
turns out that the updates were lost in the network?
 Data center networks can be surprisingly lossy at times
 Also, bursts of updates can queue up

 Such issues result in inconsistency


IS INCONSISTENCY A BAD THING?
 How much consistency is really needed in the first tier of the cloud?
 Think about YouTube videos. Would consistency be an issue here?
 What about the Amazon “number of units available” counters. Will
people notice if those are a bit off?
 Puzzle: can you come up with a general policy for knowing how
much consistency a given thing needs?
CLOUD STORAGE
COMPLEX SERVICE, SIMPLE STORAGE
Variable-size files
- read, write, append
- move, rename
- lock, unlock
- ...

Operating system
Fixed-size blocks
- read
- write

 PC users see a rich, powerful interface


 Hierarchical namespace (directories); can move, rename,
append to, truncate, (de)compress, view, delete files, ...
 But the actual storage device is very simple
 HDD only knows how to read and write fixed-size data
blocks
 Translation done by the operating system
ANALOGY TO CLOUD STORAGE
Shopping carts
Friend lists
User accounts
Profiles
...

Web service
Key/value store
- read, write
- delete

 Many cloud services have a similar structure


 Users see a rich interface (shopping carts, product
categories, searchable index, recommendations, ...)
 But the actual storage service is very simple
 Read/write 'blocks', similar to a giant hard disk
 Translation done by the web service
WHAT’S WRONG WITH RELATIONAL DBS?
 Most applications interact through a database
 Recall RDBMS:
 Manage data access, enforce data integrity, control concurrency, support
recovery after a failure
 Many applications push traditional RDBMS solutions to the limit
by demanding:
 High scalability
 Very large amounts of data
 Minimal latency
 High availability
 Solution is far from ideal
IDEAL DATA STORES ON THE CLOUD
 Many situations need hosting of large data sets
 Examples: Amazon catalog, eBay listings, Facebook pages, …
 Ideal: Abstraction of a 'big disk in the clouds', which
would have:
 Perfect durability – nothing would ever disappear in a
crash
 100% availability – we could always get to the service
 Zero latency from anywhere on earth – no delays!
 Minimal bandwidth utilization – we only send across
the network what we absolutely need
 Isolation under concurrent updates – make sure data
stays consistent
THE INCONVENIENCES OF THE REAL WORLD

 Why isn't this feasible?

 The “cloud” exists over a physical network


 Communication takes time, esp. across the globe
 Bandwidth is limited, both on the backbone and endpoint

 The “cloud” has imperfect hardware


 Hard disks crash
 Servers crash
 Software has bugs
FINDING THE RIGHT TRADEOFF
 In practice, we can't have everything
 ... but most applications don't really need 'everything'!
 Some observations:
1. Read-only (or read-mostly) data is easiest to support
 Replicate it everywhere! No concurrency issues!
 But only some kinds of data fit this pattern – examples?

2. Granularity matters: “Few large-object” tasks generally


tolerate longer latencies than “many small-object” tasks
 Fewer requests, often more processing at the client
 But it’s much more expensive to replicate or to update!

3. Maybe it makes sense to develop separate solutions for large


read-mostly objects vs. small read-write objects!
 Different requirements → different technical solutions
KVS AND CURRENT SYSTEMS
KEY-VALUE STORES
Keys Values

(bob, [email protected])
(gettysburg, "Four score and seven years ago...")
(29ck2dxa1, 0128ckso1$9#*!!8349e)
(windows, )

 The key-value store (KVS) is a simple abstraction


for managing persistent state
 Data is organized as (key, value) pairs
 Only three basic operations:
 PUT(key, value)
 GET(key) → value

 Delete(key)
EXAMPLES OF KVS
 Where have you seen this concept before?

 Conventional examples outside the cloud:


 In-memory associative arrays and hash tables – limited to a single
application, only persistent until program ends
 On-disk indices (like BerkeleyDB)
 "Inverted indices" behind search engines
 Database management systems – multiple KVSs++
 Distributed hashtables
 Decentralized distributed systems inspired by P2P (see LSINF2345)
 Examples: Chord/Pastry
SUPPORTING AN INTERNET SERVICE WITH A KVS
 We’ll do this through a central server, e.g., a Web or application
server
A B
 Two main issues:
1. There may be multiple concurrent
requests from different clients
 These might be GETs, PUTs, DELETEs, etc.
S
2. These requests may come from different
parts of the network, with message propagation delays
 It takes a while for a request to make it to the server!
 We’ll have to handle requests in the order received (why?)
CONCURRENCY CONTROL

 Most systems use locks on individual items


 Each requestor asks for the lock
 A lock manager processes these requests (typically
in FIFO order) as follows:
 Lock manager grants the lock to a requestor
 Requestor makes modifications

 Then releases the lock when it’s done


LIMITATIONS OF PER-KEY CONCURRENCY CONTROL

 Suppose I want to transfer credits


from my WoW account to my friend’s?
 … while someone else is doing a GET
on my (and her) credit amounts to see if
they want to trade?
 This is where one needs a database
management system (DBMS) or transaction
processing manager (app server)
 Allows for “locking” at a higher level, across keys and
possibly even systems (see LINGI2172 for more details)
 Could you implement higher-level locks within the
KVS? If so, how?
SPECIALIZED DATA STORES

 Example: Amazon’s solutions

 Dynamo [SOSP’07]
 Many services only store and retrieve data by primary key
 Examples: user preferences, shopping cart, best seller lists
 Don’t require querying and management RDBMS functionality
 Simple Storage Service (S3)
 Need to store large objects that change infrequently
 Examples: virtual machines, pictures
SPECIALIZED DATA STORES
 Example: Google’s solutions

 The Google File System [SOSP’03]


 Distributed file system for large data-intensive applications
 No POSIX API; focus on multi-GB files divided in fixed-size
chunks (64 MB); mostly mutated by appending new data
 Single master node maintains all file metadata

 Bigtable [OSDI’06]
 Distributed storage system for structured data
 Data model is a sparse multi-dimensional sorted map indexed
by row and column keys and a timestamp
 Each value in the map is opaque to the storage system
SPECIALIZED DATA STORES

 Example: Facebook’s solutions


 Cassandra [Ladis’09]
 A distributed storage system for large sets of structured data
 Optimized for very high write throughput; no master nodes

 Haystack [OSDI’10]
 Object store system optimized for photos
 In 2010, over 260 billion images; 20 PB of data; 60 TB/week
 Data written once, read often, never modified, rarely deleted

 TAO [ATC’13]
 A read-optimized graph data store to serve the social graph
 Sustains 1 billion reads/s on a changing data set of many PBs
 Explicitly favors availability over consistency
SPECIALIZED DATA STORES
 Example: LinkedIn’s solutions

 Kafka [NetDB’11]
 A high-throughput distributed messaging system
 Pub/sub architecture designed for aggregating log data
 Messages are persisted on disk for durability and replicated for fault
tolerance; guarantees at-least-once delivery
 Voldemort
 A distributed key-value store supporting only get/put/delete
 Inspired by Amazon’s Dynamo: tunable consistency, highly available

You might also like