0% found this document useful (0 votes)
16 views

Lecture 2 A

This document discusses data management in the cloud. It begins by defining cloud computing and cloud data management. It then explores the challenges of data management in the cloud, as well as new solutions like NoSQL databases. It provides examples of using graph data and algorithms in the cloud. Finally, it discusses how data management applications are well-suited for deployment in the cloud and how the cloud enables new models for scientific data management.

Uploaded by

Momina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 2 A

This document discusses data management in the cloud. It begins by defining cloud computing and cloud data management. It then explores the challenges of data management in the cloud, as well as new solutions like NoSQL databases. It provides examples of using graph data and algorithms in the cloud. Finally, it discusses how data management applications are well-suited for deployment in the cloud and how the cloud enables new models for scientific data management.

Uploaded by

Momina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Management in the Cloud

1
Outline
• Motivation
– what is cloud computing?
– what is cloud data management?
• Challenges, opportunities and limitations
– what makes data management in the cloud difficult?
• New solutions
– key/value, document, column family, graph, array, and object databases
– scalable SQL databases
• Application
– graph data and algorithms
– usage scenarios

2
What is Cloud Computing?
• Different definitions for “Cloud Computing” exist
– http://tech.slashdot.org/article.pl?sid=08/07/17/2117221
• Common ground of many definitions
– processing power, storage and software are commodities that are
readily available from large infrastructure
– service-based view: “everything as a service (*aaS)”, where only
“Software as a Service (SaaS)” has a precise and agreed-upon definition
– utility computing: pay-as-you-go model

3
Service-Based View on Computing

Client Software

Software
User Interface Machine Interface
(SaaS) End User

Platform
Components Services
(PaaS) Application
Developer

Infrastructure
Computation Network Storage
(IaaS) System
Administrator

Server Hardware
Source: Wikipedia (http://www.wikipedia.org)

4
Terminology
• Term cloud computing usually refers to both
– SaaS: applications delivered over the Internet as services
– The Cloud: data center hardware and systems software
• Public clouds
– available in a pay-as-you-go manner to the public
– service being sold is utility computing
– Amazon Web Service, Microsoft Azure, Google AppEngine
• Private clouds
– internal data centers of businesses or organizations
– normally not included under cloud computing

5
Based on: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
Utility Computing
• Illusion of infinite computing resources
– available on demand
– no need for users to plan ahead for provisioning
• No up-front cost or commitment by users
– companies can start small (demand unknown in advance)
– increase resources only when there is an increase in need (demand
varies with time)
• Pay for use on short-term basis as needed
– processors by the hour and storage by the day
– release them as needed, reward conservation
• “Cost associativity”
– 1000 EC2 machines for 1 hour = 1 EC2 machine for 1000 hours

6
Based on: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
Cloud Computing Users and Providers

7
Picture credit: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
Virtualization
• Virtual resources abstract from physical resources
– hardware platform, software, memory, storage, network
– fine-granular, lightweight, flexible and dynamic
• Relevance to cloud computing
– centralize and ease administrative tasks
– improve scalability and work loads
– increase stability and fault-tolerance
– provide standardized, homogenous computing platform through
hardware virtualization, i.e. virtual machines

8
Spectrum of Virtualization
• Computation virtualization
– Instruction set VM (Amazon EC2, 3Tera)
– Byte-code VM (Microsoft Azure)
– Framework VM (Google AppEngine, Force.com)
• Storage virtualization
• Network virtualization

Lower-level, Higher-level,
Less management More management

EC2 Azure AppEngine Force.com

9
Slide Credit: RAD Lab, UC Berkeley
10
Table credit: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD Lab, UC Berkeley
Economics of Cloud Users
• Pay by use instead of provisioning for peak

Static data center Data center in the cloud

Capacity

Resources
Resources

Capacity

Demand Demand
Time Time

Unused resources

11
Slide Credit: RAD Lab, UC Berkeley
Economics of Cloud Users
• Risk of over-provisioning: underutilization

Static data center


Capacity
Resources

Demand
Time

Unused resources

12
Slide Credit: RAD Lab, UC Berkeley
Economics of Cloud Users
• Heavy penalty for under-provisioning

Resources
Capacity

Demand
1 2 3
Time (days)

Lost revenue Lost users


Resources
Resources

Capacity Capacity

Demand Demand
1 2 3 1 2 3
Time (days) Time (days)

13
Slide Credit: RAD Lab, UC Berkeley
Economics of Cloud Providers
Cost in Medium Cost in Very Large
Resource Data Center Data Center Ratio

Network $95/Mbps/month $13/Mbps/month 7.1x


Storage $2.20/GB/month $0.40/GB/month 5.7x
Administration ≈140 servers/admin >1000 servers/admin 7.1x
Source: James Hamilton (http://perspectives.mvdirona.com)
• Cloud computing is 5-7x cheaper than traditional in-house
computing
• Power/cooling costs: approx double cost of storage, CPU, network
• Added benefits (to cloud providers)
– utilize off-peak capacity (Amazon)
– sell .NET tools (Microsoft)
– reuse existing infrastructure (Google)

14
Slide Credit: RAD Lab, UC Berkeley
What is Cloud Data Management?
• Data management applications are potential candidates for
deployment in the cloud
– industry: enterprise database system have significant up-front cost that
includes both hardware and software costs
– academia: manage, process and share mass-produced data in the cloud
• Many “Cloud Killer Apps” are in fact data-intensive
– Batch Processing as with map/reduce
– Online Transaction Processing (OLTP) as in automated business
applications
– Online Analytical Processing (OLAP) as in data mining or machine
learning

15
Scientific Data Management Applications
• Old model
– “Query the world”
– data acquisition coupled to a specific hypothesis
• New model
– “Download the world”
– data acquired en masse, in support of many hypotheses
• E-science examples
– astronomy: high-resolution, high-frequency sky surveys, …
– oceanography: high-resolution models, cheap sensors, satellites, …
– biology: lab automation, high-throughput sequencing, ...

16
Slide Credit: Bill Howe, U Washington
Scaling Databases
• Flavors of database scalability
– lots of (small) transactions
– lots of copies of the data
– lots of processors running on a single query (compute intensive tasks)
– extremely large data set for one query (data intensive tasks)
• Data replication
– move data to where it is needed
– managed replication for availability and reliability

17
Revisit Cloud Characteristics
• Compute power is elastic, but only if workload is parallelizable
– transactional database management systems do not typically use a
shared-nothing architecture
– shared-nothing is a good match for analytical data management
– some things parallelize well (i.e. sum), some do not (i.e. median)
– Think about: Google gmail, Amazon web site – easy? Difficult?
– Google App Engine – API forces ability to run in shared nothing
• Scalability
– in the past: out-of-core, works even if data does not fit in main memory
– in the present: exploits thousands of (cheap) nodes in parallel

18
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Parallel Database Architectures
Shared nothing Shared disc Shared memory

interconnect …

… interconnect

… interconnect

processor memory disk

Source: D. DeWitt and J. Gray: “Parallel Database Systems: The Future of


19
High Performance Database Processing”, CACM 36(6), pp. 85-98, 1992.
Revisit Cloud Characteristics
• Data is stored at an untrusted host
– there are risks with respect to privacy and security in storing
transactional data on an untrusted host
– particularly sensitive data can be left out of analysis or anonymized
– sharing and enabling access is often precisely the goal of using the
cloud for scientific data sets
– where exactly is your data? and what are that country’s laws?

20
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Revisit Cloud Characteristics
• Data is replicated, often across large geographic distances
– it is hard to maintain ACID guarantees in the presence of large-scale
replication
– full ACID guarantees are typically not required in analytical applications
• Virtualizing large data collections is challenging
– data loading takes more time than starting a VM
– storage cost vs. bandwidth cost
– online vs. offline replication

21
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Challenges
• Trade-off between functionality and operational cost
– restricted interface, minimalist query language, limited consistency
guarantees
– pushes more programming burden on developers
– enables predictable services and service level agreements
• Manageability
– limited human intervention, high-variance workloads, and a variety of
shared infrastructures
– need for self-managing and adaptive database techniques

22
Based on: “The Claremont Report on Database Research”, 2008
Challenges
• Scalability
– today’s SQL databases cannot scale to the thousands of nodes deployed
in the cloud context
– hard to support multiple, distributed updaters to the same data set
– hard to replicate huge data sets for availability, due to capacity (storage,
network bandwidth, …)
– storage: different transactional implementation techniques, different
storage semantics, or both
– query processing and optimization: limitations on either the plan
space or the search will be required
– programmability: express programs in the cloud

23
Based on: “The Claremont Report on Database Research”, 2008
Challenges
• Data privacy and security
– protect from other users and cloud providers
– specifically target usage scenarios in the cloud with practical incentives
for providers and customers
• New applications: “mash up” interesting data sets
– expect services pre-loaded with large data sets, stock prices, web
crawls, scientific data
– data sets from private or public domain
– might give rise to federated cloud architectures

24
Based on: “The Claremont Report on Database Research”, 2008
Transactional Data Management – Cloud or not?
• Transactional Data Management
– Banking, airline reservation, e-commerce, etc…
– Require ACID, write-intensive
• Features
– Do not typically use shared-nothing architectures (changing a bit)
– Hard to maintain ACID guarantees in the face of data replication over
large geographic distances
– There are risks in storing transactional data on an untrusted host
• Conclusion: not appropriate for the cloud

25
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Analytical Data Management – Cloud or not?
• Analytical Data Management
– Query data from a data store for planning, problem solving, decision
support
– Large scale
– Read-mostly
• Features
– Shared-nothing architecture is a good match for analytical data
management (Teradata, Greenplum, Vertica…)
– ACID guarantees typically not needed
– Particularly sensitive data left out of analysis
• Conclusion: appropriate for the cloud

26
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Cloud DBMS Wish List
• Efficiency
• Fault tolerance (query restart not required, commodity hw)
• Heterogeneous environment (performance of compute nodes
not consistent)
• Operate on encrypted data
• Interface with business intelligence products

27
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Option 1: MapReduce-like software
• Fault tolerance (yes, commodity hw)
• Heterogeneous environment (yes, by design)
• Operate on encrypted data (no)
• Interface with business intelligence products (no, not SQL-
compliant, no standard)
• Efficiency (up for debate)
– Questionable results in the MapReduce paper
– Absence of a loading phase (no indices, materialized views)

28
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Option 2: Shared-Nothing Parallel Database
• Interface with business intelligence products (yes, by design)
• Efficiency (yes)
• Fault tolerance (no - query restart required)
• Heterogeneous environment (no)
• Operate on encrypted data (no)

29
Based on: “Data Management in the Cloud: Limitations and Opportunities”, IEEE, 2009.
Option 3: A Hybrid Solution
• HadoopDB (http://db.cs.yale.edu/hadoopdb/hadoopdb.html)
– A hybrid of DBMS and MapReduce technologies that targets analytical
workloads
– Designed to run on a shared-nothing cluster of commodity machines,
or in the cloud
– An attempt to fill the gap in the market for a free and open source
parallel DBMS
– Much more scalable than currently available parallel database systems
and DBMS/MapReduce hybrid systems.
– As scalable as Hadoop, while achieving superior performance on
structured data analysis workload
• Commercialized as Hadapt (hadapt.com)

30
Why NoSQL?
• Value of relational databases
– Persistent data
– Concurrency/transactions
– Integration
– (Mostly) Standard Model
• Impedance Mismatch
• Application vs. Integration databases

31
Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013
Attack of the Clusters
• Data growth (links, social networks, logs, users)
• Need to scale to accommodate growth
• Traditional RDBMS (Oracle / Microsoft SQL Server) – shared
disk – don’t scale well
• “technical issues are exacerbated by licensing costs”
– Google, Amazon influential
• “The interesting thing about Cloud Computing is that we’ve
redefined Cloud Computing to include everything that we
already do… I don’t’ understand what we would do differently
in the light of Cloud Computing other than change the wording
of some of our ads.” – Larry Ellison

Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013
32
Emergence of NoSQL
• No strong definition, but…
– Do not use SQL
– Typically open-source
– Typically oriented towards clusters (but not all)
– Operate without a schema
• Various types (in order of complexity)
– Key-value stores
– Document Stores
– Extensible Record Stores

Based on: “NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence”, 2013
33
What were we talking about?
• Cloud Computing
– Utility Computing
– Virtualization
– Economics (pay as you go)
• Data management in the cloud
– Cloud characteristics (elasticity if parallelizable, untrusted host, large
distances)
– Transactional vs. Analytical
– Wish List
– Map Reduce vs. Shared-Nothing -> Hybrid
• DB vs. NoSQL in two lines…
– Database: complex / concurrent
– NoSQL: simple / scalable

34
References
• M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski,
G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, M. Zaharia: Above the
Clouds: A Berkeley View of Cloud Computing. Tech. Rep. No.
UCB/EECS-2009-28, 2009.
• D. J. Abadi: Data Management in the Cloud: Limitations and
Opportunities. IEEE Data Eng. Bull. 32(1), pp. 3—12, 2009.
• R. Agrawal, A. Ailamaki, P. A. Bernstein, E. A. Brewer, M. J. Carey, S.
Chaudhuri, A. Doan, D. Florescu, M. J. Franklin, H. Garcia‐ Molina, J.
Gehrke, L. Gruenwald, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E.
Ioannidis, H. F. Korth, D. Kossmann, S. Madden, R. Magoulas, B. Chin
Ooi, T. O’Reilly, R. Ramakrishnan, S. Sarawagi, M. Stonebraker, A. S.
Szalay, G. Weikum: The Claremont Report on Database Research.
2008.
• P. Sadalage, M. Fowler. NoSQL Distilled: A Brief Guide to the
Emerging World of Polyglot Persistence. 2013
35

You might also like