Chapter 10: Big Data: Database System Concepts, 7 Ed

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Chapter 10: Big Data

Database System Concepts, 7th Ed.


©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Motivation

 Very large volumes of data being collected


• Driven by growth of web, social media, and more recently internet-of-
things
• Web logs were an early source of data
 Analytics on web logs has great value for advertisements, web
site structuring, what posts to show to a user, etc
 Big Data: differentiated from data handled by earlier generation
databases
• Volume: much larger amounts of data stored
• Velocity: much higher rates of insertions
• Variety: many types of data, beyond relational data
• Veracity: the data might not be correct
• Value: the data provides benefit to the user

Database System Concepts - 7th Edition 10.2 ©Silberschatz, Korth and Sudarshan
Querying Big Data

 Transaction processing systems that need very high scalability


• Many applications willing to sacrifice ACID properties and other
database features, if they can get very high scalability
• Accept BASE
 Basically Available
 Soft state
 Eventually consistent
• Examples – Facebook, Stack Overflow, Google
 Query processing systems that
• Need very high scalability, and/or
• Need to support non-relation data
• Examples – Gene analysis, Pharmacology, Literature analysis

Database System Concepts - 7th Edition 10.3 ©Silberschatz, Korth and Sudarshan
Distributed File Systems

 A distributed file system stores data across a large collection of machines,


but provides single file-system view
 Highly scalable distributed file system for large data-intensive applications.
• E.g., 10K nodes, 100 million files, 10 PB
 Provides redundant storage of massive amounts of data on cheap and
unreliable computers
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
 Frequently, data is immutable (write once/read many)
 Examples:
• Google File System (GFS)
• Hadoop Distributed File System (HDFS)

Database System Concepts - 7th Edition 10.4 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Also called Columnar data stores


 Key-value storage systems store large numbers (billions or even more) of
small (KB-MB) sized records
 Records are vertcially partitioned across multiple machines and
 Queries are routed by the system to appropriate machine
 Records are also replicated across multiple machines, to ensure
availability even if a machine fails
• Key-value stores ensure that updates are applied to all replicas, to
ensure that their values are consistent
• On immutable DFS
 Versions for updates
 Tombstones for deletions

Database System Concepts - 7th Edition 10.5 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Key-value stores may store


• uninterpreted bytes, with an associated key
 E.g., Amazon S3, Amazon Dynamo
• Wide-table (can have arbitrarily many attribute names) with
associated key
• Google BigTable, Apache Cassandra, Apache Hbase, Amazon
DynamoDB
• Allows some operations (e.g., filtering) to execute on storage
node
• JSON
 MongoDB, CouchDB (document model)
 Document stores store semi-structured data, typically JSON

Database System Concepts - 7th Edition 10.6 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Key-value stores support


• put(key, value): used to store values with an associated key,
• get(key): which retrieves the stored value associated with the
specified key
• delete(key) -- Remove the key and its associated value
• CRUD (Create, Read, Update, Delete) interface
• NoSQL (Not Only SQL)
 Some systems also support range queries on key values
 Document stores also support queries on non-key attributes
• See book for MongoDB queries
 Key value stores are not full database systems
• Have no/limited support for transactional updates
• Applications must manage query processing on their own

Database System Concepts - 7th Edition 10.7 ©Silberschatz, Korth and Sudarshan
Streaming Data and Applications

 Streaming data refers to data that arrives in a continuous fashion


• Contrast to data-at-rest
 Applications include:
• Stock market: stream of trades
• e-commerce site: purchases, searches
• Sensors: sensor readings
 Internet of things
• Network monitoring data
• Social media: tweets and posts can be viewed as a stream
 Queries on streams can be very useful
• Monitoring, alerts, automated triggering of actions

Database System Concepts - 7th Edition 10.8 ©Silberschatz, Korth and Sudarshan
Querying Streaming Data

Approaches to querying streams:


 Windowing: Break up stream into windows, and queries are run on
windows
• Stream query languages support window operations
• Windows may be based on time or tuples
• Must figure out when all tuples in a window have been seen
 Easy if stream totally ordered by timestamp
 Punctuations are predicates that specify all future tuples do not
satisfy (e.g., timestamp greater that some value)
 Continuous Queries: Queries written e.g. in SQL, output partial results
based on stream seen so far; query results updated continuously
• Have some applications, but can lead to flood of updates

Database System Concepts - 7th Edition 10.9 ©Silberschatz, Korth and Sudarshan
Querying Streaming Data (Cont.)

Approaches to querying streams (Cont.):


 Algebraic operators on streams:
• Each operator consumes tuples from a stream and outputs tuples
• Operators can be written e.g., in an imperative language
• Operator may maintain state
 Pattern matching:
• Queries specify patterns, system detects occurrences of patterns
and triggers actions
• Complex Event Processing (CEP) systems
• E.g., Microsoft StreamInsight, Flink CEP, Oracle Event Processing

Database System Concepts - 7th Edition 10.10 ©Silberschatz, Korth and Sudarshan
Parallel Graph Processing

 Very large graphs (billions of nodes, trillions of edges)


• Web graph: web pages are nodes, hyper links are edges
• Social network graph: people are nodes, friend/follow links are edges
 Two popular approaches for parallel processing on such graphs
• Map-reduce and algebraic frameworks
• Bulk synchronous processing (BSP) framework
 Multiple iterations are required for any computations on graphs
• Map-reduce/algebraic frameworks often have high overheads per
iteration
• BSP frameworks have much lower per-iteration overheads
 Google’s Pregel system popularized the BSP framework
 Apache Giraph is an open-source version of Pregel
 Apache Spark’s GraphX component provides a Pregel-like API

Database System Concepts - 7th Edition 10.11 ©Silberschatz, Korth and Sudarshan
Replication and Consistency

 Availability (system can run even if parts have failed) is essential for
parallel/distributed databases
• Via replication, so even if a node has failed, another copy is available
 Consistency (atomicity) is important for replicated data
• All live replicas have same value, and each read sees latest version
• Often implemented using majority protocols
 Network partitions (network can break into two or more parts, each with
active systems that can’t talk to other parts)
 In presence of partitions, cannot guarantee both availability and
consistency
• Brewer’s CAP “Theorem”

Database System Concepts - 7th Edition 10.12 ©Silberschatz, Korth and Sudarshan
The MapReduce Paradigm

 Platform for reliable, scalable parallel computing


 Abstracts issues of distributed and parallel environment from programmer
• Programmer provides core logic (via map() and reduce() functions)
• System takes care of parallelization of computation, coordination, etc.
 Paradigm dates back many decades
• But very large scale implementations running on clusters with 10^3 to
10^4 machines are more recent
• Google Map Reduce, Hadoop, ..
 Data storage/access typically done using distributed file systems
(HDFS) or key-value stores (HBase)

Database System Concepts - 7th Edition 10.13 ©Silberschatz, Korth and Sudarshan
Algebraic Operations

 Current generation execution engines


• natively support algebraic operations such as joins, aggregation, etc.
natively.
• Allow users to create their own algebraic operators
• Support trees of algebraic operators that can be executed on multiple
nodes in parallel
 E.g. Apache Tez, Spark
• Tex provides low level API; Hive on Tez compiles SQL to Tez
• Spark provides more user-friendly API

Database System Concepts - 7th Edition 10.14 ©Silberschatz, Korth and Sudarshan

You might also like