0% found this document useful (0 votes)
63 views14 pages

Chapter 10: Big Data: Database System Concepts, 7 Ed

The document discusses different approaches for handling large volumes of data including distributed file systems, key-value storage systems, streaming data and applications, parallel graph processing, and replication and consistency challenges. It covers concepts like MapReduce, CAP theorem, and NoSQL databases.

Uploaded by

KhaledIsmail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views14 pages

Chapter 10: Big Data: Database System Concepts, 7 Ed

The document discusses different approaches for handling large volumes of data including distributed file systems, key-value storage systems, streaming data and applications, parallel graph processing, and replication and consistency challenges. It covers concepts like MapReduce, CAP theorem, and NoSQL databases.

Uploaded by

KhaledIsmail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 10: Big Data

Database System Concepts, 7th Ed.


©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Motivation

 Very large volumes of data being collected


• Driven by growth of web, social media, and more recently internet-of-
things
• Web logs were an early source of data
 Analytics on web logs has great value for advertisements, web
site structuring, what posts to show to a user, etc
 Big Data: differentiated from data handled by earlier generation
databases
• Volume: much larger amounts of data stored
• Velocity: much higher rates of insertions
• Variety: many types of data, beyond relational data
• Veracity: the data might not be correct
• Value: the data provides benefit to the user

Database System Concepts - 7th Edition 10.2 ©Silberschatz, Korth and Sudarshan
Querying Big Data

 Transaction processing systems that need very high scalability


• Many applications willing to sacrifice ACID properties and other
database features, if they can get very high scalability
• Accept BASE
 Basically Available
 Soft state
 Eventually consistent
• Examples – Facebook, Stack Overflow, Google
 Query processing systems that
• Need very high scalability, and/or
• Need to support non-relation data
• Examples – Gene analysis, Pharmacology, Literature analysis

Database System Concepts - 7th Edition 10.3 ©Silberschatz, Korth and Sudarshan
Distributed File Systems

 A distributed file system stores data across a large collection of machines,


but provides single file-system view
 Highly scalable distributed file system for large data-intensive applications.
• E.g., 10K nodes, 100 million files, 10 PB
 Provides redundant storage of massive amounts of data on cheap and
unreliable computers
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
 Frequently, data is immutable (write once/read many)
 Examples:
• Google File System (GFS)
• Hadoop Distributed File System (HDFS)

Database System Concepts - 7th Edition 10.4 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Also called Columnar data stores


 Key-value storage systems store large numbers (billions or even more) of
small (KB-MB) sized records
 Records are vertcially partitioned across multiple machines and
 Queries are routed by the system to appropriate machine
 Records are also replicated across multiple machines, to ensure
availability even if a machine fails
• Key-value stores ensure that updates are applied to all replicas, to
ensure that their values are consistent
• On immutable DFS
 Versions for updates
 Tombstones for deletions

Database System Concepts - 7th Edition 10.5 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Key-value stores may store


• uninterpreted bytes, with an associated key
 E.g., Amazon S3, Amazon Dynamo
• Wide-table (can have arbitrarily many attribute names) with
associated key
• Google BigTable, Apache Cassandra, Apache Hbase, Amazon
DynamoDB
• Allows some operations (e.g., filtering) to execute on storage
node
• JSON
 MongoDB, CouchDB (document model)
 Document stores store semi-structured data, typically JSON

Database System Concepts - 7th Edition 10.6 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Key-value stores support


• put(key, value): used to store values with an associated key,
• get(key): which retrieves the stored value associated with the
specified key
• delete(key) -- Remove the key and its associated value
• CRUD (Create, Read, Update, Delete) interface
• NoSQL (Not Only SQL)
 Some systems also support range queries on key values
 Document stores also support queries on non-key attributes
• See book for MongoDB queries
 Key value stores are not full database systems
• Have no/limited support for transactional updates
• Applications must manage query processing on their own

Database System Concepts - 7th Edition 10.7 ©Silberschatz, Korth and Sudarshan
Streaming Data and Applications

 Streaming data refers to data that arrives in a continuous fashion


• Contrast to data-at-rest
 Applications include:
• Stock market: stream of trades
• e-commerce site: purchases, searches
• Sensors: sensor readings
 Internet of things
• Network monitoring data
• Social media: tweets and posts can be viewed as a stream
 Queries on streams can be very useful
• Monitoring, alerts, automated triggering of actions

Database System Concepts - 7th Edition 10.8 ©Silberschatz, Korth and Sudarshan
Querying Streaming Data

Approaches to querying streams:


 Windowing: Break up stream into windows, and queries are run on
windows
• Stream query languages support window operations
• Windows may be based on time or tuples
• Must figure out when all tuples in a window have been seen
 Easy if stream totally ordered by timestamp
 Punctuations are predicates that specify all future tuples do not
satisfy (e.g., timestamp greater that some value)
 Continuous Queries: Queries written e.g. in SQL, output partial results
based on stream seen so far; query results updated continuously
• Have some applications, but can lead to flood of updates

Database System Concepts - 7th Edition 10.9 ©Silberschatz, Korth and Sudarshan
Querying Streaming Data (Cont.)

Approaches to querying streams (Cont.):


 Algebraic operators on streams:
• Each operator consumes tuples from a stream and outputs tuples
• Operators can be written e.g., in an imperative language
• Operator may maintain state
 Pattern matching:
• Queries specify patterns, system detects occurrences of patterns
and triggers actions
• Complex Event Processing (CEP) systems
• E.g., Microsoft StreamInsight, Flink CEP, Oracle Event Processing

Database System Concepts - 7th Edition 10.10 ©Silberschatz, Korth and Sudarshan
Parallel Graph Processing

 Very large graphs (billions of nodes, trillions of edges)


• Web graph: web pages are nodes, hyper links are edges
• Social network graph: people are nodes, friend/follow links are edges
 Two popular approaches for parallel processing on such graphs
• Map-reduce and algebraic frameworks
• Bulk synchronous processing (BSP) framework
 Multiple iterations are required for any computations on graphs
• Map-reduce/algebraic frameworks often have high overheads per
iteration
• BSP frameworks have much lower per-iteration overheads
 Google’s Pregel system popularized the BSP framework
 Apache Giraph is an open-source version of Pregel
 Apache Spark’s GraphX component provides a Pregel-like API

Database System Concepts - 7th Edition 10.11 ©Silberschatz, Korth and Sudarshan
Replication and Consistency

 Availability (system can run even if parts have failed) is essential for
parallel/distributed databases
• Via replication, so even if a node has failed, another copy is available
 Consistency (atomicity) is important for replicated data
• All live replicas have same value, and each read sees latest version
• Often implemented using majority protocols
 Network partitions (network can break into two or more parts, each with
active systems that can’t talk to other parts)
 In presence of partitions, cannot guarantee both availability and
consistency
• Brewer’s CAP “Theorem”

Database System Concepts - 7th Edition 10.12 ©Silberschatz, Korth and Sudarshan
The MapReduce Paradigm

 Platform for reliable, scalable parallel computing


 Abstracts issues of distributed and parallel environment from programmer
• Programmer provides core logic (via map() and reduce() functions)
• System takes care of parallelization of computation, coordination, etc.
 Paradigm dates back many decades
• But very large scale implementations running on clusters with 10^3 to
10^4 machines are more recent
• Google Map Reduce, Hadoop, ..
 Data storage/access typically done using distributed file systems
(HDFS) or key-value stores (HBase)

Database System Concepts - 7th Edition 10.13 ©Silberschatz, Korth and Sudarshan
Algebraic Operations

 Current generation execution engines


• natively support algebraic operations such as joins, aggregation, etc.
natively.
• Allow users to create their own algebraic operators
• Support trees of algebraic operators that can be executed on multiple
nodes in parallel
 E.g. Apache Tez, Spark
• Tex provides low level API; Hive on Tez compiles SQL to Tez
• Spark provides more user-friendly API

Database System Concepts - 7th Edition 10.14 ©Silberschatz, Korth and Sudarshan

You might also like