0% found this document useful (0 votes)

63 views14 pages

Chapter 10: Big Data: Database System Concepts, 7 Ed

The document discusses different approaches for handling large volumes of data including distributed file systems, key-value storage systems, streaming data and applications, parallel graph processing, and replication and consistency challenges. It covers concepts like MapReduce, CAP theorem, and NoSQL databases.

Uploaded by

KhaledIsmail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views14 pages

Chapter 10: Big Data: Database System Concepts, 7 Ed

Uploaded by

KhaledIsmail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Chapter 10: Big Data

Database System Concepts, 7th Ed.

©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Motivation

 Very large volumes of data being collected

• Driven by growth of web, social media, and more recently internet-of-
things
• Web logs were an early source of data
 Analytics on web logs has great value for advertisements, web
site structuring, what posts to show to a user, etc
 Big Data: differentiated from data handled by earlier generation
databases
• Volume: much larger amounts of data stored
• Velocity: much higher rates of insertions
• Variety: many types of data, beyond relational data
• Veracity: the data might not be correct
• Value: the data provides benefit to the user

Database System Concepts - 7th Edition 10.2 ©Silberschatz, Korth and Sudarshan
Querying Big Data

 Transaction processing systems that need very high scalability

• Many applications willing to sacrifice ACID properties and other
database features, if they can get very high scalability
• Accept BASE
 Basically Available
 Soft state
 Eventually consistent
• Examples – Facebook, Stack Overflow, Google
 Query processing systems that
• Need very high scalability, and/or
• Need to support non-relation data
• Examples – Gene analysis, Pharmacology, Literature analysis

Database System Concepts - 7th Edition 10.3 ©Silberschatz, Korth and Sudarshan
Distributed File Systems

 A distributed file system stores data across a large collection of machines,

but provides single file-system view
 Highly scalable distributed file system for large data-intensive applications.
• E.g., 10K nodes, 100 million files, 10 PB
 Provides redundant storage of massive amounts of data on cheap and
unreliable computers
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
 Frequently, data is immutable (write once/read many)
 Examples:
• Google File System (GFS)
• Hadoop Distributed File System (HDFS)

Database System Concepts - 7th Edition 10.4 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Also called Columnar data stores

 Key-value storage systems store large numbers (billions or even more) of
small (KB-MB) sized records
 Records are vertcially partitioned across multiple machines and
 Queries are routed by the system to appropriate machine
 Records are also replicated across multiple machines, to ensure
availability even if a machine fails
• Key-value stores ensure that updates are applied to all replicas, to
ensure that their values are consistent
• On immutable DFS
 Versions for updates
 Tombstones for deletions

Database System Concepts - 7th Edition 10.5 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Key-value stores may store

• uninterpreted bytes, with an associated key
 E.g., Amazon S3, Amazon Dynamo
• Wide-table (can have arbitrarily many attribute names) with
associated key
• Google BigTable, Apache Cassandra, Apache Hbase, Amazon
DynamoDB
• Allows some operations (e.g., filtering) to execute on storage
node
• JSON
 MongoDB, CouchDB (document model)
 Document stores store semi-structured data, typically JSON

Database System Concepts - 7th Edition 10.6 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems

 Key-value stores support

• put(key, value): used to store values with an associated key,
• get(key): which retrieves the stored value associated with the
specified key
• delete(key) -- Remove the key and its associated value
• CRUD (Create, Read, Update, Delete) interface
• NoSQL (Not Only SQL)
 Some systems also support range queries on key values
 Document stores also support queries on non-key attributes
• See book for MongoDB queries
 Key value stores are not full database systems
• Have no/limited support for transactional updates
• Applications must manage query processing on their own

Database System Concepts - 7th Edition 10.7 ©Silberschatz, Korth and Sudarshan
Streaming Data and Applications

 Streaming data refers to data that arrives in a continuous fashion

• Contrast to data-at-rest
 Applications include:
• Stock market: stream of trades
• e-commerce site: purchases, searches
• Sensors: sensor readings
 Internet of things
• Network monitoring data
• Social media: tweets and posts can be viewed as a stream
 Queries on streams can be very useful
• Monitoring, alerts, automated triggering of actions

Approaches to querying streams:

 Windowing: Break up stream into windows, and queries are run on
windows
• Stream query languages support window operations
• Windows may be based on time or tuples
• Must figure out when all tuples in a window have been seen
 Easy if stream totally ordered by timestamp
 Punctuations are predicates that specify all future tuples do not
satisfy (e.g., timestamp greater that some value)
 Continuous Queries: Queries written e.g. in SQL, output partial results
based on stream seen so far; query results updated continuously
• Have some applications, but can lead to flood of updates

Approaches to querying streams (Cont.):

 Algebraic operators on streams:
• Each operator consumes tuples from a stream and outputs tuples
• Operators can be written e.g., in an imperative language
• Operator may maintain state
 Pattern matching:
• Queries specify patterns, system detects occurrences of patterns
and triggers actions
• Complex Event Processing (CEP) systems
• E.g., Microsoft StreamInsight, Flink CEP, Oracle Event Processing

 Very large graphs (billions of nodes, trillions of edges)

• Web graph: web pages are nodes, hyper links are edges
• Social network graph: people are nodes, friend/follow links are edges
 Two popular approaches for parallel processing on such graphs
• Map-reduce and algebraic frameworks
• Bulk synchronous processing (BSP) framework
 Multiple iterations are required for any computations on graphs
• Map-reduce/algebraic frameworks often have high overheads per
iteration
• BSP frameworks have much lower per-iteration overheads
 Google’s Pregel system popularized the BSP framework
 Apache Giraph is an open-source version of Pregel
 Apache Spark’s GraphX component provides a Pregel-like API

 Availability (system can run even if parts have failed) is essential for
parallel/distributed databases
• Via replication, so even if a node has failed, another copy is available
 Consistency (atomicity) is important for replicated data
• All live replicas have same value, and each read sees latest version
• Often implemented using majority protocols
 Network partitions (network can break into two or more parts, each with
active systems that can’t talk to other parts)
 In presence of partitions, cannot guarantee both availability and
consistency
• Brewer’s CAP “Theorem”

 Platform for reliable, scalable parallel computing

 Abstracts issues of distributed and parallel environment from programmer
• Programmer provides core logic (via map() and reduce() functions)
• System takes care of parallelization of computation, coordination, etc.
 Paradigm dates back many decades
• But very large scale implementations running on clusters with 10^3 to
10^4 machines are more recent
• Google Map Reduce, Hadoop, ..
 Data storage/access typically done using distributed file systems
(HDFS) or key-value stores (HBase)

 Current generation execution engines

• natively support algebraic operations such as joins, aggregation, etc.
natively.
• Allow users to create their own algebraic operators
• Support trees of algebraic operators that can be executed on multiple
nodes in parallel
 E.g. Apache Tez, Spark
• Tex provides low level API; Hive on Tez compiles SQL to Tez
• Spark provides more user-friendly API

DBM S Design 19 March
No ratings yet
DBM S Design 19 March
184 pages
Lecture Notes CSC3170 2025 Part 1
No ratings yet
Lecture Notes CSC3170 2025 Part 1
40 pages
Chapter 22: Parallel and Distributed Query Processing: Database System Concepts, 7 Ed
No ratings yet
Chapter 22: Parallel and Distributed Query Processing: Database System Concepts, 7 Ed
79 pages
CH 1
No ratings yet
CH 1
37 pages
Ch10 Big Data
No ratings yet
Ch10 Big Data
57 pages
Indexing and Hashing VJ
No ratings yet
Indexing and Hashing VJ
79 pages
ch1 Intro
No ratings yet
ch1 Intro
51 pages
CH 11
No ratings yet
CH 11
53 pages
Chuong2 Index
No ratings yet
Chuong2 Index
79 pages
DBMS Ch14 Indexing
No ratings yet
DBMS Ch14 Indexing
66 pages
CH 14
No ratings yet
CH 14
79 pages
CH 14
No ratings yet
CH 14
42 pages
CH 20
No ratings yet
CH 20
39 pages
CH 7
No ratings yet
CH 7
36 pages
7 - Indexing and Hashing
No ratings yet
7 - Indexing and Hashing
51 pages
370 - Lec 7
No ratings yet
370 - Lec 7
46 pages
Week7 Com206
No ratings yet
Week7 Com206
34 pages
CH 21
No ratings yet
CH 21
44 pages
Storage Structure
No ratings yet
Storage Structure
28 pages
CH 1
No ratings yet
CH 1
38 pages
DBMS - Unit 1 (Chapter 1)
No ratings yet
DBMS - Unit 1 (Chapter 1)
34 pages
Ch10 Big Data
No ratings yet
Ch10 Big Data
55 pages
Ch1 Introduction
No ratings yet
Ch1 Introduction
36 pages
Ch1 Introduction
No ratings yet
Ch1 Introduction
39 pages
Introduction
No ratings yet
Introduction
32 pages
Database System Architecture
No ratings yet
Database System Architecture
45 pages
Database Note 3
No ratings yet
Database Note 3
36 pages
CH 22
No ratings yet
CH 22
34 pages
CH 15
No ratings yet
CH 15
59 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Scikit Learn Docs
100% (2)
Scikit Learn Docs
2,754 pages
Application Development
No ratings yet
Application Development
25 pages
Dbms Intro
No ratings yet
Dbms Intro
27 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Unit1 Part1
No ratings yet
Unit1 Part1
59 pages
Chapt 8 PDF
No ratings yet
Chapt 8 PDF
37 pages
CH 1
No ratings yet
CH 1
32 pages
Data Storage Structures
No ratings yet
Data Storage Structures
38 pages
Database Management System Lecture 01 Introduction
No ratings yet
Database Management System Lecture 01 Introduction
20 pages
Database: Introduction Ranjeet Ranjan Jha Mathematics Department
No ratings yet
Database: Introduction Ranjeet Ranjan Jha Mathematics Department
28 pages
Chapter 14: Indexing: Database System Concepts, 7 Ed
No ratings yet
Chapter 14: Indexing: Database System Concepts, 7 Ed
29 pages
Chapter 20: Database System Architectures
No ratings yet
Chapter 20: Database System Architectures
45 pages
Introduction
No ratings yet
Introduction
22 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
Lec2 DB
No ratings yet
Lec2 DB
32 pages
Two-Tier and Three-Tier Architectures
No ratings yet
Two-Tier and Three-Tier Architectures
8 pages
Chapter 10
No ratings yet
Chapter 10
25 pages
CSQSA Report
No ratings yet
CSQSA Report
30 pages
Chapter 13: Data Storage Structures: Database System Concepts, 7 Ed
No ratings yet
Chapter 13: Data Storage Structures: Database System Concepts, 7 Ed
29 pages
CH 1
No ratings yet
CH 1
35 pages
Ai Based PPC
No ratings yet
Ai Based PPC
28 pages
Chapter 1: Introduction: Database System Concepts, 7 Ed
No ratings yet
Chapter 1: Introduction: Database System Concepts, 7 Ed
37 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
6 pages
©silberschatz, Korth and Sudarshan 1. 1 Database System Concepts
No ratings yet
©silberschatz, Korth and Sudarshan 1. 1 Database System Concepts
44 pages
CS270 Intro
No ratings yet
CS270 Intro
41 pages
Database Management Systems: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
No ratings yet
Database Management Systems: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
33 pages
Computer Fundamentals Notes-1-1
No ratings yet
Computer Fundamentals Notes-1-1
17 pages
Az 04t00a Enu Powerpoint 09
No ratings yet
Az 04t00a Enu Powerpoint 09
30 pages
Unix Networking Programming
100% (2)
Unix Networking Programming
54 pages
Seminar Topics
No ratings yet
Seminar Topics
10 pages
Project Management Plan For Hardware Management System
100% (1)
Project Management Plan For Hardware Management System
7 pages
Cyber Forensics Unit - 1 Computer Forensics
No ratings yet
Cyber Forensics Unit - 1 Computer Forensics
17 pages
Ashish Assignment
No ratings yet
Ashish Assignment
29 pages
GSM Based Automatic Substation Load Shedding and Sharing Using Programmable Switching Control
No ratings yet
GSM Based Automatic Substation Load Shedding and Sharing Using Programmable Switching Control
3 pages
SERRATRON PROC - 001a - XSM
No ratings yet
SERRATRON PROC - 001a - XSM
17 pages
Term Paper
No ratings yet
Term Paper
24 pages
Java Environment and Features
No ratings yet
Java Environment and Features
34 pages
JuSPE CS Qnpaper
No ratings yet
JuSPE CS Qnpaper
14 pages
Exploring Latches
No ratings yet
Exploring Latches
14 pages
Branch Prediction
No ratings yet
Branch Prediction
6 pages
Sony Vegas Pro 14
No ratings yet
Sony Vegas Pro 14
10 pages
What To Do If The Hard Disk Is Recognized As RAW
No ratings yet
What To Do If The Hard Disk Is Recognized As RAW
4 pages
Lec16 Synch
No ratings yet
Lec16 Synch
9 pages
Making Hardware Block Diagrams
No ratings yet
Making Hardware Block Diagrams
11 pages
LPDDR5SYS Datasheet EN US 61W 73766 01
No ratings yet
LPDDR5SYS Datasheet EN US 61W 73766 01
10 pages
TI093 - ConfigMgr Questions
No ratings yet
TI093 - ConfigMgr Questions
7 pages
Java Sample Program
No ratings yet
Java Sample Program
6 pages
Untitled
No ratings yet
Untitled
27 pages
Motorola Driver Log
No ratings yet
Motorola Driver Log
6 pages
An Interpreter For Extended Lambda Calculus - AIM-349
No ratings yet
An Interpreter For Extended Lambda Calculus - AIM-349
43 pages
DSS Use Case 7 8
No ratings yet
DSS Use Case 7 8
2 pages
Cloud Phone: September 2020
No ratings yet
Cloud Phone: September 2020
4 pages
Laptop Price List 24 Aug-11
No ratings yet
Laptop Price List 24 Aug-11
3 pages
Mupparaju's Resume
No ratings yet
Mupparaju's Resume
1 page
HHJGHH
No ratings yet
HHJGHH
1 page
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet

Chapter 10: Big Data: Database System Concepts, 7 Ed

Uploaded by

Chapter 10: Big Data: Database System Concepts, 7 Ed

Uploaded by

Chapter 10: Big Data

Database System Concepts, 7th Ed.

 Very large volumes of data being collected

 Transaction processing systems that need very high scalability

 A distributed file system stores data across a large collection of machines,

 Also called Columnar data stores

 Key-value stores may store

 Key-value stores support

 Streaming data refers to data that arrives in a continuous fashion

Approaches to querying streams:

Approaches to querying streams (Cont.):

 Very large graphs (billions of nodes, trillions of edges)

 Platform for reliable, scalable parallel computing

 Current generation execution engines

You might also like