0% found this document useful (0 votes)

14 views10 pages

Lecture 2 - Relational Data Processing

This lecture covers relational data processing, focusing on querying relational databases using SQL, the benefits and drawbacks of distributed relational databases, and techniques in distributed query processing. Key topics include relational operators like projection, aggregation, and join, as well as the fundamentals of distributed databases, their classification, and examples of distributed and parallel database architectures. The lecture emphasizes the importance of scalability, fault-tolerance, and efficient data processing strategies in distributed systems.

Uploaded by

teun.bobbink

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views10 pages

Lecture 2 - Relational Data Processing

Uploaded by

teun.bobbink

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lecture 2 - Relational Data Processing

What should you be able to do after this week?

Query a relational database consisting of several tables with SQL (joins & aggregations)
Distinguish the benefits & drawbacksof idfferent types of distributed relational databases
Summarise basic techniques in distributed query processing (e.g., different patitioning strategies and join
types)

Relational Database Systems - A Refresher

Relational Data

Relational Operators - Projection

The Projection operator modifies
each row of a table individually

Remove columns

Add new columns by evaluating

expressions

Relational Operators - (Grouped) Aggregation

The Aggregation operator
aggregates information across
multiple rows

Compute an aggregate value

(e.g., a sum) across teh rows of
each group

Groups defined by a grouping key

(otherwise, whole table is
aggregated)

Relational Operators - Join

The Join operator combines
information from two tables

Lecture 2 - Relational Data Processing 1

Tables are typically joined by a
key (= combine rows with
matching keys)

If no key is given, a join produces

the Cartesian product (all pairs of
rows)

The life of a Relational Database Query

Distributed Database Fundamentals

What is a Distributed Database?
Simply spoken:

It is a database that is spread (distributed) across multiple machines)

Also important:
For an end-user, interacting with a distributed database should be indistinguishable from a non-
distributed one.

Why do we distribute data?

Perfomance

With data sizes growing exponentially, the need for fast data processing is outgrowing individual
machines

Elasticity

The database can be quickly & flexibly scaled to fit the requirements by adding (or removing)
resources

Fault-Tolearance

Running on more than one node allows the system to better recover from hardware failures

How do we classify distributed databases?

Multiple (often overlapping) dimensions:

Scalability: Scale-up vs Scale-Out

Lecture 2 - Relational Data Processing 2

Implementation: Parallel vs Distributed

Parallel Database:

Runs on tightly-coupled nodes (e.g., a cluser, or a multi-processor/multi-core system)

Implementation focus on multi-threading, inter-process communication

Main goal is usually to achieve peak performance

⇒ Typically a scale-up architecture

Distributed Database:

Runs on loosely-coupled nodes (e.g., individual machines, cloud resources)

Implementation focus on data distribution, network efficiency, distributed algorithms

Main goal is usually to achieve scalability, falut-tolerance, or elasticity

⇒ Typically a scale-out architecture

⇒ Often nog a clear cut: Most distributed databases are also parallel!

Application: Analytical vs Operational

Online Analytical Processing (OLAP):

Focus on a few, complex, long-running analytical queries

Data changes slowly, typically via bulk inserts or trickle loading

Think: Market Research, Scientific Databases, Data Mining

Online Transactional Processing (OLTP):

Focus on multiple concurrent, simple, short-running transactional queries

Data changes rapidly, typically via point updates

Think: Account Management, Financial Transactions, Store Inventory

Architecture: Shared Memory vs Shared Disk vs Shared

Nothing

Shared Memory: Shared Memory:

All nodes have shared access to both memory & disk

Essentially: Multi-core Server

Typical architecture found in scale-up, parallel

databases

Postgres, Oracle, SQL Server

Lecture 2 - Relational Data Processing 3

Main-Memory DBMS like Apache Ignite, Hyper,
SAP Hana

Can achieve very high performance, but is hard to

scale when running out of resources

Shared Disk:

Nodes have their own CPU & memory, but share the
same disk.

Example: Enterprise Mainframe with NAS

(network-attached storage)

Most commonly found in traditional, enterprise-

Shared Disk:
grade RDBMs systems

Oracle, MS SQL Server

Shared Nothing:

Data is spread across independent nodes that only

communicate via the network

Typical architecture found in “web scale”,scale-out

systems:

Dataflow systems like Apache Hadoop / Spark /

Flink
Shared Nothing:
Distributed Databases, Key-Value Stores

Robust architecture that offers availability &

scalability, but can be slower than shared-memory /
disk

Distributed Query Processing

In theory, distributed query processing is straightforward:

Step 1: Shuffle your data around so you have the required parts available on the nodes

Step 2: Run the local algorithm to evaluate the operator on the nodes

In fact, here’s a super-simple algorithm to run any operator distributed:

Step 1: Send all of the data to a single central node

Step 2: Run the local algorithm on the central node

Obviously, this naïve approach has major drawbacks:

We send a lot of data across the network to the central node & serialize the execution

It’s not scalable, and effectively eliminates the advantage of running on multiple nodes

Lecture 2 - Relational Data Processing 4

Still: This can be a useful strategy in some cases!

Intuition

Data Shuffling Primitives: Broadcasting

Each node sends a copy of all their data to all other nodes

Data Shuffling Primitives: Range Partitioning

Each node receives a predefined range of the key space

Data Shuffling Primitives: Hash Partitioning

Each node receives a portion of the key space determined by a hash function

Lecture 2 - Relational Data Processing 5

Task: Shuffling
How is the data distributed over node 0 and node 1 when we apply different data shuffling strategies?

Fill the grey boxes with the data items designated for a particular node.

Distributed Selection/Projection
Easiest operators to run distributed:

Operators process each row individually - No need to shuffle data around!

Distributed GroupBy/Aggregation

Lecture 2 - Relational Data Processing 6

(1) Hash partition on grouping key to collect all tuples with same key
(2) Compute aggregation locally on each node

Distributed Joins
Complex operator to run distributed

Need to ensure that matching rows end up on the same node

General Strategy:

Shuffle data around to ensure that matching pairs are on the same node

Then run a local join algorithm

Optimal strategy depends on:

How data is partitioned / distributed across the nodes

The size of the individual tables

Co-Located Join
Best case:

Both tables are partitioned by the join keys — no need to reshuffle data, just run join locally!

Asymmetric Repartition Join

If only one of the tables is partitioned by the join key: Hash-partition the other one by the join key, run
join locally

Lecture 2 - Relational Data Processing 7

Symmertric Repartition Join
General case: The tables are partitioned differently

If both tables are roughly the same size, then we hash-partition both by the join key, then run the join
locally

Broadcast Join
General case: The tables are partitioned differently

If one table is a lot smaller than the other, broadcast the small table, then run the join locally

Task: Joins

Lecture 2 - Relational Data Processing 8

Examples for Distributed and Parallel Database Architectures
Let’s go over a few explicit examples of distributed / parallel database systems:

In-Memory Database

Distributed Key Value Stores

Data Warehousing Systems

Cloud DBMS

In-Memory Databases
Scale-up, shared-memory, parallel database engine

Usually targeting both analytical, as well as operational workloads

Data is kept in memory, allowing extremely fast access

Often on a single, beefy node with multiple TBs of main memory

Focus on CPU efficiency / multi-threading

Columnar data layout, Compressed Execution, Vectorized (SIMD) operations, Lock-free algorithms

Typical applications are time-critical systems

Real-time systems, Critical Business Intelligence Solutions, Dashboarding Backends, Trading Systems,
...

Lecture 2 - Relational Data Processing 9

Examples:

SAP Hana, Hyper, Apache Ignite

Distributed Key-Value Stores

Scale-out, shared-nothing, distributed, operational database engine
Provide transactional access to key-value pairs:

User provides a key to read/write a given value (think: hash table)

Focus on fault-tolerance and transaction speed

Keys are often mapped to nodes via consistent hash function

Allows concurrent access to thousands of different keys per second

Replication is used to guarantee fault-tolerance

Typical use-cases are backends for web applications, web stores, caches
Examples:

Amazon DynamoDB, Apache Cassandra, FoundationDB

Data Warehousing Systems

Shared-nothing, scale-out, distributed, analytical database engine
Data is partitioned across multiple nodes of a cluster

“Star Schema”

User often has to provide an explicit partitioning strategy

Focus on read / IO-performance:

Columnar data layout, compressed storage, exploiting data partitioning, aggressive utilization of
metadata to avoid scans

Typical use cases are Business Intelligence (BI), Reporting, Operational Management, …
Examples:

Redshift, Teradata, Vertica, Oracle Exadata, Postgres

Cloud RDBMs
Architectural evolution of Data Warehousing Systems for modern Cloud Environments
Builds on Shared Nothing, but keeps data in cloud storage

Nodes do not “own” data, they only access what they need to process the query from cloud storage.

Transactions and access consistency are handled centrally via a distributed key value store.

Implementation focus on extreme elasticity:

Cloud Resources are “infinite”, can be provisioned within seconds.

Allows accessing the data from 1000s of nodes concurrently

Scale resources up & down exactly as and when needed.

Use cases are similar to Data Warehousing Systems, but often with a focus on larger enterprise
deployments

Lecture 2 - Relational Data Processing 10

Unit - 1 DDB
No ratings yet
Unit - 1 DDB
34 pages
CS3492 DBMS Notes
100% (1)
CS3492 DBMS Notes
165 pages
Distributed Database System (KCA045)
No ratings yet
Distributed Database System (KCA045)
9 pages
Unit VII Advanced Topics
No ratings yet
Unit VII Advanced Topics
23 pages
Data Ingreation Approaches
No ratings yet
Data Ingreation Approaches
9 pages
New 2nd Lecture Data Resource Management
No ratings yet
New 2nd Lecture Data Resource Management
24 pages
Distributed Databases
No ratings yet
Distributed Databases
32 pages
Unit - Iv Data Analytics Frameworks: Centralized and Distributed Functional Architectures of Relational Systems
No ratings yet
Unit - Iv Data Analytics Frameworks: Centralized and Distributed Functional Architectures of Relational Systems
24 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
Distibuted System
No ratings yet
Distibuted System
11 pages
20 - 04 - 2024 Cheatsheet
No ratings yet
20 - 04 - 2024 Cheatsheet
3 pages
Chapter 5 ITM100
No ratings yet
Chapter 5 ITM100
5 pages
Case Study About Database Tools
No ratings yet
Case Study About Database Tools
13 pages
PDF Document BIDA 2
No ratings yet
PDF Document BIDA 2
21 pages
BDT Unit 02 - Part1
No ratings yet
BDT Unit 02 - Part1
153 pages
Distributed Databases: Daniel Marcous
No ratings yet
Distributed Databases: Daniel Marcous
41 pages
Distributed Data Processing Concepts and Self-Learning (1)
No ratings yet
Distributed Data Processing Concepts and Self-Learning (1)
11 pages
P24CDMCA4 Unit2
No ratings yet
P24CDMCA4 Unit2
15 pages
Sayan Ghosh 26900123054 Distributed Database System Cse 6th Sem
No ratings yet
Sayan Ghosh 26900123054 Distributed Database System Cse 6th Sem
11 pages
Ddbms-Unit 1 Part2
No ratings yet
Ddbms-Unit 1 Part2
16 pages
Advanced Data Management - For SQL, NoSQL, Cloud and Distributed Databases
No ratings yet
Advanced Data Management - For SQL, NoSQL, Cloud and Distributed Databases
375 pages
01 Intro
No ratings yet
01 Intro
20 pages
C3 Ais C4 R
100% (1)
C3 Ais C4 R
44 pages
C3 Ais C4 R
No ratings yet
C3 Ais C4 R
44 pages
Session 1
No ratings yet
Session 1
48 pages
"Advanced Database Systems": Course Outlines
No ratings yet
"Advanced Database Systems": Course Outlines
23 pages
Database Management
No ratings yet
Database Management
7 pages
UNIT 2 - Part1
No ratings yet
UNIT 2 - Part1
53 pages
BY:-Abhishek Goel Shubham Gupta Varun Sood
No ratings yet
BY:-Abhishek Goel Shubham Gupta Varun Sood
27 pages
1 Introduction
No ratings yet
1 Introduction
43 pages
unit 1db
No ratings yet
unit 1db
29 pages
Introduction To Dbms
No ratings yet
Introduction To Dbms
37 pages
1.1 Adc
No ratings yet
1.1 Adc
28 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
CP4152 Database Practices Unit 1 2
No ratings yet
CP4152 Database Practices Unit 1 2
73 pages
Sayan Ghosh 26900123054 Distributed Database System Cse 6TH Sem
No ratings yet
Sayan Ghosh 26900123054 Distributed Database System Cse 6TH Sem
11 pages
TDD: Topics in Distributed Databases: Parallel Database Management Systems
No ratings yet
TDD: Topics in Distributed Databases: Parallel Database Management Systems
38 pages
Unit 2
No ratings yet
Unit 2
41 pages
NoSQL DBs
No ratings yet
NoSQL DBs
46 pages
Database Slide Book
No ratings yet
Database Slide Book
309 pages
Chapter7DBMS 177274
No ratings yet
Chapter7DBMS 177274
86 pages
DOC-20250729-WA0033.
No ratings yet
DOC-20250729-WA0033.
26 pages
7-Distributed DB
No ratings yet
7-Distributed DB
37 pages
Lecture 012
No ratings yet
Lecture 012
33 pages
Second Unit ADBMS
No ratings yet
Second Unit ADBMS
53 pages
Chapter 6 MIS270
No ratings yet
Chapter 6 MIS270
65 pages
ADBMS IMP Questions
No ratings yet
ADBMS IMP Questions
41 pages
DBMS
No ratings yet
DBMS
10 pages
Introducing Relational Database Products-2
No ratings yet
Introducing Relational Database Products-2
43 pages
Unit I
No ratings yet
Unit I
13 pages
Database Management Systems Week 1
No ratings yet
Database Management Systems Week 1
20 pages
EC 240 Database Engineering: Agenda
No ratings yet
EC 240 Database Engineering: Agenda
16 pages
Introduction To DDBMS Enhanced
No ratings yet
Introduction To DDBMS Enhanced
17 pages
Dbms nEW
No ratings yet
Dbms nEW
86 pages
DB - Chapter 1
No ratings yet
DB - Chapter 1
64 pages
DBMS 2
No ratings yet
DBMS 2
28 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
No ratings yet
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
27 pages
New Notes
No ratings yet
New Notes
100 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Bba 3 - Iit - U1
No ratings yet
Bba 3 - Iit - U1
5 pages
Niton XL5 - Spec Sheet
No ratings yet
Niton XL5 - Spec Sheet
2 pages
[SOLVED] Video Playback Stutters_lag in Cinnamon - Linux Mint Forums_(Importante)
No ratings yet
[SOLVED] Video Playback Stutters_lag in Cinnamon - Linux Mint Forums_(Importante)
8 pages
SANS - 0230920 - Sysdig - Updated - Buyers - Guide - FINAL
No ratings yet
SANS - 0230920 - Sysdig - Updated - Buyers - Guide - FINAL
17 pages
MIS - 104: Computing Fundamentals: BBA Department of MIS University of Dhaka
No ratings yet
MIS - 104: Computing Fundamentals: BBA Department of MIS University of Dhaka
22 pages
Integrating Images and External Materials
No ratings yet
Integrating Images and External Materials
2 pages
Department of Collegiate and Technical Education: Computer Science and Engineering
No ratings yet
Department of Collegiate and Technical Education: Computer Science and Engineering
21 pages
Tintu Mol Software Engineer: Mobile: +91-9746511479
No ratings yet
Tintu Mol Software Engineer: Mobile: +91-9746511479
2 pages
XRF T6 User Manual
No ratings yet
XRF T6 User Manual
38 pages
HANA SmartDataAccess SQL 1.00.60+
No ratings yet
HANA SmartDataAccess SQL 1.00.60+
9 pages
Andreas Gadatsch - IT Controlling - From IT Cost and Activity Allocation To Smart Controlling-Springe
No ratings yet
Andreas Gadatsch - IT Controlling - From IT Cost and Activity Allocation To Smart Controlling-Springe
180 pages
Unit 3 1
No ratings yet
Unit 3 1
6 pages
ECOL203 Assignment 1: Life Table Analysis For A Small Wallaby
No ratings yet
ECOL203 Assignment 1: Life Table Analysis For A Small Wallaby
2 pages
Best-Practice Order Management: Automation With SAP Solutions
No ratings yet
Best-Practice Order Management: Automation With SAP Solutions
15 pages
Chapter 8
No ratings yet
Chapter 8
14 pages
SQL Queries: 1 / 1 Point
No ratings yet
SQL Queries: 1 / 1 Point
4 pages
PEP Yearbook Methodology
No ratings yet
PEP Yearbook Methodology
5 pages
Pythonlearn 15 Databases
No ratings yet
Pythonlearn 15 Databases
96 pages
Best Practice of FBDI Loading V1
No ratings yet
Best Practice of FBDI Loading V1
5 pages
Project 5 B
No ratings yet
Project 5 B
5 pages
List Box Control
No ratings yet
List Box Control
4 pages
Salesforce Developer Interview Questions and Answers
No ratings yet
Salesforce Developer Interview Questions and Answers
8 pages
Sampreet's Resume
No ratings yet
Sampreet's Resume
1 page
Journal Pre-Proof: KSCE Journal of Civil Engineering
No ratings yet
Journal Pre-Proof: KSCE Journal of Civil Engineering
45 pages
Sound Characteristics and Purposes
No ratings yet
Sound Characteristics and Purposes
4 pages
Lauden Chapter 8
No ratings yet
Lauden Chapter 8
17 pages
Wikipedia For Schools-User Manual-2
No ratings yet
Wikipedia For Schools-User Manual-2
7 pages
System Analysis and Desing Ignou Bca Sem 3
No ratings yet
System Analysis and Desing Ignou Bca Sem 3
11 pages
White Box Testing
No ratings yet
White Box Testing
89 pages
Muhammad Sofiullah S-CV
No ratings yet
Muhammad Sofiullah S-CV
3 pages