0% found this document useful (0 votes)
3 views

Lecture 2 - Relational Data Processing

This lecture covers relational data processing, focusing on querying relational databases using SQL, the benefits and drawbacks of distributed relational databases, and techniques in distributed query processing. Key topics include relational operators like projection, aggregation, and join, as well as the fundamentals of distributed databases, their classification, and examples of distributed and parallel database architectures. The lecture emphasizes the importance of scalability, fault-tolerance, and efficient data processing strategies in distributed systems.

Uploaded by

teun.bobbink
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 2 - Relational Data Processing

This lecture covers relational data processing, focusing on querying relational databases using SQL, the benefits and drawbacks of distributed relational databases, and techniques in distributed query processing. Key topics include relational operators like projection, aggregation, and join, as well as the fundamentals of distributed databases, their classification, and examples of distributed and parallel database architectures. The lecture emphasizes the importance of scalability, fault-tolerance, and efficient data processing strategies in distributed systems.

Uploaded by

teun.bobbink
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 2 - Relational Data Processing

What should you be able to do after this week?


Query a relational database consisting of several tables with SQL (joins & aggregations)
Distinguish the benefits & drawbacksof idfferent types of distributed relational databases
Summarise basic techniques in distributed query processing (e.g., different patitioning strategies and join
types)

Relational Database Systems - A Refresher


Relational Data

Relational Operators - Projection


The Projection operator modifies
each row of a table individually

Remove columns

Add new columns by evaluating


expressions

Relational Operators - (Grouped) Aggregation


The Aggregation operator
aggregates information across
multiple rows

Compute an aggregate value


(e.g., a sum) across teh rows of
each group

Groups defined by a grouping key


(otherwise, whole table is
aggregated)

Relational Operators - Join


The Join operator combines
information from two tables

Lecture 2 - Relational Data Processing 1


Tables are typically joined by a
key (= combine rows with
matching keys)

If no key is given, a join produces


the Cartesian product (all pairs of
rows)

The life of a Relational Database Query

Distributed Database Fundamentals


What is a Distributed Database?
Simply spoken:

It is a database that is spread (distributed) across multiple machines)


Also important:
For an end-user, interacting with a distributed database should be indistinguishable from a non-
distributed one.

Why do we distribute data?


Perfomance

With data sizes growing exponentially, the need for fast data processing is outgrowing individual
machines

Elasticity

The database can be quickly & flexibly scaled to fit the requirements by adding (or removing)
resources

Fault-Tolearance

Running on more than one node allows the system to better recover from hardware failures

How do we classify distributed databases?


Multiple (often overlapping) dimensions:

Scalability: Scale-up vs Scale-Out

Lecture 2 - Relational Data Processing 2


Implementation: Parallel vs Distributed

Parallel Database:

Runs on tightly-coupled nodes (e.g., a cluser, or a multi-processor/multi-core system)

Implementation focus on multi-threading, inter-process communication

Main goal is usually to achieve peak performance

⇒ Typically a scale-up architecture

Distributed Database:

Runs on loosely-coupled nodes (e.g., individual machines, cloud resources)

Implementation focus on data distribution, network efficiency, distributed algorithms

Main goal is usually to achieve scalability, falut-tolerance, or elasticity

⇒ Typically a scale-out architecture

⇒ Often nog a clear cut: Most distributed databases are also parallel!

Application: Analytical vs Operational

Online Analytical Processing (OLAP):

Focus on a few, complex, long-running analytical queries

Data changes slowly, typically via bulk inserts or trickle loading

Think: Market Research, Scientific Databases, Data Mining

Online Transactional Processing (OLTP):

Focus on multiple concurrent, simple, short-running transactional queries

Data changes rapidly, typically via point updates

Think: Account Management, Financial Transactions, Store Inventory

Architecture: Shared Memory vs Shared Disk vs Shared


Nothing

Shared Memory: Shared Memory:


All nodes have shared access to both memory & disk

Essentially: Multi-core Server

Typical architecture found in scale-up, parallel


databases

Postgres, Oracle, SQL Server

Lecture 2 - Relational Data Processing 3


Main-Memory DBMS like Apache Ignite, Hyper,
SAP Hana

Can achieve very high performance, but is hard to


scale when running out of resources

Shared Disk:

Nodes have their own CPU & memory, but share the
same disk.

Example: Enterprise Mainframe with NAS


(network-attached storage)

Most commonly found in traditional, enterprise-


Shared Disk:
grade RDBMs systems

Oracle, MS SQL Server

Shared Nothing:

Data is spread across independent nodes that only


communicate via the network

Typical architecture found in “web scale”,scale-out


systems:

Dataflow systems like Apache Hadoop / Spark /


Flink
Shared Nothing:
Distributed Databases, Key-Value Stores

Robust architecture that offers availability &


scalability, but can be slower than shared-memory /
disk

Distributed Query Processing


In theory, distributed query processing is straightforward:

Step 1: Shuffle your data around so you have the required parts available on the nodes

Step 2: Run the local algorithm to evaluate the operator on the nodes

In fact, here’s a super-simple algorithm to run any operator distributed:

Step 1: Send all of the data to a single central node

Step 2: Run the local algorithm on the central node

Obviously, this naïve approach has major drawbacks:

We send a lot of data across the network to the central node & serialize the execution

It’s not scalable, and effectively eliminates the advantage of running on multiple nodes

Lecture 2 - Relational Data Processing 4


Still: This can be a useful strategy in some cases!

Intuition

Data Shuffling Primitives: Broadcasting


Each node sends a copy of all their data to all other nodes

Data Shuffling Primitives: Range Partitioning


Each node receives a predefined range of the key space

Data Shuffling Primitives: Hash Partitioning


Each node receives a portion of the key space determined by a hash function

Lecture 2 - Relational Data Processing 5


Task: Shuffling
How is the data distributed over node 0 and node 1 when we apply different data shuffling strategies?

Fill the grey boxes with the data items designated for a particular node.

Distributed Selection/Projection
Easiest operators to run distributed:

Operators process each row individually - No need to shuffle data around!

Distributed GroupBy/Aggregation

Lecture 2 - Relational Data Processing 6


(1) Hash partition on grouping key to collect all tuples with same key
(2) Compute aggregation locally on each node

Distributed Joins
Complex operator to run distributed

Need to ensure that matching rows end up on the same node

General Strategy:

Shuffle data around to ensure that matching pairs are on the same node

Then run a local join algorithm

Optimal strategy depends on:

How data is partitioned / distributed across the nodes

The size of the individual tables

Co-Located Join
Best case:

Both tables are partitioned by the join keys — no need to reshuffle data, just run join locally!

Asymmetric Repartition Join


If only one of the tables is partitioned by the join key: Hash-partition the other one by the join key, run
join locally

Lecture 2 - Relational Data Processing 7


Symmertric Repartition Join
General case: The tables are partitioned differently

If both tables are roughly the same size, then we hash-partition both by the join key, then run the join
locally

Broadcast Join
General case: The tables are partitioned differently

If one table is a lot smaller than the other, broadcast the small table, then run the join locally

Task: Joins

Lecture 2 - Relational Data Processing 8


Examples for Distributed and Parallel Database Architectures
Let’s go over a few explicit examples of distributed / parallel database systems:

In-Memory Database

Distributed Key Value Stores

Data Warehousing Systems

Cloud DBMS

In-Memory Databases
Scale-up, shared-memory, parallel database engine

Usually targeting both analytical, as well as operational workloads

Data is kept in memory, allowing extremely fast access

Often on a single, beefy node with multiple TBs of main memory

Focus on CPU efficiency / multi-threading

Columnar data layout, Compressed Execution, Vectorized (SIMD) operations, Lock-free algorithms

Typical applications are time-critical systems

Real-time systems, Critical Business Intelligence Solutions, Dashboarding Backends, Trading Systems,
...

Lecture 2 - Relational Data Processing 9


Examples:

SAP Hana, Hyper, Apache Ignite

Distributed Key-Value Stores


Scale-out, shared-nothing, distributed, operational database engine
Provide transactional access to key-value pairs:

User provides a key to read/write a given value (think: hash table)

Focus on fault-tolerance and transaction speed

Keys are often mapped to nodes via consistent hash function

Allows concurrent access to thousands of different keys per second

Replication is used to guarantee fault-tolerance

Typical use-cases are backends for web applications, web stores, caches
Examples:

Amazon DynamoDB, Apache Cassandra, FoundationDB

Data Warehousing Systems


Shared-nothing, scale-out, distributed, analytical database engine
Data is partitioned across multiple nodes of a cluster

“Star Schema”

User often has to provide an explicit partitioning strategy

Focus on read / IO-performance:

Columnar data layout, compressed storage, exploiting data partitioning, aggressive utilization of
metadata to avoid scans

Typical use cases are Business Intelligence (BI), Reporting, Operational Management, …
Examples:

Redshift, Teradata, Vertica, Oracle Exadata, Postgres

Cloud RDBMs
Architectural evolution of Data Warehousing Systems for modern Cloud Environments
Builds on Shared Nothing, but keeps data in cloud storage

Nodes do not “own” data, they only access what they need to process the query from cloud storage.

Transactions and access consistency are handled centrally via a distributed key value store.

Implementation focus on extreme elasticity:

Cloud Resources are “infinite”, can be provisioned within seconds.

Allows accessing the data from 1000s of nodes concurrently

Scale resources up & down exactly as and when needed.

Use cases are similar to Data Warehousing Systems, but often with a focus on larger enterprise
deployments

Lecture 2 - Relational Data Processing 10

You might also like