Lecture 2 - Relational Data Processing
Lecture 2 - Relational Data Processing
Remove columns
With data sizes growing exponentially, the need for fast data processing is outgrowing individual
machines
Elasticity
The database can be quickly & flexibly scaled to fit the requirements by adding (or removing)
resources
Fault-Tolearance
Running on more than one node allows the system to better recover from hardware failures
Parallel Database:
Distributed Database:
⇒ Often nog a clear cut: Most distributed databases are also parallel!
Shared Disk:
Nodes have their own CPU & memory, but share the
same disk.
Shared Nothing:
Step 1: Shuffle your data around so you have the required parts available on the nodes
Step 2: Run the local algorithm to evaluate the operator on the nodes
We send a lot of data across the network to the central node & serialize the execution
It’s not scalable, and effectively eliminates the advantage of running on multiple nodes
Intuition
Fill the grey boxes with the data items designated for a particular node.
Distributed Selection/Projection
Easiest operators to run distributed:
Distributed GroupBy/Aggregation
Distributed Joins
Complex operator to run distributed
General Strategy:
Shuffle data around to ensure that matching pairs are on the same node
Co-Located Join
Best case:
Both tables are partitioned by the join keys — no need to reshuffle data, just run join locally!
If both tables are roughly the same size, then we hash-partition both by the join key, then run the join
locally
Broadcast Join
General case: The tables are partitioned differently
If one table is a lot smaller than the other, broadcast the small table, then run the join locally
Task: Joins
In-Memory Database
Cloud DBMS
In-Memory Databases
Scale-up, shared-memory, parallel database engine
Columnar data layout, Compressed Execution, Vectorized (SIMD) operations, Lock-free algorithms
Real-time systems, Critical Business Intelligence Solutions, Dashboarding Backends, Trading Systems,
...
Typical use-cases are backends for web applications, web stores, caches
Examples:
“Star Schema”
Columnar data layout, compressed storage, exploiting data partitioning, aggressive utilization of
metadata to avoid scans
Typical use cases are Business Intelligence (BI), Reporting, Operational Management, …
Examples:
Cloud RDBMs
Architectural evolution of Data Warehousing Systems for modern Cloud Environments
Builds on Shared Nothing, but keeps data in cloud storage
Nodes do not “own” data, they only access what they need to process the query from cloud storage.
Transactions and access consistency are handled centrally via a distributed key value store.
Use cases are similar to Data Warehousing Systems, but often with a focus on larger enterprise
deployments