0% found this document useful (0 votes)
15 views38 pages

00 Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views38 pages

00 Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

SCALABLE DATA MANAGEMENT SYSTEMS

INTRODUCTION TO COURSE
SYSTEMS GROUP @ TU DARMSTADT http://tuda.systems/

Focus: Efficient and Modern Data Systems


LECTURES OF SYSTEMS GROUP
Foundations of
Modern Data Systems
Advanced
FOMO Data Systems
(Winter Term)
Database Introduction
ADMS
Scalable Data
Systems & Cloud (Summer Term)
InfMan SDMS AI for Data
(Summer Term) (Winter Term) Management
AIDM
(Winter Term)

Bachelor-level/ Bachelor- / Master-level Master-level /


Basics of DBMSs Some Research Research-heavy
Systems@TUDa http://tuda.systems/ | 3
THE TEAM FOR THE COURSE

Lecturers:

Carsten Binnig Zsolt Istvan

Teaching
Assistants:
Muhammad El-Hindi Nils Boeschen Adrian Lutsch

Systems@TUDa http://tuda.systems/ | 4
THE COURSE IN A NUTSHELL

The focus of this course is on the

systems-oriented internals of
scalable data systems

for storing & processing large amounts of data


Systems@TUDa http://tuda.systems/ | 5
TODAY’S AGENDA
Course Overview

Course Logistics
• Organization
• Grading

Systems@TUDa http://tuda.systems/ | 6
Systems@TUDa http://tuda.systems/ | 7
Volume of data/information created …. worldwide
from 2010 to 2020, with forecasts from 2021 to 2025

https://www.statista.com/statistics/871513/worldwide-data-created/
Scalable Databases to Analyse Petabytes
of Structured (Tabular) Data
Modern Cloud DBMSs
LARGE SCALE AI: GENERATIVE MODELS / LLMs
LLMs (e.g., GPT4) trained
on large collections of
text and images

Systems@TUDa http://tuda.systems/ | 10
WHY NOW?
Game changer: Exponential growth in data & technology to process data

Growth

DATA … $/GB, $/FLOP


has been decreasing
Compute

Time
Systems@TUDa http://tuda.systems/ | 11
EFFECTS OF EXPONENTIAL GROWTH?

The legend of
the king and the
chessboard

„1 grain on the first


square, 2 on the second, 4
on the third, 8 on the
fourth and so on“

→ Covers all India


with 15cm of rice
(9 times Germany)

Systems@TUDa http://tuda.systems/ | 12
THE „SECOND HALF“ OF THE CHESS BOARD

The "second half of the chessboard" is a


phrase, coined by economist Ray Kurzweil

Growing factor will then have


significant impact on every
business & technology

Example: Exponential speed-ups in genome sequencing brought the full


human genome scan from 1000 down to 13 years (Human Genome Project)
Systems@TUDa http://tuda.systems/ | 13
HOW DOES THIS TRANSLATE TO THIS COURSE?
Progress

DATA
Exponential grows in
data & resources
changes what we can do!

Time

Systems@TUDa http://tuda.systems/ | 14
THIS COURSE: WHAT ARE YOU LEARNING?

How to design Scalable Data (& AI) Systems on distributed


resources to store & process large volumes of data

Distributed DBMSs Cloud DBMSs Big Data Systems Scalable AI Systems


(10’s of machines) (100’s of machines) (beyond SQL) (beyond CPUs)

Systems@TUDa http://tuda.systems/ | 15
THE INFRASTRUCTURE: CLOUD DATA CENTERS
Data Centers in the Cloud: 1000’s of machines connected via
high-speed networks. How to use them for data processing?

Systems@TUDa http://tuda.systems/ | 16
COURSE SCHEDULE
Week 1: Introduction
Week 2: DBMS Storage - Single Node
Week 3: DBMS Storage - Distributed
Week 4: DBMS Query Processing - Single Node Single-Node +
Week 5: DBMS Query Processing - Distributed Query Distributed DBMS
Week 6: DBMS Query Optimization - Single Node & Distributed
Week 7: DBMS Transaction Processing - Single Node
Week 8: DBMS Transaction Processing - Distributed
Week 9: Cloud DBMS - Data Centers & DBMS Architectures
Week 10: Cloud DBMS - Scalable Query Processing
Cloud DBMSs
Week 11: Cloud DBMS - Scalable Transaction Processing
Week 12: Cloud DBMS - Secure DBMSs
Week 13: Other Workloads – MapReduce / Streaming
Week 14: Other Workloads – Distributed AI Other Workloads

Systems@TUDa http://tuda.systems/ | 17
THIS COURSE: WHAT ARE YOU LEARNING?

How to design Scalable Data (& AI) Systems on distributed


resources to store & process large volumes of data

Distributed DBMSs Cloud DBMSs Big Data Systems Scalable AI Systems


(10’s of machines) (100’s of machines) (beyond SQL) (beyond CPUs)

Systems@TUDa http://tuda.systems/ | 18
DISTRIBUTED DBMS: QUERY PROCESSING

Distributed Query Processing of SQL in DBMSs:


Query Execution
(in parallel)

Worker
SQL-Query:
SELECT * compile Coord
inator
Worker

FROM orders Worker


Compilation &
WHERE amount>50 Optimization

Coordinator compiles/optimizes query & workers run query plans in parallel


Systems@TUDa http://tuda.systems/ | 19
DISTRIBUTED DBMS: QUERY PROCESSING

SELECT * Distributed Execution: SQL query is executed


FROM orders
WHERE amount>50
across multiple smaller partitions
U (e.g., on Worker1)

σamount>50 σamount>50

Partition1 of Orders Table (on Worker1) Partition2 of Orders Table (on Worker2)

How to enable this for more complex queries (e.g., with joins)?
Systems@TUDa http://tuda.systems/ | 20
THIS COURSE: WHAT ARE YOU LEARNING?

How to design Scalable Data & AI Systems on distributed


resources to store & process large volumes of data

Distributed DBMSs Cloud DBMSs Big Data Systems Scalable AI Systems


(10’s of machines) (100’s of machines) (beyond SQL) (beyond CPUs)

Systems@TUDa http://tuda.systems/ | 21
DBMS MARKET: THE CLOUD IS TAKING OVER

Source: https://blogs.gartner.com/merv-adrian/2022/04/16/dbms-market-transformation-2021-the-big-picture/
CLOUD DBMS: DISAGGREGATED SYSTEMS
SQL Query

Service Layer
(Virtual Machines) Optimizer Metadata … Security Scale

Parallel Query Plans (one per node)

Compute Layer Compute Compute Compute


Compute
(Virtual Machine)
Node 1 Node 2 Node 3
… Node N Scale

Read Files Read Files Read Files Read Files

Storage Layer Table1_File1 Table1_File2 … Table2_File1 Table2_File2 Scale


(e.g., S3)

Systems@TUDa http://tuda.systems/ | 23
THIS COURSE: WHAT ARE YOU LEARNING?

Understand how to design Scalable Data Management Systems that can


make use of current technology to process large volumes of data

Distributed DBMSs Cloud DBMSs Big Data Systems Scalable AI Systems


(10’s of machines) (100’s of machines) (beyond SQL) (beyond CPUs)

Systems@TUDa http://tuda.systems/ | 24
BIG DATA: BEYOND SQL
Jeffrey Dean (Lead of Google AI): Keynote 2008

Challenge: Build a Search Index for Google over ~20 billion web pages

Main Problem:
• 20 billion x 20KB (per web page) = 400TB of raw data
• Average read rate of a commodity disk is 30-35MB/s =>
~ 4 months to just process the web crawl with 1 machine!

The good news: same problem with 1000 machines takes


only < 3 hours to scan all the data

Systems@TUDa http://tuda.systems/ | 25
BIG DATA: BEYOND SQL
Input-File Output-File

the brown, 2
quick fox, 2
brown how, 1
fox now, 1
the, 3
Example: Compute
the fox Word Frequencies
ate the
mouse
ate, 1
how now cow, 1
brown mouse, 1
cow quick, 1

Systems@TUDa http://tuda.systems/ | 26
BIG DATA: MAP REDUCE
Input-File (Distributed) Map Shuffle Reduce Output-File (Distributed)
the, 1
quick, 1 brown, {1,1}
the brown, 1 fox, {1,1} brown, 2
Block quick Map … … fox, 2
(64MB)
brown Reduce how, 1
fox now, 1
the, 1
fox, 1 the, 3
the fox
MapReduce was developed by Google
ate, 1
the, 1
Block
(64MB)
ate the to run scalable
Map data processing beyond SQL

mouse
how, 1
ate, 1
how now now, 1
Block brown, 1
cow, 1
(64MB) brown Reduce mouse, 1
….
cow Map ate, {1} quick, 1
cow, {1}

Systems@TUDa http://tuda.systems/ | 27
STRUCTURE OF COURSE
Part 1: Single-Node & Distributed Database Architectures

Part 2: Cloud Database Architectures

Part 3: Other Workloads (e.g., MapReduce, Streaming, AI)

Systems@TUDa http://tuda.systems/ | 28
TODAY’S AGENDA
Course Overview

Course Logistics
• Organization
• Grading

Systems@TUDa http://tuda.systems/ | 29
COURSE LOGISTICS
Organization:
• Lectures: Tuesday 15:20-17:00pm
• Exercises: Friday 9:50-11:30am (Exercise Sheets + Programming Labs)
• Exercise Sheets: Preparation for Final Exam
• Programming Labs (Graded): Practical Implementation of Concepts in Lecture

Pre-requisites:
• Solid programming skills required
• Lecture “Information Management” or equivalent (Intro to Databases/SQL)
• Sufficient time to work on course assignments (i.e., labs)

Systems@TUDa http://tuda.systems/ | 30
COURSE GRADING AND REGISTRATION
Grading:
• 60 Points for Final Exam
• 40 Points for Programming Labs (3 Labs: 15P + 15P + 10P )
• Up to 6 Bonus Points for Leaderboard (3 Labs x 2P - 0P)
• 2 Bonus Points for Warm-up Lab
• 50 Points overall required to pass

Registration in TUCan for Exam/Labs: („Prüfungsanmeldung“)


• You need to register for the exam before you hand in the first lab
• Registration deadline: 12.12.2024
• No de-registration after deadline possible
Systems@TUDa http://tuda.systems/ | 31
PROGRAMMING LABS
Implement major building blocks of a scalable DBMS (e.g.,
DBMS storage, query processing, ….)
• We will have 3 programming labs with 5-6 weeks time per lab
• Labs have to be solved individually!
• Labs are solved “at home”, not in the exercise sessions
• We provide code framework and tests to check your
implementation

Details of programming labs will be discussed in the exercise


• Attending the exercise is highly recommended
Systems@TUDa http://tuda.systems/ | 32
PROGRAMMING LABS: RUST (NEW!)

Programming Language of Labs:

Why Rust?
• Alternatives for scalable DBMSs: C++, Java (Not anymore)
• Rust: Efficient as C++ but with enhanced safety and modern features

Use this as chance to learn an important server-side language:


We provide tutorials to get into the language
Systems@TUDa http://tuda.systems/ | 33
PROGRAMMING LABS: RUST
https://survey.stackoverflow.co/2024/techn
ology/#most-popular-technologies

Systems@TUDa http://tuda.systems/ | 34
PROGRAMMING LABS: RUST

Systems@TUDa http://tuda.systems/ | 35
FINAL EXAM
Exam questions based on lecture and exercises

Duration / Points: 60 Minutes = 60 Points

Closed-book

No additional notes

Systems@TUDa http://tuda.systems/ | 36
COURSE MATERIAL & INFRASTRUCTURE
Moodle (of CS Department): https://moodle.informatik.tu-
darmstadt.de/course/view.php?id=1663
Details on
• Lecture (Slides) and Exercise/Lab Material
http://tuda.systems/
• Forum for Q&A

Lab Infrastructure:
• What you get: Code framework /
Automated testing (in Rust) Details in
• What you hand-in: Your code via Gitlab exercise session
(will be automatically tested) this Friday!
(you need to
• Warm-up lab: get to know Rust + lab/hand-in setup come re setup)
Systems@TUDa http://tuda.systems/ | 37
QUESTIONS

Systems@TUDa http://tuda.systems/ | 38

You might also like