02 Big Data - Lessons Learnt
02 Big Data - Lessons Learnt
Ghislain Fourny
Big Data
2. Lessons Learnt
Ghislain Fourny
Fall 2021
Mr. Databases: Edgar Codd
Wikipedia
Data Independence (Edgar Codd)
Lorem Ipsum
Dolor sit amet
Consectetur
Physical storage Adipiscing
Elit. In
Imperdiet
Ipsum ante
Data Independence (Edgar Codd)
Physical storage
Data Independence (Edgar Codd)
Physical storage
Data Independence (Edgar Codd)
Physical storage
Data Independence (Edgar Codd)
Physical storage
Data Independence (Edgar Codd)
Physical storage
Data Shapes
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui
aliquet vulputate sed quis nulla. Donec eget ultricies magna, eu dignissim elit.
Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer varius
ornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Aenean eu efficitur orci. Aenean ac posuere tellus. Ut id commodo turpis.
Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget,
scelerisque justo. Ut volutpat, massa ac lacinia cursus, nisl dui volutpat arcu, quis
interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetra justo
massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit
rutrum. Phasellus sit amet euismod diam. Nullam convallis nunc sit amet diam
suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetra congue porta.
Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh
vel, posuere ipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna
eget tincidunt.
Data Shapes
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui
aliquet vulputate sed quis nulla. Donec eget ultricies magna, eu dignissim elit.
Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer varius
ornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Aenean eu efficitur orci. Aenean ac posuere tellus. Ut id commodo turpis.
Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget,
scelerisque justo. Ut volutpat, massa ac lacinia cursus, nisl dui volutpat arcu, quis
interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetra justo
massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit
rutrum. Phasellus sit amet euismod diam. Nullam convallis nunc sit amet diam
suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetra congue porta.
Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh
vel, posuere ipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna
eget tincidunt.
Overall architecture
SQL Language
Model
Compute
Storage
Data model
Language
Model
Compute
Storage
Old New
Take-away Concepts
Take-away Concepts
Table
Collection
Take-away Concepts
Attribute
Column
Field
Property
Take-away Concepts
Primary Key
Row ID
Name
Take-away Concepts
Row
Business Object
Item
Entity
Document
Record
Relational Algebra
Table as a relation
𝑅 ⊆ 𝒟! ×𝒟" ×𝒟#
1 2 3
Relations (the math)
A relation R is made of
𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠! ⊆ 𝕊
1. A set of attributes
2. An extension (set of tuples)
𝐸𝑥𝑡𝑒𝑛𝑠𝑖𝑜𝑛! ⊆ 𝕊 ↛ 𝕍
Tuple: example
Name ↦ Einstein
First name ↦ Albert
Physicist ↦ true
Year
↦ 1905
𝕊 𝕍
Tuple: more intuitive display
First
Name Physicist Year
name
Einstein Albert true 1905
Country Alan false 1936
A Gödel Kurt 1931
Rule #2: Atomic integrity (1st normal form)
32-000-000 Alan Turing 263-3010-00L Big Data Bletchley Park UK MK3 6EB
32-000-000 Alan Turing Lecture ID Lecture Name Bletchley Park UK MK3 6EB
xxx-xxxx-xxX Cryptography
Domain integrity
Atomic integrity
SQL
Relational integrity
From SQL to NoSQL
Domain integrity
Atomic integrity
SQL
NoSQL
Relational integrity
From SQL to NoSQL
SQL
Relational Algebra
Summary of relational queries
Union
Selection
Intersection
Projection
Subtraction
Filter queries
Set queries
Cartesian product
Relation renaming
Natural join
Attribute renaming
Theta join
Renaming queries
Joining queries
Selection
Selection
R S
A B C A B C
string integer boolean string integer boolean
foo 1 true foo 1 true
bar 2 false bar 2 false
foo 3 false
foobar 4 true
Projection
Projection
R S
A B C A C
string integer boolean string boolean
foo 1 true foo true
bar 2 false bar false
foo 3 false foo false
foobar 4 true foobar true
Grouping
D
Grouping
R R R R
G A G A G A G A
string integer string integer string integer string integer
foo 19 foo 19 foo 19 foo 69
bar 28 foo 4 4 bar 677
bar 265 foo 46 46 foobar 3510
foo 4 bar 28 bar 28
foobar 54 bar 265 265
foo 46 bar 245 245
bar 245 bar 139 139
foobar 3456 foobar 54 foobar 54
bar 139 foobar 3456 3456
Sorting
1
2
3
4
5
Cartesian product
Cartesian product
R
A B C T
A
A
B
B
A
A
A
B
Join
R
A B C T
Update anomaly
Insert anomaly
1st Normal Form (tabular) – The Key
1st Normal Form: counter-example
32-000-000 Alan Turing Lecture ID Lecture Name Bletchley Park UK MK3 6EB
xxx-xxxx-xxX Cryptography
32-000-000 Alan Turing 263-3010-00L Big Data Bletchley Park UK MK3 6EB
32-000-000 Alan Turing 263-3010-00L Big Data Bletchley Park UK MK3 6EB
Legi Lecture ID
Legi Name City State PLZ
32-000-000 xxx-xxxx-xxX
32-000-000 Alan Turing Bletchley Park UK MK3 6EB 32-000-000 263-3010-00L
62-000-000 263-3010-00L
32-000-000 Alan Turing Bletchley Park UK MK3 6EB
62-000-000 123-4567-89L
25-000-000 123-4567-89L
62-000-000 Georg Cantor Pfäffikon SZ 8808
Lecture ID Lecture Name
62-000-000 Georg Cantor Pfäffikon SZ 8808 xxx-xxxx-xxX Cryptography
263-3010-00L Big Data
25-000-000 Felix Bloch Pfäffikon ZH 8330
123-4567-89L Set theory
3rd Normal Form – Nothing But The Key
3rd Normal Form: Counter-Example
3NF
2NF
Data Denormalization
3NF
1NF
Data Denormalization
3NF
0NF
SQL Brush-Up
SQL History
System R
+
SEQUEL
First commercial relational query language
1982
SEQUEL
Structured English QUEry Language
Declarative language
Set-based
(Manipulates entire relations
with a single command)
Renaming
SEQUEL
(Trademark issue)
SQL
ESS-kew-EL or SEE-kwəl
SQL is a declarative language
Physical execution
SQL is a declarative language
Query Plan
SQL is a declarative language
Parallelism
SQL is a functional language
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
James T Kirk 2233-03-22 M AD10E7
Beverly C Crusher 2324-10-13 F AD234F7
Spock NULL NULL 2230-01-06 M AD234F7
SELECT *
FROM persons
WHERE last_name = 'Crusher'
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
Beverly C Crusher 2324-10-13 F AD234F7
A projecting query
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
James T Kirk 2233-03-22 M AD10E7
Beverly C Crusher 2324-10-13 F AD234F7
Spock NULL NULL 2230-01-06 M AD234F7
name birth_date
varchar(30) date
James 2233-03-22
Beverly 2324-10-13
Spock 2230-01-06
A renaming query
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
James T Kirk 2233-03-22 M AD10E7
Beverly C Crusher 2324-10-13 F AD234F7
Spock NULL NULL 2230-01-06 M AD234F7
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
James T Kirk 2233-03-22 M AD10E7
Beverly C Crusher 2324-10-13 F AD234F7
Spock NULL NULL 2230-01-06 M AD234F7
SELECT *
FROM persons
ORDER BY birth_date
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
Spock NULL NULL 2230-01-06 M AD234F7
James T Kirk 2233-03-22 M AD10E7
Beverly C Crusher 2324-10-13 F AD234F7
Sorting options: NULLs first
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
James T Kirk 2233-03-22 M AD10E7
Beverly C Crusher 2324-10-13 F AD234F7
Spock NULL NULL 2230-01-06 M AD234F7
SELECT *
FROM persons
ORDER BY last_name NULLS FIRST
persons
name middle_initial last_name birth_date gender passport_scan
varchar(30) char(1) text date boolean bytea
Spock NULL NULL 2230-01-06 M AD234F7
Beverly C Crusher 2324-10-13 F AD234F7
James T Kirk 2233-03-22 M AD10E7
A grouping query
persons
name middle_initial last_name century captain
varchar(30) char(1) text integer boolean
James T Kirk 23 TRUE
Beverly C Crusher 24 FALSE
Jean-Luc NULL Picard 24 TRUE
Kathryn NULL Janeway 24 TRUE
SELECT century AS c
FROM persons
GROUP BY century
HAVING COUNT(*) > 2
c
integer
24
A union query
spaceships1
warp name last_name code successor
numeric varchar(30) text varchar varchar
5 USS Enterprise A Kirk NCC-1701-A NCC-1701-B
6 USS Enterprise B Kirk NCC-1701-B NCC-1701-C
7 USS Enterprise C Kirk NCC-1701-C NCC-1701-D
spaceships2
warp name last_name code successor
numeric varchar(30) text varchar varchar
7 USS Enterprise C Kirk NCC-1701-C NCC-1701-D
4 USS Enterprise Kirk NCC-1701 NCC-1701-A
9.2 USS Enterprise D Picard NCC-1701-D NCC-1701-E
spaceships1
warp name last_name code successor
numeric varchar(30) text varchar varchar
5 USS Enterprise A Kirk NCC-1701-A NCC-1701-B
6 USS Enterprise B Kirk NCC-1701-B NCC-1701-C
7 USS Enterprise C Kirk NCC-1701-C NCC-1701-D
spaceships2
warp name last_name code successor
numeric varchar(30) text varchar varchar
7 USS Enterprise C Kirk NCC-1701-C NCC-1701-D
4 USS Enterprise Kirk NCC-1701 NCC-1701-A
spaceships1
warp name last_name code successor
numeric varchar(30) text varchar varchar
5 USS Enterprise A Kirk NCC-1701-A NCC-1701-B
6 USS Enterprise B Kirk NCC-1701-B NCC-1701-C
7 USS Enterprise C Kirk NCC-1701-C NCC-1701-D
spaceships2
warp name last_name code successor
numeric varchar(30) text varchar varchar
7 USS Enterprise C Kirk NCC-1701-C NCC-1701-D
4 USS Enterprise Kirk NCC-1701 NCC-1701-A
persons spaceships
SELECT *
FROM persons LEFT OUTER JOIN spaceships
ON persons.last_name = spaceships.captain_name
persons spaceships
SELECT *
FROM persons RIGHT OUTER JOIN spaceships
ON persons.last_name = spaceships.captain_name
NULL NULL NULL NULL NULL 9.2 USS Enterprise D Picard NCC-1701-D
persons spaceships
SELECT *
FROM persons FULL OUTER JOIN spaceships
ON persons.last_name = spaceships.captain_name
NULL NULL NULL NULL NULL 9.2 USS Enterprise D Picard NCC-1701-D
persons spaceships
SELECT *
FROM persons NATURAL FULL OUTER JOIN spaceships
𝜸
SELECT
σ σ π
Pre-grouping Post-grouping
Three-valued logics: OR
Data
Schema
DDL: Data Definition Language
(Create or table/schema, drop it)
Language landscape
Proto-imperative
language
Language landscape
Proto-imperative Imperative
language language
Language landscape
Proto-imperative Imperative
language language
Functional/declarative
language
Language landscape
Functional/declarative
language
Language landscape
Proto-here-is-an-example
language
Functional/declarative
language
Language landscape
Here-is-an-example Proto-here-is-an-example
language language
Functional/declarative
language
Language landscape
Databases
Software engineering
Here-is-an-example Proto-here-is-an-example
language language
Functional/declarative
language
Transactions
The good old times of databases: ACID
Atomicity
Consistency
Isolation
Durability
Atomicity
OnLine OnLine
Transaction Analytical
Processing Processing
OLTP OLAP
Write-intensive Read-intensive
No such thing as "one size fits all"
Mind
data shapes!
Data Scale-Up
Data can have...
Lots of rows
Data can have...
Lots of columns
Data can have...
Lots of nesting
The rest of the lecture: Scaling up