06 QueryProcessing-noblanks
06 QueryProcessing-noblanks
Weeks 6 - 8
DATABASE
Inde System
x Catalog
Data
File
Files
s
⋈
evaluation engine
suosCode=‘DATA3404’ Student
query output
Enrolled Name
statistics data
about data John Smith
Sally Waters
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 7
Relational Algebra
(abbreviation: RA)
– Why is it important?
– Helps us understanding the precise meaning of declarative SQL queries.
– Intermediate language used within DBMS (cf. chapter on db tuning)
X = =
È = - = Ç =
sid name gender country sid uos_code semester uos_code title credit
Points
1001 Ian M AUS 1001 COMP5138 2005-S2
COMP5138 Relational DBMS 6
1002 Ha Tschi F ROK 1002 COMP5702 2005-S2
COMP5318 Data Mining 6
1003 Grant M AUS 1003 COMP5138 2005-S2
INFO6007 IT Project Management 6
1004 Simon M GBR 1006 COMP5318 2005-S2
SOFT1002 Algorithms 12
1005 Jesse F CHN 1001 INFS6014 2004-S1
ISYS3207 IS Project 4
1006 Franzisca F GER 1003 ISYS3207 2005-S2
COMP5702 MIT Research Project 18
– Result relation can be the input for another relational algebra operation!
(Operator composition)
– Example: Õ name ( s country=‘AUS’ (Student) )
Student
name
Ian
Grant
1001
€ Cho Chung M AUS 47112344 Vera Chung 321
… … … … … … …
– Result schema similar to cross-product, but only one copy of fields for
which equality is specified.
Enrolled UnitOfStudy result
1002 COMP5702 COMP5318 Data Mining 6 1002 COMP5702 MIT Research Project 18
RA: RA:
πuosCode, sem(Assessment) σuosCode=‘INFO2120’(Assessment)
⋈ NESTED LOOPS
ρE Student
Rename operator (not essential)
Enrolled
balance<2500)
• File Scan
• Access Paths
• Selections
σ • Index Scan
• Filter
suosCode=‘DATA3404’ πname
FILTER(uosCode=‘DATA3404’) PROJECT(name)
Enrolled ssid=1234
Access Paths Student
Ø Table Scan: Ø Table Scan + Filter
Retrieve all pages (table scans typically include filtering)
of relation Ø Index Scan: Use index with matching Ø Index-only scan:
search key to find matching records Use a covering index
(if available) without accessing
- tree index? records (if it exists)
- hash Index?
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 35
Projections
SELECT R.sid, R.bid FROM Reserves R
– Trivial operation: just return subset of columns
Person Macy
Neve Daniels
Ruiz Brisbane
Alice Springs 38
50
Sawyer
Aphrodite Frame 2
Holloway
George Sydney
Darwin 59
56
2 I/O Aphrodite
Esra Rasmussen
Esra
Wyoming
Maynard
Rasmussen
Campbell
Cairns
Cairns
Adelaide
42
56
56
59
Brendan
Sawyer Holloway
Wall Perth
Sydney 13
59
Gail
Clio Brooks
Terry Cairns
Newcastle 63
3
1) Read relation
Leonard
Melodie Holden
Stewart Darwin
Melbourne 21
65
Into memory
DATA3404 ”Scalable Data Management" - 2023 (Roehm) 44
General External Merge Sort
– To sort a file with N pages using B buffer pages:
– Pass 0: use B buffer pages.
Produce éN / Bù sorted runs of B pages each.
– Pass 2, …, etc.: merge B-1 runs.
INPUT 1
... INPUT 2
...
OUTPUT ...
INPUT B-1
Disk Disk
B Main memory buffers
Buffer in RAM
3 I/O
Frame unsorted
unsorted Frame unsorted
Frame
1
Sorted Run
2 3
Buffer size B = 3
Person
3 I/O
Sort Phase:
Read all N pages into memory, B pages at a time.
Save back to disk as N/B = 12 sorted runs of B = 3 pages.
Total cost is N reads and N writes = 2N I/O
N=36 pages
Buffer in RAM
Person Chastity Silva 4 Macy Ruiz
Julie Ingram 38
7 Chastity Silva
Brendan Wall 4
13
Input 1
Lionel Hines 14
Input 2
Laurel Daniels
Aphrodite Maynard 9
42
Output
Julie Ingram 7
Merge pass 2:
6 sorted runs of 6 pages merged
to 3 sorted runs of 12 pages
N=36 (whole relation) pages read
N=36 (whole relation) pages written
4) (Harder) Can you derive a general formula for the amount of I/O?
– Textbook:
– Garcia-Molina/Ullman/Widom: Chapter 15
– Ramakrishnan/Gehrke: Chapter 14
– Kifer/Bernstein/Lewis: Chapter 10.3-10.7