Solution 03
Solution 03
a) Instead of counting all page accesses, we now would like to estimate the amount of sequential page
accesses (e.g.: Loading the first 128 pages of a table into the buffer counts as 1 sequential page
access).
In the Block-Nested-Loop join from the lecture as many tuples of the outer relation as possible
are stored in the buffer. Now we will fill the buffer with 50% inner tuples and 50% outer tuples.
Calculate the amount of sequential page accesses, that are required to calculate the join.
Solution
One page in the buffer is reserved for the result tuples. The remaining 127 pages are split into 64
pages for the outer and 63 pages for the inner relation (or the other way around). For each 64-page
chunk of the outer relation we have to read the whole inner relation in 63-page chunks.
The outer relation has d100 000/64e = 1 563 of these chunks, while the inner relation has
d200 000/63e = 3 175.
Thereby, the total amount of page accesses is 1 563 + 1 563 · 3 175 = 4 964 088.
b) We now want to perform a hash join, with the hash function mod k. Determine k, such that
the buffer is utilized optimally, and calculate the amount of sequential page accesses for the join.
You can ignore the write operations after partitioning and assume that all relevant attributes are
uniformly distributed natural numbers ≥ 1
Solution
With the sizes of the relations, the uniform distribution and the lossless join we can determine
that each invoice will be joined with 2 items. We use the buffer optimally if it is filled with tuples
from Invoice and Item in a 1 : 2 ratio. 127 pages are available for tuples in the buffer. Thereby,
b127/3c = 42, 42 · 1 = 42, 42 · 2 = 84 tuples of the respective relations should be stored in the buffer.
In total 126 of the 127 possible tuples will be used.
Since we want to end up with 42 pages in each bucket, k can now be calculated with d 10042000 e = 2 381
resulting in k = 2 381.
To calculate the sequential reads, we have to partition both relations by reading them with
100 000/127 + 200 000/127 sequential accesses. The k = 2 381 partitions of both relations are now
read and and joined, resulting in another 2 · 2 381 accesses.
In total 100 000/127 + 200 000/127 + 2 · 2 381 sequential read operations are performed.
1
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016
First you should load the required values from the TPC-H database into lists. You then execute a
nested loop join and an index-based nested loop join once. The index should be a suitable HashMap.
Solution
Pseudo code nested loop join:
result = []
for o in outer:
for i in inner:
result << new Tuple(o, i)
return result
map = {}
for i in inner:
map[i.key] = i
result = []
for o in outer:
result << new Tuple(o, map[o.key])
return result
b) Measure the execution time of your implementations (including the index creation time) and com-
pare it to the time Postgres requires for the same query. How do you explain the differences?
Solution
We collected statistics from multiple systems and noticed: The nested loop join requires approx. 4
s, the hash join 20 ms and Postgres 500 ms. The hash join is way faster than the nested loop join,
as we only have to iterate the outer relation once and find the join partner in the inner relation in
constant time.
The slower time of Postgres is most probably due to the fact that we also measure the time Postgres
needs to load the tuples from disk.
1 In the example code we only load tuples where orderkey < 50000, you may try bigger values if you want.
2
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016
c) Which join algorithm did Postgres choose? Make Postgres perform a merge join and measure the
execution time.
Solution
Postgres performs a hash join, by hashing the Orders tuples. To perform a merge join, the data
has to be readable in a sorted manner.
We can use the following two solutions to enable a sorted read:
• Create two indices, enabling a index-only scan.
CREATE INDEX index1 ON lineitem(l_orderkey, l_shipdate);
CREATE INDEX index2 ON orders(o_orderkey, o_orderdate);
Solution
Merge Join:
The merge join is naturally symmetric. The only little thing to take care of is if there are ties, which need
to be processed either entirely first or extra caution is needed.
Nested Loop Join:
The nested loop is not symmetric, but whenever the inner loop is completed for one specific tuple of the
outer relation, we can swap inputs. But we have to make sure to not join tuples for a second time.
E.g.: If |R| = 10, |S| = 20, and we just completed the join for iR = 2, then after swapping the inputs,
the nested loop would have to make sure not to join the first [0, 2] tuples in R:
for each s in S
for each r in R[3,|R|]
...
3
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016
a) A file consisting of 100 000 blocks is to be sorted. The sorting should take not more than 4 merge
phases. Is this possible using standard external sort with an in-memory buffer of 20 pages, and if it
is possible, what is the minimum buffer size required?
Solution
With N = 100 000; nB = 20 we calculate:
N 100 000
nR = d e=d e = 5 000
nB 20
p = dlognB −1 (nR )e = dlog19 (5 000)e = 3
Thereby, it is possible to merge the blocks in the specified number of phases. To get the minimum
number we may try sensible numbers or calculate it by using the phases formula.
N
p = dlognB −1 (d e)e = 4
nB
By solving it for nB , we get nB ≈ 10.807. Thereby, we need a minimum of 11 pages.
b) Apply external sort with and without blocked I/O, for N and nB as specified in the previous
question. For blocked I/O assume a buffer block of b = 2. Specify the number of runs in each pass
of the algorithm. Which algorithm requires less phases?
Solution
The number of initial runs is given by d nNb e, where N is the number of blocks of the file and nb is
the available size of main memory used for sorting (buffer size). In the merge phase, for external
sort without blocked I/O we read the next nb − 1 runs, where as for blocked I/O b nbb c − 1 ( 1 block
is reserved for the output block ).
External sort without blocked I/O:
• Pass 0 (Initial Sorting): d 100000
20 e = 5 000 runs.
• Pass 1: d 5 19
000
e = 264 runs.
• Pass 2: d 264
19 e = 14 runs.
• Pass 3: merges the 14 runs.
External sort with blocked I/O:
• Pass 0 (Initial Sorting): d 100000
20 e = 5 000 runs.
• Pass 1: d 5 000
9 e = 556 runs.
• Pass 2: 556
d 9 e = 62 runs.
• Pass 3: d 62
9 e = 7 runs.
• Pass 4: merges the 7 runs.
Clearly, external sort without blocked I/O produces less number of phases.
4
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016
a) Given the following table of values and frequencies. Calculate the v-optimal histogram for B = 2
cells. Use the algorithm that was discussed in the lecture.
Please write down P, PP and a table with all calculated SSE ∗ (i, k) values and note when and how
it was updated. As final result, the histogram bounds should be provided.
Value Frequency
1 9
2 15
3 2
4 17
5 15
Solution
k 0 1 2 3 4 5
P 0 9 24 26 43 58
PP 0 81 306 310 599 824
In the lecture we saw, that we can calculate the SSE with the following formula:
P [j] − P [i − 1] 2
SSE([i, j]) = P P [j] − P P [i − 1] − (j − i + 1) ∗ ( )
(j − i + 1)
For k = 1 we do not have a choice on how to distribute the values. So we only have to calculate the
errors with SSE(1, i):
k,i 1 2 3 4 5
1 0.00 18.00 84.67 136.75 151.20
2 - - - - -
5
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016
k,i 1 2 3 4 5
1 0.00 18.00 84.67 136.75 151.20
2 0.00 0.00 18.00 84.67 86.67
By looking at the best choices made, we can see that the buckets are [1, 3][4, 5] and have an error
of 86.667.
Solution
X
SSE([i, j]) = (F [k] − AV G([i, j]))2 (1)
i≤k≤j
X
F [k]2 − 2 AV G([i, j]) ∗ F [k] + AV G([i, j])2
= (2)
i≤k≤j
X X X
= F [k]2 − 2 AV G([i, j]) ∗ F [k] + AV G([i, j])2 (3)
i≤k≤j i≤k≤j i≤k≤j
X X X
= F [k]2 − 2 AV G([i, j]) ∗ F [k] + AV G([i, j])2 (4)
i≤k≤j i≤k≤j i≤k≤j
X X
= F [k]2 − 2 AV G([i, j]) ∗ (j − i + 1)AV G([i, j]) + AV G([i, j])2 (5)
i≤k≤j i≤k≤j
X
= F [k]2 − 2 AV G([i, j]) ∗ (j − i + 1)AV G([i, j]) + (j − i + 1)AV G([i, j])2(6)
i≤k≤j
X
= F [k]2 − 2 (j − i + 1)AV G([i, j])2 + (j − i + 1)AV G([i, j])2 (7)
i≤k≤j
X
= F [k]2 − (j − i + 1)AV G([i, j])2 (8)
i≤k≤j