0% found this document useful (0 votes)
5 views6 pages

Solution 03

The document outlines Exercise 3 for a Database Systems course, focusing on various join algorithms, sorting techniques, and histogram calculations. It includes detailed questions and solutions related to Nested-Loop-Join, Hash-Join, join implementations in code, external sorting, and V-optimal histograms. Each question provides specific tasks, calculations, and expected outcomes, along with insights into performance comparisons and algorithm efficiency.

Uploaded by

deyik21439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

Solution 03

The document outlines Exercise 3 for a Database Systems course, focusing on various join algorithms, sorting techniques, and histogram calculations. It includes detailed questions and solutions related to Nested-Loop-Join, Hash-Join, join implementations in code, external sorting, and V-optimal histograms. Each question provides specific tasks, calculations, and expected outcomes, along with insights into performance comparisons and algorithm efficiency.

Uploaded by

deyik21439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Database Systems WS 2024/25

Prof. Dr.-Ing. Sebastian Michel


M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016

Question 1: Nested-Loop-Join and Hash-Join (1 P.)


We want to join the relations Invoice and Item. There are 100 000 invoices and 200 000 items. Each
tuple of each relation fits in exactly one page. The join buffer has 128 pages and the result tuples of the
join also fit into one page (We perform the projection during the join). For this join Item is the inner
relation and Invoice the outer.

a) Instead of counting all page accesses, we now would like to estimate the amount of sequential page
accesses (e.g.: Loading the first 128 pages of a table into the buffer counts as 1 sequential page
access).
In the Block-Nested-Loop join from the lecture as many tuples of the outer relation as possible
are stored in the buffer. Now we will fill the buffer with 50% inner tuples and 50% outer tuples.
Calculate the amount of sequential page accesses, that are required to calculate the join.

Solution
One page in the buffer is reserved for the result tuples. The remaining 127 pages are split into 64
pages for the outer and 63 pages for the inner relation (or the other way around). For each 64-page
chunk of the outer relation we have to read the whole inner relation in 63-page chunks.
The outer relation has d100 000/64e = 1 563 of these chunks, while the inner relation has
d200 000/63e = 3 175.
Thereby, the total amount of page accesses is 1 563 + 1 563 · 3 175 = 4 964 088.

b) We now want to perform a hash join, with the hash function mod k. Determine k, such that
the buffer is utilized optimally, and calculate the amount of sequential page accesses for the join.
You can ignore the write operations after partitioning and assume that all relevant attributes are
uniformly distributed natural numbers ≥ 1

Solution
With the sizes of the relations, the uniform distribution and the lossless join we can determine
that each invoice will be joined with 2 items. We use the buffer optimally if it is filled with tuples
from Invoice and Item in a 1 : 2 ratio. 127 pages are available for tuples in the buffer. Thereby,
b127/3c = 42, 42 · 1 = 42, 42 · 2 = 84 tuples of the respective relations should be stored in the buffer.
In total 126 of the 127 possible tuples will be used.
Since we want to end up with 42 pages in each bucket, k can now be calculated with d 10042000 e = 2 381
resulting in k = 2 381.
To calculate the sequential reads, we have to partition both relations by reading them with
100 000/127 + 200 000/127 sequential accesses. The k = 2 381 partitions of both relations are now
read and and joined, resulting in another 2 · 2 381 accesses.
In total 100 000/127 + 200 000/127 + 2 · 2 381 sequential read operations are performed.

1
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016

Question 2: Join Implementation (1 P.)


This question requires you to implement code in any programming language you want. The code has to
compile and return the correct result. Submit the code as a separate file (or archive, if you have
more than one source file). If you use a different language than Java, please provide instructions on
how to compile and run your code. In OLAT, we provide a Java template with most of the boilerplate
code already in place.
Please delete all indices that were created on the tables lineitem and orders.

a) Implement the following query1 :

SELECT l_orderkey, l_shipdate, o_orderdate


FROM lineitem JOIN orders ON l_orderkey = o_orderkey

First you should load the required values from the TPC-H database into lists. You then execute a
nested loop join and an index-based nested loop join once. The index should be a suitable HashMap.

Solution
Pseudo code nested loop join:

result = []
for o in outer:
for i in inner:
result << new Tuple(o, i)
return result

Pseudo code index-based:

map = {}
for i in inner:
map[i.key] = i

result = []
for o in outer:
result << new Tuple(o, map[o.key])
return result

b) Measure the execution time of your implementations (including the index creation time) and com-
pare it to the time Postgres requires for the same query. How do you explain the differences?

Solution
We collected statistics from multiple systems and noticed: The nested loop join requires approx. 4
s, the hash join 20 ms and Postgres 500 ms. The hash join is way faster than the nested loop join,
as we only have to iterate the outer relation once and find the join partner in the inner relation in
constant time.
The slower time of Postgres is most probably due to the fact that we also measure the time Postgres
needs to load the tuples from disk.
1 In the example code we only load tuples where orderkey < 50000, you may try bigger values if you want.

2
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016

c) Which join algorithm did Postgres choose? Make Postgres perform a merge join and measure the
execution time.

Solution
Postgres performs a hash join, by hashing the Orders tuples. To perform a merge join, the data
has to be readable in a sorted manner.
We can use the following two solutions to enable a sorted read:
• Create two indices, enabling a index-only scan.
CREATE INDEX index1 ON lineitem(l_orderkey, l_shipdate);
CREATE INDEX index2 ON orders(o_orderkey, o_orderdate);

• or, cluster the tables by orderkey :


CREATE INDEX index3 ON lineitem(l_orderkey);
CLUSTER lineitem USING index3;
CREATE INDEX index4 ON orders(o_orderkey);
CLUSTER orders USING index4;

Question 3: Swapping Inputs of Join Operators at Runtime(1 P.)


Consider a natural join between two relations R(A, B) and S(B, C). As we know, the join operator is
associative and commutative, so regarding the results it does not matter if we execute R o
n S or S on R.
Assume that the join algorithm started executing R o n S. Discuss whether it is possible or not to swap
during execution the two inputs and at which points of the algorithm this can be done. Do so for Nested
Loop Join as well as Merge Join (relations are already properly sorted).

Solution
Merge Join:
The merge join is naturally symmetric. The only little thing to take care of is if there are ties, which need
to be processed either entirely first or extra caution is needed.
Nested Loop Join:
The nested loop is not symmetric, but whenever the inner loop is completed for one specific tuple of the
outer relation, we can swap inputs. But we have to make sure to not join tuples for a second time.
E.g.: If |R| = 10, |S| = 20, and we just completed the join for iR = 2, then after swapping the inputs,
the nested loop would have to make sure not to join the first [0, 2] tuples in R:

for each s in S
for each r in R[3,|R|]
...

3
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016

Question 4: External Sort (1 P.)

a) A file consisting of 100 000 blocks is to be sorted. The sorting should take not more than 4 merge
phases. Is this possible using standard external sort with an in-memory buffer of 20 pages, and if it
is possible, what is the minimum buffer size required?

Solution
With N = 100 000; nB = 20 we calculate:
N 100 000
nR = d e=d e = 5 000
nB 20
p = dlognB −1 (nR )e = dlog19 (5 000)e = 3
Thereby, it is possible to merge the blocks in the specified number of phases. To get the minimum
number we may try sensible numbers or calculate it by using the phases formula.
N
p = dlognB −1 (d e)e = 4
nB
By solving it for nB , we get nB ≈ 10.807. Thereby, we need a minimum of 11 pages.

b) Apply external sort with and without blocked I/O, for N and nB as specified in the previous
question. For blocked I/O assume a buffer block of b = 2. Specify the number of runs in each pass
of the algorithm. Which algorithm requires less phases?

Solution
The number of initial runs is given by d nNb e, where N is the number of blocks of the file and nb is
the available size of main memory used for sorting (buffer size). In the merge phase, for external
sort without blocked I/O we read the next nb − 1 runs, where as for blocked I/O b nbb c − 1 ( 1 block
is reserved for the output block ).
External sort without blocked I/O:
• Pass 0 (Initial Sorting): d 100000
20 e = 5 000 runs.
• Pass 1: d 5 19
000
e = 264 runs.
• Pass 2: d 264
19 e = 14 runs.
• Pass 3: merges the 14 runs.
External sort with blocked I/O:
• Pass 0 (Initial Sorting): d 100000
20 e = 5 000 runs.
• Pass 1: d 5 000
9 e = 556 runs.
• Pass 2: 556
d 9 e = 62 runs.
• Pass 3: d 62
9 e = 7 runs.
• Pass 4: merges the 7 runs.
Clearly, external sort without blocked I/O produces less number of phases.

4
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016

Question 5: V-Optimal Histograms (1 P.)

a) Given the following table of values and frequencies. Calculate the v-optimal histogram for B = 2
cells. Use the algorithm that was discussed in the lecture.
Please write down P, PP and a table with all calculated SSE ∗ (i, k) values and note when and how
it was updated. As final result, the histogram bounds should be provided.

Value Frequency
1 9
2 15
3 2
4 17
5 15

Solution

k 0 1 2 3 4 5
P 0 9 24 26 43 58
PP 0 81 306 310 599 824

In the lecture we saw, that we can calculate the SSE with the following formula:

P [j] − P [i − 1] 2
SSE([i, j]) = P P [j] − P P [i − 1] − (j − i + 1) ∗ ( )
(j − i + 1)
For k = 1 we do not have a choice on how to distribute the values. So we only have to calculate the
errors with SSE(1, i):

k,i 1 2 3 4 5
1 0.00 18.00 84.67 136.75 151.20
2 - - - - -

For k > 1 we now execute the algorithm as follows:

k i j besterror[j][k − 1] SSE(j + 1, i) besterror[i][k] Action


2 1 1 0.00 0.00 0.00 Initial
2 2 1 0.00 0.00 0.00 Initial
2 2 2 18.00 0.00 0.00 -
2 3 1 0.00 84.50 0.00 Initial
2 3 2 18.00 0.00 84.50 Replace
2 3 3 84.67 0.00 18.00 -
2 4 1 0.00 132.67 0.00 Initial
2 4 2 18.00 112.50 132.67 Replace
2 4 3 84.67 0.00 130.50 Replace
2 4 4 136.75 0.00 84.67 -
2 5 1 0.00 142.75 0.00 Initial
2 5 2 18.00 132.67 142.75 -
2 5 3 84.67 2.00 142.75 Replace
2 5 4 136.75 0.00 86.67 -
2 5 5 151.20 0.00 86.67 -

5
Database Systems WS 2024/25
Prof. Dr.-Ing. Sebastian Michel
M.Sc. Angjela Davitkova
Exercise 3: Handout 04.11.2024, Due 11.11.2024 12:00 CET https://dbis.cs.uni-kl.de
Lecture content: Videos up to #016

The full SSE ∗ (i, k) table now looks like:

k,i 1 2 3 4 5
1 0.00 18.00 84.67 136.75 151.20
2 0.00 0.00 18.00 84.67 86.67

By looking at the best choices made, we can see that the buckets are [1, 3][4, 5] and have an error
of 86.667.

b) Formally show that


X
SSE([i, j]) = (F [k]2 ) − (j − i + 1) ∗ AV G([i, j])2
i≤k≤j

Solution

X
SSE([i, j]) = (F [k] − AV G([i, j]))2 (1)
i≤k≤j
X
F [k]2 − 2 AV G([i, j]) ∗ F [k] + AV G([i, j])2

= (2)
i≤k≤j
X X X
= F [k]2 − 2 AV G([i, j]) ∗ F [k] + AV G([i, j])2 (3)
i≤k≤j i≤k≤j i≤k≤j
X X X
= F [k]2 − 2 AV G([i, j]) ∗ F [k] + AV G([i, j])2 (4)
i≤k≤j i≤k≤j i≤k≤j
X X
= F [k]2 − 2 AV G([i, j]) ∗ (j − i + 1)AV G([i, j]) + AV G([i, j])2 (5)
i≤k≤j i≤k≤j
X
= F [k]2 − 2 AV G([i, j]) ∗ (j − i + 1)AV G([i, j]) + (j − i + 1)AV G([i, j])2(6)
i≤k≤j
X
= F [k]2 − 2 (j − i + 1)AV G([i, j])2 + (j − i + 1)AV G([i, j])2 (7)
i≤k≤j
X
= F [k]2 − (j − i + 1)AV G([i, j])2 (8)
i≤k≤j

You might also like