2019 Spring Final Sol
2019 Spring Final Sol
Question Booklet
100 points (20% of course grade) + 20 points extra credit
Friday, May 3, 2019
●This exam booklet is printed single-sided. You may use the back of the pages as scratch space or
extra space (add a pointer if you are writing your answer there).
● The exam is open-book and open-notes; any written materials may be used. No electronic materials
can be used.
● You have 180 minutes to complete the exam.
● The problems do not necessarily come in increasing order of difficulty. You might want to look
through the entire exam before getting started, in order to plan your strategy.
● There is no penalty for guessing answers to questions. However, for short-answer questions,
simplicity and clarity of solutions will count. You may get as few as 0 point for a problem if your
solution is far more complicated than necessary.
● No explanations are needed unless asked explicitly. But if you welcome to write 1-2 sentences if you
think the question does not specify all assumptions and clarify your answer for possible partial/full
credit.
NAME (please print):
In accordance with both the letter and spirit of the Duke Community Standard, I have neither given nor
received assistance on this exam.
SIGNATURE:
Page 1 of 19
Problem 0 (1 point)
We fell little short to meet the numbers of course evaluations to have the promised free points, but as a thank
you to all of you who did complete the course evaluation, you will all receive one free points!
Problem 1: For each statement below, indicate whether it is “T” (true) or “F” (false).
No explanations needed. But if you are unsure or think either answer may be correct depending on the
context, you can give a 1 line explanation.
Problem 1a ( 2 points)
Given a table R(A, B, C) that has a B+ tree index on R(A, B), then there is no need to create a B+ tree index
on R(B).
ANS: F.
For answering queries like select * from R where B > X, index on R(A, B) does not help.
Problem 1b ( 3 points)
Given three tables R(A, B), S(B, C), and T(C, D), consider Selinger’s dynamic programming algorithm that
optimizes the natural join of the three tables. Suppose there are two sub query plans P1 and P2 for joining R
and S: P1 sorts the join result by B, while P2 sorts the join result by C, and P2 costs more than P1. We should
always discard P2 for better performance.
ANS: T. In Selinger’s algorithm we only consider optimal subplans so we should discard P2.
NOTE: We also gave points for F that implies we cannot discard P2 from further consideration because P1
and P2 produce relevant yet different interesting orders.
Problem 1c ( 4 points)
(2 points) If a DBMS uses steal and force policy, it never has to redo changes of a committed transaction.
Ans: T
(2 points) If a DBMS uses steal and force policy, it never has to undo changes of a uncommitted transaction.
Ans: F
Page 2 of 19
Problem 1d ( 10 points)
Consider the following schedule involving three transactions T1, T2, T3.
False:
T1 releases lock on A after w1(A), because T2 needs it; and T3 acquires lock on B before r3(B), then T1 has
to acquire lock on B after w3(B), which is not allowed in 2PL.
If T1 retains the lock on A until it is done with w1(B), w2(A) cannot execute.
Problem 1e (2 points)
Given relations R(A, B) , S (B, C) , the following relational algebra expression is valid: πBσA=5 R
− σB=5S,
Answer: False
Problem 1f (6 points)
Page 3 of 19
(3 points) Given relations R and S (without duplicates), R∩S = R − ((R − S ) ⋃ (S − R)) , where ⋃ , ∩ ,
and − denote set union, set intersection, and set difference
True
(3 points) Given relations R and S (with possible duplicates, bag semantic), if R has m copies of 1 and S
has n copies of 1, then for all values of m and n such that m >= n
R ⋒ S = R − ((R − S )⋓ (S − R)) , where ⋓ , ⋒ , a nd − denote bag union, bag intersection, and bag
difference.
(3 points) In the above question, for all values of m and n such that m < n
R ⋒ S = R − ((R − S )⋓ (S − R)) , where ⋓ , ⋒ , a nd − denote bag union, bag intersection, and bag
difference.
e.g. assume R is {1}, S is {1, 1}: LHS = {1}, RHS = {1} - {1} = emptyset
Problem 1g (6 points)
Consider the following two XPath queries:
(3 points) Every element returned by the first query will be returned by the second.
Answer: a
The first query will return this element; the second query will not.
(3 points) Every element returned by the second query will be returned by the first.
Answer: a
True. The C and D elements that make an A element satisfy the condition in the second query will make the
same A element satisfy the condition in the first query
Page 4 of 19
Problem 2: SQL etc. ( points 10)
Consider the schema of a database for course registration:
• Course: C(cid, name, year, capacity) (capacity denotes the maximum possible number of students). - note
that we clarified that cid is not the key here.
The following SQL query outputs the yearly enrollment rate (as e_rate) of the courses across different years.
SOLUTION:
(i) (3 points) How will you modify the given SQL query, if for each output tuple (c, y, r), course id c
should have enrollment rate r >= 0.9 in year y? (No need to rewrite the query, just state the changes).
SELECT C.cid, C.year, COUNT(S.sid) / C.capacity as e_rate
FROM Course C, Student S, Registration R
WHERE C.cid = R.cid AND R.sid = S.sid
GROUP BY C.cid, C.year
HAVING (COUNT(S.sid) / C.capacity) >= 0.9
or use the given query as a subquery and select on e_rate
(ii) (4 points) Using the given SQL query, output cours ids c whose enrollment rate did not reach 0.9 in any
year. (You can assume that the given query result has been materialized as a view TEMP and use TEMP in
your new query).
SELECT TEMP.cid FROM TEMP GROUP BY TEMP.cid HAVING
MAX(TEMP.e_rate) < 0.9
or
Page 5 of 19
(iii) (3 points) Draw a query plan tree for the given SQL query that IS NOT CONSIDERED by Selinger’s
dynamic programming algorithm for the joins of C, S, and R.
NOTE: We did not deduct point if someone did not include the top-most aggregate step.
ta_id and student_id denote the id-s of the TA and the student. For each homework, we have multiple
problems. Each answer may include the solutions to multiple problems (as in gradescope). For each problem
in each homework submitted by a student, there is a grade, and if it is late then overdue = 1.
We denote this schema by DTAHPSGO by taking the first letter from each attribute. Suppose there are
following functional dependencies.
(i) (3 points): For the following example table, are there any rows violating the stated functional
dependencies? If so, identify any one violation, and state which functional dependency is violated,
and which duty_id’s violate them.
duty_id TA_id (T) answer_id homewor_ problem_i student_id grade (G) overdue
(D) (A) id (H) d (P) (S) (O)
Page 6 of 19
1002 Jesse 105 2 3 5002 85 No
(ii) (4 points): Compute a BCNF decomposition of this table schema DTAHPSGO using the three FDs.
Show the steps.
Option 1:
From A -> SH and HT -> P we have AT -> SHT -> P, so AT -> P.
Option 2:
Using A -> SH, decompose DTAHSGO and get ASH and DTGO
(iii) (3 points): Consider a simplified version of this schema with only the attributes DTH, and one
functional dependency D -> TH. If we decompose the table into DT and TH is it lossless join
decomposition? Explain your answer..
DTH:
1001 Ray 1
1002 Ray 2
Page 7 of 19
DT:
duty_id TA_id
1001 Ray
1002 Ray
TH:
TA_id homework_id
Ray 1
Ray 2
DT join TH:
1001 Ray 1
1001 Ray 2
1002 Ray 1
1002 Ray 2
Row 2 and 3 are incorrect.
<!DOCTYPE Registration [
<!ELEMENT Registration (Course+)>
<!ELEMENT Course (Name, Student* )>
<!ATTLIST Course Capacity CDATA #REQUIRED>
<!ELEMENT Name (#PCDATA)>
Page 8 of 19
<!ELEMENT Student (Name, Grade)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Grade (#PCDATA)>
]>
(i) (4 points) Find XPath expressions that are equivalent to the XQuery below. Select ALL OPTIONS that
are correct (or say that none are correct).
for $c in /Registration/Course
return
if (exists($c/Student[Grade >= 90 and Grade < 95])) then $c/Name
A. /Registration/Course[Student[Grade >= 90 and Grade < 95]]/Name
B. /Registration/Course[./Student/Grade >= 90 and ./Student/Grade < 95]/Name
C. /Registration/Course[Student[Grade >= 90][Grade < 95]]/Name
D. /Registration/Course[count(./Student[Grade < 90 or Grade >= 95]) = 0]/Name
ANS: 1. AC
B is wrong because there can be multiple students in a course, and the condition checks if one student’s grade
is >= 90 and at least one student’s grade is < 95.
D is wrong because not necessarily the count of student outside the range [90, 95) should be 0.
(ii) (5 points) Write a query using XQuery to find courses where at least half of the enrolled students
received their highest score among all courses they took. You may use some of the aggregate functions:
count(), min(), max(), where aggr(nodeset) returns a single value. Assume that the names of students are
unique.
for $c in /Registration/Course
let $cnt := count(
for $s in $c/Student
let $g := max(//Student[Name = $s/Name]/Grade)
where $g <= $s/Grade
return $s/Name
)
let $cnt2 := count(for $s in $c/Student return $s)
where $cnt >= 0.5 * $cnt2
return $c/Name
Page 9 of 19
Problem 5: LOGS (8 points)
Consider UNDO/REDO logging with fuzzy checkpointing. Recall that update record (T, A, o, n) denotes
that transaction T changed the value of A from o to n.
At the time of a system crash, let the log segment involving four transactions S,T,U,V b
e as follows.
1. (START S)
3. (S, Y, 5, 10)
4. (COMMIT S)
5. (START T)
7. (START CKPT(T))
9. (START U)
10. (COMMIT T)
14. (START V)
(i) (4 pts) Fill out the table below by writing “Guaranteed” or “Maybe” or “Impossible” for each updates,
which says whether that updated value of the variable was written to disk by the time of the crash:
2 X = 20
3 Y = 10
6 X = 30
8 Y = 20
11 X = 40
13 Y = 30
17 Y = 40
18 Y = 50
(ii) (2 pts) After the REDO stage of recovery, what will be the value of X and Y?
X =
Y =
(iii) (2 pts) After the UNDO stage of recovery, what will be the value of X and Y?
X =
Y =
Solution:
(i) Since the second checkpointing did not complete, we only have guarantee on the updates before
the first START CKPT. But all other updates might have been written to disk while the second
checkpointing was running.
2 X = 20 Guaranteed
3 Y = 10 Guaranteed
6 X = 30 Guaranteed
8 Y = 20 Maybe
(ii) X = 40
Y = 50
S c ommitted before the first START CKPT, so it is complete. T,U committed as well at the time of
the crash. But all changes made by T prior to the START CKPT are guaranteed to be on disk, so we
do not need to go beyond the START CKPT at step 7. From this point, we start “repeating the
history”, i.e. make all updates in order, as a result we get X = 40 (line 11) and Y = 50 (line 18)
(iii)
Only uncommitted transaction is V. So the UNDO step will only focus on V and undo the changes
by V in reverse direction. It first reverts Y from 50 to 40 (line 18), then it reverts Y from 40 to 30
(line 17). So the final values will be
X = 40
Y = 30.
FROM R, S
Suppose that all attributes in R(A, B), S(B, C) are integers, the range of S.C is [11, 210], and the tuples are
uniformly distributed.
(i). (9 points) Assume that we have infinite memory, no indexes, and our only join algorithm is a nested
loops join. The database optimizer decides to execute the query with the following query plan. When X =
208, for each table and operator, estimate the number of output tuples and the I/O Cost (in terms of number
of pages) of each operator in the query plan.
The tables have the following number of tuples and pages on disk:
R 1,000 100
S 50,000 5,000
tuples: ________
cost: ________
S
Ans:
cost: ___0_____
For selection on S, the estimated number of tuples in the selection result is (210-208) / (210 - 11+1)
* 50000 = 500. And we have to scan the entire table.
For R join S, the result can be stored in memory so there is no I/O cost in this phase.
Now the solution needed some additional assumptions. If you assume that R has the Primary key and S
foreign key, then since there are 50000 tuples in S and 1000 tuples in R, the estimated number of matching
tuples in R for every tuple in S is 0.02. So the estimated number of tuples in final result is 500 * 0.02 = 10.
(note the uniformity assumption.)
If you assume some distinct number of values of B in S and R, the answer would be different.
Since the question was under-specified, everyone got 1.5 points for the top-most sub-question on #tuples.
(ii) (8 points) Now assume that the memory has only 10 pages, one index lookup requires 4 I/O, and after
the selection S.C > X, the results are written back to disk. Suppose there is a clustered B+tree index on S(B).
Ignore page boundaries. Suppose the options are
● hash join
● index nested loop join
a. (4 points) Which join algorithm will you choose when X = 208 to have less I/O? Write costs or
explanations to justify your answer.
Ans:
2.a: Use hash join. When X=208, there are 500 tuples in the S part fitting in 50 pages. There are 1000 tuples
in the R part fitting in 100 pages. In total we need min(sqrt(50, 100))+2 = 10 pages in memory, so hash join
can be performed, cost = 3(50+100) = 450
The cost of indexed nested loop join is B(R) + |R|* (4 + 1) = 100 + 1,000 * 5 = 5,100
1 R tuple joins with at most 5 S tuples, which would fit in one page (ignoring page boundary).
When X=110, there are 25000 tuples in the S part fitting in 2500 pages. There are 1000 tuples in the R part
fitting in 100 pages. In total we need min(sqrt(2500, 100))+2 = 12 pages in memory, so hash join cannot be
performed
Suppose you have a table R(A, B, C) in London, and the table S(B, E, F) in Paris and we want to do an
equijoin. Here are the steps:
1. At London: Project R onto the join column B. Send the projection R1(B) to Paris.
1. At London: compute a bit-vector of some size k: – Hash R.B values into range 0 to k-1. If
some tuple hashes to p-th bucket, set bit p to 1 (p from 0 to k-1) – Ship bit-vector to Paris.
2. At Paris, hash each tuple of S.B similarly – discard S tuples that hash to 0 in R-s bit-vector.
Let the resulting subset of S be S1. Ship S1(B, E, F) to London.
(i) (3 points) State one reason why bloom-join can be more efficient than semi-join.
(i) Sending a bit vector is more efficient than sending a subset of R in step 1.
(ii) (3 points) State one reason why bloom-join can be less efficient than semi-join.
Solution:
(ii) Since hashing may have collision, in step 3, Bloom join can ship more S tuples that might not appear in
final join result. Semijoin only sends the S tuples that appears in the join result.
Consider four tables: S(A, B), T(B, C, D), U(C, D, E), V(B, C, F) and a natural join among them. Instead of
computing the result of join, the goal is to compute “fully reduced relations”: which will produce (1) subsets
of tuples from S, T, U, V such that each of these tuples will participate in at least one result tuple. i.e.
(2) Further, these subsets should be maximal, i.e. all tuples in S, T, U, V that appear in the original join result
should stay.
10 9 9 1 7 1 7 6 9 1 5
12 9 9 1 4 1 4 1
Note that there are four result tuples: (A, B, C, D, E, F) = (10, 9, 1, 7, 6, 5), (10, 9, 1, 4, 1, 5), (12, 9, 1, 7, 6, 5),
(12, 9, 1, 4, 1, 5), and all tuples from all relations participate in these result set.
10 9 9 1 7 1 7 6 9 1 5
12 9 9 2 4 1 4 1
Here T(9, 2, 4) does not participate in the join result and is a “dangling tuple”. Similarly U(1, 4, 1). The fully
reduced relations for S, T, U, V will get rid of (only) these two tuples and produce the following output.
10 9 9 1 7 1 7 6 9 1 5
12 9
(i) (4 points) can you construct an instance where each of S, T, U, V has n tuples, and the result of the join is
n2 ?
T has (1, c1, 1), (1, c2, 1), … , (1, cn, 1)
(ii) (10 points) Given arbitrary S, T, U, V, can you design an O(n log n) algorithm to compute the fully
reduced relations? Note that the join size can be 𝚹(n2), so you cannot compute the join and project on to
individual relations! A brief description in English with some analysis is fine (no need for pseudocode etc.)
Hint1: One concept from class learnt in another context would help!
S(A, B)
|
T(B, C, D)
/ \
We need to do a semi-join pass bottom -up, then do another pass top-down.. and use sorting to keep rows
that actually join.
Bottom-up pass:
Step 1: U and T
-- See which tuples of T match with U1, delete the rest from T (O(n))
Step 2: V and T
-- See which tuples of T match with V1, delete the rest from T (O(n))
Step 3: T and S
-- See which tuples of S match with T1, discard the rest from S (O(n))
Top-down pass:
Now do the same trick top-down, i.e. first (new) S with (new) T, then (newer) T with U, and then with V to
keep only tuples in T, U, V that would participate in the join result.
Trivia: These queries are called “acyclic queries” for which such trees (called join trees) would always exist.
e.g., you cannot do this trick for queries like R(A, B), S(B, C), T(C, A), which is not acyclic and intuitively has
a cycle A -> B -> C -> A!
For the following three problems, assume we are considering replacing a hard disk drive (HDD) with a solid
state drive (SSD). Assume that our database can be held on both disks and pages from each relation are
stored in consecutive disk blocks initially. Suppose sequential scans over every table in the database on HDD
and SSD take roughly the same amount of time, but a random access in SSD is much faster. Write True/false,
no explanations are needed.
Ans: T
Ans: F
Problem X2.c (2 points)
Consider an external sort where the table being sorted has M2 pages (M is the number of pages in the
memory available for sorting), replacing HDD with SSD will improve the performance significantly.
Ans: T
INLJ with unclustered index will access pages from random locations.
BNLJ will load multiple pages of outer relation from consecutive relation sequentially.
Sorting accesses next page from different runs that are not consecutively stored necessarily.