0% found this document useful (0 votes)
98 views19 pages

2019 Spring Final Sol

Here are the steps to normalize the schema to 3NF: (1) (4 points) Identify the candidate keys: {D}, {A} (2) (2 points) Decompose based on FD (1): grade_duty1(D, T, A, H, P, S, G, O) grade_duty2(T, A, H, P, S, G, O) (3) (2 points) Decompose grade_duty2 based on FD (2): grade_duty3(A, S, H) grade_duty4(A, P, G, O) (4) (2 points) The schemas grade_

Uploaded by

Ishaan Maitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views19 pages

2019 Spring Final Sol

Here are the steps to normalize the schema to 3NF: (1) (4 points) Identify the candidate keys: {D}, {A} (2) (2 points) Decompose based on FD (1): grade_duty1(D, T, A, H, P, S, G, O) grade_duty2(T, A, H, P, S, G, O) (3) (2 points) Decompose grade_duty2 based on FD (2): grade_duty3(A, S, H) grade_duty4(A, P, G, O) (4) (2 points) The schemas grade_

Uploaded by

Ishaan Maitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CompSci 316 Spring 2019: Final Exam 

Question Booklet 
100 points (20% of course grade) + 20 points extra credit 
Friday, May 3, 2019 

●This exam booklet is printed single-sided. You may use the back of the pages as scratch space or 
extra space (add a pointer if you are writing your answer there). 
● The exam is open-book and open-notes; any written materials may be used. No electronic materials 
can be used. 
● You have 180 minutes to complete the exam. 
● The problems do not necessarily come in increasing order of difficulty. You might want to look 
through the entire exam before getting started, in order to plan your strategy. 
● There is no penalty for guessing answers to questions. However, for short-answer questions, 
simplicity and clarity of solutions will count. You may get as few as 0 point for a problem if your 
solution is far more complicated than necessary. 
● No explanations are needed unless asked explicitly. But if you welcome to write 1-2 sentences if you 
think the question does not specify all assumptions and clarify your answer for possible partial/full 
credit. 
NAME (please print):      

In accordance with both the letter and spirit of the Duke Community Standard, I have neither given nor 
received assistance on this exam. 

SIGNATURE:     

0. (bonus)  1/1  1. (Short Q/A)  /40 


2. (SQL etc.)  /10  3. (Normalization)  /10 
4. (XML)  /9  5. (Logs)  /8 
6. (Index/Joins)  /17  7. (Distributed)  /6 
X1. (extra credit)  /13  X2. (extra credit)  /7 
Total   

Page 1​ of 19 
 
 

 
 

Problem 0 (1 point) 
We fell little short to meet the numbers of course evaluations to have the promised free points, but as a thank 
you to all of you who did complete the course evaluation, you will all receive ​one​ free points! 

Problem 1: For each statement below, indicate whether it is “T” (true) or “F” (false). 
No explanations needed.​ But if you are unsure or think either answer may be correct depending on the 
context, you can give a 1 line explanation. 

Problem 1a ( 2 points) 
Given a table R(A, B, C) that has a B+ tree index on R(A, B), then there is no need to create a B+ tree index 
on R(B). 

ANS: F.  

For answering queries like select * from R where B > X, index on R(A, B) does not help. 

Problem 1b ( 3 points) 
Given three tables R(A, B), S(B, C), and T(C, D), consider Selinger’s dynamic programming algorithm that 
optimizes the natural join of the three tables. Suppose there are two sub query plans P1 and P2 for joining R 
and S: P1 sorts the join result by B, while P2 sorts the join result by C, and P2 costs more than P1. We should 
always discard P2 for better performance. 

ANS: T. In Selinger’s algorithm we only consider optimal subplans so we should discard P2. 

NOTE: We also gave points for F that implies we cannot discard P2 from further consideration because P1 
and P2 produce relevant yet different interesting orders. 

Problem 1c ( 4 points) 
(2 points)​ If a DBMS uses steal and force policy, it never has to redo changes of a committed transaction. 

Ans: T 

(2 points)​ If a DBMS uses steal and force policy, it never has to undo changes of a uncommitted transaction. 

Ans: F 

Page 2​ of 19 
 
Problem 1d ( 10 points) 
Consider the following schedule involving three transactions T1, T2, T3. 

(recall that r1(A) means T1 reads A etc.) 

r1(A), w1(A), r3(B), w2(A), w3(B), w1(B) 

(3 points)​ The schedule is conflict serializable. 

True: draw the precedence graph: T3 -> T1 -> T2 

(3 points)​ The schedule is possible under two-phase locking (2PL). 

False: 

T1 releases lock on A after w1(A), because T2 needs it; and T3 acquires lock on B before r3(B), then T1 has 
to acquire lock on B after w3(B), which is not allowed in 2PL. 

If T1 retains the lock on A until it is done with w1(B), w2(A) cannot execute. 

(2 points)​ The schedule avoids cascading rollback. 

True: No dirty read from data written by other uncommitted transaction 

(2 points)​ The schedule has two equivalent serial schedules. 

False: from the precedence graph, there is only one. 

Problem 1e (2 points) 
Given relations R(A, B) , S (B, C) , the following relational algebra expression is valid: π​B​σ​A=5 R
​ − σ​B=5​S, 

where − denote set difference. 

Answer: False 

The schema are different, not union compatiable 

Problem 1f (6 points) 

Page 3​ of 19 
 
(3 points)​ Given relations R and S (without duplicates), R∩S = R − ((R − S ) ⋃ (S − R)) , where ⋃ , ∩ ,
and − denote set union, set intersection, and set difference 

True 

(3 points)​ Given relations R and S (with possible duplicates, bag semantic), if R has m copies of 1 and S 
has n copies of 1, then for all values of m and n such that ​m >= n  

R ⋒ S = R − ((R − S )⋓ (S − R)) , where ⋓ , ⋒ , a​ nd − denote bag union, bag intersection, and bag 
difference. 

True: Suppose R has m 1, S has n 1, and m >= n. LHS = n. RHS = m - ((m-n) + 0) = n 

(3 points)​ In the above question, for all values of m and n such that ​m < n 

R ⋒ S = R − ((R − S )⋓ (S − R)) , where ⋓ , ⋒ , a​ nd − denote bag union, bag intersection, and bag 
difference. 

False: If m < n, LHS = m, RHS = m - (0 + n-m) = 2m - n, may be different from m. 

e.g. assume R is {1}, S is {1, 1}: LHS = {1}, RHS = {1} - {1} = emptyset 

Problem 1g (6 points) 
Consider the following two XPath queries: 

//A[B/C = "foo" and B/D = "bar"] 

//A[B[C = "foo" and D = "bar"]] 

(3 points) ​Every element returned by the first query will be returned by the second. 

Answer: ​ ​a      

False. Consider the following XML document: 


<A> 
<B><C>foo</C></B> 
<B><D>bar</D></B> 
</A> 

The first query will return this element; the second query will not. 

(3 points)​ Every element returned by the second query will be returned by the first. 

Answer: ​ ​a  

True. The C and D elements that make an A element satisfy the condition in the second query will make the 
same A element satisfy the condition in the first query   

Page 4​ of 19 
 
Problem 2: SQL etc. ( points 10) 
Consider the schema of a database for course registration: 

• Course: C(cid, name, year, capacity) (capacity denotes the maximum possible number of students). - note 
that we clarified that cid is not the key here. 

• Student: S(sid, name, major)  

• Registration: R(sid, cid) 

The following SQL query outputs the yearly enrollment rate (as e_rate) of the courses across different years. 
 

SELECT C.cid, C.year, COUNT(S.sid) / C.capacity as e_rate

FROM Course C, Student S, Registration R

WHERE C.cid = R.cid AND R.sid = S.sid

GROUP BY C.cid, C.year 


 

SOLUTION: 

(i) ​(3 points) ​How will you modify the given SQL query, if for each output tuple (c, y, r), course id c 
should have enrollment rate r >= 0.9 in year y? (No need to rewrite the query, just state the changes). 
SELECT C.cid, C.year, COUNT(S.sid) / C.capacity as e_rate
FROM Course C, Student S, Registration R
WHERE C.cid = R.cid AND R.sid = S.sid
GROUP BY C.cid, C.year
HAVING (COUNT(S.sid) / C.capacity) >= 0.9 
or use the given query as a subquery and select on e_rate 

(ii) (4 points) ​Using the given SQL query, output cours ids c whose enrollment rate did not reach 0.9 in any 
year. (You can assume that the given query result has been materialized as a view ​TEMP​ and use ​TEMP​ in 
your new query). 
SELECT TEMP.cid FROM TEMP GROUP BY TEMP.cid HAVING
MAX(TEMP.e_rate) < 0.9

or 

SELECT TEMP.cid FROM TEMP


EXCEPT
SELECT TEMP.cid FROM TEMP WHERE TEMP.e_rate >= 0.9

Page 5​ of 19 
 
(iii) ​(3 points)​ Draw a query plan tree for the given SQL query that IS NOT CONSIDERED by Selinger’s 
dynamic programming algorithm for the joins of C, S, and R.

Any query plan which is not a left-deep plan.

Gamma​cid, year, ​COUNT(S.sid) / C.capacity -> e_rate


|
JOIN
/ \  
C JOIN 
/ \ 
S R 

NOTE: We did not deduct point if someone did not include the top-most aggregate step. 

Problem 3: Normalization (14 points) 


Suppose you have the following schema representing the grading duties of TAs in CS 316 in Spring 2020 : 

grade_duty(duty_id, ta_id, answer_id, homework_id, problem_id, student_id, grade, overdue) 

ta_id and student_id denote the id-s of the TA and the student. For each homework, we have multiple 
problems. Each answer may include the solutions to multiple problems (as in gradescope). For each problem 
in each homework submitted by a student, there is a grade, and if it is late then overdue = 1.  

We denote this schema by DTAHPSGO by taking the first letter from each attribute. Suppose there are 
following functional dependencies. 

(1) D -> TAHPSGO -- Duty id is the key for the table 


(2) A -> SH -- Each answer applies to one student, one homework 
(3) HT -> P -- One TA grades only one problem for each homework 

(i) (3 points): ​For the following example table, are there any rows violating the stated functional 
dependencies? If so, identify any one violation, and state which functional dependency is violated, 
and which duty_id’s violate them. 

duty_id  TA_id (T)  answer_id  homewor_ problem_i student_id  grade (G)  overdue 
(D)   (A)   id (H)  d (P)  (S)   (O) 

1000  John  101  1  1  5001  90  Yes 

1001  Mary  104  2  3  5005  92  Yes 

Page 6​ of 19 
 
1002  Jesse  105  2  3  5002  85  No 

1003  John  103  4  2  5005  100  No 

1004  Mary  105  2  5  5002  95  No 

1005  Ray  103  4  3  5003  70  No 


 

1001 and 1004, violate HT -> P 

1003 and 1005, violate A -> SH 

(ii) (4 points)​: Compute a BCNF decomposition of this table schema DTAHPSGO using the three FDs. 
Show the steps. 

Option 1: 

From A -> SH, decompose DTAHPSGO into DTAPGO and ASH. 

From A -> SH and HT -> P we have AT -> SHT -> P, so AT -> P. 

From AT -> P, demopose DTAPGO into DATGO and ATP 

Option 2: 

Using HT->P, get HTP and DTAHSGO 

Using A -> SH, decompose DTAHSGO and get ASH and DTGO 

(iii) (3 points)​: Consider a simplified version of this schema with only the attributes DTH, and one 
functional dependency D -> TH. If we decompose the table into DT and TH is it lossless join 
decomposition? Explain your answer.. 

No. See counterexample below: 

DTH: 

duty_id   TA_id  homework_id 

1001  Ray  1 

1002  Ray  2 
 

Page 7​ of 19 
 
 

DT: 

duty_id   TA_id 

1001  Ray 

1002  Ray 
 

TH: 

TA_id  homework_id 

Ray  1 

Ray  2 
 

DT join TH: 

duty_id   TA_id  assignment_id 

1001  Ray  1 

1001  Ray  2 

1002  Ray  1 

1002  Ray  2 
Row 2 and 3 are incorrect. 

Problem 4: XML and XQuery (9 points) 


Consider a course registration XML DTD: 

<!DOCTYPE Registration [
<!ELEMENT Registration (Course+)>
<!ELEMENT Course (Name, Student* )>
<!ATTLIST Course Capacity CDATA #REQUIRED>
<!ELEMENT Name (#PCDATA)>

Page 8​ of 19 
 
<!ELEMENT Student (Name, Grade)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Grade (#PCDATA)>
]>

(i) ​(4 points)​ Find XPath expressions that are equivalent to the XQuery below. Select ​ALL OPTIONS​ that 
are correct (or say that none are correct). 

for $c in /Registration/Course 
return 
if (exists($c/Student[Grade >= 90 and Grade < 95])) then $c/Name 
   
A. /Registration/Course[Student[Grade >= 90 and Grade < 95]]/Name 
B. /Registration/Course[./Student/Grade >= 90 and ./Student/Grade < 95]/Name 
C. /Registration/Course[Student[Grade >= 90][Grade < 95]]/Name 
D. /Registration/Course[count(./Student[Grade < 90 or Grade >= 95]) = 0]/Name 

ANS: 1. AC 

B is wrong because there can be multiple students in a course, and the condition checks if one student’s grade 
is >= 90 and at least one student’s grade is < 95. 

D is wrong because not necessarily the count of student outside the range [90, 95) should be 0. 

(ii) (5 points) ​Write a query using XQuery to find courses where at least half of the enrolled students 
received their highest score among all courses they took. You may use some of the aggregate functions: 
count(), min(), max(), where aggr(nodeset) returns a single value. Assume that the names of students are 
unique. 

f​or $c in /Registration/Course 
let $cnt := count( 
for $s in $c/Student 
let $g := max(//Student[Name = $s/Name]/Grade) 
where $g <= $s/Grade 
return $s/Name 

let $cnt2 := count(for $s in $c/Student return $s) 
where $cnt >= 0.5 * $cnt2 
return $c/Name 
 
 
 
 
 

Page 9​ of 19 
 
Problem 5: LOGS (8 points) 
 

Consider UNDO/REDO logging with fuzzy checkpointing. Recall that update record (T, A, o, n) denotes 
that transaction T changed the value of A from o to n. 

At the time of a system crash, let the log segment involving four transactions ​S​,​T​,​U​,​V b
​ e as follows.  

1. (START S) 

2. (S, X, 10, 20) 

3. (S, Y, 5, 10)  

4. (COMMIT S) 

5. (START T) 

6. (T, X, 20, 30)  

7. (START CKPT(T))  

8. (T, Y, 10, 20)  

9. (START U) 

10. (COMMIT T)  

11. (U, X, 30, 40)  

12. (END CKPT) 

13. (U, Y, 20, 30)  

14. (START V)  

15. (START CKPT(U,V))  

16. (COMMIT U) 

17. (V, Y, 30, 40)  

18. (V, Y, 40, 50)  

(i) (4 pts) Fill out the table below by writing “Guaranteed” or “Maybe” or “Impossible” for each updates, 
which says whether that updated value of the variable was written to disk by the time of the crash:  

● Guaranteed: this update is guaranteed to be written to disk,  


● Maybe: this update might have been written to disk, 

Page 10​ of 19 


 
● Impossible: this update could not be written to disk.  

Log sequence  Object and its value  Guaranteed/Maybe/Impossible 


number 

2  X = 20   

3  Y = 10   

6  X = 30   

8  Y = 20   

11  X = 40   

13  Y = 30   

17  Y = 40   

18  Y = 50   
 

(ii) (2 pts)​ After the REDO stage of recovery, what will be the value of X and Y? 

X =  

Y =  

(iii) (2 pts) ​After the UNDO stage of recovery, what will be the value of X and Y? 

X =  

Y =  

Solution: 

(i) Since the second checkpointing did not complete, we only have guarantee on the updates before 
the first START CKPT. But all other updates might have been written to disk while the second 
checkpointing was running. 

Page 11​ of 19 


 
Log sequence  Object and its value  Guaranteed/Maybe/Impossible 
number 

2  X = 20  Guaranteed 

3  Y = 10  Guaranteed 

6  X = 30  Guaranteed 

8  Y = 20  Maybe 

11  X = 40  Maybe 

13  Y = 30  Maybe 

17  Y = 40  Maybe 

18  Y = 50  Maybe 


 

(ii) X = 40 

Y = 50  

S c​ ommitted before the first START CKPT, so it is complete. ​T​,​U ​committed as well at the time of 
the crash. But all changes made by ​T ​prior to the START CKPT are guaranteed to be on disk, so we 
do not need to go beyond the START CKPT at step 7. From this point, we start “repeating the 
history”, i.e. make all updates in order, as a result we get X = 40 (line 11) and Y = 50 (line 18) 

(iii) 

Only uncommitted transaction is V. So the UNDO step will only focus on V and undo the changes 
by V in reverse direction. It first reverts Y from 50 to 40 (line 18), then it reverts Y from 40 to 30 
(line 17). So the final values will be 

X = 40 

Y = 30.  

   

Page 12​ of 19 


 
Problem 6: Join Algorithm/Index (17 points) 
Consider the the following query: 

SELECT R.A, S.B, S.C

FROM R, S

WHERE R.B = S.B AND S.C > X

Suppose that all attributes in R(A, B), S(B, C) are integers, the range of S.C is [11, 210], and the tuples are 
uniformly distributed. 

(i). (9 points) ​Assume that we have ​infinite memory​, ​no indexes​, and our only join algorithm is a nested 
loops join. The database optimizer decides to execute the query with the following query plan. When X = 
208, for each table and operator, estimate the number of output tuples and the I/O Cost (in terms of number 
of pages) of each operator in the query plan. 

The tables have the following number of tuples and pages on disk: 

Table  Tuples  Pages 

R  1,000  100 

S  50,000  5,000 
 

tuples: ________ 

cost: ________ 

Nested Loops Join R.B = S.B 

tuples: ________ tuples: ________ 

cost: ________ cost: ________ 

R select(S.C > 208) 

Page 13​ of 19 


 
 

Ans: 

tuples: __(see below - everyone got 1.5 for this)___ 

cost: ___0_____ 

Nested Loops Join R.B = S.B 

tuples: ___1000_____ tuples: __500______ 

cost: ___100_____ cost: ____5000_______ 

Scan R Scan S and select(S.C > 208) 

For selection on S, the estimated number of tuples in the selection result is (210-208) / (210 - 11+1) 
* 50000 = 500. And we have to scan the entire table.   

For R join S, the result can be stored in memory so there is no I/O cost in this phase.  

Now the solution needed some additional assumptions. If you assume that R has the Primary key and S 
foreign key, then since there are 50000 tuples in S and 1000 tuples in R, the estimated number of matching 
tuples in R for every tuple in S is 0.02. So the estimated number of tuples in final result is 500 * 0.02 = 10. 
(note the uniformity assumption.) 

If you assume some distinct number of values of B in S and R, the answer would be different.  

Since the question was under-specified, everyone got 1.5 points for the top-most sub-question on #tuples. 

(ii) (8 points) ​Now assume that the ​memory has only 10 page​s​, one index lookup requires 4 I/O, and after 
the selection S.C > X, the results are written back to disk. Suppose there is a clustered B+tree index on S(B). 
Ignore page boundaries. Suppose the options are 

● hash join 
● index nested loop join 

a. (4 points) ​Which join algorithm will you choose when X = 208 to have less I/O? Write costs or 
explanations to justify your answer. 

Page 14​ of 19 


 
a. (4 points) ​Which join algorithm will you choose when X = 110 to have less I/O? Write costs or 
explanations to justify your answer. 

Ans: 

2.a: Use hash join. ​When X=208, there are 500 tuples in the S part fitting in 50 pages. There are 1000 tuples 
in the R part fitting in 100 pages. In total we need min(sqrt(50, 100))+2 = 10 pages in memory, so hash join 
can be performed, cost = 3(50+100) = 450 

The cost of indexed nested loop join is B(R) + |R|* (4 + 1) = 100 + 1,000 * 5 = 5,100 

1 R tuple joins with at most 5 S tuples, which would fit in one page (ignoring page boundary). 

2.b: Use index nested-loop join. 

When X=110, there are 25000 tuples in the S part fitting in 2500 pages. There are 1000 tuples in the R part 
fitting in 100 pages. In total we need min(sqrt(2500, 100))+2 = 12 pages in memory, so hash join cannot be 
performed 

Problem 7: Distributed Joins (6 points)  


Recall the semijoin algorithm for distributed join processing. 

Suppose you have a table R(A, B, C) in London, and the table S(B, E, F) in Paris and we want to do an 
equijoin. Here are the steps: 

1. At London: Project R onto the join column B. Send the projection R1(B) to Paris. 

2. At Paris, equijoin R1 with S, send the result S1(B, E, F) to London 

3. At London, join S1 with R. Output final answer A(A, B, C, E, F). 

Here is a slight modification of semijoin, called Bloomjoin, works. 

1. At London: compute a bit-vector of some size k: – Hash R.B values into range 0 to k-1. If 
some tuple hashes to p-th bucket, set bit p to 1 (p from 0 to k-1) – Ship bit-vector to Paris. 
2. At Paris, hash each tuple of S.B similarly – discard S tuples that hash to 0 in R-s bit-vector. 
Let the resulting subset of S be S1. Ship S1(B, E, F) to London. 

3. At London, join S1 with R. Output final answer A(A, B, C, E, F).  

Step 3 is the same as semi-join. Steps 1 and 2 are different.  

(i) (3 points) State one reason why bloom-join can be ​more efficient ​than semi-join. 

Page 15​ of 19 


 
Solution: 

(i) Sending a bit vector is more efficient than sending a subset of R in step 1. 

(ii) (3 points) State one reason why bloom-join can be ​less efficient​ than semi-join. 

Solution: 

(ii) Since hashing may have collision, in step 3, Bloom join can ship more S tuples that might not appear in 
final join result. Semijoin only sends the S tuples that appears in the join result. 

Problem X1: Advanced join algorithm. (14 points) 


So far we have only considered I/O cost for join processing. Now we will look at an advanced join algorithm 
that aims to improve the CPU cost, i.e., we will try to design a better join algorithm with better running time 
complexity considering standard CPU cost (you can assume all relations are in memory). 

Consider four tables: S(A, B), T(B, C, D), U(C, D, E), V(B, C, F) and a natural join among them. Instead of 
computing the result of join, the goal is to compute “fully reduced relations”: which will produce (1) subsets 
of tuples from S, T, U, V such that each of these tuples will participate in at least one result tuple. i.e.  

π​A, B​ (S ⨝ T ⨝ U ⨝ V) = S π​B, C, D​ (S ⨝ T ⨝ U ⨝ V) = T etc. 

(2) Further, these subsets should be maximal, i.e. all tuples in S, T, U, V that appear in the original join result 
should stay.  

Example of fully reduced relations: 

S(A, B) T(B, C, D) U(C, D, E) V(B, C, F) 

10 9 9 1 7 1 7 6 9 1 5 

12 9 9 1 4 1 4 1   

Note that there are four result tuples: (A, B, C, D, E, F) = (10, 9, 1, 7, 6, 5), (10, 9, 1, 4, 1, 5), (12, 9, 1, 7, 6, 5), 
(12, 9, 1, 4, 1, 5), and all tuples from all relations participate in these result set.  

Example of NOT fully reduced relations: 

S(A, B) T(B, C, D) U(C, D, E) V(B, C, F) 

10 9 9 1 7 1 7 6 9 1 5 

12 9 ​9 2 4​ ​ 1 4 1   

Here T(9, 2, 4) does not participate in the join result and is a “dangling tuple”. Similarly U(1, 4, 1). The fully 
reduced relations for S, T, U, V will get rid of (only) these two tuples and produce the following output. 

Page 16​ of 19 


 
S(A, B) T(B, C, D) U(C, D, E) V(B, C, F) 

10 9 9 1 7 1 7 6 9 1 5 

12 9   

Answer the following questions. 

(i) ​(4 points)​ can you construct an instance where each of S, T, U, V has n tuples, and the result of the join is 
n​2​ ? 

Ans: Have A: a1… an. Have B = 1. Have C = c1, .. . cn 

D = E = F = 1 for all tuples 

S, T, U, V have n tuples, join has n​2​, i.e.,  

S has (a1, 1), (a2, 1) …., (an, 1) 

T has (1, c1, 1), (1, c2, 1), … , (1, cn, 1) 

U has (c1, 1, 1), …. (cn, 1, 1) 

V has (1, c1, 1), ... , (1, cn, 1) 

Several other answers are possible! 

(ii) ​(10 points) ​Given arbitrary S, T, U, V, can you design an O(n log n) algorithm to compute the fully 
reduced relations? Note that the join size can be 𝚹(n​2​), so you cannot compute the join and project on to 
individual relations! A brief description in English with some analysis is fine (no need for pseudocode etc.) 

Hint1: One concept from class learnt in another context would help! 

Hint 2: does this tree give you some idea? 

S(A, B) 

T(B, C, D) 

/ \ 

U(C, D, E) V(B, C, F) 

Page 17​ of 19 


 
Ans: The concept we use here is semi-join! i.e., project onto the join column, join with sorting, see which 
tuples participate. But basically it is project and sorting, and the tree should give us the idea how to project 
and compare relations. 

We need to do a semi-join pass bottom -up, then do another pass top-down.. and use sorting to keep rows 
that actually join. 

So here is the full algo:  

Bottom-up pass: 
Step 1: U and T 

-- Project U to CD to U1 and keep U1 in sorted order (O(n log n)) 

-- Sort T by CD (O(n log n)) 

-- See which tuples of T match with U1, delete the rest from T (O(n)) 

Step 2: V and T 

-- Project V to BC to V1 and keep V1 in sorted order (O(n log n)) 

-- Sort (new) T by BC (O(n log n)) 

-- See which tuples of T match with V1, delete the rest from T (O(n))   

Step 3: T and S 

-- Project (new) T to B to T1 and keep T1 in sorted order (O(n log n)) 

-- Sort S by B (O(n log n)) 

-- See which tuples of S match with T1, discard the rest from S (O(n))   

After this bottom up pass, every tuple in S participate in a join tuple! 

Top-down pass: 

Now do the same trick top-down, i.e. first (new) S with (new) T, then (newer) T with U, and then with V to 
keep only tuples in T, U, V that would participate in the join result. 

Total cost = still O(n log n) 

Trivia: These queries are called “acyclic queries” for which such trees (called join trees) would always exist. 
e.g., you cannot do this trick for queries like R(A, B), S(B, C), T(C, A), which is not acyclic and intuitively has 
a cycle A -> B -> C -> A! 

Page 18​ of 19 


 
Problem X2: 6 points 
 

For the following three problems, assume we are considering replacing a hard disk drive (HDD) with a solid 
state drive (SSD). Assume that our database can be held on both disks and pages from each relation are 
stored in consecutive disk blocks initially. Suppose sequential scans over every table in the database on HDD 
and SSD take roughly the same amount of time, but a random access in SSD is much faster. Write True/false, 
no explanations are needed. 

Problem X2.a (2 points) 


Consider an indexed nested loop join using an unclustered index, where the inner table cannot fit in the 
memory, replacing HDD with SSD will improve the performance significantly. 

Ans: T 

Problem X2.b (2 points) 


Consider a block nested loop join where the smaller table fits in the memory and multiple pages are available 
to load pages from the outer relation, replacing HDD with SSD will improve the performance significantly. 

Ans: F 
 
Problem X2.c (2 points) 
Consider an external sort where the table being sorted has M​2​ pages (M is the number of pages in the 
memory available for sorting), replacing HDD with SSD will improve the performance significantly. 

Ans: T 
 

Problem X2.d (1 points) 


Explain your answer for above three questions in 1-2 sentences. 
 
Ans: The first and third one need random access, the second one needs sequential access. 

INLJ with unclustered index will access pages from random locations. 

BNLJ will load multiple pages of outer relation from consecutive relation sequentially. 

Sorting accesses next page from different runs that are not consecutively stored necessarily. 

Page 19​ of 19 


 

You might also like