Midterm Exam: Introduction To Database Systems: Solutions: Below Is The Preferred Solution
Midterm Exam: Introduction To Database Systems: Solutions: Below Is The Preferred Solution
College of Engineering
Department of EECS, Computer Science Division
CS186 J. Hellerstein
Spring 2010 Midterm #1
a. [12 points] Complete the diagram above to be a valid E-R diagram reflecting the following
constraints. (Be sure to make your bold lines very bold!)
• A student, uniquely identified by her SID, takes an exam (for example, Midterm #1) on exactly one
date. A student may take any number of exams (for example, Midterm #1, Midterm #2, and a final
exam), and every exam is taken by at least one student. An exam is uniquely identified by the
combination of a course and a semester.
Points for question 1(a) were assigned according to the following rubric:
+1 “name” is underlined with a dotted line and connected to “Overseer” with a regular line.
+1 “Overseer” and “has” are bolded, and there is a bold arrow from “Overseer” to “has”
+1 “Exam” and “has” are connected with a bold line
+1 “Exam” and “takes” are connected with a bold line
+1 “Exam” and “on” are connected with a bold line
+1 “course” and “semester” are underlined with solid lines. “Exam” and “course” are
connected with a solid line. “Exam” and “course” are connected with a solid line.
+1 “Question” is connected to “on” with a regular arrow
+1 “date” is connected to “takes” (preferred) or to “Exam” with a regular line, and is not
underlined.
+1 “takes” is connected to “Student” with a regular line
+1 “sid” is underlined and connected to “Student” with a regular line
+1 “Student” is connected to “answers” with either a regular or a bold line
+1 “answers” is either connected to “Exam”, “Question”, or to an aggregate surrounding
“Exam”, “on”, and “Question”, with a regular line.
-1 Extraneous markings were included, such as bolding a relation other than Overseer or
underlining a relation.
• [2 points] Consider the following E-R diagram, which is a fragment of a simple board game schema
that captures the legal moves available in each position on a board:
We want to translate “Moves” into an SQL table and maintain the constraints in the ER diagram.
Correctly complete the SQL statement below (note that “--" begins a comment in SQL):
Assume there are no indexes available, and both relations are in arbitrary order on disk. Assume that
we use the refinement for sort-merge join that joins during the final merge phase. However, assume
that our implementation of hash join is simple: it cannot perform recursive partitioning, and does not
perform hybrid hash join. The optimizer will not choose hash join if it would require recursive
partitioning.
For each of these questions, 2 points was given for the correct algorithm and 3 points for the correct
cost. If the algorithm was incorrect but the cost was correct for the given algorithm, 1 point was given.
One point was deducted if the name of an algorithm wasn’t quite right. If the cost calculation
contained a few minor errors, or was incomplete, 1 or 2 points were deducted.
• [5 points] Assume you have B=3 memory buffers, enough to hold 3 disk pages in memory at
once. (Remember that one buffer must be used to buffer the output of a join). What is the best
join algorithm to compute the result of this query? What is its cost, as measured in the number
of I/Os (i.e., pages requested. Do not include the cost of writing the final output. You may
ignore buffer pool effects in this question.)
• [5 points] Suppose we raise the number of memory buffers to B=52, and increase the size of the
Departments relation to 500 pages. What is the best join algorithm now, and what is its cost (no
writing final output, ignoring buffer hits)?
• Hash Join: Since B2 = 2704 > 500 = min(|Employee|, |Department|), we can use hash join
in this problem. Since there is no recursive partitioning, total cost is
3(|Department| + |Employee|) = 3(1100 + 500) = 4800.
• Sort-Merge Join: Number of buffers is now large enough to sort each relation in two
passes, with enough room to do the refinement for both relations (1100/52 +
500/52 = 32 < 52). So cost is 3(1100 + 500) = 4800, same as Hash Join.
• Doubly-nested loop join: cost is NumTuples(Department)*|Employee| + |Department| or
about (10000)(1100) + 500 = 11000500 > 4800
• Page-oriented doubly-nested loop join: cost is |Department|*|Employee| + |Department| =
(500)(1100) + 500 = 550500 > 4800
• Block nested loops join: Cost is (ceiling(|Department|/(B-2)) * |Employee|) +
|Department| = (500/(52-2))(1100) + 500 = 11500 > 4800.
1, 99, 100. Initially the pool fills with 1, 2, 3. Then the 3 position is overwritten until the end
of the first pass, at which point it is 1, 2, 100. During the second pass, 1 and 2 are hits, then
the 2 position is overwritten until 99 is reached. Finally 100 is a hit.
• [1 point] What is the hit rate (#hits/#requests) in the scenario of part (a)?
3/200 or 1.5%. Only 1, 2, and 100 are hit, once each.
• [3 points] To save a random I/O, your friend suggests that we scan the file once from pageID
1 to pageID 100, and then switch into reverse and scan from pageID 100 back down to 1.
Again starting with an empty buffer pool and using MRU, what pages will be in memory at
the end of this scan?
1, 2, 3. Initially the pool fills with 1, 2, 3. Then the 3 position is overwritten until the end of
the first pass, at which point it is 1, 2, 100. During the second pass, 100 is a hit, then the 100
position is overwritten until 3 is reached. Finally 1 and 2 are hits.
• [1 point] What is the hit rate (#hits/#requests) in the scenario of part (c)?
• [1 point] Consider a sorted file organization as we studied in class, with N tightly packed
pages of records. Write an expression for expected (average case) number of I/O requests for
an equality lookup in the file.
log2 N. This can be achieved with binary search on the pages, followed by searching the page
containing the record in memory.
• [1 point] Again using MRU and starting with an empty buffer pool, what is the expected hit
rate in the buffer pool for the scenario of part (e)?
• [5 points] The B+-tree drawn below has order 3 (i.e. max 6 pointers per internal node), and contains
a number of errors. Circle the errors and draw a new correct B+-tree over the same data entries (the
entries in the leaves).
• (1 pt) The internal node containing only “4” is underfull. It has 2 pointers, less than the
minimum of 3 (which is half the maximum of 6).
• (1 pt) The value “7” is in the left subtree of the root, but should be in the right subtree,
since the key in the root node is 6 and 7 > 6.
• (1 pt) The internal node containing “11 15” is not underfull, because it contains 3
pointers to child nodes, which is exactly half the maximum (6 pointers).
Additionally, the presence of the key “11” which is not present in any leaf is
not an error (internal nodes may contain values which were once present in
leaves but have since been deleted). This point was deducted if this internal
node was marked erroneous.
Additionally, the root node is not underfull because the root node is specially permitted to be less
than half full. Following is a correct new B+-tree over the same data entries:
This is the unique correct B+-tree with the exact same leaf nodes. It was also valid to compact the
data values into a smaller number of leaf nodes, if the values in the root are adjusted accordingly. It
is not possible to use more than one internal node.
When 9 is inserted, it goes into the node containing “5 6 7 8”. However, this node is now overfull and
must be split into two nodes “5 6 7” and “8 9”. The value 7 is added to the parent internal node,
which is now in turn overfull, so it is split into “5 7” and “10 12”. Finally a new root node is created
and “7” is moved to the root node.
Alternatively, we can make all our splits with the smaller side on the left. In this case, the overfull leaf
is split as “5 6” and “7 8 9”, 8 is added to the parent node, and “10” is moved to the new root node
instead of the new value. Both trees are shown below.
New root node uses a total of 4(4) + 2 + 3 + 2 = 23 < 30 bytes (assuming string prefixes are null-
terminated or length-prefixed with a single byte), so it all fits in one node.