0% found this document useful (0 votes)
110 views8 pages

HW 3 Sol

This homework assignment involves analyzing different indexing structures and query evaluation strategies for a database management systems course. Questions involve describing the structure of extensible hash tables, linear hash tables, and B+ trees after various insertions and deletions. Cost analyses are provided for range queries and inserts on these indexing structures. Strategies for performing multi-table joins using sort-merge joins and hash joins are explored, including choices between one-pass and two-pass algorithms based on data sizes. A two-pass hash-based algorithm for duplicate elimination is described.

Uploaded by

Chuang James
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views8 pages

HW 3 Sol

This homework assignment involves analyzing different indexing structures and query evaluation strategies for a database management systems course. Questions involve describing the structure of extensible hash tables, linear hash tables, and B+ trees after various insertions and deletions. Cost analyses are provided for range queries and inserts on these indexing structures. Strategies for performing multi-table joins using sort-merge joins and hash joins are explored, including choices between one-pass and two-pass algorithms based on data sizes. A two-pass hash-based algorithm for duplicate elimination is described.

Uploaded by

Chuang James
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Database Management Systems (COP 5725)

Homework 3

Instructor: Dr. Daisy Zhe Wang

TAs:
Yang Chen, Kun Li, Yang Peng
yang, kli, [email protected] l.edu

November 26, 2013

Name:
UFID:
Email Address:

Pledge(Must be signed according to UF Honor Code)


On my honor, I have neither given nor received unauthorized aid in doing this assignment.

Signature

For grading use only:

Question: I II III IV Total


Points: 26 25 24 25 100
Score:

i
COP5725, Fall 2013 Homework 3 Page 1 of 7

I. [26 points] Indexing.


Make the following assumptions for questions (1) and (2):
• A bucket can hold two keys and a pointer.
• The initial database D contains one object with key 10100.
Six objects with the following keys are inserted to D in the following order: 00110, 11010,
10011, 01010, 10110, and 01011.
(1) [3 points] Assume that an extensible hash table is used to index the database. Show
the index structure after the insertions.
(2) [3 points] Assume that a linear hash table is used to index the database with the
restriction that at most 80% of the hash table can be full at any time. Show the index
structure after the insertions.
Solution: For (1) (left) and (2) (right).

00110 2
i=3

01010 2
000
01011 i=3
001
000
n=5
010 10011 3 r=7
001
011
00110 10110
100 010
10100 3 01010 11010
101 10110 01011
011
110 10011
11010 2 10100
111 100

(3) [3 points] Describe a scenario when extensible hash tables are preferred over linear
hash tables. Explain your answer.
Solution: Two possibilities:
• When the insertions are very frequent. This is because linear hash table is reorganized every
time a new bucket is added. A new bucket is added frequently.
• When the values of the keys that the index is built on uniformly distributed keys. (i.e., the
hash table is filled uniformly).

For (4)-(6), consider a B+ tree whose nodes contain up to 4 keys (5 pointers).


(4) [3 points] Bulkload the B+ tree with values 46, 10, 70, 49, 23, 40, 59, 29, 34, 54, 75,
30.
Solution:
COP5725, Fall 2013 Homework 3 Page 2 of 7

34 54

10 23 29 30 34 40 46 49 54 59 70 75

(5) [3 points] Show the result B+ tree after inserting values 80, 24, 42.
Solution:
42

24 34 54 70

10 23 24 29 30 34 40 42 46 49 54 59 70 75 80

(6) [3 points] Based on the B+ tree in (5), show the result B+ tree after deleting values
10, 40.
Solution:
29 42 54 70

23 24 29 30 34 42 46 49 54 59 70 75 80

(7) [8 points] Fill in the cost table below for “Alternative 1” ISAM and B+ tree indices.
Assume each index takes P pages on disk, has height H, and fanout F at each internal
node. Assume there are R tuples in the relation, and B tuples fit on a leaf (or overflow)
page. In each case, assume infinite buffer pool size, but the buffer pool starts out empty.
For each page that gets dirty, add 1 to your I/O cost since it will eventually have to
be flushed to disk. For ISAM, assume that a leaf node maintains only a pointer to
the beginning of an overflow list. Given the constraints of a B+ Tree/ISAM, assume
whatever data you want in the tree for each case below.
Solution:
COP5725, Fall 2013 Homework 3 Page 3 of 7

Worst-case # IOs for range


Worst-case # IOs for insert
query
P (index consists of root with a lin-
ear string of overflow pages. Need P + 2 (index consist of a root with
to look at all overflow pages since a string of overflow pages. Need to
they’re not sorted) scan til the end, and add a new over-
or flow page in the worst case, and up-
ISAM date the previous last overflow page
H + R/B
with a pointer)
or
or
H + (F H − 1) + R/B (look at whole
leaf level and all data in last leaf H + R/B + 2
overflow)
H + F H (range query covers the
whole table) 3H + 1 (every node needs to split,
H + R/B +1 for new root. Read pages we’re
B+ Tree
P was not accepted here, as this going to split on the way down, so
would imply only 2 I/Os, given the we don’t need to read them again.)
structure of the index.

II. [25 points] Query Evaluation.


Suppose we want to compute (R(a, b) ./ S(a, c)) ./ T (a, d) in the order indicated. We
have M = 101 main memory buffers, and the number of disk blocks (pages) for R and S
B(R) = B(S) = 2000. Now we decide to use one-pass or two-pass sort-merge-join algorithms
to implement the query.
(1) [2 points] Would you use a one- or two-pass sort-merge-join for R ./ S? Explain.
Solution: Two-pass sort-merge-join, since both operands are larger than main memory.
(2) We shall use the appropriate number of passes for the second join, first dividing T into
some number of sublists sorted by a, and merging them with the sorted and pipelined
stream of tuples from the join R ./ S. For what values of B(T ) should we choose for
the join of T with R ./ S:
i. [3 points] A one-pass join; i.e., we read T into memory, and compare its tuples with
the tuples of R ./ S as they are generated.
Solution: B(T ) ≤ 60.
ii. [3 points] A two-pass join; i.e., we create sorted sublists for T and keep one buffer in
memory for each sorted sublist, while we generate tuples of R ./ S.
Solution: B(T ) > 60.
iii. [4 points] For cases in i. ii., what is the total number of disk I/O’s (in terms of
B(T ))?
Solution: For i. we need 3 × (2000 + 2000) = 12, 000 I/O’s to perform the two-pass sort-
merge-join of R and S, and B(T ) I/O’s to read T in the one-pass join of (R ./ S) ./ T . The
total # of I/O’s is 12, 000 + B(T ).
COP5725, Fall 2013 Homework 3 Page 4 of 7

For ii. we need


• 2B(T ) disk I/O’s to sort B(T ) into sublists;
• 12,000 disk I/O’s to join R ./ S;
• B(T ) to read the sorted lists of T .
The total number of disk I/O’s is 12, 000 + 3B(T ).
(3) [4 points] Consider the query (R(a, b) ./ S(a, c)) ./ T (c, d), i.e., the second join is
based on attribute c instead of a. How would you choose the join algorithms? Provide
a new cost estimation if your choices differ from (2).
Solution: We need to re-sort the intermediate result R ./ S based on the attribute c. New
cost is 12, 000 + 3B(T ) + 2B(R ./ S), where the 2B(R ./ S) term comes from writing out the
sublists of R ./ S and read them in again while joining (R ./ S) ./ T .

For (4)-(6), you are given M memory blocks and a relation R.


(4) [3 points] Describe a two-pass hash-based algorithm for duplicate elimination, δ(R).
(Hint: review the aggregation algorithm with grouping).
Solution:
• Hash R into M − 1 buckets based on all attributes.
• Perform δ on each bucket in isolation, using M memory blocks.
(5) [3 points] What is the largest relation your algorithm can handle given M blocks of
main memory?
Solution: M (M − 1).
(6) [3 points] What is the number of disk I/O’s of your algorithm?
Solution:
• B(R) for reading R and hashing;
• B(R) for writing out the buckets;
• B(R) for reading the buckets and do the actual duplication.
3B(R) in total.

III. [24 points] Query Optimization.


Consider the following database schema:
Employees(eid: integer, ename: string, sal: integer, title: string,
age: integer)

Suppose that the following indexes, all using Alternative (2) for data entries, exist: a hash
index on eid, a B+ tree index on sal, a hash index on age, and a clustered B+ tree index on
(age, sal). Each Employees record is 100 bytes long, and you can assume that each index
data entry is 20 bytes long. The Employees relation contains 10,000 pages.
(1) Consider each of the following selection conditions and, assuming that the reduction factor
(RF) for each term that matches an index is 0.1, compute the cost of the most selective access
path for retrieving all Employees tuples that satisfy the condition (in terms of the number
of I/O’s):
COP5725, Fall 2013 Homework 3 Page 5 of 7

i. [4 points] age=25.
Solution: The clustered B+ tree index would be the best option here, with a cost of 2 (lookup)
+ 10000 × 0.1 (data pages) + 10000 × 0.2 (index pages) ×0.1 = 1202. Although the hash index
has a less lookup time, the potential number of record lookups (10000 × 0.1 × 20 tuples per
page = 20000) renders the clustered index more efficient.
ii. [4 points] sal>200 AND age>30 AND title=’CFO’.
Solution: Here an age condition is present, so the clustered B+ tree index on (age, sal)
can be used. The cost is 2 + 10, 000 × 0.2 × 0.1 (all index pages needs to be fetched satisfying
age>30) + 10, 000 × 0.1 × 0.1 (data pages) = 302.

Consider the following relational schema and SQL query:

Emp(eid, did, sal, hobby)


Dept(did, dname, floor, phone)
Finance(did, budget, sales, expences)

SELECT D.dname, F.budget


FROM Emp E, Dept D, Finance F
WHERE E.did = D.did AND D.did = F.did AND
D.floor = 1 AND E.sal >= 59000 AND E.hobby = yodelling;

(2) [5 points] Identify a query plan that a decent query optimizer would choose.
Solution:
πD.dname, F.budget

π F.did, F.budget

πE.did πD.did, D.dname


F

σ E.sal≥59000, E.hobby="yodelling" σ D.floor=1

E D
(3) Suppose that the following additional information is available:
• Unclustered B+ tree indexes exist on Emp.did, Emp.sal, Dept.did, and Finance.did
(each leaf page contains up to 200 entries).
• The systems statistics indicate that employee salaries range from 10,000 to 60,000, employees
enjoy 200 different hobbies.
• The company owns two floors in the building.
• There are a total of 50,000 employees and 5,000 departments (each with corresponding
financial information) in the database.
• The DBMS used by the company has just one join method available, namely index nested
loops.
COP5725, Fall 2013 Homework 3 Page 6 of 7

i. [3 points] For each of the query’s base relations, estimate the number of tuples that would
be initially selected from that relation if all of the non-join predicates on that relation were
applied to it before any join processing begins.
Solution:
1000 1
• Emp: 50000 × 50000 × 200 = 5.
1
• Dept: 5000 × 2 = 2500.
• Finance: 5000.
ii. [8 points] Under the System R approach, determine a join order that has the least estimated
cost. Compute the cost of your plan (in terms of the number of disk I/O’s).
Solution: ((D ./ E) ./ F ).
• First, we use the fact that there is a B-tree index on salary to retrieve the tuples from
E such that E.salary >= 59000. We estimate that (50000/50) = 1000 such tuples
selected out, with a cost of 1 tree traversal (say 3 I/O’s to get to the leaf) + the cost of
scanning the leaf pages (1000/200 + 1 - 1 = 5) + the cost of retrieving the 1000 tuples
(since the index is unclustered each tuple is potentially 1 disk I/O) = 3 + 5 + 1000 =
1008. Of these 1000 retrieved tuples, do an on-the fly select out only those that have
hobby = "yodelling", we estimate there will be (1000/200) = 5 such tuples.
• Pipeline these 5 tuples from E one at a time to D. By using the B+ tree index on D.did
and the fact the D.did is a key, we can find the matching tuples for the join by searching
the D.did B+ tree and retrieving at most 1 matching tuple per tuple from E.
The cost of E ./ D is hence total cost of index nested loop. 5×(tree traversal of D.did
Btree + record retrieval) = 5 × (3 + 1) = 20.
• Now select out the d5/2e = 3 tuples that have D.floor = 1 on the fly and pipeline it to
the next level F. (This is done after E ./ D is done). Use the B+ tree index on F.did
and the fact that F.did is a key to retrieve at most 1 tuple for each of the 3 pipelined
tuples. This cost is at most 3 × (3 + 1) = 12.
• Ignoring the cost of writing out the final result, we get a total cost of 1008 + 20 + 12 =
1040.

IV. [25 points] Transactions and Concurrency Control.


(1) For each of the following schedules:
a) r1 (A); r2 (B); w1 (B); w2 (C); r3 (C); w3 (A);
b) r1 (A); r2 (A); r1 (B); r2 (B); r3 (A); r4 (B); w1 (A); w2 (B);
Answer the following questions:
i. [4 points] What is the precedence graph for the schedule?
ii. [4 points] Is the schedule conflict-serializable? If so, what is an equivalent serial schedules?
Solution: a) i. T2 → T1 , T2 → T3 , T1 → T3 .
ii. Yes, equivalent schedules: T2 → T1 → T3 .
b) i. T2 → T1 , T3 → T1 , T1 → T2 , T4 → T2 .
ii. No, there are cycles in the precedence graph (T2 → T1 , T1 → T2 ).
(2) [17 points] Consider the following two transactions:

T1 : w1 (C); r1 (A); w1 (A); r1 (B); w1 (B);


T2 : r2 (B); w2 (B); r2 (A); w2 (A);
COP5725, Fall 2013 Homework 3 Page 7 of 7

Say our scheduler performs exclusive locking only (i.e., no shared locks). For each of the
following three instances of transactions T1 and T2 annotated with lock and unlock actions,
say whether the annotated transactions:

1. obey two-phase locking,


2. will necessarily result in a conflict serializable schedule (if no deadlock occurs),
3. will necessarily result in a strict schedule (if no deadlock occurs),
4. will necessarily result in a serial schedule (if no deadlock occurs), and
5. may result in a deadlock.

a) T1 : l1 (B); l1 (C); w1 (C); l1 (A); r1 (A); w1 (A); r1 (B); w1 (B); Commit; u1 (A); u1 (C); u1 (B);
T2 : l2 (B); r2 (B); w2 (B); l2 (A); r2 (A); w2 (A); Commit; u2 (A); u2 (B);
b) T1 : l1 (C); l1 (A); r1 (A); w1 (C); w1 (A); l1 (B); r1 (B); w1 (B); u1 (A); u1 (C); u1 (B); Commit;
T2 : l2 (B); r2 (B); w2 (B); l2 (A); r2 (A); w2 (A); Commit; u2 (A); u2 (B);
c) T1 : l1 (C); w1 (C); l1 (A); r1 (A); w1 (A); l1 (B); r1 (B); w1 (B); Commit; u1 (A); u1 (C); u1 (B);
T2 : l2 (B); r2 (B); w2 (B); l2 (A); r2 (A); w2 (A); Commit; u2 (A); u2 (B);

Format your answer in a table with Yes(Y)/No(N) entries.


Solution:
Necessarily Necessarily Necessarily
May result
2PL conflict strict Serial
in deadlock
Serializable schedule schedule
a) Y Y Y Y N
b) Y Y N Y Y
c) Y Y Y N Y

You might also like